我知道不会有"预制";我尝试做什么的选项。
我有一系列这样的文本文件,它们是使用其他工具中的grep和sed组合而成的。
示例文件";堆叠-IVT7.dat";它的内容
./stacking_t13/ALL-stacking-13.dat #this is a line in the file, for disambiguation
==> stacking-count-11DG.dat <==
0.8822 Undefined
0.1178 stacked
==> stacking-count-12DT.dat <==
0.9321 Undefined
0.0679 stacked
==> stacking-count-14DG.dat <==
0.1701 Undefined
0.8299 stacked
我想把它们读到一个pd.dataframe中,并像这样构建它:
Interaction IVT7
13-vs-11DG 0.1178
13-vs-12DT 0.0679
13-vs-14DG 0.8299
您可以看到,我将有选择地从文件中"拉"左侧列名,并从文件名中"拉动"列标题。这似乎是pd.read_csv((和re.findall((的组合问题
我不知道从哪里开始。。或者如何以有意义的方式将这两种功能结合起来。
edit:我在pd.read_csv((上搜索并阅读了相当多的内容。但它似乎是构建的——做我想做的事
我可以让它成功导入结构化(类似csv(文本文件,并在这里编写了一个脚本,效果不错。https://github.com/PwnusMaximus/md_scripts/blob/0ad82d6dbc096af4422ea625c29f4c0b0bfb4b95/analysis/combine-hbond-avg.py
我也知道(相当粗略地(如何使用sed将这个文件拆开,以便按照我的意愿进行大部分清理。(我知道这效率很低(
sed -i '/Undefined/d' *.dat
sed -i 's/stacked//g' *.dat
sed -i 's/*[0-9]+[A-Z]+*/[0-9]+[A-Z]+/' *.dat
然而,关于让pd.read_csv((实际导入这个文件的本质,我不知所措,而且除了之外,我还无法让它解析
df_final = pd.read_csv('super-duper-stacking-IVT7.dat', header=None)
edit2澄清了以上的文件内容与文件名
您已经正确地认识到,对于您要做的事情,没有现成的解决方案。您必须逐行读取文件,并构建包含所需信息的数据结构。
您可以使用正则表达式提取例如stacking-count-11DG.dat
的11DG
部分
考虑以下内容:
import re
import pandas as pd
# This regex captures anything after stacking-count- and before .dat
interaction_regex = re.compile(r"stacking-count-(.*?).dat")
all_data = [] # Empty list to hold all data
current_interaction = ""
with open("stacking-IVT7.dat") as f:
for line in f:
line = line.strip() # Strip the line
if not line: continue # Ig the line is empty, move to the next line
# If the line begins and ends with arrows, it is a filename so try to extract the interaction from it
if line.startswith("==>") and line.endswith("<=="):
inter = interaction_regex.findall(line)
if not inter: continue # if inter is empty, go to the next line
current_interaction = f"13-vs-{inter[0]}" # if not, set the currently active interaction
# If the line doesn't begin and end with arrows, try to extract data from it
# But only if current_interaction is not empty
elif current_interaction:
file_row = line.split() # Split the line on whitespace
if file_row[1] == "stacked":
# If the second element of the row is "stacked",
# Create a tuple containing the current_interaction and the number in this line
df_row = (current_interaction, float(file_row[0]))
all_data.append(df_row) # Append the tuple to our list
df = pd.DataFrame(all_data, columns=["Interaction", "IVT7"]) # Create a dataframe using the data we read
它给出了以下数据帧:
Interaction IVT7
0 13-vs-11DG 0.1178
1 13-vs-12DT 0.0679
2 13-vs-14DG 0.8299