将不规则的文本文件转换为有序的数据帧



我知道不会有"预制";我尝试做什么的选项。

我有一系列这样的文本文件,它们是使用其他工具中的grep和sed组合而成的。

示例文件";堆叠-IVT7.dat";它的内容

./stacking_t13/ALL-stacking-13.dat   #this is a line in the file, for disambiguation 
==> stacking-count-11DG.dat <==
0.8822 Undefined
0.1178 stacked
==> stacking-count-12DT.dat <==
0.9321 Undefined
0.0679 stacked
==> stacking-count-14DG.dat <==
0.1701 Undefined
0.8299 stacked

我想把它们读到一个pd.dataframe中,并像这样构建它:

Interaction IVT7
13-vs-11DG  0.1178 
13-vs-12DT  0.0679
13-vs-14DG  0.8299

您可以看到,我将有选择地从文件中"拉"左侧列名,并从文件名中"拉动"列标题。这似乎是pd.read_csv((和re.findall((的组合问题

我不知道从哪里开始。。或者如何以有意义的方式将这两种功能结合起来。

edit:我在pd.read_csv((上搜索并阅读了相当多的内容。但它似乎是构建的——做我想做的事
我可以让它成功导入结构化(类似csv(文本文件,并在这里编写了一个脚本,效果不错。https://github.com/PwnusMaximus/md_scripts/blob/0ad82d6dbc096af4422ea625c29f4c0b0bfb4b95/analysis/combine-hbond-avg.py

我也知道(相当粗略地(如何使用sed将这个文件拆开,以便按照我的意愿进行大部分清理。(我知道这效率很低(

sed -i '/Undefined/d' *.dat 
sed -i 's/stacked//g' *.dat 
sed -i 's/*[0-9]+[A-Z]+*/[0-9]+[A-Z]+/' *.dat 

然而,关于让pd.read_csv((实际导入这个文件的本质,我不知所措,而且除了之外,我还无法让它解析

df_final = pd.read_csv('super-duper-stacking-IVT7.dat', header=None)

edit2澄清了以上的文件内容与文件名

您已经正确地认识到,对于您要做的事情,没有现成的解决方案。您必须逐行读取文件,并构建包含所需信息的数据结构。

您可以使用正则表达式提取例如stacking-count-11DG.dat11DG部分

考虑以下内容:

import re
import pandas as pd
# This regex captures anything after stacking-count- and before .dat
interaction_regex = re.compile(r"stacking-count-(.*?).dat") 
all_data = [] # Empty list to hold all data
current_interaction = ""
with open("stacking-IVT7.dat") as f:
for line in f:
line = line.strip() # Strip the line
if not line: continue # Ig the line is empty, move to the next line
# If the line begins and ends with arrows, it is a filename so try to extract the interaction from it
if line.startswith("==>") and line.endswith("<=="):
inter = interaction_regex.findall(line)
if not inter: continue                     # if inter is empty, go to the next line
current_interaction = f"13-vs-{inter[0]}"  # if not, set the currently active interaction
# If the line doesn't begin and end with arrows, try to extract data from it
# But only if current_interaction is not empty
elif current_interaction:                      
file_row = line.split()        # Split the line on whitespace
if file_row[1] == "stacked":   
# If the second element of the row is "stacked", 
# Create a tuple containing the current_interaction and the number in this line
df_row = (current_interaction, float(file_row[0])) 
all_data.append(df_row) # Append the tuple to our list


df = pd.DataFrame(all_data, columns=["Interaction", "IVT7"])  # Create a dataframe using the data we read

它给出了以下数据帧:

Interaction    IVT7
0  13-vs-11DG  0.1178
1  13-vs-12DT  0.0679
2  13-vs-14DG  0.8299

最新更新