将不规则的文本文件转换为有序的数据帧

我知道不会有"预制"；我尝试做什么的选项。

我有一系列这样的文本文件，它们是使用其他工具中的grep和sed组合而成的。

示例文件"；堆叠-IVT7.dat"；它的内容

./stacking_t13/ALL-stacking-13.dat   #this is a line in the file, for disambiguation 
==> stacking-count-11DG.dat <==
0.8822 Undefined
0.1178 stacked
==> stacking-count-12DT.dat <==
0.9321 Undefined
0.0679 stacked
==> stacking-count-14DG.dat <==
0.1701 Undefined
0.8299 stacked

我想把它们读到一个pd.dataframe中，并像这样构建它：

Interaction IVT7
13-vs-11DG  0.1178 
13-vs-12DT  0.0679
13-vs-14DG  0.8299

您可以看到，我将有选择地从文件中"拉"左侧列名，并从文件名中"拉动"列标题。这似乎是pd.read_csv((和re.findall((的组合问题

我不知道从哪里开始。。或者如何以有意义的方式将这两种功能结合起来。

edit:我在pd.read_csv((上搜索并阅读了相当多的内容。但它似乎是构建的——做我想做的事
我可以让它成功导入结构化(类似csv(文本文件，并在这里编写了一个脚本，效果不错。https://github.com/PwnusMaximus/md_scripts/blob/0ad82d6dbc096af4422ea625c29f4c0b0bfb4b95/analysis/combine-hbond-avg.py

我也知道(相当粗略地(如何使用sed将这个文件拆开，以便按照我的意愿进行大部分清理。(我知道这效率很低(

sed -i '/Undefined/d' *.dat 
sed -i 's/stacked//g' *.dat 
sed -i 's/*[0-9]+[A-Z]+*/[0-9]+[A-Z]+/' *.dat

然而，关于让pd.read_csv((实际导入这个文件的本质，我不知所措，而且除了之外，我还无法让它解析

df_final = pd.read_csv('super-duper-stacking-IVT7.dat', header=None)

edit2澄清了以上的文件内容与文件名

您已经正确地认识到，对于您要做的事情，没有现成的解决方案。您必须逐行读取文件，并构建包含所需信息的数据结构。

您可以使用正则表达式提取例如stacking-count-11DG.dat的11DG部分

考虑以下内容：

import re
import pandas as pd
# This regex captures anything after stacking-count- and before .dat
interaction_regex = re.compile(r"stacking-count-(.*?).dat") 
all_data = [] # Empty list to hold all data
current_interaction = ""
with open("stacking-IVT7.dat") as f:
for line in f:
line = line.strip() # Strip the line
if not line: continue # Ig the line is empty, move to the next line
# If the line begins and ends with arrows, it is a filename so try to extract the interaction from it
if line.startswith("==>") and line.endswith("<=="):
inter = interaction_regex.findall(line)
if not inter: continue                     # if inter is empty, go to the next line
current_interaction = f"13-vs-{inter[0]}"  # if not, set the currently active interaction
# If the line doesn't begin and end with arrows, try to extract data from it
# But only if current_interaction is not empty
elif current_interaction:                      
file_row = line.split()        # Split the line on whitespace
if file_row[1] == "stacked":   
# If the second element of the row is "stacked", 
# Create a tuple containing the current_interaction and the number in this line
df_row = (current_interaction, float(file_row[0])) 
all_data.append(df_row) # Append the tuple to our list


df = pd.DataFrame(all_data, columns=["Interaction", "IVT7"])  # Create a dataframe using the data we read

它给出了以下数据帧：

Interaction    IVT7
0  13-vs-11DG  0.1178
1  13-vs-12DT  0.0679
2  13-vs-14DG  0.8299

相关内容

最新更新

热门标签：