将带有标题的文本文件转换为pandas数据框架



我一直在努力将文本文件转换为pandas Dataframe,因此我可以随后对值进行计算并绘制坐标。

文本文件具有以下格式,具有长标题和多行。我在下面放置了部分标题和一行的示例。我写了一个小脚本来获取我感兴趣的文本文件表部分的开始和最后一行。

starfile_name:

# version 30001
data_particles
loop_ 
_rlnTomoParticleName #1 
_rlnTomoName #2 
_rlnNormCorrection #21 
_rlnLogLikeliContribution #22 
_rlnMaxValueProbDistribution #23 
_rlnNrOfSignificantSamples #24 
TS_002/1     TS_002            1            2            1  1733.000000  3485.000000   938.000000     -1.08872     -1.08872     0.411277   131.760000    89.920000    97.200000 PseudoSubtomo/job052/Subtomograms/TS_002/1_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/1_weights.mrc            1    92.905599    28.438417    57.199867     1.000000 1.128367e+06     0.017733          224 
TS_002/2     TS_002            1            1            1  1124.000000   693.000000  1096.000000     0.411277     -1.08872     -1.08872    79.270000    86.780000   100.730000 PseudoSubtomo/job052/Subtomograms/TS_002/2_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/2_weights.mrc            1   159.849821     4.120413   101.904501     1.000000 1.126854e+06     0.183934           37 
TS_002/3     TS_002            1            2            1  1694.000000  2329.000000  1378.000000     5.955277     -6.63272     -1.08872   -140.62000    88.860000    99.000000 PseudoSubtomo/job052/Subtomograms/TS_002/3_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/3_weights.mrc            1   127.794678     4.085294   168.730698     1.000000 1.124178e+06     0.184649           18 

我使用以下行将其转换为DataFrame

#skip is the line number where the header and irrelevant part of the table ends 
#foot is the number of rows at the end of the table that I'm not interested in
pandas_table = pd.read_csv(starfile_name, engine='python', index_col=False, header=None,skiprows=int(skip), skipfooter=int(foot), sep="t")
print(pandas_table)
df = pd.DataFrame(data=pandas_table)
df

似乎整个表被读取为仅作为一列。我尝试提供列标记,但它们与实际数据不一致。我还使用了str.split()和squeeze()选项,但我一直得到错误。

输出:

0
0     TS_002/1     TS_002            1            2 ...
1     TS_002/2     TS_002            1            1 ...
2     TS_002/3     TS_002            1            2 ...
3     TS_002/4     TS_002            1            1 ...
4     TS_002/5     TS_002            1            2 ...
...                                                 ...
1423  TS_002/1424     TS_002            1           ...
1424  TS_002/1425     TS_002            1           ...
1425  TS_002/1426     TS_002            1           ...
1426  TS_002/1427     TS_002            1           ...
1427  TS_002/1428     TS_002            1           ...
[1428 rows x 1 columns]
0
0   TS_002/1 TS_002 1 2 ...
1   TS_002/2 TS_002 1 1 ...
2   TS_002/3 TS_002 1 2 ...
3   TS_002/4 TS_002 1 1 ...
4   TS_002/5 TS_002 1 2 ...
...     ...
1423    TS_002/1424 TS_002 1 ...
1424    TS_002/1425 TS_002 1 ...
1425    TS_002/1426 TS_002 1 ...
1426    TS_002/1427 TS_002 1 ...
1427    TS_002/1428 TS_002 1 ...
1428 rows × 1 columns

我认为这将有助于您通过可变长度空间分割列:使用sep='s+'

df = pd.read_csv(starfile_name,  ...., sep='s+')
print(df)
>>>
0       1   2   3   4       5       6       7         8        9   
0  TS_002/1  TS_002   1   2   1  1733.0  3485.0   938.0 -1.088720 -1.08872   
1  TS_002/2  TS_002   1   1   1  1124.0   693.0  1096.0  0.411277 -1.08872   
2  TS_002/3  TS_002   1   2   1  1694.0  2329.0  1378.0  5.955277 -6.63272   
...                                                 14  
0  ...  PseudoSubtomo/job052/Subtomograms/TS_002/1_dat...   
1  ...  PseudoSubtomo/job052/Subtomograms/TS_002/2_dat...   
2  ...  PseudoSubtomo/job052/Subtomograms/TS_002/3_dat...   
15  16          17  
0  PseudoSubtomo/job052/Subtomograms/TS_002/1_wei...   1   92.905599   
1  PseudoSubtomo/job052/Subtomograms/TS_002/2_wei...   1  159.849821   
2  PseudoSubtomo/job052/Subtomograms/TS_002/3_wei...   1  127.794678   
18          19   20         21        22   23  
0  28.438417   57.199867  1.0  1128367.0  0.017733  224  
1   4.120413  101.904501  1.0  1126854.0  0.183934   37  
2   4.085294  168.730698  1.0  1124178.0  0.184649   18  
[3 rows x 24 columns]

最新更新