我一直在努力将文本文件转换为pandas Dataframe,因此我可以随后对值进行计算并绘制坐标。
文本文件具有以下格式,具有长标题和多行。我在下面放置了部分标题和一行的示例。我写了一个小脚本来获取我感兴趣的文本文件表部分的开始和最后一行。
starfile_name:
# version 30001
data_particles
loop_
_rlnTomoParticleName #1
_rlnTomoName #2
_rlnNormCorrection #21
_rlnLogLikeliContribution #22
_rlnMaxValueProbDistribution #23
_rlnNrOfSignificantSamples #24
TS_002/1 TS_002 1 2 1 1733.000000 3485.000000 938.000000 -1.08872 -1.08872 0.411277 131.760000 89.920000 97.200000 PseudoSubtomo/job052/Subtomograms/TS_002/1_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/1_weights.mrc 1 92.905599 28.438417 57.199867 1.000000 1.128367e+06 0.017733 224
TS_002/2 TS_002 1 1 1 1124.000000 693.000000 1096.000000 0.411277 -1.08872 -1.08872 79.270000 86.780000 100.730000 PseudoSubtomo/job052/Subtomograms/TS_002/2_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/2_weights.mrc 1 159.849821 4.120413 101.904501 1.000000 1.126854e+06 0.183934 37
TS_002/3 TS_002 1 2 1 1694.000000 2329.000000 1378.000000 5.955277 -6.63272 -1.08872 -140.62000 88.860000 99.000000 PseudoSubtomo/job052/Subtomograms/TS_002/3_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/3_weights.mrc 1 127.794678 4.085294 168.730698 1.000000 1.124178e+06 0.184649 18
我使用以下行将其转换为DataFrame
#skip is the line number where the header and irrelevant part of the table ends
#foot is the number of rows at the end of the table that I'm not interested in
pandas_table = pd.read_csv(starfile_name, engine='python', index_col=False, header=None,skiprows=int(skip), skipfooter=int(foot), sep="t")
print(pandas_table)
df = pd.DataFrame(data=pandas_table)
df
似乎整个表被读取为仅作为一列。我尝试提供列标记,但它们与实际数据不一致。我还使用了str.split()和squeeze()选项,但我一直得到错误。
输出:
0
0 TS_002/1 TS_002 1 2 ...
1 TS_002/2 TS_002 1 1 ...
2 TS_002/3 TS_002 1 2 ...
3 TS_002/4 TS_002 1 1 ...
4 TS_002/5 TS_002 1 2 ...
... ...
1423 TS_002/1424 TS_002 1 ...
1424 TS_002/1425 TS_002 1 ...
1425 TS_002/1426 TS_002 1 ...
1426 TS_002/1427 TS_002 1 ...
1427 TS_002/1428 TS_002 1 ...
[1428 rows x 1 columns]
0
0 TS_002/1 TS_002 1 2 ...
1 TS_002/2 TS_002 1 1 ...
2 TS_002/3 TS_002 1 2 ...
3 TS_002/4 TS_002 1 1 ...
4 TS_002/5 TS_002 1 2 ...
... ...
1423 TS_002/1424 TS_002 1 ...
1424 TS_002/1425 TS_002 1 ...
1425 TS_002/1426 TS_002 1 ...
1426 TS_002/1427 TS_002 1 ...
1427 TS_002/1428 TS_002 1 ...
1428 rows × 1 columns
我认为这将有助于您通过可变长度空间分割列:使用sep='s+'
df = pd.read_csv(starfile_name, ...., sep='s+')
print(df)
>>>
0 1 2 3 4 5 6 7 8 9
0 TS_002/1 TS_002 1 2 1 1733.0 3485.0 938.0 -1.088720 -1.08872
1 TS_002/2 TS_002 1 1 1 1124.0 693.0 1096.0 0.411277 -1.08872
2 TS_002/3 TS_002 1 2 1 1694.0 2329.0 1378.0 5.955277 -6.63272
... 14
0 ... PseudoSubtomo/job052/Subtomograms/TS_002/1_dat...
1 ... PseudoSubtomo/job052/Subtomograms/TS_002/2_dat...
2 ... PseudoSubtomo/job052/Subtomograms/TS_002/3_dat...
15 16 17
0 PseudoSubtomo/job052/Subtomograms/TS_002/1_wei... 1 92.905599
1 PseudoSubtomo/job052/Subtomograms/TS_002/2_wei... 1 159.849821
2 PseudoSubtomo/job052/Subtomograms/TS_002/3_wei... 1 127.794678
18 19 20 21 22 23
0 28.438417 57.199867 1.0 1128367.0 0.017733 224
1 4.120413 101.904501 1.0 1126854.0 0.183934 37
2 4.085294 168.730698 1.0 1124178.0 0.184649 18
[3 rows x 24 columns]