我有一个.txt数据文件,其中包含许多具有不同标题的列。我可以读取包含所有列和行的文件。但是,我的问题是该文件包含一个带有三列的附加标题,附加到初始标题数据的最后一行。如何将最后三列与第一列分开?此外,我想删除三列中的第一列,因为它是第一列的副本,并将其他两列逐列附加到文件顶部的列中。我用熊猫像这样阅读文件:
c = pd.read_csv('C:filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])
结果是:
<Title1> ... <Title12>
134849000 -0.420384078515376 ... 244.507248
135016000 -0.406915327374619 ... 244.507248
135183000 -0.406915327374619 ... 244.507248
135349000 -0.406915327374619 ... 244.507248
135516000 -0.406915327374619 ... 244.507248
... ... ... ... <-- (somewhere in here there is a new header with three columns)
2316226000 0.349323222511261 ... NaN
2316393000 0.359268272664523 ... NaN
2316560000 0.346797179431672 ... NaN
2316726000 0.291363936474923 ... NaN
2316893000 0.256587672540276 ... NaN
[26188 rows x 12 columns]
可以看出,数据集的"4.th 困境"(或第 4-12 列,第 x 行,1 索引)包含 NaN 值,因为三列已附加到第一个标题的最后一行,因此它们保留空值,因为文件包含从顶部开始的 12 列。此外,两个标题都有两行,其中不需要第一行,所以我需要跳过这些行。
示例文件:
<Header1>
<Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13>
134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792
135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792
135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792
135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792
135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792
136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792
136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792
136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792
136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792
136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792
136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792
137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792
137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792
137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792
137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792
137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792
137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792
<Header2>
<Title13(same as Title 1)><Title14><Title15>
134849000,0.120862187115588,0
135016000,0.171543242833847,0
135183000,0.146335932645973,0
135349000,0.09773669641824,0
135516000,0.0882672298282907,0
135683000,0.124406962864472,0
135849000,0.186013875486258,0
136016000,0.219045896500945,0
136183000,0.197246332120462,0
136349000,0.150083583561413,0
136516000,0.0838562129822536,0
136683000,0.00269632558524612,0
136849000,-0.0447052988191479,0
137016000,-0.00496292706410619,0
137183000,0.0799457149607322,0
137349000,0.137388731956788,0
137516000,0.142305654943302,0
137683000,0.115943857754048,0
137849000,0.0991913228381935,0
这里有两个解决方案,第一种方法生成一个新文件,第二种方法在read_csv操作期间修复标头。如果文件将被多次处理,则可以使用第一个,但至少需要读取所有行两次。如果只需要读取一次多个大文件,则首选第二种方法。
解决方案 1:预处理文件
解析一次文件以删除额外的标头。
# create second file with unique header
with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
header = f_in.readlines(1)[0]
f_out.write(header)
for line in f_in.readlines():
if line != header:
f_out.write(line)
# then read corrected csv file
pd.read_csv('file_single_header.csv')
解决方案 2:将标题行视为注释并手动分配标题
# read first line of the file to get header and split names
import re
with open('file.csv', 'r') as f:
header = re.split('s+', f.readlines(1)[0].strip())
# exclude header lines and assign names manually
pd.read_csv('file.csv', comment='<', names= header)
注意。 从您的示例中不清楚您是否真的有逗号分隔的值,另一个分隔符。如果这是空间,则需要按如下方式调整read_csv。此外,如果索引位于 csv 文件中,则需要为选项 2 添加一个名称(此处为"无")。
# option 1
pd.read_csv('file_fixed_header.csv', sep='s+')
# option 2
pd.read_csv('file.csv',
comment='<',
names=[None] + header, # added None for index
sep='s+',
index_col=0
).dropna(axis=0, how='all')
解决方案3:修复错误的CSV文件
with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
i = 0
for line in f_in.readlines():
if line.strip().startswith('<'):
if i == 1:
f_out.write(','.join(line.strip('<>n').split('><'))+'n')
i+=1
else:
f_out.write(line)
一个可能的解决方案:
临时更改索引
c.reset_index(inplace=True)
在第二个标题中找到新列的行
newcols = c.iloc[c[c.iloc[:, 1].isna()].index.min() + 2:, [1, 2]].reset_index(drop=True)
重命名新列
newcols.rename(columns={'<Title1>' : '<Title14>', '<Title2>' : '<Title15>'}, inplace=True)
添加新列,删除第二个标题中的行,并还原原始索引
c = pd.concat([c, newcols], axis=1).dropna().set_index('index')
我设法为这个特定的数据集找到了一个相对强大和简单的解决方案。
读取数据并跳过第一个标头后:
raw_data = pd.read_csv('C:datafile.txt', sep=',',header=None, skiprows=[0,1])
我检查第一列中的非数值,以找出下一个标题的位置:
a = pd.to_numeric(pd.to_numeric(raw_data[0], errors='coerce').isnull())
结果:
0 False
1 False
2 False
3 False
4 False
...
26183 False
26184 False
26185 False
26186 False
26187 False
Name: 0, Length: 26188, dtype: bool
然后我找到语句为真的索引:
a = np.where(a)[0]
结果:
[13093 13094]
从这里,我可以使用索引简单地索引两个标头的数据:
d = raw_data.iloc[:raw_data.index.get_loc(a[0])]
e = raw_data.iloc[raw_data.index.get_loc(a[0])+2:, :3]
在 e 中,我还确保索引列,因为我们只有三列用于第二个标题
结果:
d =
0 1 ... 11 12
0 134849000 -0.420384078515376 ... 189.971016 244.507248
1 135016000 -0.406915327374619 ... 189.971016 244.507248
2 135183000 -0.406915327374619 ... 189.971016 244.507248
3 135349000 -0.406915327374619 ... 189.971016 244.507248
4 135516000 -0.406915327374619 ... 189.971016 244.507248
... ... ... ... ... ...
13088 2316226000 -0.30945361835179 ... 188.914284 243.942856
13089 2316393000 -0.4099956694033 ... 188.914284 243.942856
13090 2316560000 -0.4099956694033 ... 188.914284 243.942856
13091 2316726000 -0.4099956694033 ... 188.914284 243.942856
13092 2316893000 -0.429752713005517 ... 188.914284 243.942856
[13093 rows x 13 columns]
e =
0 1 2
13095 134849000 0.120862187115588 0
13096 135016000 0.171543242833847 0
13097 135183000 0.146335932645973 0
13098 135349000 0.09773669641824 0
13099 135516000 0.0882672298282907 0
... ... ... ..
26183 2316226000 0.349323222511261 0
26184 2316393000 0.359268272664523 0
26185 2316560000 0.346797179431672 0
26186 2316726000 0.291363936474923 0
26187 2316893000 0.256587672540276 0
[13093 rows x 3 columns]
由于两个数据集都有一个公共列(每个标题的第一列),我使用 merge 将底部数据集附加到顶部数据集:
f = pd.merge(d,e, on=[0,0])
结果:
0 1_x ... 1_y 2_y
0 134849000 -0.420384078515376 ... 0.120862187115588 0
1 135016000 -0.406915327374619 ... 0.171543242833847 0
2 135183000 -0.406915327374619 ... 0.146335932645973 0
3 135349000 -0.406915327374619 ... 0.09773669641824 0
4 135516000 -0.406915327374619 ... 0.0882672298282907 0
... ... ... ... ... ..
13088 2316226000 -0.30945361835179 ... 0.349323222511261 0
13089 2316393000 -0.4099956694033 ... 0.359268272664523 0
13090 2316560000 -0.4099956694033 ... 0.346797179431672 0
13091 2316726000 -0.4099956694033 ... 0.291363936474923 0
13092 2316893000 -0.429752713005517 ... 0.256587672540276 0
[13093 rows x 15 columns]
现在我有了可以保存的正确数据集,用.to_csv定义我自己的标题!