读取文件中间带有新标头的数据文件

我有一个.txt数据文件，其中包含许多具有不同标题的列。我可以读取包含所有列和行的文件。但是，我的问题是该文件包含一个带有三列的附加标题，附加到初始标题数据的最后一行。如何将最后三列与第一列分开？此外，我想删除三列中的第一列，因为它是第一列的副本，并将其他两列逐列附加到文件顶部的列中。我用熊猫像这样阅读文件：

c = pd.read_csv('C:filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])

结果是：

<Title1>  ... <Title12>
134849000   -0.420384078515376  ...    244.507248
135016000   -0.406915327374619  ...    244.507248
135183000   -0.406915327374619  ...    244.507248
135349000   -0.406915327374619  ...    244.507248
135516000   -0.406915327374619  ...    244.507248
...                        ...  ...           ... <-- (somewhere in here there is a new header with three columns)
2316226000   0.349323222511261  ...           NaN
2316393000   0.359268272664523  ...           NaN
2316560000   0.346797179431672  ...           NaN
2316726000   0.291363936474923  ...           NaN
2316893000   0.256587672540276  ...           NaN
[26188 rows x 12 columns]

可以看出，数据集的"4.th 困境"(或第 4-12 列，第 x 行，1 索引)包含 NaN 值，因为三列已附加到第一个标题的最后一行，因此它们保留空值，因为文件包含从顶部开始的 12 列。此外，两个标题都有两行，其中不需要第一行，所以我需要跳过这些行。

示例文件：

<Header1>
<Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13>
134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792
135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792
135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792
135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792
135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792
136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792
136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792
136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792
136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792
136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792
136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792
137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792
137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792
137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792
137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792
137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792
137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792
<Header2>
<Title13(same as Title 1)><Title14><Title15>
134849000,0.120862187115588,0
135016000,0.171543242833847,0
135183000,0.146335932645973,0
135349000,0.09773669641824,0
135516000,0.0882672298282907,0
135683000,0.124406962864472,0
135849000,0.186013875486258,0
136016000,0.219045896500945,0
136183000,0.197246332120462,0
136349000,0.150083583561413,0
136516000,0.0838562129822536,0
136683000,0.00269632558524612,0
136849000,-0.0447052988191479,0
137016000,-0.00496292706410619,0
137183000,0.0799457149607322,0
137349000,0.137388731956788,0
137516000,0.142305654943302,0
137683000,0.115943857754048,0
137849000,0.0991913228381935,0

这里有两个解决方案，第一种方法生成一个新文件，第二种方法在read_csv操作期间修复标头。如果文件将被多次处理，则可以使用第一个，但至少需要读取所有行两次。如果只需要读取一次多个大文件，则首选第二种方法。

解决方案 1：预处理文件

解析一次文件以删除额外的标头。

# create second file with unique header
with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
header = f_in.readlines(1)[0]
f_out.write(header)
for line in f_in.readlines():
if line != header:
f_out.write(line)
# then read corrected csv file
pd.read_csv('file_single_header.csv')

解决方案 2：将标题行视为注释并手动分配标题

# read first line of the file to get header and split names
import re
with open('file.csv', 'r') as f:
header = re.split('s+', f.readlines(1)[0].strip())
# exclude header lines and assign names manually
pd.read_csv('file.csv', comment='<', names= header)

注意。从您的示例中不清楚您是否真的有逗号分隔的值，另一个分隔符。如果这是空间，则需要按如下方式调整read_csv。此外，如果索引位于 csv 文件中，则需要为选项 2 添加一个名称(此处为"无")。

# option 1
pd.read_csv('file_fixed_header.csv', sep='s+')
# option 2
pd.read_csv('file.csv',
comment='<',
names=[None] + header, # added None for index
sep='s+',
index_col=0
).dropna(axis=0, how='all')

解决方案3：修复错误的CSV文件

with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
i = 0
for line in f_in.readlines():
if line.strip().startswith('<'):
if i == 1:
f_out.write(','.join(line.strip('<>n').split('><'))+'n')
i+=1
else:
f_out.write(line)

一个可能的解决方案：

临时更改索引

c.reset_index(inplace=True)

在第二个标题中找到新列的行

newcols = c.iloc[c[c.iloc[:, 1].isna()].index.min() + 2:, [1, 2]].reset_index(drop=True)

重命名新列

newcols.rename(columns={'<Title1>' : '<Title14>', '<Title2>' : '<Title15>'}, inplace=True)

添加新列，删除第二个标题中的行，并还原原始索引

c = pd.concat([c, newcols], axis=1).dropna().set_index('index')

我设法为这个特定的数据集找到了一个相对强大和简单的解决方案。

读取数据并跳过第一个标头后：

raw_data = pd.read_csv('C:datafile.txt', sep=',',header=None, skiprows=[0,1])

我检查第一列中的非数值，以找出下一个标题的位置：

a = pd.to_numeric(pd.to_numeric(raw_data[0], errors='coerce').isnull())

结果：

0        False
1        False
2        False
3        False
4        False
...  
26183    False
26184    False
26185    False
26186    False
26187    False
Name: 0, Length: 26188, dtype: bool

然后我找到语句为真的索引：

a = np.where(a)[0]

结果：

[13093 13094]

从这里，我可以使用索引简单地索引两个标头的数据：

d = raw_data.iloc[:raw_data.index.get_loc(a[0])]
e = raw_data.iloc[raw_data.index.get_loc(a[0])+2:, :3]

在 e 中，我还确保索引列，因为我们只有三列用于第二个标题

结果：

d =

0                   1   ...          11          12
0       134849000  -0.420384078515376  ...  189.971016  244.507248
1       135016000  -0.406915327374619  ...  189.971016  244.507248
2       135183000  -0.406915327374619  ...  189.971016  244.507248
3       135349000  -0.406915327374619  ...  189.971016  244.507248
4       135516000  -0.406915327374619  ...  189.971016  244.507248
...           ...                 ...  ...         ...         ...
13088  2316226000   -0.30945361835179  ...  188.914284  243.942856
13089  2316393000    -0.4099956694033  ...  188.914284  243.942856
13090  2316560000    -0.4099956694033  ...  188.914284  243.942856
13091  2316726000    -0.4099956694033  ...  188.914284  243.942856
13092  2316893000  -0.429752713005517  ...  188.914284  243.942856
[13093 rows x 13 columns]

e =

0                   1  2
13095   134849000   0.120862187115588  0
13096   135016000   0.171543242833847  0
13097   135183000   0.146335932645973  0
13098   135349000    0.09773669641824  0
13099   135516000  0.0882672298282907  0
...           ...                 ... ..
26183  2316226000   0.349323222511261  0
26184  2316393000   0.359268272664523  0
26185  2316560000   0.346797179431672  0
26186  2316726000   0.291363936474923  0
26187  2316893000   0.256587672540276  0
[13093 rows x 3 columns]

由于两个数据集都有一个公共列(每个标题的第一列)，我使用 merge 将底部数据集附加到顶部数据集：

f = pd.merge(d,e, on=[0,0])

结果：

0                 1_x  ...                 1_y 2_y
0       134849000  -0.420384078515376  ...   0.120862187115588   0
1       135016000  -0.406915327374619  ...   0.171543242833847   0
2       135183000  -0.406915327374619  ...   0.146335932645973   0
3       135349000  -0.406915327374619  ...    0.09773669641824   0
4       135516000  -0.406915327374619  ...  0.0882672298282907   0
...           ...                 ...  ...                 ...  ..
13088  2316226000   -0.30945361835179  ...   0.349323222511261   0
13089  2316393000    -0.4099956694033  ...   0.359268272664523   0
13090  2316560000    -0.4099956694033  ...   0.346797179431672   0
13091  2316726000    -0.4099956694033  ...   0.291363936474923   0
13092  2316893000  -0.429752713005517  ...   0.256587672540276   0
[13093 rows x 15 columns]

现在我有了可以保存的正确数据集，用.to_csv定义我自己的标题！

解决方案 1：预处理文件

解决方案 2：将标题行视为注释并手动分配标题

解决方案3：修复错误的CSV文件

一个可能的解决方案：

相关内容

最新更新

热门标签：