我在一个文件夹中有多个csv文件(4000(。每个csv文件都有如下数据。每个csv文件中的数据长度、标题行数和不同行中的标题数可能不同。有多个具有标题的表,并且这些表都以相同的列"开始;a";。我想得到它的表头包含"的表;苹果;以及值。
输入
a b c d e f g h i
1 2 3 4 5 6 7 8 9
a b1 c1 d1 e1 f1 g1
1 2 3 4 5 6 7
a b2 c2 d2 e2 f2 g2 h2 i2 k2 l2
3 5 6 7 3 4 5 6 7 7 0
a b3 d3 e3 g23 t4 apple r4 w2 r5 t6
1 2 3 4 5 6 7 8 9 1 1 2
1 2 3 4 5 6 7 8 9 10 1 2
1 2 3 4 5 6 7 8 9 11 1 2
1 2 3 4 5 6 7 8 9 12 1 2
1 2 3 4 5 6 7 8 9 13 1 2
1 2 3 4 5 6 7 8 9 14 1 2
1 2 3 4 5 6 7 8 9 15 1 2
1 2 3 4 5 6 7 8 9 16 1 2
1 2 3 4 5 6 7 8 9 17 1 2
1 2 3 4 5 6 7 8 9 18 1 2
a b c d e f g h i
1 2 3 4 5 6 7 8 9
最终输出
a b3 d3 e3 g23 t4 apple r4 w2 r5 t6
1 2 3 4 5 6 7 8 9 1 1 2
1 2 3 4 5 6 7 8 9 10 1 2
1 2 3 4 5 6 7 8 9 11 1 2
1 2 3 4 5 6 7 8 9 12 1 2
1 2 3 4 5 6 7 8 9 13 1 2
1 2 3 4 5 6 7 8 9 14 1 2
1 2 3 4 5 6 7 8 9 15 1 2
1 2 3 4 5 6 7 8 9 16 1 2
1 2 3 4 5 6 7 8 9 17 1 2
1 2 3 4 5 6 7 8 9 18 1 2
好的,根据我得到的信息,您必须手动遍历每个文件中的每一行,直到找到第一列仅为a
且包含列apple
的行。从那里你知道这是正确的标题,所以你开始以某种方式存储那一行和后面的值行。下次当您看到第一列仅为a
的行时,您就知道已经到达了新的标题。
pandas可能无法直接做到这一点,所以您必须进行一些手动字符串插值。
buffer = ''
with open('filename') as f:
found_apple = False
for row in f:
# if a row starts with 'a,' it's a header row
has_a = row.startswith('a,')
if found_apple:
# if the row is a header row, we're done with the table and should wrap up
if has_a:
break
# else it's a row that should be part of our output, so store it in a buffer
buffer += row # row will already have the n
elif not has_a:
# we aren't ready to look at values, and this row isn't a header row, so skip it
continue
elif 'apple' in row:
# you might have to tweak this if there are headers that *contain* 'apple' but aren't the header you're looking for
# we've found the start of the table we want, we're ready to start storing the value rows
found_apple = True
buffer += row
# buffer will be the table you want, as a string
# example:
# """a,apple
# 1,2"""
# if that's all you need, you can simply output buffer
# if you wanted to do other pandas stuff with that table, you can now pass buffer to pandas
import pandas as pd
from io import StringIO
table = pd.read(StringIO(buffer))
如果有你不理解的地方,请告诉我。
编辑:要循环一个目录中的每个文件,只需用另一个循环包装with
:
import os
buffer = ''
for filename in os.listdir():
if not os.path.isfile(filename):
continue
with open(filename) as f:
...
if buffer is not '':
break
# buffer will be the table you want, as a string