我有多个csv文件.每个csv文件包含多个具有多个标题的表.如何获取其表头包含给定特定值的表



我在一个文件夹中有多个csv文件(4000(。每个csv文件都有如下数据。每个csv文件中的数据长度、标题行数和不同行中的标题数可能不同。有多个具有标题的表,并且这些表都以相同的列"开始;a";。我想得到它的表头包含"的表;苹果;以及值。

输入


a   b   c   d   e   f   g   h   i           
1   2   3   4   5   6   7   8   9           
a   b1  c1  d1  e1  f1  g1                  
1   2   3   4   5   6   7                   
a   b2  c2  d2  e2  f2  g2  h2  i2  k2  l2  
3   5   6   7   3   4   5   6   7   7   0   
a   b3  d3  e3  g23 t4  apple   r4  w2  r5  t6  
1   2   3   4   5   6   7   8   9   1   1   2
1   2   3   4   5   6   7   8   9   10  1   2
1   2   3   4   5   6   7   8   9   11  1   2
1   2   3   4   5   6   7   8   9   12  1   2
1   2   3   4   5   6   7   8   9   13  1   2
1   2   3   4   5   6   7   8   9   14  1   2
1   2   3   4   5   6   7   8   9   15  1   2
1   2   3   4   5   6   7   8   9   16  1   2
1   2   3   4   5   6   7   8   9   17  1   2
1   2   3   4   5   6   7   8   9   18  1   2
a   b   c   d   e   f   g   h   i           
1   2   3   4   5   6   7   8   9           

最终输出

a   b3  d3  e3  g23 t4  apple   r4  w2  r5  t6
1   2   3   4   5   6   7   8   9   1   1   2
1   2   3   4   5   6   7   8   9   10  1   2
1   2   3   4   5   6   7   8   9   11  1   2
1   2   3   4   5   6   7   8   9   12  1   2
1   2   3   4   5   6   7   8   9   13  1   2
1   2   3   4   5   6   7   8   9   14  1   2
1   2   3   4   5   6   7   8   9   15  1   2
1   2   3   4   5   6   7   8   9   16  1   2
1   2   3   4   5   6   7   8   9   17  1   2
1   2   3   4   5   6   7   8   9   18  1   2

好的,根据我得到的信息,您必须手动遍历每个文件中的每一行,直到找到第一列仅为a包含列apple的行。从那里你知道这是正确的标题,所以你开始以某种方式存储那一行和后面的值行。下次当您看到第一列仅为a的行时,您就知道已经到达了新的标题。

pandas可能无法直接做到这一点,所以您必须进行一些手动字符串插值。

buffer = ''
with open('filename') as f:
found_apple = False
for row in f:
# if a row starts with 'a,' it's a header row
has_a = row.startswith('a,')
if found_apple:
# if the row is a header row, we're done with the table and should wrap up
if has_a:
break
# else it's a row that should be part of our output, so store it in a buffer
buffer += row # row will already have the n
elif not has_a:
# we aren't ready to look at values, and this row isn't a header row, so skip it
continue
elif 'apple' in row:
# you might have to tweak this if there are headers that *contain* 'apple' but aren't the header you're looking for
# we've found the start of the table we want, we're ready to start storing the value rows
found_apple = True
buffer += row
# buffer will be the table you want, as a string
# example:
# """a,apple
# 1,2"""
# if that's all you need, you can simply output buffer
# if you wanted to do other pandas stuff with that table, you can now pass buffer to pandas
import pandas as pd
from io import StringIO
table = pd.read(StringIO(buffer))

如果有你不理解的地方,请告诉我。

编辑:要循环一个目录中的每个文件,只需用另一个循环包装with

import os
buffer = ''
for filename in os.listdir():
if not os.path.isfile(filename):
continue
with open(filename) as f:
...
if buffer is not '':
break
# buffer will be the table you want, as a string

最新更新