如何循环文本文件,读取前n行,并写入列表或数据帧



我正在处理大约3700个文本文件,并试图更好地了解每个文件的内容。有些文件完全相同,只按季度更改,有些则不同。我正在考虑循环浏览每个文件,逐个打开它们,并将前3、4或5行的内容写入列表,这样我就可以更好地了解哪些文件具有相同的模式。这是我整理的代码。

import pandas as pd
import csv
import glob
import os
results = pd.DataFrame([])
filelist = glob.glob("C:\Users\ryans\Downloads\*.txt")
number_of_lines = 3
for filename in filelist:
for i in range(number_of_lines):
print(filename)  
namedf = pd.read_csv(filename, skiprows=0, index_col=0)
results = results.append(namedf)

这是完整的堆栈跟踪。

results = results.append(namedf)
C:UsersryansDownloadsFFIEC CDR Call Bulk POR 03312001.txt
Traceback (most recent call last):
File "<ipython-input-14-64ec4bc99b05>", line 12, in <module>
namedf = pd.read_csv(filename, skiprows=0, index_col=0)
File "C:UsersryansAnaconda3libsite-packagespandasioparsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:UsersryansAnaconda3libsite-packagespandasioparsers.py", line 458, in _read
data = parser.read(nrows)
File "C:UsersryansAnaconda3libsite-packagespandasioparsers.py", line 1196, in read
ret = self._engine.read(nrows)
File "C:UsersryansAnaconda3libsite-packagespandasioparsers.py", line 2155, in read
data = self._reader.read(nrows)
File "pandas_libsparsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas_libsparsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas_libsparsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libsparsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas_libsparsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

如何使此代码正常工作?此外,这是最好的方法吗?还是有更好的方法来处理这3700个文本文件?

  1. 如果在最坏的情况下想要读取5行,那么读取所有文件是没有意义的。

    使用pd.read_csv()nrows=5选项

  2. 你得到的例外是因为你认为所有文件都是合法的csv文件。但在您的情况下,失败的文件

"在第3行中预期1个字段;

您应该错误处理(尝试并排除(这些情况,并维护无效csv文件的列表。

import glob
results = []
filelist = glob.glob("C:\Users\ryans\Downloads\*.txt")
number_of_lines = 3
for filename in filelist:
print(filename)
f =open(filename,"w")
lines = f.readlines()
results.append(lines[:number_of_lines])
print(results)

最新更新