Pandas read_csv() and dtype doubts

我有一组txt格式的公告，这些公告有一些大块的(大标题，尾部等)数据，我能够"清理"这些数据。他们用熊猫。然后我必须将所有的DataFrame附加到一个新的DataFrame中，以便有一个新的文件，因为我需要处理大约10年的数据，所以代码是:

os.chdir(r'D:InvesCatalogsOSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))
new_data = []
for f in csv_files:
df = pd.read_csv(f)
print('Location File:', f)
print('File Name:', f.split("\")[-1])
df = pd.read_csv(f, header=10, sep='s+')
n = 2
df.drop(df.tail(n).index, inplace = True)
df = df[df.YYYY != '----'] # deleting the '----' row
print('File Content:')

print('...Appending...')
print('...................')
new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
#new_data.dtypes
new_data.to_csv(r'D:InvesCatalogsFull_1988-2008.csv',
index=False, header=True, sep=',')

CSV文件Full_1988-2008.csv"大约10MB(~173395行)，文件中的数据如下所示:

YYYY,MM,JJ,HH,MI,SS,STIME,LAT,SLAT,LON,SLON,DEPTH,ML,ORID,RMS,Num,Fase
1988,07,05,03,01,44,.92,-16.420,"8,41",-68.810,"7,56",94.00,1.01,34,",4",6,
1988,07,05,03,45,00,1.70,-16.990,"10,57",-68.910,"10,15",65.00,-1.00,35,"1,12",11,
1988,07,05,04,40,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,36,0,5,
1988,07,05,05,13,12,1.50,-16.600,"5,51",-68.550,"3,64",15.00,1.97,37,",92",10,
1988,07,05,06,25,45,1.21,-16.960,"4,27",-68.520,"5,92",2.00,2.03,38,",74",8,
1988,07,05,07,24,42,2.04,-19.410,"74,58",-68.910,"23,03",160.00,2.78,39,"1,18",8,
1988,07,05,09,03,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,41,0,3,

我需要从YYYY(年)，LAT &LON(坐标)DEPTH(深度)和ML(幅度)，所以我这样做:

DF = pd.read_csv(kat, sep=',',
usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']),
dtype={'YYYY': int, 'LAT': float, 'LON': float,
'DEPTH': float, 'ML': float})

但是我得到了错误:

File "pandas_libsparsers.pyx", line 1050, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-13-b2a95a2d83fd>", line 46, in <module>
'DEPTH': float, 'ML': float})
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 468, in _read
return parser.read(nrows)
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 1057, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 2061, in read
data = self._reader.read(nrows)
File "pandas_libsparsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
File "pandas_libsparsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas_libsparsers.pyx", line 850, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libsparsers.pyx", line 982, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas_libsparsers.pyx", line 1056, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'YYYY'

在我的理解中，标题YYYY, LAT, LON, DEPTH, ML成为数据的一部分，不能格式化为int或float。然而，如果我跳过标头，我就无法获得我需要的数据，因为标头变成了1998，-16.65，-66.65,12,3.2。

有没有人有一些线索来改进处理我的数据的方式?我附上了两个完整的文件，以防你想重复我的错误。

https://drive.google.com/drive/folders/18xrDC7vqEm_pY3D2sxwou3dlBdkZ6nHF?usp=sharing

您的代码可以很好地处理两个文件1988.txt和1989.txt。为了调试，我建议从read_csv中删除强制转换:

DF = pd.read_csv(kat, sep=',', usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']))

现在，检查YYYY列的值:

new_data['YYYY'].unique()

也许，您将看到'YYYY'作为值。要找到这些行:

new_data[new_data['YYYY'] == 'YYYY']

根据您的建议，我在代码中添加了以下行:

new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]

然后是YYYY,SS…已被移除。最后的代码看起来像:

os.chdir(r'D:InvesCatalogsOSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))
new_data = []
for f in csv_files:
df = pd.read_csv(f)
print('Location File:', f)
print('File Name:', f.split("\")[-1])
df = pd.read_csv(f, header=10, sep='s+')
n = 2
df.drop(df.tail(n).index, inplace = True)
df = df[df.YYYY != '----'] # deleting the '----' row
print('File Content:')

print('...Appending...')
print('...................')
new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]
new_data_d.to_csv(r'D:InvesCatalogsFull_1988-2008.csv',
index=False, header=True, sep=',')

相关内容

最新更新

热门标签：