我有一组txt格式的公告,这些公告有一些大块的(大标题,尾部等)数据,我能够"清理"这些数据。他们用熊猫。然后我必须将所有的DataFrame附加到一个新的DataFrame中,以便有一个新的文件,因为我需要处理大约10年的数据,所以代码是:
os.chdir(r'D:InvesCatalogsOSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))
new_data = []
for f in csv_files:
df = pd.read_csv(f)
print('Location File:', f)
print('File Name:', f.split("\")[-1])
df = pd.read_csv(f, header=10, sep='s+')
n = 2
df.drop(df.tail(n).index, inplace = True)
df = df[df.YYYY != '----'] # deleting the '----' row
print('File Content:')
print('...Appending...')
print('...................')
new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
#new_data.dtypes
new_data.to_csv(r'D:InvesCatalogsFull_1988-2008.csv',
index=False, header=True, sep=',')
CSV文件Full_1988-2008.csv"大约10MB(~173395行),文件中的数据如下所示:
YYYY,MM,JJ,HH,MI,SS,STIME,LAT,SLAT,LON,SLON,DEPTH,ML,ORID,RMS,Num,Fase
1988,07,05,03,01,44,.92,-16.420,"8,41",-68.810,"7,56",94.00,1.01,34,",4",6,
1988,07,05,03,45,00,1.70,-16.990,"10,57",-68.910,"10,15",65.00,-1.00,35,"1,12",11,
1988,07,05,04,40,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,36,0,5,
1988,07,05,05,13,12,1.50,-16.600,"5,51",-68.550,"3,64",15.00,1.97,37,",92",10,
1988,07,05,06,25,45,1.21,-16.960,"4,27",-68.520,"5,92",2.00,2.03,38,",74",8,
1988,07,05,07,24,42,2.04,-19.410,"74,58",-68.910,"23,03",160.00,2.78,39,"1,18",8,
1988,07,05,09,03,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,41,0,3,
我需要从YYYY(年),LAT &LON(坐标)DEPTH(深度)和ML(幅度),所以我这样做:
DF = pd.read_csv(kat, sep=',',
usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']),
dtype={'YYYY': int, 'LAT': float, 'LON': float,
'DEPTH': float, 'ML': float})
但是我得到了错误:
File "pandas_libsparsers.pyx", line 1050, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-13-b2a95a2d83fd>", line 46, in <module>
'DEPTH': float, 'ML': float})
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 468, in _read
return parser.read(nrows)
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 1057, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:UsersDirectoranaconda3envsobspylibsite-packagespandasioparsers.py", line 2061, in read
data = self._reader.read(nrows)
File "pandas_libsparsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
File "pandas_libsparsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas_libsparsers.pyx", line 850, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libsparsers.pyx", line 982, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas_libsparsers.pyx", line 1056, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'YYYY'
在我的理解中,标题YYYY, LAT, LON, DEPTH, ML成为数据的一部分,不能格式化为int或float。然而,如果我跳过标头,我就无法获得我需要的数据,因为标头变成了1998,-16.65,-66.65,12,3.2。
有没有人有一些线索来改进处理我的数据的方式?我附上了两个完整的文件,以防你想重复我的错误。
https://drive.google.com/drive/folders/18xrDC7vqEm_pY3D2sxwou3dlBdkZ6nHF?usp=sharing
您的代码可以很好地处理两个文件1988.txt
和1989.txt
。为了调试,我建议从read_csv
中删除强制转换:
DF = pd.read_csv(kat, sep=',', usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']))
现在,检查YYYY
列的值:
new_data['YYYY'].unique()
也许,您将看到'YYYY'作为值。要找到这些行:
new_data[new_data['YYYY'] == 'YYYY']
根据您的建议,我在代码中添加了以下行:
new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]
然后是YYYY,SS…已被移除。最后的代码看起来像:
os.chdir(r'D:InvesCatalogsOSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))
new_data = []
for f in csv_files:
df = pd.read_csv(f)
print('Location File:', f)
print('File Name:', f.split("\")[-1])
df = pd.read_csv(f, header=10, sep='s+')
n = 2
df.drop(df.tail(n).index, inplace = True)
df = df[df.YYYY != '----'] # deleting the '----' row
print('File Content:')
print('...Appending...')
print('...................')
new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]
new_data_d.to_csv(r'D:InvesCatalogsFull_1988-2008.csv',
index=False, header=True, sep=',')