在Pandas和Python中更有效的迭代



我有数百个文本文件,其中包含数千条记录,看起来像这样(来自NOAA的6 GB天气数据):"15 + 0175690150931212019010100567 + 34300 - 116166 - fm - 06……"字符串中的每个位置都有唯一的信息,我用下面的函数将其分成几列:

def convert_to_df(path):
df = pd.read_csv(path, low_memory= False, header=None, names= ['Data'])
df_clean = pd.DataFrame(columns = ['TVC','USAF','WBAN','DATE','TIME','SOURCE','LAT','LONG','TYPE','ELEV','FWSID',
'MPOQC','WIND_ANGLE',...)
for record, data in enumerate(df['Data']):
TVC = df['Data'][record][0:4]
USAF = df['Data'][record][4:10]
WBAN = df['Data'][record][10:15]
DATE = df['Data'][record][15:23]
TIME = df['Data'][record][23:27]
...
ATM_PRESSURE = df['Data'][record][99:104]
ATM_PRESSURE_QC = df['Data'][record][104:105]

clean_dict = {'TVC':TVC,'USAF':USAF,'WBAN':WBAN,'DATE':DATE,
'TIME':TIME,'SOURCE':SOURCE,'LAT':LAT,'LONG':LONG,'TYPE':TYPE,'ELEV':ELEV,...}
df_clean = df_clean.append(clean_dict, ignore_index = True)

它最终看起来像这样,尽管有更多的行和31列:

<表类>TVC美国空军WBAN日期时间tbody><<tr>017569015093121201901010056017569015093121201901010156

您可以一次对整个初始DataFrame执行字符串切片:

TVC = df["Data"].str[0:4]
USAF = df["Data"].str[4:10]
WBAN = df["Data"].str[10:15]
.
.
.

这很繁琐,所以您可以使用列名称和切片边界的list:

columns = ['TVC', 'USAF', 'WBAN', 'DATE', 'TIME', 'SOURCE', ...]
slices = [0, 4, 10, 15, 23, 27, ...]
slices = zip(slices, slices[1:])  # (0, 4), (4, 10), (10, 15), ...
clean_dict = {}
for key, (start, end) in zip(columns, slices):
clean_dict[key] = df["Data"].str[start:end]

最新更新