当我执行一个包含一列和大约2000万行的Dataframe的转置时,Pandas会毫无错误或异常地删除大约1800万个单元格。为什么?
我对数据帧丢失数据并不感到惊讶——它可能耗尽了内存,但我预计会出现错误。这让我怀疑我是否没有理解关于数据帧的一些重要的东西。在下面的代码片段中,您可以看到它显示一列数据并对其求和。然后,它将列转换成行,显示不正确的值,愉快地计算错误的和,然后愉快地完成,没有任何错误的指示。巧合的是,这两个和看起来很相似,因为它们都以1.5开头,但如果你仔细观察,你会发现它们相差一个数量级。我在Windows 10 64位系统上的Visual Studio Code中运行Anaconda Python 3.7.8。
<Read in all of the Chicago taxi data from 2016 here...>
print('Isolating seconds...')
seconds = all_trips.iloc[:,3]
print(seconds)
print('Converting to int64...')
seconds = seconds.astype(dtype='Int64')
print(seconds)
print('Converting seconds series to data frame...')
pd_seconds = pd.DataFrame(seconds)
print(pd_seconds)
sum_seconds_column = pd_seconds.sum()
print('Sum of column: {0:.4e}.'.format(sum_seconds_column.values[0]))
print('Transposing column to row...')
# Loses data in the following line.
transpose_seconds = pd_seconds.T
print(transpose_seconds)
sum_seconds_row = transpose_seconds.sum(axis=1)
print('Sum of row: {0:.4e}.'.format(sum_seconds_row.values[0]))
输出为:
Isolating seconds...
0 180.0
1 240.0
...
1245710 2040.0
1245711 0.0
Name: trip_seconds, Length: 19866157, dtype: float64
转换为int64…
0 180
1 240
...
1245710 2040
1245711 0
Name: trip_seconds, Length: 19866157, dtype: Int64
秒序列转换为数据帧…
0 180
1 240
... ...
1245710 2040
1245711 0
[**19866157** rows x 1 columns]
Sum of column:1.5235e+10.
将列转到行…
0 1 ... 1975106 1975107
trip_seconds 900 240 ... 180 0
[1行x1975108列)
Sum of row:1.5395e+09.
提前感谢。
对于将来的读者,我找到了实际问题和解决方案,如果没有解释的话。感谢评论者的建议,帮助我们缩小了问题的范围。创建大数据帧的代码如下所示:
january_trips_df = pd.read_csv('chicago_taxi_trips_2016_01.csv')
february_trips_df = pd.read_csv('chicago_taxi_trips_2016_02.csv')
march_trips_df = pd.read_csv('chicago_taxi_trips_2016_03.csv')
april_trips_df = pd.read_csv('chicago_taxi_trips_2016_04.csv')
may_trips_df = pd.read_csv('chicago_taxi_trips_2016_05.csv')
june_trips_df = pd.read_csv('chicago_taxi_trips_2016_06.csv')
july_trips_df = pd.read_csv('chicago_taxi_trips_2016_07.csv')
august_trips_df = pd.read_csv('chicago_taxi_trips_2016_08.csv')
september_trips_df = pd.read_csv('chicago_taxi_trips_2016_09.csv')
october_trips_df = pd.read_csv('chicago_taxi_trips_2016_10.csv')
november_trips_df = pd.read_csv('chicago_taxi_trips_2016_11.csv')
december_trips_df = pd.read_csv('chicago_taxi_trips_2016_12.csv')
all_trips = pd.concat([january_trips_df,february_trips_df,march_trips_df,april_trips_df,
may_trips_df,june_trips_df,july_trips_df,august_trips_df,september_trips_df,
october_trips_df,november_trips_df,december_trips_df])
隔离问题发现索引的编号不一致,从1到~20,000,000。为了纠正索引,我将参数ignore_index=True
添加到concat()的调用中,代码运行良好,没有丢失数据。
我尝试用一个较小的数据集重新创建这个问题,但无济于事。我不确定我是否偶然发现了一个bug,或者使用不一致的索引进行转置是否有意外的行为(至少对我来说)。