熊猫:填充南表现不佳 - 避免迭代行? - Pandas: Filling nan poor performance

我在填写数据集中的缺失值时遇到性能问题。这涉及 500mb/5.000.0000 行数据集(Kaggle：Expedia 2013(。

使用df.fillna()是最容易的，但似乎我不能用它来用不同的值填充每个 NaN。

我创建了一个lookup表：

srch_destination_id | Value
2        0.0110
3        0.0000
5        0.0207
7           NaN
8           NaN
9           NaN
10       0.1500
12       0.0114

此表包含每个srch_destination_id要替换为NaN的相应值dataset。

# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
if pd.isnull(dataset.iloc[row]['prop_location_score2']):
cell = dataset.iloc[row]['srch_destination_id']
df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])

此代码在迭代超过 1000 行时有效，但在迭代所有 500 万行时，我的计算机永远不会完成(我等了几个小时(。

有没有更好的方法来做我正在做的事情？我在某处犯了错误吗？

pd.Series.fillna接受序列或字典，以及标量替换值。

因此，您可以从lookup创建系列映射：

s = lookup.set_index('srch_destination')['Value']

然后使用它在dataset中填写NaN值：

dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))

请注意，在fillna输入中，我们将映射来自dataset的标识符。此外，我们使用pd.Series.map来执行必要的映射。

熊猫:填充南表现不佳 - 避免迭代行?

相关内容

最新更新

热门标签：