从大熊猫数据帧中选择条目很慢

我有两个熊猫数据帧：一个是高级客户，df_premium_customer，另一个是所有已售出的商品，df_sold，具有列 "客户ID"(包含高级客户以及其他客户的ID(，"文章ID"，"日期"和其他几个。

这就是df_premium_customer的样子

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h2>Bordered Table</h2>
<p>Use the CSS border property to add a border to the table.</p>
<table style="width:100%">
<tr>
<th>Premium_CustomerID</th>
</tr>
<tr>
<td>34674324</td>
</tr>
<tr>
<td>18634345</td>
</tr>
<tr>
<td>99744336</td>
</tr>
</table>
</body>
</html>

这是df_sold外观

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h2>Bordered Table</h2>
<p>Use the CSS border property to add a border to the table.</p>
<table style="width:100%">
<tr>
<th>CustimerID</th>
<th>ArticleID</th> 
<th>Date</th>
</tr>
<tr>
<td>34674324</td>
<td>3467434</td>
<td>20140302</td>
</tr>
<tr>
<td>98674342</td>
<td>3454234</td>
<td>20140822</td>
</tr>
<tr>
<td>74644334</td>
<td>4444434</td>
<td>20150321</td>
</tr>
</table>
</body>
</html>

对于每个客户，我需要创建一个数据结构(初步我选择了一个字典(，显示已出售给每个高级客户的内容。

到目前为止，我使用以下Python 3代码：

sold_to_customer = {}
for customer in df_premium_customer["CustomerID"]: 
#generate the list of indexes of this this customers appears in df_sold
cust_index = df_sold.index[df_sold['CustomerID'] == customer].tolist()
#add this customers as key to the dict                              
sold_to_customer[customer] = []  
for ind in cust_index:  
#add the name of the things he bought,when, and for how much as values to this key     
sold_to_customer[customer].append(list(df_sold[ind][["ArticleID","Date"]]))

这是慢下来的方法！

让它运行一会儿并推断它需要 16 个小时才能完成，因为我有 300k 高级客户和已售商品数据框中的数百万行条目。

我相信你的问题来自熊猫。一般来说，大熊猫很慢。使用合并或分组方法可能会获得一些加速，但我什至不确定。我相信获得加速的一种简单方法是在 numpy 中完成所有操作。我认为这条线

cust_index = df_sold.index[df_sold['CustomerID'] == customer].tolist()

花费你很多，因为你为每个客户做这件事。

您可以做的是创建一个包含所有高级客户ID的字典，并浏览所有数据。要遍历所有数据，您可以使用 for 循环，这仍然很慢，但我相信比你对熊猫所做的更快。

sold_to_customer = {}
for customer in df_premium_customer["CustomerID"]: 
#Initialize the dict
sold_to_customer[customer] = []
data = df_sold.values
for i,j,k in data:
sold_to_customer[i].append([j,k])

这使得您只浏览一次数据，并且由于对字典的访问应该很快，因此您应该很高兴。让我知道这是否加快了速度，速度是否足够，或者仍然应该优化。

相关内容

最新更新

热门标签：