如何优化包含 for 循环和数据帧中 2000 万行的函数

我有一个熊猫数据帧df，如下所示：

student_id   category_id  count
1              111        10
2              111        5
3              222        8
4              333        5
5              111        6

同样，我有 2000 万行。

我想计算每个student_id的评级。例如，让我们考虑一个category_id"111"。我们在此类别中有 3 student_ids 1、2 和 5。 student_id 1 有 10 个计数，student_id 2 有 5 个计数，student_id 5 有 6 个计数。 category_id的每个student_id的评级由以下公式计算：

(count per student_id / total number of counts per category_id) * 5

对于student_id 1 -> 10/21 * 5 = 2.38

对于student_id 2 -> 5/21 *5 = 1.19

对于student_id 5 -> 6/21 * 5 = 1.43

以下是我已经拥有的计算函数：

countPerStudentID = datasetPandas.groupby('student_id').agg(list)
countPerCategoryID = datasetPandas.groupby('category_id').agg(list)
studentIDMap = dict()
def func1(student_id):
if student_id in studentIDMap:
return studentIDMap[student_id]
runningSum = 0
countList = countPerStudentID.loc[student_id, 'count']
for count in countList:
runningSum += count
studentIDMap[student_id] = runningSum
return studentIDMap[student_id]
#Similar to the above function
categoryIDMap = dict()
def func2(category_id):
if category_id in categoryIDMap:
return categoryIDMap[category_id]
runningSum = 0
countList = countPerCategoryID.loc[category_id, 'count']
for count in countList:
runningSum += count
categoryIDMap[category_id] = runningSum
return categoryIDMap[category_id]

最后，我从下面调用这两个函数：

#Calculating rating category-wise
rating = []
for index, row in df.iterrows():
totalCountPerCategoryID = func1(row['category_id'])
totalCountPerStudentID = func2(row['student_id'])
rating.append((totalCountPerStudentID / totalCountPerCategoryID) * 5)
df['rating'] = rating

所需输出：

student_id   category_id  count   rating
1              111        10       2.38
2              111        5        1.19
3              222        8         5
4              333        5         5 
5              111        6        1.43

由于数据量巨大，因此运行此数据需要花费大量时间。我想知道如何优化此计算

提前致谢

你不需要循环，这是一个groupby的情况：

df['rating'] = df['count']/df.groupby('category_id')['count'].transform('sum') * 5

输出：

student_id  category_id  count    rating
0           1          111     10  2.380952
1           2          111      5  1.190476
2           3          222      8  5.000000
3           4          333      5  5.000000
4           5          111      6  1.428571

天哪，不要用iterrows和append，更不要一起使用。难怪你的表现是爬行的。有了pandas，iterrows应该是最后的手段。

您应该能够使用矢量化方法实现此目的：

>>> df['rating'] = df['count'].div(df.groupby('category_id')['count'].transform(sum)).mul(5)
>>> df
student_id  category_id  count    rating
0           1          111     10  2.380952
1           2          111      5  1.190476
2           3          222      8  5.000000
3           4          333      5  5.000000
4           5          111      6  1.428571

相关内容

最新更新

热门标签：