是什么让Python代码如此缓慢?我如何修改它以运行得更快



我正在用Python为一个数据分析项目编写一个程序,该项目涉及与广告特征匹配的广告表现数据,旨在识别具有n个相似特征的高性能广告组。我使用的数据集将单个广告作为行,将特征、摘要和性能数据作为列。下面是我当前的代码-我使用的实际数据集有51列,其中4列被排除在外,因此它在外循环中运行47 C4,即178365次迭代。

目前,此代码执行大约需要2个小时。我知道嵌套的for循环可能是这样一个问题的根源,但我不知道为什么运行要花这么长时间,也不确定我如何修改内部/外部for循环以提高性能。我们将非常感谢对这两个主题的任何反馈。

import itertools
import pandas as pd
import numpy as np
# Identify Clusters of Rows (Ads) that have a KPI value above a certain threshold
def set_groups(df, n):
"""This function takes a dataframe and a number n, and returns a list of lists. Each list is a group of n columns.
The list of lists will hold all size n combinations of the columns in the dataframe.
"""
# Create a list of all relevant column names
columns = list(df.columns[4:]) # exclude first 4 summary columns
# Create a list of lists, where each list is a group of n columns
groups = []
vals_lst = list(map(list, itertools.product([True, False], repeat=n))) # Create a list of all possible combinations of 0s and 1s
for comb in itertools.combinations(columns, n): # itertools.combinations returns a list of tuples
groups.append([comb, vals_lst])
groups = np.array(groups,dtype=object)
return groups  # len(groups) = len(columns(df)) choose n
def identify_clusters(df, KPI, KPI_threshhold, max_size, min_size, groups):
"""
This function takes in a dataframe, a KPI, a threshhold value, a max and min size, and a list of lists of groupings.
The function will identify groups of rows in the dataframe that have the same values for each column in each list of groupings.
The function will return a list of lists with each list of groups, the values list, and the ad_ids in the cluster.
"""
# Create a dictionary to hold the results
output = []
# Iterate through each list of groups
for group in groups:
for vals_lst in group[1]:  # for each pair of groups and associated value matrices
# Create a temporary dataframe to hold the group of rows with matching values for columns in group
temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])]  # reduce the temp_df to only rows that match the values in vals_lst for each combination of values
if temp_df[KPI].mean() > KPI_threshhold:  # if the mean of the KPI for the temp_df is above the threshhold
output.append([group, vals_lst, temp_df['ad_id'].values])  # append the group, vals_lst, and ad_ids to the output list
print(output)
return output
## Main
df = pd.read_excel('data.xlsx', sheet_name='name')
groups = set_groups(df, 4)
print(len(groups))
identify_clusters(df, 'KPI_var', 0.0015, 6, 4, groups)

任何关于代码运行时间如此之长的见解,和/或任何关于提高代码性能的建议都将非常有用。

我认为您最大的问题是行:

temp_df = df
for i in range(len(group[0])):
temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])]

您正在过滤整个数据帧,而我认为您实际上只对KPI和ad_id列感兴趣。你可以创建一个滚动的面具,类似

mask = pd.Series(True, index=df.index)
for i in range(len(group[0])):
mask = mask & (temp_df[group[0][i]] == vals_lst[i])]

然后,您可以访问您的子集,例如df[mask][KPI].mean()df[mask]['ad_id'].values。如果这样做,您将避免在每次迭代中复制大量数据。

我还想稍微简化一下代码,例如,我相信vals_lst = list(map(list, itertools.product([True, False], repeat=n)))对每个组都是相同的,所以我可能会计算一次,并将其作为一个独立变量,而不是将其添加到每个组中;这将清除group[0]group[1]group[0][i]引用,这些引用在第一次读取代码时有点难以跟踪。

从迭代滤波到跟踪掩码的变化来看,掩码方法总是表现得更好,但差距随着数据大小的增加而增加。10000行的间隙为:

2.8098094911581533<1.0>
方法时间相对
原始2.900383699918166
使用掩码10322349993328

最新更新