在改变DataFrame的值时增加嵌套循环的速度



我希望提高嵌套for循环的速度。

变量:

'dataframe' -我试图在第二个for循环中修改的数据帧。它由针对同一群人的大量培训课程组成。这是考勤文档,如果报告数据框中存在匹配项,则更改该文档。

'dictNewNames' -这是一个会话标题名称的字典。键是较长的会话标题名称,值是剥离后的会话标题名称。例如{'Week 1: Training': 'Training'}等。该键等于每行中的'Session Title'列,但该值用于在第二个for循环中搜索子字符串。

'reporting' -包含有关会议标题和出席情况的信息的数据框架。报告数据框架已经被过滤,所以"报告"数据框架中的每个人都应该在"数据框架"中获得信用。唯一需要注意的是,'search'名称嵌套在路径标题中。

dataframe = {
'Session Title': ['Organization Week 1: Train', 'Organization Week 2: Train', 'Organization Week 3: Train'],
'Attendee Email': ['name@gmail.com', 'name2@gmail.com', 'name3@gmail.com'],
'Completed': ['No', 'No', 'No'],
'Date Completed': ['','','']}
dictNewNames = { 'Organization Week 1: Train': 'Train', ' Organization Week 2: Train': 'Train', 'Organization Week 3: Train': 'Train' }

标题格式不正确(即:':' vs '-',如下面的路径标题所示)。数据在格式方面完全是到处都是。

reporting = {
'Pathway Title': ['Training 1 - Train', 'Training 2: Train', 'Training 3 - Train'],
'Email': ['name@gmail.com', 'name2@gmail.com', 'name3@gmail.com'],
'Date Completed': ['xx/yy/xx', 'yy/xx/zz', 'zz/xx/yy']}
expectedOuput = {
'Session Title': ['Organization Week 1: Train', 'Organization Week 2: Train', 'Organization Week 3: Train'],
'Attendee Email': ['name@gmail.com', 'name2@gmail.com', 'name3@gmail.com'],
'Completed': ['Yes', 'Yes', 'Yes'],
'Date Completed': ['xx/yy/xx', 'yy/xx/zz', 'zz/xx/yy']}

我代码:

def giveCredit(dataframe, dictNewNames, reporting):
for index, row in dataframe.iterrows():

temp = row['Session Title']
searchName = dictNewNames[temp]
attendeeEmail = row['Attendee: Email']

for index1, row1 in reporting.iterrows():
pathwayTitle = row1['Pathway Title']
Email = row1['Organization Email']
dateCompleted = row1['Date Completed']
if attendeeEmail == Email and searchName in pathwayTitle:
dataframe.at[index, 'Completed'] = 'Yes'
dataframe.at[index, 'Date Completed'] = dateCompleted
break
return dataframe

你的图案看起来像merge:

for loop1 on first dataframe:
for loop2 on second dataframe:
if conditions match between both dataframes:

:

# Create a common key Name based on dictNewNames
pat = fr"({'|'.join(dictNewNames.values())})"
name1 = dataframe['Session Title'].map(dictNewNames)
name2 = reporting['Pathway Title'].str.extract(pat)
# Merge dataframes based on this key and email
out = pd.merge(dataframe.assign(Name=name1),
reporting.assign(Name=name2),
left_on=['Name', 'Attendee Email'],
right_on=['Name', 'Email'],
how='left', suffixes=(None, '_'))
# Update the dataframe
out['Date Completed'] = out.pop('Date Completed_')
out['Completed'] = np.where(out['Date Completed'].notna(), 'Yes', 'No')
out = out[dataframe.columns]

输出:

>>> out
Session Title   Attendee Email Completed Date Completed
0  Week 1: Train 1   name@gmail.com       Yes       xx/yy/xx
1  Week 2: Train 2  name2@gmail.com       Yes       yy/xx/zz
2  Week 3: Train 3  name3@gmail.com       Yes       zz/xx/yy

这个解决方法将我的执行时间从460秒减少到10秒以下。


def giveCredit(dataframe, dictNewNames, reporting):
reporting['Date Completed'] = pd.to_datetime(reporting['Date Completed'])
for index1, row in dataframe.iterrows():
temp = row['Session Title']
numberList = re.findall('[0-9]+', temp)
finalNumber = str(numberList[0])

searchName = dictNewNames[temp]
attendeeEmail = row['Attendee: Email']
row = reporting.loc[(reporting['Pathway Title'].str.contains(searchName, case=False)) & (reporting['Organization Email'] == attendeeEmail)]
if len(row.index) != 0:
new_row = row.loc[(reporting['Pathway Title'].str.contains(finalNumber, case=False))]
if len(new_row.index) != 0:
dataframe = modifyFrame(dataframe, new_row, index1)
else:
dataframe= modifyFrame(dataframe, row, index1)
dataframe = dataframe.sort_values(["Completed", "Attendee"], ascending=[False, True])
return dataframe


def modifyFrame(frame, row, index1):
dateCompleted = row['Date Completed']
dateCompleted = dateCompleted.to_string(buf=None, header=False, index=False, length=False, name=False, max_rows=None).strip()
dataframe.at[index1, 'Completed'] = 'Yes'
dataframe.at[index1, 'Date Completed'] = dateCompleted
return dataframe