Pandas使用groupby创建新列并避免循环



我有一个df,其中包含customer_id, year, order和其他一些不重要的列。每次当我得到一个新订单时,我的代码都会创建一个新行,所以每个customer_id可以有多个行。我想创建一个新列'actually',如果customer_id在2020年或2021年购买,其中包括'True'。我的代码是:

#Run through all customers and check if they bought in 2020 or 2021
investors = df["customer_id"].unique()
df["actually"] = np.nan
for i in investors:
selected_df = df.loc[df["customer_id"] == i]
for year in selected_df['year'].unique():
if "2021" in str(year) or "2020" in str(year):
df.loc[df["customer_id"] == i, "actually"] = "True"
break
#Want just latest orders / customers
df = df.loc[df["actually"] == "True"]

这工作得很好,但相当慢。我想使用Pandas的groupby功能,但是到目前为止还没有找到一个工作方式。我还避免了循环。有人知道吗?

您可以像这样创建列名'Actually'。

list1=df['Customer_id'][df.year==2020].unique()
list2=df['Customer_id'][df.year==2021].unique()
df['Actually']=df['Customer_id'].apply( lambda x : x in list1 or x in list2)

根据我对您的场景的理解,这里是一个简单的代码:

import pandas as pd
# Sample data to recreate the scenarion
data = {'customer_id': ['c1','c2','c1','c4','c3','c3'], 'year': [2019, 2018,2021,2012,2020,2021], 'order': ['A1','A2','A3','A4','A5','A6']}
df = pd.DataFrame.from_dict(data)
# Creating the new column (initially all false)
df['actually'] = False
# Filling only required rows with True
df.loc[(df['year']==2020) | (df['year']==2021), 'actually'] = True
print(df)

这将产生:

customer_id  year order  actually
0          c1  2019    A1     False
1          c2  2018    A2     False
2          c1  2021    A3      True
3          c4  2012    A4     False
4          c3  2020    A5      True
5          c3  2021    A6      True

您可以使用apply方法,以避免循环:

df['actually']=df['customer_id'].apply(lambda x: df[df.customer_id==x]['year'].str.contains('2020').any() or df[df.customer_id==x]['year'].str.contains('2021').any())

最新更新