python函数标记数据框中前30%的事务



我需要在具有7000行事务数据的数据框架上创建一个函数或循环,以便我可以找到发生的第一个事务(以百分比组表示)。

数据已按日期列升序排序,如下所示(使用pySpark):

sorted_df = df.orderBy(asc(date))

我现在需要一个函数,该函数将在数据框中的前30%的数据行中找到并然后在新列中创建一个标志,因此在本例中,它将是前2100行(7000 * 0.3)。然后,我想改进这个函数,为落入40%,50%,60%事务括号的行添加额外的标志问题的下一部分是能够将其应用于数据中的一组不同月份(对于上面的df,我已将其过滤为一个月的数据,以使其更容易应用)。我被困在这里,因为我是新的创建功能,并希望以此为学习的机会。多谢

您正在寻找这样的东西吗?

def flag_dataframe(df):
df = df.reset_index() #to make sure the row index its still in the right order
df.insert(len(df.columns), 'Flag', None) #create column flag
flags = [30,40,50,60,70,80,90,100] #the flag percentages
for i, row in df.iterrows(): #iterate through the dataframe, i is the index of the row, which is reset on the second line
for flag in flags: 
if(i / len(df) * 100 <= flag): #check which flag is the right flag
df.loc[i, "flag"] = f"{flag}%" #setting the flag value of this row
break #break out of this loop so it wont override the flag value for another one
return df

使用例子:

df = flag_dataframe(df)

您可以通过从函数中删除flags列表并将其添加为带有一些自定义值的参数来改进这一点。在这种情况下,我只是使用了你在问题中列出的标志。

关于如何将此应用于选定的行数(在本例中为相同月份的记录)的问题:

def flag_dataframe_by_month(df):
df = df.reset_index() # to make sure the row index its still in the right order
df.insert(len(df.columns), 'Flag', None) #create column flag
flags = [30,40,50,60,70,80,90,100] #the flag percentages
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
for month in months:
month_df = df[df["Month"] == month] #this will select all the rows from a month, but keep the index that is set on line 2
month_df.insert(len(month_df.columns), 'month_rec_index', [i for i in range(len(month_df))]) #this will create an index based on the number of records with the same month, this index will not be used in the result
for i, row in month_df.iterrows(): #iterate through the records with the same month, i is the index of the row in the original dataframe, which is set on line 2
for flag in flags: 
if(row["month_rec_index"] / len(month_df) * 100 < flag): #check which flag is the right flag
df.loc[i, "Flag"] = f"{flag}%" #setting the flag value of this row in the original dataframe
break #break out of this loop so it wont override the flag value for another one
return df.drop(columns=["index"]) #pandas creates a second index, I dont exactly know why, but this is how to remove it again.

用法相同,如果您的月份名称不同或按索引命名,只需编辑月份列表中的月份即可。

我还从我的原始答案中编辑了一些行,因为那些是警告

最新更新