pandas数据帧上的Python嵌套循环



我有一个30000行的pandas数据帧,看起来像这样:

ID     year    month    var1-var300    test   
1111   2017    7        ...            1      
1111   2017    9        ...            0      
2222   2017    6        ...            1      
2222   2017    6        ...            0      
2222   2016    6        ...            0      
3333   2017    3        ...            1      
3333   2017    3        ...            0     
3333   2015    8        ...            0      
...

以下是我想对每一行做的事情:如果test=1,我想提取变量"ID年-月";,在整个数据帧上循环,如果在任何其他行中找到此变量组合,则将1分配给新变量"check"。最后的数据帧应该是这样的:

ID     year    month    var1-var300    test   check
1111   2017    7        ...            1      0
1111   2017    9        ...            0      0
2222   2017    6        ...            1      1
2222   2017    6        ...            0      0
2222   2016    6        ...            0      0
3333   2017    3        ...            1      1
3333   2017    3        ...            0      0
3333   2015    8        ...            0      0
...

以下是我想象中的某种伪代码:

for line in df:
if line['test']=1:
I=line['ID']
Y=line['year']
MO=line['month']
for row in df:
if row['ID']=I & row['year']=Y & row['month']=MO:
line['check']=1
break

知道如何在Python中执行类似的代码吗?

您应该能够反转您的逻辑:

  1. 按ID和年份分组
  2. 在每组中进行检查
def func(group):
if len(group) > 1:
group.loc[group['test'] == 1, 'check'] = 1
return group
df = df.groupby(['ID', 'year']).apply(func)

我认为您可以使用一个转换来计算组。然后你就可以得到结果了。只有两行。

这是我的解决方案。

创建测试数据:

import pandas as pd
ID = [1111, 1111, 2222, 2222, 2222, 3333, 3333, 3333]
year = [2017, 2017, 2017, 2017, 2016, 2017, 2017, 2015]
month = [7, 9, 6, 6, 6, 3, 3, 8]
test = [1, 0, 1, 0, 0, 1, 0, 0]
df = pd.DataFrame({
"ID": ID,
"year": year,
"month": month,
"test": test
})

获取结果:

df.loc[:, "group_count"] = df.groupby(["ID", "year", "month"]).transform("count").values
df.loc[:, "check"] = ((df["test"]>0) & (df["group_count"] > 1)).astype(int)

因此,您想要一个单独的列来指示对应行的ID、年份和月份是否对应于test==1的行的ID和年份和月份?

使用iterrows((进行迭代:

to_check = []
for index, row in df.iterrows():
if row['test']==1: # in your pseudocode, you use single =; that's for assigning variables
to_ckeck.append([row['ID'], row['year'], row['month']])
check = []
for index, row in df.iterrows():
if [row['ID'], row['year'], row['month']] in to_check:
check.append(1)
else:
check.append(0)
df["check"] = check

您可以进行以下更改并尝试:

for line in df:
if line['test']==1:
I=line['ID']
Y=line['year']
MO=line['month']
for row in df:
if row['ID']==I and row['year']==Y and row['month']==MO:
line['check']=1
break

最新更新