我有一个30000行的pandas数据帧,看起来像这样:
ID year month var1-var300 test
1111 2017 7 ... 1
1111 2017 9 ... 0
2222 2017 6 ... 1
2222 2017 6 ... 0
2222 2016 6 ... 0
3333 2017 3 ... 1
3333 2017 3 ... 0
3333 2015 8 ... 0
...
以下是我想对每一行做的事情:如果test=1,我想提取变量"ID年-月";,在整个数据帧上循环,如果在任何其他行中找到此变量组合,则将1分配给新变量"check"。最后的数据帧应该是这样的:
ID year month var1-var300 test check
1111 2017 7 ... 1 0
1111 2017 9 ... 0 0
2222 2017 6 ... 1 1
2222 2017 6 ... 0 0
2222 2016 6 ... 0 0
3333 2017 3 ... 1 1
3333 2017 3 ... 0 0
3333 2015 8 ... 0 0
...
以下是我想象中的某种伪代码:
for line in df:
if line['test']=1:
I=line['ID']
Y=line['year']
MO=line['month']
for row in df:
if row['ID']=I & row['year']=Y & row['month']=MO:
line['check']=1
break
知道如何在Python中执行类似的代码吗?
您应该能够反转您的逻辑:
- 按ID和年份分组
- 在每组中进行检查
def func(group):
if len(group) > 1:
group.loc[group['test'] == 1, 'check'] = 1
return group
df = df.groupby(['ID', 'year']).apply(func)
我认为您可以使用一个转换来计算组。然后你就可以得到结果了。只有两行。
这是我的解决方案。
创建测试数据:
import pandas as pd
ID = [1111, 1111, 2222, 2222, 2222, 3333, 3333, 3333]
year = [2017, 2017, 2017, 2017, 2016, 2017, 2017, 2015]
month = [7, 9, 6, 6, 6, 3, 3, 8]
test = [1, 0, 1, 0, 0, 1, 0, 0]
df = pd.DataFrame({
"ID": ID,
"year": year,
"month": month,
"test": test
})
获取结果:
df.loc[:, "group_count"] = df.groupby(["ID", "year", "month"]).transform("count").values
df.loc[:, "check"] = ((df["test"]>0) & (df["group_count"] > 1)).astype(int)
因此,您想要一个单独的列来指示对应行的ID、年份和月份是否对应于test==1的行的ID和年份和月份?
使用iterrows((进行迭代:
to_check = []
for index, row in df.iterrows():
if row['test']==1: # in your pseudocode, you use single =; that's for assigning variables
to_ckeck.append([row['ID'], row['year'], row['month']])
check = []
for index, row in df.iterrows():
if [row['ID'], row['year'], row['month']] in to_check:
check.append(1)
else:
check.append(0)
df["check"] = check
您可以进行以下更改并尝试:
for line in df:
if line['test']==1:
I=line['ID']
Y=line['year']
MO=line['month']
for row in df:
if row['ID']==I and row['year']==Y and row['month']==MO:
line['check']=1
break