我有一个数据帧如下:
name country gender
wang ca 1
jay us 1
jay ca 0
jay ca 1
lisa en 0
lisa us 1
我想根据代码"US"分配性别。如果名称相同,那么所有的性别都应该与指定给我们编码的性别相同。对于没有重复的名称,我们返回相同的行。
返回结果应为
name code gender
wang ca 1
jay us 1
lisa us 1
我使用
df.gropuby(['name', 'country'])['gender'].transform()
关于如何解决这个问题,有什么建议吗?
# Get country and gender in separate lists for a name
a = df.groupby('name')['country'].apply(list).reset_index(name='country_list')
b = df.groupby('name')['gender'].apply(list).reset_index(name='gender_list')
# Merge
df2 = a.merge(b, on='name', how='left')
# Using apply get final required values
def get_val(x):
cl, gl = x
final = [cl[0], gl[0]]
for c,g in zip(cl,gl):
if c=='us':
final.append(c)
final.append(g)
return final
df2['final_col'] = df2[['country_list', 'gender_list']].apply(get_val, axis=1)
df2['code'] = df2['final_col'].apply(lambda l: l[0])
df2['gender'] = df2['final_col'].apply(lambda l: l[1])
print(df2)
我使用的方法是先用merge()
,然后用条件替换(np.where()
(
它有点复杂,但适用于条件,而不是您的样本数据。
import io
import numpy as np
df = pd.read_csv(io.StringIO("""name country gender
wang ca 1
jay us 1
jay ca 0
jay ca 1
lisa en 0
lisa us 1"""), sep="s+")
# use "us" as basis for lookup. left merge on name only
df2 = (df.merge(df.query("country=='us'"),
on=["name"], how="left", suffixes=("", "_new"))
# replace only where it's not "us" and "us" has a different value
.assign(gender=lambda x: np.where((x["country"]!="us")&
(x["gender"]!=x["gender_new"])&
~(x["gender_new"].isna())
# force type casting so it doesn't become float64 because of NaN
, x["gender_new"].fillna(-1).astype("int64"),
x["gender"]))
# remove columns inserted by merge...
.drop(columns=["country_new", "gender_new"])
)
输出
name country gender
wang ca 1
jay us 1
jay ca 1
jay ca 1
lisa en 1
lisa us 1