从我的数据框中按国家返回性别



我有一个数据帧如下:

name   country    gender
wang    ca         1
jay     us         1
jay     ca         0
jay     ca         1
lisa    en         0
lisa    us         1

我想根据代码"US"分配性别。如果名称相同,那么所有的性别都应该与指定给我们编码的性别相同。对于没有重复的名称,我们返回相同的行。

返回结果应为

name    code   gender
wang     ca     1
jay      us     1
lisa     us     1

我使用

df.gropuby(['name', 'country'])['gender'].transform()

关于如何解决这个问题,有什么建议吗?

# Get country and gender in separate lists for a name
a = df.groupby('name')['country'].apply(list).reset_index(name='country_list')
b = df.groupby('name')['gender'].apply(list).reset_index(name='gender_list')
# Merge 
df2 = a.merge(b, on='name', how='left')
# Using apply get final required values
def get_val(x):
cl, gl = x
final = [cl[0], gl[0]]
for c,g in zip(cl,gl):
if c=='us':
final.append(c)
final.append(g)
return final
df2['final_col'] = df2[['country_list', 'gender_list']].apply(get_val, axis=1) 
df2['code'] = df2['final_col'].apply(lambda l: l[0])
df2['gender'] = df2['final_col'].apply(lambda l: l[1])
print(df2)

我使用的方法是先用merge(),然后用条件替换(np.where()(

它有点复杂,但适用于条件,而不是您的样本数据。

import io
import numpy as np
df = pd.read_csv(io.StringIO("""name   country    gender
wang    ca         1
jay     us         1
jay     ca         0
jay     ca         1
lisa    en         0
lisa    us         1"""), sep="s+")
# use "us" as basis for lookup. left merge on name only
df2 = (df.merge(df.query("country=='us'"), 
on=["name"], how="left", suffixes=("", "_new"))
# replace only where it's not "us" and "us" has a different value
.assign(gender=lambda x: np.where((x["country"]!="us")&
(x["gender"]!=x["gender_new"])&
~(x["gender_new"].isna())
# force type casting so it doesn't become float64 because of NaN
, x["gender_new"].fillna(-1).astype("int64"), 
x["gender"]))
# remove columns inserted by merge...
.drop(columns=["country_new", "gender_new"])
)

输出

name country  gender
wang      ca       1
jay      us       1
jay      ca       1
jay      ca       1
lisa      en       1
lisa      us       1

最新更新