我很难过滤这个CSV文件。
以下是csv表中的一些条目:
Name Info Bio
Alice Woman: 21y (USA) Actress
Breonna Woman: (France) Singer
Carla Woman: 30y (Trinidad and Tobago) Actress
Diana Woman: (USA) Singer
我正在尝试过滤"信息"行,以获得所有国家和频率的列表。随着年龄的增长,我也在努力做同样的事情。正如你所看到的,并不是所有的女性都公布了自己的年龄。
我试过
women= pd.read_csv('women.csv')
women_count = pd.Series(' '.join(women.Info).split()).value_counts()
然而,这会分割所有内容和输出:
Woman: 4
(USA) 2
21y 1
(Trinidad 1
and 1
Tobago) 1
30y 1
我应该补充一点,我试过women_filtered = women[women['Info'] == '(USA)']
,但不起作用
我的问题是:
- 我如何分割字符串以按国家/地区进行筛选,尤其是因为所有国家/地区都在括号中
- 如何筛选没有年龄的条目
谢谢
import pandas as pd
df = pd.DataFrame(
{'Name':['Alice', 'Breonna', 'Carla', 'Diana'],
'Info':['Woman: 21y (USA)', 'Woman: (France)', 'Woman: 30y (Trinidad and Tobago)', 'Woman: (USA)'],
'Bio':['Actress', 'Singer', 'Actress', 'Singer']}
)
# defining columns using regex
df['country'] = df['Info'].str.extract('(([^)]+))')
df['age'] = df['Info'].str.extract('[s]+([d]{2})y[s]+').astype(float)
df['noage'] = df['age'].isnull().astype(int)
# frequency of countries
sizes = df.groupby('country').size()
sizes
这将输出频率。
country
France 1
Trinidad and Tobago 1
USA 2
dtype: int64
我将查找如何编写regex表达式,这样您就可以学习如何自己从字符串中提取信息。Pythex.org是一个很好的网站,可以试用Python中的regex表达式,并提供了一些有用的提示。
打印(df(
Name Info Bio
0 Alice Woman: 21y (USA) Actress
1 Carla 30y (Trinidad and Tobago) Singer
2 Breonna Woman: (France) Actress
3 Diana Woman: (USA) Singer
#Solution
#Extract Name of countries
df=df.assign(Age=df.Info.str.extract('(d+(?=D))'), Countries=df.Info.str.extract('((.*?))'))
Name Info Bio Age Countries
0 Alice Woman: 21y (USA) Actress 21 USA
1 Carla 30y (Trinidad and Tobago) Singer 30 Trinidad and Tobago
2 Breonna Woman: (France) Actress NaN France
3 Diana Woman: (USA) Singer NaN USA
#Filter without Age
df[df.Age.isna()]
Name Info Bio Age Countries
2 Breonna Woman: (France) Actress NaN France
3 Diana Woman: (USA) Singer NaN USA