在Pandas中使用Regex提取特定的单词



我正在尝试从以下数据框中提取国家名称

country
0   NaN
1   Country: America
2   Country: France ...More CountriesFranceNorwayP...
3   NaN
4   Country: India

使用以下正则表达式

import re
regex = re.compile(
r"Country: (?P<country>w+)"
)
df['country'] = df['country'].str.extractall(regex).droplevel(1)

但是它返回

country
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN

而不是返回

country
0   NaN
1   America
2   France
3   NaN
4   India

我错过了什么?

请通知

您可以使用extract:

df['country'] = df['country'].str.extract(r'Country:s*(w+)')

熊猫测试:

import pandas as pd
import numpy as np
df = pd.DataFrame({'country' : [np.nan, 'Country: America', 'Country France ... More countries...']})
df['country'].str.extract(r'Country:s*(w+)')
#          0
# 0      NaN
# 1  America
# 2      NaN

您也可以避免使用regex而使用Series.str.split:

In [86]: df = pd.DataFrame({'country' : [np.nan, 'Country: America', 'Country: France ... More countries...', np.nan, 'Country: India']})
In [87]: df
Out[87]: 
country
0                                    NaN
1                       Country: America
2  Country: France ... More countries...
3                                    NaN
4                         Country: India
In [94]: df.country.str.split(':').str[1].str.split().str[0]
Out[94]: 
0        NaN
1    America
2     France
3        NaN
4      India
Name: country, dtype: object

最新更新