从pandas的新列中的字符串进行关键字分类



这不是最好的方法,但这是我目前所做的:

我有这个例子df:

df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df

给:

City
0   I lived Los Angeles
1   I visited London and Toronto
2   the best one is Toronto
3   business hub is in New York
4   Mexico city is stunning

我正在尝试从嵌套的dic中匹配(不区分大小写)城市名称,并创建一个新的列,其中包含国家名称和int值,用于统计目的。

所以,这里是我嵌套的dic作为国家和城市的参考:

country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']      
}

,我创建了一个函数,该函数应该从df中查找城市并将其与dic进行匹配,然后创建一个具有国家名称的列:

def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count + 1
else:
return None

然后应用到df:

df.City.apply(lambda x: get_country(x.lower()))

得到如下输出:

City                              US
0   I lived Los Angeles               1
1   I visited London and Toronto      1
2   the best one is Toronto           1
3   business hub is in New York       1
4   Mexico city is stunning           1

预期输出:

City                              US    CA   UK
0   I lived Los Angeles               1      0    0
1   I visited London and Toronto      0      1    1
2   the best one is Toronto           0      1    0
3   business hub is in New York       1      0    0
4   Mexico city is stunning           0      0    0

这是基于您的函数的解决方案。我更改了变量的名称,使其更易于阅读和理解。

df = pd.DataFrame({
'City': ['I lived Los Angeles', 
'I visited London and Toronto',
'the best one is Toronto', 
'business hub is in New York',
' Mexico city is stunning']
})

country_cities = { 
'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']      
}
def get_country(text):
text = text.lower()
count = 0
country_counts = dict.fromkeys(country_cities, 0)

for country, cities in country_cities.items():
for city in cities:
if city.lower() in text:
country_counts[country] += 1 

return pd.Series(country_counts)
df = df.join(df.City.apply(get_country))

输出:

City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0

基于Series.str.count的解决方案

一个更简单的解决方案是使用Series.str.count来计算每个国家的以下正则表达式模式city1|city2|etc的出现次数(该模式与city1city2etc匹配)。使用与上面相同的设置:

country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}
for country, pat in country_patterns.items():
df[country] = df['City'].str.count(pat)

为什么你的解决方案不起作用?

if y.lower() in x:
df[k] = count + 1
else:
return None

你的函数没有产生正确输出的原因是如果在文本中没有找到城市,则返回None:不检查其余的国家和城市,因为return语句自动退出函数。

所发生的是,只检查US城市,df[k] = 1行(在本例中为k = 'US')创建名为k的整个列,填充值为1。它不是为该行创建单个值,而是创建或修改整个列。当使用apply时,你想要改变单行或值(函数的输入),所以不要直接改变函数内的主DataFrame。

您可以使用lambda函数来检查字符串中是否包含每个国家的任何城市,在country中的城市名称的第一个小写字母之后:

cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)

输出:

City  US  CA  UK
0           I lived Los Angeles   1   0   0
1  I visited London and Toronto   0   1   1
2       the best one is Toronto   0   1   0
3   business hub is in New York   1   0   0
4       Mexico city is stunning   0   0   0

最新更新