这不是最好的方法,但这是我目前所做的:
我有这个例子df:
df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df
给:
City
0 I lived Los Angeles
1 I visited London and Toronto
2 the best one is Toronto
3 business hub is in New York
4 Mexico city is stunning
我正在尝试从嵌套的dic中匹配(不区分大小写)城市名称,并创建一个新的列,其中包含国家名称和int值,用于统计目的。
所以,这里是我嵌套的dic作为国家和城市的参考:
country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
,我创建了一个函数,该函数应该从df中查找城市并将其与dic进行匹配,然后创建一个具有国家名称的列:
def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count + 1
else:
return None
然后应用到df:
df.City.apply(lambda x: get_country(x.lower()))
得到如下输出:
City US
0 I lived Los Angeles 1
1 I visited London and Toronto 1
2 the best one is Toronto 1
3 business hub is in New York 1
4 Mexico city is stunning 1
预期输出:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
这是基于您的函数的解决方案。我更改了变量的名称,使其更易于阅读和理解。
df = pd.DataFrame({
'City': ['I lived Los Angeles',
'I visited London and Toronto',
'the best one is Toronto',
'business hub is in New York',
' Mexico city is stunning']
})
country_cities = {
'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
def get_country(text):
text = text.lower()
count = 0
country_counts = dict.fromkeys(country_cities, 0)
for country, cities in country_cities.items():
for city in cities:
if city.lower() in text:
country_counts[country] += 1
return pd.Series(country_counts)
df = df.join(df.City.apply(get_country))
输出:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
基于Series.str.count
的解决方案
一个更简单的解决方案是使用Series.str.count
来计算每个国家的以下正则表达式模式city1|city2|etc
的出现次数(该模式与city1
或city2
或etc
匹配)。使用与上面相同的设置:
country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}
for country, pat in country_patterns.items():
df[country] = df['City'].str.count(pat)
为什么你的解决方案不起作用?
if y.lower() in x: df[k] = count + 1 else: return None
你的函数没有产生正确输出的原因是如果在文本中没有找到城市,则返回None
:不检查其余的国家和城市,因为return
语句自动退出函数。
所发生的是,只检查US
城市,df[k] = 1
行(在本例中为k = 'US'
)创建名为k
的整个列,填充值为1。它不是为该行创建单个值,而是创建或修改整个列。当使用apply
时,你想要改变单行或值(函数的输入),所以不要直接改变函数内的主DataFrame。
您可以使用lambda函数来检查字符串中是否包含每个国家的任何城市,在country
中的城市名称的第一个小写字母之后:
cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)
输出:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0