现有数据框架:
Id status countries
01 pass ['xyx','Indonesia','brazil']
02 fail ['PQ','XT','sri lanka']
03 pass ['spain', 'india','xtx']
期望数据帧:
Id status countries filtered_countries_name
01 pass ['xyx','Indonesia','brazil'] 'Indonesia','brazil'
02 fail ['PQ','XT','sri lanka'] 'sri lanka'
03 pass ['spain', 'india','xtx'] 'spain', 'india'
我确实有特定国家(我想检查的国家)的主列表,从那里我比较国家列中的现有列表。
my approach:
countries_list = ['china', 'india', 'united states', 'indonesia', 'brazil', 'pakistan', 'nigeria', 'bangladesh', 'russia', 'japan', 'mexico', 'philippines', 'vietnam', 'ethiopia', 'egypt', 'germany', 'iran', 'turkey', 'democratic republic of the congo', 'thailand', 'france', 'united kingdom', 'italy', 'burma', 'south africa', 'south korea', 'colombia', 'spain', 'ukraine', 'tanzania', 'kenya', 'argentina', 'algeria', 'poland', 'sudan', 'uganda','Indonesia','brazil','spain','sri lanka']
import re
countries_re = '|'.join(str(v) for v in countries_list )
df['filtered_countries_name'] = df['countries'].str.extractall(countries_re)
,但无法获取相同的错误
TypeError:插入列索引与框架索引不兼容
任何导致. . ? ?
如果您有列表,请使用带有set
的列表推导式作为效率参考:
S = set(countries_list)
df['filtered_countries_name'] = [[c for c in l if c.lower() in S]
for l in df['countries']]
输出:
Id status countries filtered_countries_name
0 1 pass [xyx, Indonesia, brazil] [Indonesia, brazil]
1 2 fail [PQ, XT, sri lanka] [sri lanka]
2 3 pass [spain, india, xtx] [spain, india]
使用设置的交叉路口:
df = pd.DataFrame({'Id': {0: 1, 1: 2, 2: 3},
'status': {0: 'pass', 1: 'fail', 2: 'pass'},
'countries': {0: ['xyx', 'Indonesia', 'brazil'],
1: ['PQ', 'XT', 'sri lanka'],
2: ['spain', 'india', 'xtx']}})
countries_list = ['china', 'india', 'united states', 'indonesia', 'brazil', 'pakistan', 'nigeria', 'bangladesh', 'russia', 'japan', 'mexico', 'philippines', 'vietnam', 'ethiopia', 'egypt', 'germany', 'iran', 'turkey', 'democratic republic of the congo', 'thailand', 'france', 'united kingdom', 'italy', 'burma', 'south africa', 'south korea', 'colombia', 'spain', 'ukraine', 'tanzania', 'kenya', 'argentina', 'algeria', 'poland', 'sudan', 'uganda','Indonesia','brazil','spain','sri lanka']
df["filtered_names"] = df["countries"].apply(lambda x: list(set(x) & set(countries_list)))
df
# Id status countries filtered_names
# 0 1 pass [xyx, Indonesia, brazil] [Indonesia, brazil]
# 1 2 fail [PQ, XT, sri lanka] [sri lanka]
# 2 3 pass [spain, india, xtx] [india, spain]