这是数据帧:
data = {"Company" : [["ConsenSys"] , ["Cognizant"], ["IBM"], ["IBM"], ["Reddit, Inc"], ["Reddit, Inc"], ["IBM"]],
"skills" : [['services', 'scientist technical expertise', 'databases'], ['datacomputing tools experience', 'deep learning models', 'cloud services'], ['quantitative analytical projects', 'financial services', 'field experience'],
['filesystems server architectures', 'systems', 'statistical analysis', 'data analytics', 'workflows', 'aws cloud services'], ['aws services'], ['data mining statistics', 'statistical analysis', 'aws cloud', 'services', 'data discovery', 'visualization'], ['communication skills experience', 'services', 'manufacturing environment', 'sox compliance']]}
dff = pd.DataFrame(data)
dff
- 我需要创建一个新的列,我想通过采取具体的
- 不包含这些特定单词的行应该是删除。 关键词:"服务"、"统计分析"预期输出:
<表类>公司 技能 new_col 0 [ConsenSys] [services,科学家技术专长,数据库] [services] 1 [IBM] [文件系统服务器架构、系统、统计分析、数据分析、工作流、aws云服务] [服务、统计分析] 2 [Reddit, Inc] [数据挖掘统计、统计分析、aws云、服务、数据发现、可视化] [统计分析] 3 [IBM] ['沟通技巧经验','服务','制造环境','sox合规'] [services] 表类>
可以将lambda与列表一起使用
words = ["services", "statistical analysis"]
dff["found"] = dff["skills"].apply(lambda x: ", ".join(set([i for i in x if i in words])).split(", "))
word = ['services', 'statistical analysis']
s1 = df['skills'].apply(lambda x: [i for i in word if i in x])
输出(s1
):
0 [services]
1 []
2 []
3 [statistical analysis]
4 []
5 [services, statistical analysis]
6 [services]
Name: skills, dtype: object
使s1
变为new_col
和boolean indexing
df.assign(new_col=s1)[lambda x: x['new_col'].astype('bool')]
结果:
Company skills new_col
0 [ConsenSys] [services, scientist technical expertise, data... [services]
3 [IBM] [filesystems server architectures, systems, st... [statistical analysis]
5 [Reddit, Inc] [data mining statistics, statistical analysis,... [services, statistical analysis]
6 [IBM] [communication skills experience, services, ma... [services]
我认为你应该做更简单的例子