如何仅筛选那些包含给定标记列表中任何值的行

DataFrame

我有这个数据帧，它包含用户id和与用户相关的标记。只有那些具有包含该列表中任何一个的标记的行才能过滤掉，最好的方法是什么。data_science=["python"、"r"、"matlab"、"sas"、"excel"、"sql"]我在pandas中尝试过下面的代码，它确实在一定程度上过滤掉了，但它给出了与列表相似的标签。例如，对于sql，它抛出sql服务器。你能建议一个更好的方法吗？

df_ds = df_combo[df_combo["Tag"].astype(str).str.contains('(python|excel|sql|matlab)', regex=True)]

我认为一种可能更简单的方法，但可能很冗长：

# create a set with the queried tags
tags = {'python', 'r', 'matlab', 'sas', 'excel', 'sql'}
# create an auxiliary column where all the tags are separated elements of a set 
df_combo['Tag-set'] = df_combo['Tag'].str.split(',').apply(lambda x: [e.strip() for e in x]).tolist() 
# use sets to check the intersection
df_combo['Tag-set'] = df_combo['Tag-set'].apply(set)
# filter the list
df_fd = df_combo[df_combo['Tag-set'].apply(lambda x: len(x & tags) > 0)]

这个想法是使用split和strip清理所有字符串，然后只保留那些交集至少有一个元素的字符串。

相关内容

最新更新

热门标签：