我有两张excel表,其中一张有四种不同类型的类别,其中列出了关键字。我使用Python在评论数据中查找关键字,并将它们匹配到一个类别。我曾尝试使用熊猫和数据帧进行比较,但我会遇到类似";DataFrame对象是可变的,因此它们不能被散列;。我不确定是否有更好的方法,但我是熊猫的新手。
这里有一个例子:
分类表
服务 | ||
---|---|---|
快 | ||
速度慢 | easy |
一种方法是从cat
帧构建正则表达式:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
或者,将cat
替换为要测试的列列表:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
然后使用str.extractall
和aggregate
到摘要+join
添加回reviews
帧以获得匹配:
聚合到列表中:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
聚合成字符串:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
或者,对于存在性测试,使用等级为0:的any
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
或者在列上迭代,并使用str.contains
:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True