Python根据另一张excel表中的关键词对excel中的数据进行分类



我有两张excel表,其中一张有四种不同类型的类别,其中列出了关键字。我使用Python在评论数据中查找关键字,并将它们匹配到一个类别。我曾尝试使用熊猫和数据帧进行比较,但我会遇到类似";DataFrame对象是可变的,因此它们不能被散列;。我不确定是否有更好的方法,但我是熊猫的新手。

这里有一个例子:

分类表

体验
服务
速度慢 easy

一种方法是从cat帧构建正则表达式:

exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)

或者,将cat替换为要测试的列列表:

cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)

然后使用str.extractallaggregate到摘要+join添加回reviews帧以获得匹配:

聚合到列表中:

reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review #  Location                                  Review Service Experience
0         1  New York          The service was fast and easy!  [fast]     [easy]
1         2     Texas  Overall it was a bad experience for me      []      [bad]

聚合成字符串:

reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review #  Location                                  Review Service Experience
0         1  New York          The service was fast and easy!    fast       easy
1         2     Texas  Overall it was a bad experience for me                bad

或者,对于存在性测试,使用等级为0:的any

reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review #  Location                                  Review  Service  Experience
0         1  New York          The service was fast and easy!     True        True
1         2     Texas  Overall it was a bad experience for me    False        True

或者在列上迭代,并使用str.contains:

cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review #  Location                                  Review  Service  Experience
0         1  New York          The service was fast and easy!     True        True
1         2     Texas  Overall it was a bad experience for me    False        True

最新更新