Python根据另一张excel表中的关键词对excel中的数据进行分类

我有两张excel表，其中一张有四种不同类型的类别，其中列出了关键字。我使用Python在评论数据中查找关键字，并将它们匹配到一个类别。我曾尝试使用熊猫和数据帧进行比较，但我会遇到类似"；DataFrame对象是可变的，因此它们不能被散列；。我不确定是否有更好的方法，但我是熊猫的新手。

这里有一个例子：

分类表

体验

服务
快
速度慢	easy

一种方法是从cat帧构建正则表达式：

exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])

(?P<Service>fast|slow)|(?P<Experience>bad|easy)

或者，将cat替换为要测试的列列表：

cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])

(?P<Service>fast|slow|quick)

然后使用str.extractall和aggregate到摘要+join添加回reviews帧以获得匹配：

聚合到列表中：

reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)

Review #  Location                                  Review Service Experience
0         1  New York          The service was fast and easy!  [fast]     [easy]
1         2     Texas  Overall it was a bad experience for me      []      [bad]

聚合成字符串：

reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)

Review #  Location                                  Review Service Experience
0         1  New York          The service was fast and easy!    fast       easy
1         2     Texas  Overall it was a bad experience for me                bad

或者，对于存在性测试，使用等级为0:的any

reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)

Review #  Location                                  Review  Service  Experience
0         1  New York          The service was fast and easy!     True        True
1         2     Texas  Overall it was a bad experience for me    False        True

或者在列上迭代，并使用str.contains:

cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))

Review #  Location                                  Review  Service  Experience
0         1  New York          The service was fast and easy!     True        True
1         2     Texas  Overall it was a bad experience for me    False        True

相关内容

最新更新

热门标签：