蟒蛇熊猫在单词中骑复数"s"以准备字数统计

我有以下python pandas dataframe：

Question_ID | Customer_ID | Answer
    1           234         The team worked very hard ...
    2           234         All the teams have been working together ...

我将使用我的代码来计算答案列中的单词。但是事先，我想从"团队"一词中删除" S"，因此在上面的示例中，我计算团队：2而不是团队：1和团队：1。

我该怎么做所有的话？

您需要使用令牌剂（将句子分解为单词）和lemmmatizer（用于标准化单词表单），均由自然语言工具包nltk：

提供

import nltk
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(word) for word in nltk.wordpunct_tokenize(sentence)]
# ['All', 'the', 'team', 'have', 'been', 'working', 'together']

使用 str.replace从 's'结尾的任何3个或多个字母单词中删除s。

df.Answer.str.replace(r'(w{2,})sb', r'1')
0                  The team worked very hard ...
1    All the team have been working together ...
Name: Answer, dtype: object

'{2,}'指定2个或更多。结合使用's'可确保您会错过'is'。您可以将其设置为'{3,}'，以确保您也跳过'its'。

尝试NTLK工具包。特定的诱导和诱饵。我从来没有亲自使用过它，但是您可以在这里尝试。

这是一些棘手的复数的示例，

its it's his quizzes fishes maths mathematics

变成

it it ' s hi quizz fish math mathemat

您可以看到它处理的"他的"（和"数学"）很差，但是再说一次，您可能会有很多缩写的" Hellos"。这是野兽的本质。

相关内容

最新更新

热门标签：