我正在使用Porter和Lancaster进行stemming，我发现了以下观察结果：

Input: replied
Porter: repli
Lancaster: reply

Input:  twice
porter:  twice
lancaster:  twic
Input:  came
porter:  came
lancaster:  cam
Input:  In
porter:  In
lancaster:  in

我的问题是：

Lancaster被认为是"侵略性的"stemmer，但它与replied一起正常工作。为什么
单词In在Porter中保持不变，大写为In，为什么
注意，Lancaster正在删除以e结尾的单词，为什么

我不能理解这些概念。你能帮忙吗？

Q:Lancaster被认为是"攻击性"的词干，但它与`replied`一起工作得很好。为什么

这是因为Lancaster stemmer的实现在https://github.com/nltk/nltk/pull/1654

如果我们看看https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62，有一个后缀规则，用于更改-ied > -y

default_rule_tuple = (
"ai*2.",   # -ia > -   if intact
"a*1.",    # -a > -    if intact
"bb1.",    # -bb > -b
"city3s.", # -ytic > -ys
"ci2>",    # -ic > -
"cn1t>",   # -nc > -nt
"dd1.",    # -dd > -d
"dei3y>",  # -ied > -y
...)

该功能允许用户输入新规则，如果没有添加其他规则，则它将使用parseRules中的self.default_rule_tuple，其中将应用rule_tuplehttps://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196

def parseRules(self, rule_tuple=None):
"""Validate the set of rules used in this stemmer.
If this function is called as an individual method, without using stem
method, rule_tuple argument will be compiled into self.rule_dictionary.
If this function is called within stem, self._rule_tuple will be used.
"""
# If there is no argument for the function, use class' own rule tuple.
rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
valid_rule = re.compile("^[a-z]+*?d[a-z]*[>.]?$")
# Empty any old rules from the rule set before adding new ones
self.rule_dictionary = {}
for rule in rule_tuple:
if not valid_rule.match(rule):
raise ValueError("The rule {0} is invalid".format(rule))
first_letter = rule[0:1]
if first_letter in self.rule_dictionary:
self.rule_dictionary[first_letter].append(rule)
else:
self.rule_dictionary[first_letter] = [rule]

default_rule_tuple实际上来自于佩斯外壳茎干器的嗖嗖声实现，也就是兰开斯特茎干器https://github.com/nltk/nltk/pull/1661=(

Q：在Porter中，In一词与大写的In保持不变，为什么

这太有趣了！很可能是一个bug。

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

如果我们看一下代码，PorterStemmer.stem()对小写字母所做的第一件事，https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651

def stem(self, word):
stem = word.lower()
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
stem = self._step1a(stem)
stem = self._step1b(stem)
stem = self._step1c(stem)
stem = self._step2(stem)
stem = self._step3(stem)
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
return stem

但如果我们看一下代码，其他所有内容都返回了stem，它是小写的，但有两个if子句返回了原始word的某种形式，但它没有小写

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word

第一个if子句检查单词是否在包含不规则单词及其词干的self.pool中。

第二个检查len(word)&lt2，然后返回它的原始形式，在"in"的情况下，第二个if子句返回True，从而返回原始的非小写形式。

Q：请注意，Lancaster正在删除"come"中以`e`结尾的单词，为什么

同样来自default_rule_tuple也不足为奇https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67，有一条规则会更改-e > -=(

Q：如何从`default_rule_tuple`禁用`-e > -`规则

(Un-(幸运的是，LancasterStemmer._rule_tuple对象是一个不可变的元组，所以我们不能简单地从中删除一个项，但我们可以覆盖它=(

>>> from nltk.stem import LancasterStemmer
>>> lancaster = LancasterStemmer()
>>> lancaster.stem('came')
'cam'
# Create a new stemmer object to refresh the cache.
>>> lancaster = LancasterStemmer()
>>> temp_rule_list = list(lancaster._rule_tuple)
# Find the 'e1>' rule.
>>> lancaster._rule_tuple.index('e1>') 
12
# Create a temporary rule list from the tuple.
>>> temp_rule_list = list(lancaster._rule_tuple)
# Remove the rule.
>>> temp_rule_list.pop(12)
'e1>'
# Override the `._rule_tuple` variable.
>>> lancaster._rule_tuple = tuple(temp_rule_list)
# Et voila!
>>> lancaster.stem('came')
'came'

波特和兰开斯特阻止澄清

Q:Lancaster被认为是"攻击性"的词干，但它与`replied`一起工作得很好。为什么

Q：在Porter中，In一词与大写的In保持不变，为什么

Q：请注意，Lancaster正在删除"come"中以`e`结尾的单词，为什么

Q：如何从`default_rule_tuple`禁用`-e > -`规则

相关内容

最新更新

热门标签：

波特和兰开斯特阻止澄清

Q:Lancaster被认为是"攻击性"的词干，但它与replied一起工作得很好。为什么

Q： 在Porter中，In一词与大写的In保持不变，为什么

Q： 请注意，Lancaster正在删除"come"中以e结尾的单词，为什么

Q： 如何从default_rule_tuple禁用-e > -规则

相关内容

最新更新

热门标签：

Q:Lancaster被认为是"攻击性"的词干，但它与`replied`一起工作得很好。为什么

Q：在Porter中，In一词与大写的In保持不变，为什么

Q：请注意，Lancaster正在删除"come"中以`e`结尾的单词，为什么

Q：如何从`default_rule_tuple`禁用`-e > -`规则