我想出了下面的办法。我已经将问题缩小到无法同时捕获一个单词和两个单词的专有名词。
(1)如果我能加入一个条件,在两个捕获之间进行选择时,默认使用较长的单词,那就太好了。
和
(2)如果我可以告诉正则表达式只在字符串以介词开头时才考虑这个,例如On|At|For。我在玩这样的东西,但它不工作:
(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}s{0,1}[A-Z][a-z]{0,5})
我怎么做1和2?
my current regex
r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}s{0,1}[A-Z][a-z]{0,15})'
我想捕捉,Ashoka, Shift系列,Compass Partners和Kenneth Cole
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',
你在这里要做的事情在自然语言处理中被称为"命名实体识别"。如果您确实需要一种能够找到专有名词的方法,那么您可能必须考虑升级到命名实体识别。值得庆幸的是,nltk
库中有一些易于使用的函数:
import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)
结果:
res.productions()
Out[8]:
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]
我会使用NLP工具,python中最流行的似乎是nltk。正则表达式真的不是正确的方法…在nltk网站的首页上有一个例子,链接到前面的答案,它被复制粘贴在下面:
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)
实体现在包含根据Penn树库标记的单词
不完全正确,但这将匹配您正在寻找的大部分内容,除了On
。
import re
text = """
#'On its 25th anniversary, Ashoka',
#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)
print matches
输出:[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]
然后也许你可以实现一个过滤器来遍历这个列表
def filter_false_positive(unfiltered_matches):
filtered_matches = []
black_list = ["an","on","in","foo","bar"] #etc
for match in unfiltered_matches:
if match.lower() not in black_list:
filtered_matches.append(match)
return filtered_matches
或者因为python很酷:
def filter_false_positive(unfiltered_matches):
black_list = ["an","on","in","foo","bar"] #etc
return [match for match in filtered_matches if match.lower() not in black_list]
你可以这样使用:
# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches
给出最终输出:
['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']
判断一个单词是否因为出现在句子开头而大写,或者它是否是专有名词,这个问题并不是那么简单。
'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'
在这种情况下,这是相当困难的,所以没有什么东西可以知道一个专有名词的其他标准,一个黑名单,一个数据库等,这将不会那么容易。regex
很棒,但我不认为它可以在语法层面上以任何琐碎的方式解释英语…
话虽如此,祝你好运!