NLP生成集合

我正在进行手工操作，预期输出为

[('fans'，3(，('car'，3(

["跑车"、"运动迷"]

我的代码在下面。我能够得到第一个预期的输出，但不能正确地得到第二个输出。有人能帮我这里怎么了吗

from nltk.tokenize import RegexpTokenizer
text='Thirty-five sports disciplines and four cultural activities will be offered during seven days of competitions. He skated with charisma, changing from one gear to another, from one direction to another, faster than a sports car. Armchair sports fans settling down to watch the Olympic Games could be for the high jump if they do not pay their TV licence fee. Such invitationals will attract more viewership for sports fans by sparking interest among sports fans. She barely noticed a flashy sports car almost run them over, until Eddie lunged forward and grabbed her body away. And he flatters the mother and she kind of gets prissy and he talks her into going for a ride in the sports car.'
word='sports'
tokenizedword = nltk.tokenize.regexp_tokenize(text, pattern = 'w*', gaps = False)
#Step 2
tokenizedwords = [x.lower() for x in tokenizedword if x != '']
tokenizedwordsbigram=list(nltk.bigrams(tokenizedwords))
stop_words = set(stopwords.words('english')) 
filteredwords = []
for x in tokenizedwordsbigram:
if x not in stop_words:
filteredwords.append(x)

tokenizednonstopwordsbigram = nltk.ConditionalFreqDist(filteredwords)  
print(tokenizednonstopwordsbigram[word].most_common(3))
gen_text=nltk.Text(tokenizedwords)
print(gen_text.collocations())

更换

print(gen_text.collocations())

带有

print(gen_text.collocation_list())

你的程序将运行良好

我运行了添加所需导入nltk import和from nltk.corpus import stopwords的代码，得到了以下输出。

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
# use to find bigrams, which are pairs of words
text = 
'Thirty-five sports disciplines and four cultural activities will be offered during seven days of competitions. He skated with charisma, changing from one gear to another, from one direction to another, faster than a sports car. Armchair sports fans settling down to watch the Olympic Games could be for the high jump if they do not pay their TV licence fee. Such invitationals will attract more viewership for sports fans by sparking interest among sports fans. She barely noticed a flashy sports car almost run them over, until Eddie lunged forward and grabbed her body away. And he flatters the mother and she kind of gets prissy and he talks her into going for a ride in the sports car.'
word = 'sports'
tokenizedword = nltk.tokenize.regexp_tokenize(text, pattern='w*',
gaps=False)
# Step 2
tokenizedwords = [x.lower() for x in tokenizedword if x != '']
tokenizedwordsbigram = list(nltk.bigrams(tokenizedwords))
stop_words = set(stopwords.words('english'))
filteredwords = []
for x in tokenizedwordsbigram:
if x not in stop_words:
filteredwords.append(x)
tokenizednonstopwordsbigram = nltk.ConditionalFreqDist(filteredwords)
print tokenizednonstopwordsbigram[word].most_common(3)
gen_text = nltk.Text(tokenizedwords)
print gen_text.collocations()

这是输出：

[('car', 3), ('fans', 3), ('disciplines', 1)]
sports car; sports fans
None

gen_text=nltk。文本(标记词(.colocation_list((

b＝[i[0]+"i[1]对于gen_text]中的i

返回b

您将输出为：

["跑车"、"运动迷"]

from nltk.corpus import stopwords
def performBigramsAndCollocations(textcontent, word):
stop_words=set(stopwords.words('english'))
pattern =r'w+'
tokenizewords=nltk.regexp_tokenize(textcontent,pattern)
tokenizewords=[word.lower() for word in tokenizewords]
tokenizewordsbiagrams=nltk.bigrams(tokenizewords)
tokenizednonstopwordbigrams=[(w1,w2) for w1,w2 in tokenizewordsbiagrams if w1 not in stop_words and w2 not in stop_words]
cfd_bigrams=nltk.ConditionalFreqDist(tokenizednonstopwordbigrams)
cfd_bigrams=cfd_bigrams[word]
mostfrequentwordafter=cfd_bigrams.most_common(3)
collocationwords=nltk.Text(tokenizewords)
collocationwords=collocationwords.collocation_list()
collocationwords=[i[0]+" "+i[1] for i in collocationwords]

return mostfrequentwordafter,collocationwords

你可以试试这个。它对我有用！

相关内容

最新更新

热门标签：