有什么有效的方法可以在python中找到围绕目标短语的ADJ



我正在对给定的文档进行情感分析,我的目标是找出句子中与目标短语最接近或周围的形容词。我确实知道如何提取与目标短语相关的周围单词,但我如何找到与目标短语相对接近或最接近的形容词或NNPVBN或其他POS标签。

以下是我如何使周围的单词与我的目标短语相对应的草图。

sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}
target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}

注意,我的原始数据集是作为数据帧给出的,其中给出了句子和相应目标短语的列表。这里我只是模拟数据如下:

import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)

在这里,我把这个句子标记如下:

from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]

然后我试图通过在这里使用这个战利品来找出与我的目标短语相关的周围单词。然而,我想找出与我的目标短语相对更接近或更接近的adjectiveverbsVBN。我怎样才能做到这一点?有什么办法完成这件事吗?感谢

以下内容适合您吗?我知道需要做一些调整才能使其完全有用(检查大小写;如果出现平局,它还会返回句子中前面的单词,而不是后面的单词(,但希望它足够有用,可以让你开始:

import nltk
from nltk.tokenize import MWETokenizer
def smart_tokenizer(sentence, target_phrase):
"""
Tokenize a sentence using a full target phrase.
"""
tokenizer = MWETokenizer()
target_tuple = tuple(target_phrase.split())
tokenizer.add_mwe(target_tuple)
token_sentence = nltk.pos_tag(tokenizer.tokenize(sentence.split()))
# The MWETokenizer puts underscores to replace spaces, for some reason
# So just identify what the phrase has been converted to
temp_phrase = target_phrase.replace(' ', '_')
target_index = [i for i, y in enumerate(token_sentence) if y[0] == temp_phrase]
if len(target_index) == 0:
return None, None
else:
return token_sentence, target_index[0]

def search(text_tag, tokenized_sentence, target_index):
"""
Search for a part of speech (POS) nearest a target phrase of interest.
"""
for i, entry in enumerate(tokenized_sentence):
# entry[0] is the word; entry[1] is the POS
ahead = target_index + i
behind = target_index - i
try:
if (tokenized_sentence[ahead][1]) == text_tag:
return tokenized_sentence[ahead][0]
except IndexError:
try:
if (tokenized_sentence[behind][1]) == text_tag:
return tokenized_sentence[behind][0]
except IndexError:
continue
x, i = smart_tokenizer(sentence='My problem was with DELL Customer Service',
target_phrase='DELL Customer Service')
print(search('NN', x, i))
y, j = smart_tokenizer(sentence="Good for everyday computing and web browsing.",
target_phrase="everyday computing")
print(search('NN', y, j))

编辑:我做了一些更改来解决使用任意长度的目标短语的问题,正如您在smart_tokenizer函数中看到的那样。关键是nltk.tokenize.MWETokenizer类(有关更多信息,请参阅:Python:用短语标记(。希望这能有所帮助。顺便说一句,我会质疑spaCy一定是更优雅的想法——在某种程度上,必须有人编写代码才能完成工作。这要么是spaCy开发人员,要么是您推出自己的解决方案。他们的API相当复杂,所以我将把这个练习留给你。

相关内容

最新更新