是否有一个功能可以让我确定文本是否谈论预定义的主题?

我想编写主题列表来检查评论是否涉及定义的主题之一。对我来说，自己编写主题列表而不是使用主题建模来查找可能的主题很重要。

我以为这叫做字典分析，但我什么也找不到。

我有一个数据框，其中包含来自亚马逊的评论：

df = pd.DataFrame({'User': ['UserA', 'UserB','UserC'], 
'text': ['Example text where he talks about a phone and his charging cable',
'Example text where he talks about a car with some wheels',
'Example text where he talks about a plane']})

现在我想定义主题列表：

phone = ['phone', 'cable', 'charge', 'charging', 'call', 'telephone']
car = ['car', 'wheel','steering', 'seat','roof','other car related words']
plane = ['plane', 'wings', 'turbine', 'fly']

该方法的结果对于第一次评论的"电话"主题(主题列表中的 3 个单词，其中有 12 个单词)应该是 3/12，对于其他两个主题，该方法的结果应该是 0。

第二次审查的结果是"汽车"主题为2/11，其他主题为0，第三次审查"飞机"主题为1/8，其他主题为0。

结果以列表形式显示：

phone_results = [0.25, 0, 0]
car_results = [0, 0.18181818182, 0]
plane_results = [0, 0, 0.125]

当然，我只会使用评论的小写词干，这使得定义主题更容易，但现在这不应该引起关注。

有没有一种方法，或者我必须写一个？提前谢谢你！

NLP 可能很深，但对于已知单词的比例，你可能会做一些更基本的事情。例如：

word_map = {
'phone': ['phone', 'cable', 'charge', 'charging', 'call', 'telephone'],
'car': ['car', 'wheels','steering', 'seat','roof','other car related words'],
'plane': ['plane', 'wings', 'turbine', 'fly']
}
sentences = [
'Example text where he talks about a phone and his charging cable',
'Example text where he talks about a car with some wheels',
'Example text where he talks about a plane'
]
for sentence in sentences:
print '==== %s ==== ' % sentence
words = sentence.split()
for prefix in word_map:
match_score = 0
for word in words:
if word in word_map[prefix]:
match_score += 1
print 'Prefix: %s | MatchScore: %.2fs' % (prefix, float(match_score)/len(words))

你会得到这样的东西：

==== Example text where he talks about a phone and his charging cable ==== 
Prefix: phone | MatchScore: 0.25s
Prefix: plane | MatchScore: 0.00s
Prefix: car | MatchScore: 0.00s
==== Example text where he talks about a car with some wheels ==== 
Prefix: phone | MatchScore: 0.00s
Prefix: plane | MatchScore: 0.00s
Prefix: car | MatchScore: 0.18s
==== Example text where he talks about a plane ==== 
Prefix: phone | MatchScore: 0.00s
Prefix: plane | MatchScore: 0.12s
Prefix: car | MatchScore: 0.00s

当然，这是一个基本的例子，单词有时不以空格结尾——它可以是逗号、句号等。所以你要考虑到这一点。还有时态我可以"打电话"某人或"打电话"或"打电话"，但我们也不希望像"语音"这样的词混淆。所以它在边缘情况下变得非常棘手，但对于一个非常基本的工作(！)示例，我会看看你是否可以在不使用自然语言库的情况下在 python 中做到这一点。最终，如果它不符合您的用例，您可以开始测试它们。

除此之外，您还可以查看Rasa NLU或nltk之类的东西。

您可以使用RASA-NLU意图分类预训练模型

我想我回馈社区并发布我完成的代码，该代码基于@David542答案：

import pandas as pd
import numpy as np 
import re
i=0
#Iterates through the reviews
total_length = len(sentences)
print("Process started:")
s = 1
for sentence in sentences:

#Splits a review text into single words
words = sentence.split()
previous_word = ""
#Iterates through the topics, each is one column in a table
for column in dictio:
#Saves the topic words in the pattern list
pattern = list(dictio[column])
#remove nan values
clean_pattern = [x for x in pattern if str(x) != 'nan']
match_score = 0
#iterates through each entry of the topic list
for search_words in clean_pattern:
#iterates through each word of the review
for word in words:
#when two consecutive words are searched for the first if statement gets activated
if len(search_words.split())>1:
pattern2 = r"( "+re.escape(search_words.split()[0])+r"([a-z]+|) "+re.escape(search_words.split()[1])+r"([a-z]+|))"
#the spaces are important so bedtime doesnt match time
if re.search(pattern2, " "+previous_word+" "+word, re.IGNORECASE):
match_score +=1
#print(pattern2, " match ", previous_word," ", word)
if len(search_words.split())==1:
pattern1 = r" "+re.escape(search_words)+r"([a-z]+|)"
if re.search(pattern1, " "+word, re.IGNORECASE):
match_score +=1
#print(pattern1, " match ", word)
#saves the word for the next iteration to be used as the previous word
previous_word = word

result=0       
if match_score > 0:
result = 1
df.at[i, column] = int(result)
i+=1
#status bar
factor = round(s/total_length,4)
if factor%0.05 == 0:
print("Status: "+str(factor*100)+"%")
s+=1

我要分析的文本在字符串列表中sentences.我想在我的文本中查找的主题在数据帧dictio中。主题以主题名称开头，并包含搜索词行。分析采用一个或两个连续的单词，并在每个字符串中以可变结尾查找它们。如果正则表达式与原始数据帧匹配，df则在分配给主题的列的相应行中得到"1"。除了在我的问题中指定的之外，我没有计算单词的百分比，因为我发现它不会为我的分析增加价值。应删除字符串中的标点符号，但不需要词干提取。如果您有具体问题，请发表评论，我将编辑此代码或回答您的评论。

相关内容

最新更新

热门标签：