比较Python表中的文本

我想将python列表中的文本相互比较。例如

Url         | text
            |
www.xyz.com | " hello bha njik **bhavd bhavd** bjavd manhbd kdkndsik wkjdk"
            | 
www.abc.com | "bhavye jsbsdv sjbs jcsbjd adjbsd jdfhjdb jdshbjf jdsbjf"
            |
www.lokj.com| "bsjgad adhuad jadshjasd kdashda kdajikd kdfsj **bhavd bhavd** "

现在，我想将第一文本与其他行进行比较，以了解文本中有多少个单词。并逐渐使用以下行等第二行。...

我使用的方法应该是什么？我应该使用哪种数据结构？

for Python3

如评论中所述，我们生成每个可能的一对，创建集合以确保单词唯一性，我们只计算每对唯一的常用单词的数量。如果您的文本列表结构有些不同

，则可能需要对此进行一些调整

import itertools
my_list = ["a text a", "an other text b", "a last text c and so on"]
def simil(text_a, text_b):
    # returns the number of common unique words betwene two texts 
    return len(set(text_a.split()).intersection(set(text_b.split())))
results = []
# for each unique combination of texts
for pair in itertools.combinations(my_list, r=2):
    results.append(simil(*pair))
print(result)

旁注：根据您想做的事情，您可能想查看算法，例如 tfidf (简单的教程(，用于文本/文档相似性，或其他许多人...

最佳方法可以使用OrderedDict()，这对于维护订单以获取dict keys。

通过迭代该dict，比较值，您将获得输出

一种可能的方法可以将每个字符串转换为一组单词，然后比较集合的交叉点

string_1 = "hello bha njik bhavd bhavd bjavd manhbd kdkndsik wkjdk"
string_2 = "bhavd dskghfski fjfbhskf ewkjhsdkifs fjuekdjsdf ue"
# First split your strings into sets of words
set_1 = set(string_1.split())
set_2 = set(string_2.split())
# Compare the sets to find where they both have the same value
print set_1 & set_2
print set_1.intersection(set_2)
# Both print out {'bhavd'}

相关内容

最新更新

热门标签：