聚类每行单词很少的行,更喜欢Levenstein



我有一个带行的文本文件,每行都有几个单词,我想按行对它们进行聚类,而不是将每行分隔为一个单词。我写了一些代码,但输出很奇怪。我的代码:

import numpy as np
import sklearn.cluster
import distance
f = open("names.txt", "r")
words = f.read().split(',')
#for line in f:
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))

输出:

- *BRAZEMAX ESTATYS:*  Inc.,  Inc.
BBAZEMAX ESTATES, BRAZEMAX ESTATYS
- * LTD
Gramkai Books
Bras5emax Estates:*  Jr
John Smith
PC Adelman
Gramkai,  LTD
BOZEMAN Ent.
Gramkat Estates,  LTD
Gramkai Books
Bras5emax Estates
- * L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman:*  Inc.
Bozeman Enterprises
Michele LTD
Gramkat,  L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman

文件:

BRAZEMAX ESTATYS, LTD
Gramkai Books
Bras5emax Estates, L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman, Jr
John Smith
PC Adelman
Gramkai, Inc.
Bozeman Enterprises
Michele LTD
Gramkat, Inc.
BBAZEMAX ESTATES, LTD
BOZEMAN Ent.
Gramkat Estates, Inc.

这里怎么了?

您可能还需要删除n字符。单词与换行符组合。这就是您看到多行输出的原因。

您可以在阅读文件后更新代码:

original_file = """BRAZEMAX ESTATYS, LTD
Gramkai Books
Bras5emax Estates, L.T.D.
BOZEMAN Enterprises
BOZERMAN ENTERPRISES
Nadelman, Jr
John Smith
PC Adelman
Gramkai, Inc.
Bozeman Enterprises
Michele LTD
Gramkat, Inc.
BBAZEMAX ESTATES, LTD
BOZEMAN Ent.
Gramkat Estates, Inc."""
original_file
'BRAZEMAX ESTATYS, LTDnGramkai BooksnBras5emax Estates, L.T.D.nBOZEMAN EnterprisesnBOZERMAN ENTERPRISESnNadelman, JrnJohn SmithnPC AdelmannGramkai, Inc.nBozeman EnterprisesnMichele LTDnGramkat, Inc.nBBAZEMAX ESTATES, LTDnBOZEMAN Ent.nGramkat Estates, Inc.'
import re
re.split('[n,]', original_file)
['BRAZEMAX ESTATYS',
' LTD',
'Gramkai Books',
'Bras5emax Estates',
' L.T.D.',
'BOZEMAN Enterprises',
'BOZERMAN ENTERPRISES',
'Nadelman',
' Jr',
'John Smith',
'PC Adelman',
'Gramkai',
' Inc.',
'Bozeman Enterprises',
'Michele LTD',
'Gramkat',
' Inc.',
'BBAZEMAX ESTATES',
' LTD',
'BOZEMAN Ent.',
'Gramkat Estates',
' Inc.']
​

现在,单词被换行符和逗号分隔开。

相关内容

最新更新