在余弦相似性中对数字应用比字符串更多的权重



我有一个程序,可以从互联网上提取地址,并根据数据库进行检查。这很有用,但我现在正在尝试引入一个相似性函数来比较互联网上的地址与数据库中的地址。

我使用以下脚本来检查余弦相似性比较

地址的程度:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
addresses = [
'705 Sherlock House, 221B Baker Street, London NW1 6XE', 
'75 Sherlock House, 221B Baker Street, London NW1 6XE', 
'Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE', 
'Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE', 
'705, 221B Baker Street, London NW1 6XE', 
'75, 221B Baker Street, London NW1 6XE',
'705 Watson House, 219 Baker Street, London NW1 6XE',
'32 Baker Street, London NW1 6XE',
'1060 West Addison, London, W2 6SR',
'705 Sherlock Hse, Baker Street, London, NW1'
]
def clean_address(text):
text = ''.join([word for word in text if word not in string.punctuation])
text = text.lower()
return text
cleaned = list(map(clean_address, addresses))
vectorizer = CountVectorizer()
transformedVectorizer = vectorizer.fit_transform(cleaned)
vectors = transformedVectorizer.toarray()
csim = cosine_similarity(vectors)
def cosine_sim_vectors(vec1, vec2):
vec1 = vec1.reshape(1, -1)
vec2 = vec2.reshape(1, -1)
return cosine_similarity(vec1, vec2)[0][0]
cosine_sim_vectors1 = cosine_sim_vectors(vectors[0], vectors[1])
cosine_sim_vectors2 = cosine_sim_vectors(vectors[0], vectors[2])
cosine_sim_vectors3 = cosine_sim_vectors(vectors[0], vectors[3])
cosine_sim_vectors4 = cosine_sim_vectors(vectors[0], vectors[4])
cosine_sim_vectors5 = cosine_sim_vectors(vectors[0], vectors[5])
cosine_sim_vectors6 = cosine_sim_vectors(vectors[0], vectors[6])
cosine_sim_vectors7 = cosine_sim_vectors(vectors[0], vectors[7])
cosine_sim_vectors8 = cosine_sim_vectors(vectors[0], vectors[8])
cosine_sim_vectors9 = cosine_sim_vectors(vectors[0], vectors[9])
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 75 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors1 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors2 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors3 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors4 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 75, 221B Baker Street, London NW1 6XE".format(cosine_sim_vectors5 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705 Watson House, 219 Baker Street, London NW1 6XE".format(cosine_sim_vectors6 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 32 Baker Street, London NW1 6XE".format(cosine_sim_vectors7 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 1060 West Addison, London, W2 6SR".format(cosine_sim_vectors8 * 100))
print("705 Sherlock House, 221B Baker Street, London NW1 6XE is {:.1f}% similar to 705 Sherlock Hse, Baker Street, London, NW1".format(cosine_sim_vectors9 * 100))

输出为:

705 Sherlock House, 221B Baker Street, London NW1 6XE is 88.9% similar to 75 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 84.3% similar to Apartment 704 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 94.9% similar to Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 88.2% similar to 705, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 75.6% similar to 75, 221B Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 77.8% similar to 705 Watson House, 219 Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 68.0% similar to 32 Baker Street, London NW1 6XE
705 Sherlock House, 221B Baker Street, London NW1 6XE is 13.6% similar to 1060 West Addison, London, W2 6SR
705 Sherlock House, 221B Baker Street, London NW1 6XE is 75.6% similar to 705 Sherlock Hse, Baker Street, London, NW1

它做得很合理,因为我可能会盯着任何超过 60-70% 的东西,而且我印象深刻的是,它几乎抓住了我故意试图用 705 Watson House 和 705 Sherlock Hse 欺骗它的企图,但我确实认为如果它认识到,例如,705 是比伦敦更重要的东西,它会改进算法,或者, 鉴于我可以删除伦敦,6XE。

如果有更合适的函数,我也愿意使用其他相似性函数,因为我确实了解余弦相似性是将字符串更改为向量,并且基本上平等对待它们。

在我的地址字符串的一部分上增加更多的权重是没有好处的,余弦相似性开箱即用。

为此,余弦相似性是比字符串编辑距离更好的算法,因为"75 Sherlock House, 221B Baker Street, London NW1 6XE"与"705 Sherlock House, 221B Baker Street,London NW1 6XE"并不比"Apartment 705 Sherlock House, 221B Baker Street, London NW1 6XE"更相似 - 余弦相似性抓住了这种直觉。

最新更新