如果__eq __ funciton使用编辑距离确定平等，那么__hash__函数的最佳实现是什么 - What would be the best implementation of the __hash__ function if the __eq__ funciton determines equality using edit distance? 小贝子编程网

我有一个奇怪的要求，我需要从两个不同且非常大的列表中找到常见的"客户"。两个列表中的每个条目都是一个客户对象，其中包含客户及其地址的第一个和姓氏（被地址线分解，例如adversion_line1，adversion_line2等）。问题在于，这两个列表中的数据可能不完整，例如，对于第一个列表中的一个记录之一，客户的名字可能会丢失，而第二个列表中，对于同一客户（同一客户）地址（第2行和第3行）可能会丢失。我需要做的是找到两个列表中的客户。要注意的一点是，列表可能很大。要记住的另一点是，名称和地址在语义上可能是相同的，但是当您进行精确的字符串匹配时可能不会返回结果。例如，在第一个列表中，第一个列表中客户的地址可能是 B-502 ABC Street的形式，而第二个列表中同一客户的地址可以以 B 502 ABC Street表格。我使用编辑距离的原因是要考虑列表中的用户输入错误，并处理两个列表中存在的数据中的某些其他较小差异

我所做的是在客户类中实现 eq 函数如下

import re
import editdistance # Using this: https://pypi.python.org/pypi/editdistance
class Customer:
    def __init__(self, fname, lname, address1, address2, address3, city):
        # Removing special characters from all arguments and converting them to lower case
        self.fname = re.sub("[^a-zA-Z0-9]", "", fname.lower())
        self.lname = re.sub("[^a-zA-Z0-9]", "", lname.lower())
        self.address1 = re.sub("[^a-zA-Z0-9]", "", address1.lower())
        self.address2 = re.sub("[^a-zA-Z0-9]", "", address2.lower())
        self.address3 = re.sub("[^a-zA-Z0-9]", "", address3.lower())
        self.city = re.sub("[^a-zA-Z0-9]", "", city.lower())
    def __eq__(self, other):
        if self.lname == "" or self.lname != other.lname:
            return False
        t = 0
        if self.fname != "" and other.fname != "" and self.fname[0] == other.fname[0]:
            t += 1
        if editdistance.eval(self.fname, other.fname) <= 2:
            t += 3
        if editdistance.eval(self.address1, other.address1) <= 3:
            t += 1
        if editdistance.eval(self.address2, other.address2) <= 3:
            t += 1
        if editdistance.eval(self.address3, other.address3) <= 3:
            t += 1
        if editdistance.eval(self.city, other.city) <= 2:
            t += 1
        if t >= 4:
            return True
        return False
    def __hash__():
        # TODO:  Have a robust implementation of a hash function here. If two objects are "equal", their hashes should be the same

为了使客户在这两个列表中都存在，我将进行以下操作：

set(first_list).intersection(set(second_list))

但是，为了使客户起作用，需要进行客户对象。

有人可以用一个好的哈希机制帮助我吗？

您唯一的选项是归一化数据。如果您需要比较平等，并且可能会有不同的格式，则解决方案是标准化。转换所有内容，因此在两个列表中都将以相同的格式。

我已经在西班牙的地址的标准化算法中工作了几个月。相同地址的不同用户输入的组合是无穷无尽的（我正在使用700万行数据库）。使用该距离函数可能不够准确

第一个关键问题是，您可以负担的错误百分比是多少？因为使用用户输入和大数据，您将始终有一些。

下一步将是测量使用该距离算法（或任何其他算法）所获得的错误百分比。仔细选择示例数据，因此百分比不会随完整的数据而变化。

如果您使用该算法的百分比，如果没有，请找到其他算法并测量它们。

如果eq funciton使用编辑距离确定平等，那么hash函数的最佳实现是什么

相关内容

最新更新

热门标签：