检查列表中是否有字符串，具体取决于最后两个字符

Set-up

我正在使用Scrapy来抓取住房广告。每个广告我检索一个邮政编码，该邮政编码由四个数字后跟 2 个字母组成，例如1053ZM.

我有一个 excel 表，通过以下方式将地区链接到邮政编码，

district    postcode_min    postcode_max
A           1011AB           1011BD
A           1011BG           1011CE
A           1011CH           1011CZ

因此，第二行指出从1011AB, 1011AC,..., 1011AZ, 1011BA,...,1011BD到的邮政编码属于区A。

实际列表包含 1214 行。

问题

我想使用邮政编码和列表将每个广告与其各自的地区相匹配。

我不确定这样做的最佳方法是什么，以及如何做到这一点。

我想出了两种不同的方法：

在postcode_min和postcode_max之间创建所有邮政编码，将所有邮政编码及其各自的地区分配给字典，以便随后使用循环进行匹配。

即创建，

d = {'A': ['1011AB','1011AC',...,'1011BD',
'1011BG','1011BH',...,'1011CE',
'1011CH','1011CI',...,'1011CZ'],
'B': [...],           
}

然后

found = False
for distr in d.keys(): # loop over districts
for code in d[distr]: # loop over district's postal codes
if postal_code in code: # assign if ad's postal code in code                 
district = distr
found = True
break
else:
district = 'unknown'
if found:
break

让 Python 了解postcode_min和postcode_max之间存在范围，将范围及其各自的区域分配给字典，并使用循环进行匹配。

即类似的东西，

d = {'A': [range(1011AB,1011BD), range(1011BG,1011CE),range(1011CH,1011CZ)],
'B': [...]
}

然后

found = False
for distr in d.keys(): # loop over districts
for range in d[distr]: # loop over district's ranges
if postal_code in range: # assign if ad's postal code in range                 
district = distr
found = True
break
else:
district = 'unknown'
if found:
break

问题

对于方法 1：

如何创建所有邮政编码并将其分配给字典？

对于方法 2：

我用range()来解释，但我知道range()这样工作。

我需要什么才能有效地获得如上例所示的range()？
如何正确遍历这些范围？

我认为我更喜欢方法 2，但我很高兴使用任何一种方法。或者如果您有其他解决方案，请使用其他解决方案。

您可以像这样在 excel 中收集值

d = {'A': ['1011AB', '1011BD', '1011BG', '1011CE',  '1011CH', '1011CZ'],
'B': ['1061WB', '1061WB'],
}
def is_in_postcode_range(current_postcode, min, max):
return min <= current_postcode <= max
def get_district_by_post_code(postcode):
for district, codes in d.items():
first_code = codes[0]
last_code = codes[-1]
if is_in_postcode_range(postcode, first_code, last_code):
if any(is_in_postcode_range(postcode, codes[i], codes[i+1]) for i in range(0, len(codes), 2)):
return district
else:
return None

用法：

print get_district_by_post_code('1011AC'): A
print get_district_by_post_code('1011BE'): None
print get_district_by_post_code('1061WB'): B

您可以使用 intervaltree 来实现更好的查找速度，并将邮政编码解释为以 36 为基数的数字(10 位数字和 26 个字母)。

from intervaltree import IntervalTree
t = IntervalTree()
for district,postcode_min,postcode_max in your_district_table:
# We read the postcode as a number in base 36
postcode_min = int(postcode_min, 36)
postcode_max = int(postcode_max, 36)
t[postcode_min:postcode_max] = district

如果邮政编码是包含的(包括"max"邮政编码)，则改用以下内容：

t[postcode_min:postcode_max+1] = district

最后，您可以按如下所示post_code查找地区：

def get_district(post_code):
intervals = t[int(post_code, 36)]
if not intervals:
return None
# I assume you have only one district that matches a postal code
return intervals[0][2] # The value of the first interval on the list

相关内容

最新更新

热门标签：