如何部分匹配数字到字符串?



灵感来自转位和比较| Python,但现在我想介绍部分匹配的复杂性。

数据:

PreviousData = { 'Item' : ['abc-023','def-78','ghi-012','jkl-100','mno-01','pqr-890','stu-024','vwx-765','yza-789','uaza-400','fupa-499'],
'Summary' : ['party','weekend','food','school','tv','photo','camera','python','r','rstudio','spyder'],
'2022-01-01' : [1, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan,np.nan,2],
'2022-02-01' : [1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,3,np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,3,np.nan],
'2022-06-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,1,np.nan],
'2022-10-01' : [np.nan,np.nan,1,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,2,np.nan,np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,np.nan,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,2,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-04-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-05-01' : [np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan],
'2023-06-01' : [1,1,np.nan,np.nan,9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-07-01' : [np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2023-08-01' : [np.nan,1,np.nan,np.nan,1,np.nan,1,np.nan,np.nan,np.nan,np.nan],
'2023-09-01' : [np.nan,1,1,np.nan,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
}
PreviousData = pd.DataFrame(PreviousData)
PreviousData


CurrentData = { 'Item' : ['ghi-012:XYZ','stu-024:Z','abc-023-100','mno-01-100:Z','jkl-100:Z-900','pqr-890-FR','def-78-RF-FR','vwx-765:NCVE','yza-789-YU'],
'Summary' : ['food','camera','party','tv','school','photo','weekend','python','r'],
'2022-01-01' : [3, np.nan, np.nan, 1.0, np.nan, 1.0, np.nan, np.nan, np.nan],
'2022-02-01' : [np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-03-01' : [np.nan,1,1,1,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-04-01' : [np.nan,np.nan,1,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-05-01' : [np.nan,np.nan,3,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-06-01' : [2,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-07-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,4,np.nan,np.nan,np.nan],
'2022-08-01' : [np.nan,np.nan,3,np.nan,4,np.nan,np.nan,np.nan,np.nan],
'2022-09-01' : [np.nan,np.nan,3,3,3,np.nan,np.nan,5,5],
'2022-10-01' : [np.nan,np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan],
'2022-11-01' : [np.nan,np.nan,np.nan,5,np.nan,np.nan,np.nan,np.nan,np.nan],
'2022-12-01' : [np.nan,4,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan],
'2023-01-01' : [np.nan,np.nan,np.nan,np.nan,1,1,np.nan,np.nan,np.nan],
'2023-02-01' : [np.nan,np.nan,np.nan,2,1,np.nan,np.nan,np.nan,np.nan],
'2023-03-01' : [np.nan,np.nan,np.nan,np.nan,2,np.nan,2,np.nan,2],
'2023-04-01' : [np.nan,np.nan,np.nan,np.nan,np.nan,2,np.nan,np.nan,2],
}
CurrentData = pd.DataFrame(CurrentData)
CurrentData

部分匹配的例子有:abc-023 vs abc-023-100;stu-024 vs stu-024:Z等

代码尝试:

PreviousData_t = PreviousData.melt(id_vars=["Item", "Summary"], 
var_name="Date", 
value_name="value1")

CurrentData_t = CurrentData.melt(id_vars=["Item", "Summary"], 
var_name="Date", 
value_name="value2")

Compare = PreviousData_t.merge(CurrentData_t, on =['Date','Item','Summary'], how = 'left')
Compare['diff'] = np.where(Compare['value1']!=Compare['value2'], 1,0)
#Code Does Not Take Into Account for Partial Matches of Items

任何关于这方面的建议都是非常感谢的。

这是一种集群问题,我将提供一种解决方案。

写完这篇文章后,我想起来这正是Google refine解决的问题。你可以阅读关于开源版本打开细化这里:https://guides.library.illinois.edu/openrefine/clustering

无论如何,首先我将Item列中的所有字符串连接起来,并将它们保存在列表all_items中。

import pandas as pd
import numpy as np
prev = list(PreviousData.Item)
curr = list(CurrentData.Item)
all_items = prev+curr
all_items
['abc-023',
'def-78',
'ghi-012',
'jkl-100',
'mno-01',
'pqr-890',
'stu-024',
'vwx-765',
'yza-789',
'uaza-400',
'fupa-499',
'ghi-012:XYZ',
'stu-024:Z',
'abc-023-100',
'mno-01-100:Z',
'jkl-100:Z-900',
'pqr-890-FR',
'def-78-RF-FR',
'vwx-765:NCVE',
'yza-789-YU']

现在您想要将相似的字符串组合在一起,例如'abc-023''abc-023-100''pqr-890''dpqr-890-FR'。在all_items中,最多有两个类似的字符串,但一般来说,这是一个更复杂的问题,因为一个字符串可以有几个类似的字符串,如何决定哪个字符串是最好的选择?这个问题的解决方案叫做聚类

关于相似性函数:该示例似乎表明,如果一个字符串是另一个字符串的子字符串,则需要匹配两个字符串。一般来说,有许多相似函数,您可以选择最适合您的应用程序的一个。

我将展示一个解决方案,它使用来自sklearnDBSCAN集群和来自difflibSequenceMatcher集群。在这种情况下,这可能是过度的,但它可能对更大的数据集和更复杂的字符串匹配任务有用。

la = len(all_items)
from difflib import SequenceMatcher
from sklearn.cluster import DBSCAN
# this is the distance function between string
diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
# Note: SequenceMatcher ratio goes from 0 to 1, highest similarity is 1
# but since we’re building a distance matrix, highest similarity=minimal distance
diff_matrix = np.zeros((la, la))
for i in range(la):
for j in range(i, la):
diff_matrix[i,j] = diff(i,j)
diff_matrix[j,i] = diff_matrix[i,j]
pd.DataFrame(diff_matrix) # for pretty-printing (note: this is a symmetric matrix)
# all distances over 0.4 are too far (this means two strings match if SequenceMatcher ratio is >0.6)
db = DBSCAN(eps=0.4, min_samples=2, metric='precomputed').fit(diff_matrix)

现在我们已经聚集了字符串。有多少簇?

# number of clusters is the number of unique labels except for noise
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
n_clusters_
# 9

因为我们从20个项目开始,我们得到了9个标签,看起来大多数标签都是2对2匹配的。这些是字符串簇:

clusters = {'label'+str(k):[] for k in set(labels)}
for k,v in zip(labels, all_items):
clusters['label'+str(k)].append(v)
clusters
# {‘label0': ['abc-023', 'abc-023-100'],
#  'label1': ['def-78', 'def-78-RF-FR'],
#  'label2': ['ghi-012', 'ghi-012:XYZ'],
#  'label3': ['jkl-100', 'jkl-100:Z-900'],
#  'label4': ['mno-01', 'mno-01-100:Z'],
#  'label5': ['pqr-890', 'pqr-890-FR'],
#  'label6': ['stu-024', 'stu-024:Z'],
#  'label7': ['vwx-765', 'vwx-765:NCVE'],
#  'label8': ['yza-789', 'yza-789-YU'],
#  'label-1': ['uaza-400', 'fupa-499']}

9个字符串2对2匹配,2个字符串不匹配(label =-1)。

创建字典normalized_strings,用于将每个字符串转换为每个集群中的唯一值。我选择每组字符串中的第一个值(例如,在['abc-023', 'abc-023-100’]组中,我选择'abc-023’

normalized_strings = {all_items[k]: clusters['label'+str(labels[k])][0] if labels[k]>-1 else all_items[k] for k in range(len(all_items))}
normalized_strings
# {‘abc-023': 'abc-023',
#  'def-78': 'def-78',
#  'ghi-012': 'ghi-012',
#  'jkl-100': 'jkl-100',
#  'mno-01': 'mno-01',
#  'pqr-890': 'pqr-890',
#  'stu-024': 'stu-024',
#  'vwx-765': 'vwx-765',
#  'yza-789': 'yza-789',
#  'uaza-400': 'uaza-400',
#  'fupa-499': 'fupa-499',
#  'ghi-012:XYZ': 'ghi-012',
#  'stu-024:Z': 'stu-024',
#  'abc-023-100': 'abc-023',
#  'mno-01-100:Z': 'mno-01',
#  'jkl-100:Z-900': 'jkl-100',
#  'pqr-890-FR': 'pqr-890',
#  'def-78-RF-FR': 'def-78',
#  'vwx-765:NCVE': 'vwx-765',
#  'yza-789-YU': 'yza-789'}
有了这本字典,你现在可以"翻译"了。数据框架中的所有字符串。

关于diff功能:

diff = lambda i,j: 1 - SequenceMatcher(None, all_items[i], all_items[j]).ratio()
diff(0, 13)
# 0.2222222222222222

'abc-023''abc-023-100'非常相似,因此距离很小

如果SequenceMatcher开销太大,还可以定义一个更简单的字符串匹配函数,例如,如果一个字符串是另一个字符串的子字符串,则两个字符串匹配

diff = lambda i,j: 1 - int((all_items[i] in all_items[j]) or (all_items[j] in all_items[i]))
diff(0,13) # strings match, no difference
# 0
diff(0,12) # strings do not match
# 1

参见:

  • difflib。SequenceMatcher:一个灵活的类,用于比较任何类型的序列对,只要序列元素是可哈希的
  • sklearnDBSCAN
  • <
  • DBSCAN演示/gh>

最新更新