我正试图编写一个python脚本,将文件名列表与自身进行比较,并提取出完全匹配或仅相差一个单词的任何文件名。。。
类似的东西
def FindCloseMatches(list_in_question):
match_list = []
list_one = [{x: x.split()} for x in list_in_question]
list_two = [{x: x.split()} for x in list_in_question]
# pseudo-ish
for x, y in zip(list_one, list_two):
if x.values in list_one match all but one of y.values in list_two:
match_list.append(x, y)
我该如何比较两个文件名列表,并找到任何只相差一个单词或更少的文件名?
例如,如果我有一个名为WaterServiceLines.pdf
和CustomerWaterServiceLines.pdf
的文件(它们在空格和下划线等方面的格式不同(,那么这将是匹配的。但是CCD_ 3和CCD_。
类似的东西?
假设所有单词都用大写字母分隔
import Levenshtein
import re
def FindCloseMatches(filenames):
# remove file types from filenames
filenames = [x.split('.')[0] for x in filenames]
# split filenames into words
for i in range(len(filenames)):
filenames[i] = [s for s in re.split("([A-Z][^A-Z]*)", filenames[i]) if s != '']
# compare each element in the list to itself
# count the number of words that are different
# if the number of words is 1 or less, then it is a match
matches = []
for i in range(len(filenames)):
for j in range(i + 1, len(filenames)):
if Levenshtein.distance(filenames[i], filenames[j]) <= 1:
# combin words into a string
matches.append((''.join(filenames[i]), ''.join(filenames[j])))
return matches
l = ['WaterServiceLines.pdf', 'CustomerWaterServiceLines.pdf', 'SewerMainLines.pdf', 'WaterServiceLines.pdf']
print(FindCloseMatches(l))
输出:
[('WaterServiceLines', 'CustomerWaterServiceLines'), ('WaterServiceLines', 'WaterServiceLines'), ('CustomerWaterServiceLines', 'WaterServiceLines')]
用pip install levenshtein
安装Levenstein
如果您想要输出中的文件类型:
import Levenshtein
import re
def FindCloseMatches(filenames):
# create dictionary keyed by filename, with value file type
# e.g. {'WaterServiceLines': 'pdf', 'CustomerWaterServiceLines': 'pdf'}
filetypes = {}
for filename in filenames:
filetypes[filename.split('.')[0]] = filename.split('.')[-1]
filenames = [x.split('.')[0] for x in filenames]
# split filenames into words
for i in range(len(filenames)):
filenames[i] = [s for s in re.split("([A-Z][^A-Z]*)", filenames[i]) if s != '']
# compare each element in the list to itself
# count the number of words that are different
# if the number of words is 1 or less, then it is a match
matches = []
for i in range(len(filenames)):
for j in range(i + 1, len(filenames)):
if Levenshtein.distance(filenames[i], filenames[j]) <= 1:
# combine words into the filename and append the filetype
f1 = ''.join(filenames[i])
f2 = ''.join(filenames[j])
f1 = f1 + '.' + filetypes[f1]
f2 = f2 + '.' + filetypes[f2]
matches.append((f1, f2))
return matches
l = ['WaterServiceLines.pdf', 'CustomerWaterServiceLines.pdf', 'SewerMainLines.pdf', 'WaterServiceLines.pdf']
print(FindCloseMatches(l))
输出:
[('WaterServiceLines.pdf', 'CustomerWaterServiceLines.pdf'), ('WaterServiceLines.pdf', 'WaterServiceLines.pdf'), ('CustomerWaterServiceLines.pdf', 'WaterServiceLines.pdf')]