找到只差一个单词的文件名的最简单方法



我正试图编写一个python脚本,将文件名列表与自身进行比较,并提取出完全匹配或仅相差一个单词的任何文件名。。。

类似的东西

def FindCloseMatches(list_in_question):
    match_list = []
    list_one = [{x: x.split()} for x in list_in_question]
    list_two = [{x: x.split()} for x in list_in_question]
    
    # pseudo-ish
    for x, y in zip(list_one, list_two):
        if x.values in list_one match all but one of y.values in list_two:
            match_list.append(x, y)

我该如何比较两个文件名列表,并找到任何只相差一个单词或更少的文件名?

例如,如果我有一个名为WaterServiceLines.pdfCustomerWaterServiceLines.pdf的文件(它们在空格和下划线等方面的格式不同(,那么这将是匹配的。但是CCD_ 3和CCD_。

类似的东西?

假设所有单词都用大写字母分隔

import Levenshtein
import re
def FindCloseMatches(filenames):
    # remove file types from filenames
    filenames = [x.split('.')[0] for x in filenames]
    # split filenames into words
    for i in range(len(filenames)):
        filenames[i] = [s for s in re.split("([A-Z][^A-Z]*)", filenames[i]) if s != '']
    # compare each element in the list to itself
    # count the number of words that are different
    # if the number of words is 1 or less, then it is a match
    matches = []
    for i in range(len(filenames)):
        for j in range(i + 1, len(filenames)):
            if Levenshtein.distance(filenames[i], filenames[j]) <= 1:
                # combin words into a string
                matches.append((''.join(filenames[i]), ''.join(filenames[j])))
    return matches
l = ['WaterServiceLines.pdf', 'CustomerWaterServiceLines.pdf', 'SewerMainLines.pdf', 'WaterServiceLines.pdf']
print(FindCloseMatches(l))

输出:

[('WaterServiceLines', 'CustomerWaterServiceLines'), ('WaterServiceLines', 'WaterServiceLines'), ('CustomerWaterServiceLines', 'WaterServiceLines')]

pip install levenshtein 安装Levenstein

如果您想要输出中的文件类型:

import Levenshtein
import re
def FindCloseMatches(filenames):
    # create dictionary keyed by filename, with value file type
    # e.g. {'WaterServiceLines': 'pdf', 'CustomerWaterServiceLines': 'pdf'}
    filetypes = {}
    for filename in filenames:
        filetypes[filename.split('.')[0]] = filename.split('.')[-1]
    filenames = [x.split('.')[0] for x in filenames]

    # split filenames into words
    for i in range(len(filenames)):
        filenames[i] = [s for s in re.split("([A-Z][^A-Z]*)", filenames[i]) if s != '']
    # compare each element in the list to itself
    # count the number of words that are different
    # if the number of words is 1 or less, then it is a match
    matches = []
    for i in range(len(filenames)):
        for j in range(i + 1, len(filenames)):
            if Levenshtein.distance(filenames[i], filenames[j]) <= 1:
                # combine words into the filename and append the filetype
                f1 = ''.join(filenames[i])
                f2 = ''.join(filenames[j])
                f1 = f1 + '.' + filetypes[f1]
                f2 = f2 + '.' + filetypes[f2]
                matches.append((f1, f2))
    return matches
l = ['WaterServiceLines.pdf', 'CustomerWaterServiceLines.pdf', 'SewerMainLines.pdf', 'WaterServiceLines.pdf']
print(FindCloseMatches(l))

输出:

[('WaterServiceLines.pdf', 'CustomerWaterServiceLines.pdf'), ('WaterServiceLines.pdf', 'WaterServiceLines.pdf'), ('CustomerWaterServiceLines.pdf', 'WaterServiceLines.pdf')]

最新更新