从 URL 列表生成树



我有一个包含多个URL的列表,有些目录有多个具有不同扩展名的文件等等。例:

List = [
"http://www.example.com/folder1",
"http://www.example.com/folder1",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/image1.png",
"http://www.example.com/folder1/folder2/image2.png",
"http://www.example.com/folder1/folder2/file.txt",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2/folder3/file1.txt",
"http://www.example.com/folder1/folder2/folder3/file2.txt",
"http://www.example.com/folder1/folder2/folder3/file3.txt",
...
]

我试图实现的是过滤这些 URL,以便获得一个列表,该列表将仅包含文件夹的 URL 和每个不同扩展名的一个 URL。像这样:

List = [
"http://www.example.com/folder1",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/image1.png",
"http://www.example.com/folder1/folder2/file.txt",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2/folder3/file1.txt",
...
]

目前,我被困在如何从中生成某种树,以便我可以遍历它并删除重复的文件。

我已经尝试了一些不同的方法,但我对Python仍然有点陌生。

谢谢:)

您可以将itertools.groupby与递归一起使用:

import itertools, re
data = ['http://www.example.com/folder1', 'http://www.example.com/folder1', 'http://www.example.com/folder1/folder2', 'http://www.example.com/folder1/folder2/folder3', 'http://www.example.com/folder1/folder2', 'http://www.example.com/folder1/folder2/image1.png', 'http://www.example.com/folder1/folder2/image2.png', 'http://www.example.com/folder1/folder2/file.txt', 'http://www.example.com/folder1/folder2/folder3', 'http://www.example.com/folder1/folder2/folder3/file1.txt', 'http://www.example.com/folder1/folder2/folder3/file2.txt', 'http://www.example.com/folder1/folder2/folder3/file3.txt']
def group(d, path = []):
new_d = [[a, [j for _, *j in b]] for a, b in itertools.groupby(sorted(d, key=lambda x:x[0]), key=lambda x:x[0])]
for a, c in new_d:
_d, _fold, _path = [i[0] for i in c if len(i) == 1], [], []
for i in _d:
if not re.findall('.w+$', i):
if i not in _fold:
yield '/'.join(path+[a]+[i])
_fold.append(i)
else:
if i.split('.')[-1] not in _path:
yield '/'.join(path+[a]+[i])
_path.append(i.split('.')[-1])
r = [i for i in c if len(i) != 1]
yield from group(r, path+[a])
_data = [[a, *b.split('/')] for a, b in map(lambda x:re.split('(?<=.com)/', x), data)]
print(list(group(_data)))

输出:

['http://www.example.com/folder1', 
'http://www.example.com/folder1/folder2', 
'http://www.example.com/folder1/folder2/folder3', 
'http://www.example.com/folder1/folder2/image1.png', 
'http://www.example.com/folder1/folder2/file.txt', 
'http://www.example.com/folder1/folder2/folder3/file1.txt']

如果您的 URL 遵循这种简单格式,您可以使用dict过滤列表,以跟踪使用了哪些目录:

List = [
"http://www.example.com/folder1",
"http://www.example.com/folder1",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/image1.png",
"http://www.example.com/folder1/folder2/image2.png",
"http://www.example.com/folder1/folder2/file.txt",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2/folder3/file1.txt",
"http://www.example.com/folder1/folder2/folder3/file2.txt",
"http://www.example.com/folder1/folder2/folder3/file3.txt",
...
]
dirnames = {}
filtered = []
for url in List:
dirname = os.path.dirname(url)
dirnames.setdefault(dirname, {})
extension = os.path.splitext(url)[1]
if extension not in dirnames[dirname]:
dirnames[dirname][extension] = True
filtered.append(url)
print(filtered)

最新更新