如何使用os.walk或任何其他替代方法按自然名称顺序递归遍历文件夹



在python中,如果我通过os.walk递归迭代所有文件夹,以找到任何具有定义扩展名的过滤器。这是我现在的密码;

def get_data_paths(root_path, ext = '*.jpg'):
    import os
    import fnmatch
    matches = []
    classes = []
    class_names = []
    for root, dirnames, filenames in os.walk(root_path):
      for filename in fnmatch.filter(filenames, ext):
          matches.append(os.path.join(root, filename))
          class_name =  os.path.basename(os.path.dirname(os.path.join(root, filename)))
          if class_name not in class_names:
               class_names.append(class_name)
          classes.append(class_names.index(class_name))
    print "There are ",len(matches), " files're found!!"
    return matches, classes, class_names

然而,这里的问题是,这个函数访问文件夹的文件夹名称顺序很奇怪。相反,我想穿过A-Z。我应该如何修改此代码或使用任何其他替代方法来执行此操作?

默认情况下,os.walktopdown参数为True,因此在其自己的目录下降之前会报告目录三元组。文件状态:

调用方可以就地修改dirnames列表(可能使用del或切片分配),并且walk()将只递归到名称保留在dirnames中的子目录中;这可以用于修剪搜索,强制执行特定的访问顺序,甚至在调用方再次恢复walk()之前通知walk()调用方创建或重命名的目录。

大胆面对我。所以你所需要做的就是:

for root, dirnames, filenames in os.walk(root_path):
    dirnames[:] = natsort.natsorted(dirnames)
    # continue with other directory processing...

由于需要就地编辑列表,因此需要使用[:]切片表示法。


以下是os.walk的操作示例。给定一个目录树,它看起来像:

$ ls -RF cm3mm/SAM3/src
Applets/                RTC.cc          SAM3X/
DBGUWriteString.cc  SAM3A/          SMC.cc.in
EEFC.cc             SAM3N/          SoftBoot.cc
Memories.txt        SAM3S/
PIO.cc              SAM3U/
cm3mm/SAM3/src/Applets:
AppletAPI.cc   IntFlash.cc   Main.cc        MessageSink.cc  Runtime.cc
cm3mm/SAM3/src/SAM3A:
Map.txt     Pins.txt
cm3mm/SAM3/src/SAM3N:
Map.txt     Pins.txt
cm3mm/SAM3/src/SAM3S:
Map.txt     Pins.txt
cm3mm/SAM3/src/SAM3U:
Map.txt     Pins.txt
cm3mm/SAM3/src/SAM3X:
Map.txt     Pins.txt

现在,让我们看看os.walk的作用:

>>> import os
>>> for root, dirnames, filenames in os.walk("cm3mm/SAM3/src"):
...     print "-----"
...     print "root =", root
...     print "dirnames =", dirnames
...     print "filenames =", filenames
...
-----
root = cm3mm/SAM3/src
dirnames = ['Applets', 'SAM3A', 'SAM3N', 'SAM3S', 'SAM3U', 'SAM3X']
filenames = ['DBGUWriteString.cc', 'EEFC.cc', 'Memories.txt', 'PIO.cc', 'RTC.cc', 'SMC.cc.in', 'SoftBoot.cc']
-----
root = cm3mm/SAM3/src/Applets
dirnames = []
filenames = ['AppletAPI.cc', 'IntFlash.cc', 'Main.cc', 'MessageSink.cc', 'Runtime.cc']
-----
root = cm3mm/SAM3/src/SAM3A
dirnames = []
filenames = ['Map.txt', 'Pins.txt']
-----
root = cm3mm/SAM3/src/SAM3N
dirnames = []
filenames = ['Map.txt', 'Pins.txt']
-----
root = cm3mm/SAM3/src/SAM3S
dirnames = []
filenames = ['Map.txt', 'Pins.txt']
-----
root = cm3mm/SAM3/src/SAM3U
dirnames = []
filenames = ['Map.txt', 'Pins.txt']
-----
root = cm3mm/SAM3/src/SAM3X
dirnames = []
filenames = ['Map.txt', 'Pins.txt']

每次通过循环,您都会获得一个目录的目录和文件。我们确切地知道哪个文件属于哪个文件夹:filenames中的文件属于文件夹root

我这样修改了代码;

def get_data_paths(root_path, ext = '*.jpg'):
    import os
    import fnmatch
    import natsort  # import this
    matches = []
    classes = []
    class_names = []
    dir_list= natsort.natsorted(list(os.walk(root_path))) # add this
    for root, dirnames, filenames in dir_list:
      for filename in fnmatch.filter(filenames, ext):
          matches.append(os.path.join(root, filename))
          class_name =  os.path.basename(os.path.dirname(os.path.join(root, filename)))
          if class_name not in class_names:
               class_names.append(class_name)
          classes.append(class_names.index(class_name))
    print "There are ",len(matches), " files're found!!"
    return matches, classes, class_names

最新更新