Python /使用os.path.getsize区分列表

我用下面的代码打印我的目录

的文件大小

from os import listdir
from os.path import isfile, join, getsize    
pdf_files = [f for f in listdir(folder_location) if isfile(join(folder_location, f)) & f.lower().endswith(".pdf")] 
for index, f2 in enumerate(pdf_files):
print(index, f2, getsize(folder_location+f2))

结果是:

1 1-01-01-3.pdf 1191722
2 1-01-01-4.pdf 885649
3 1-01-01-5.pdf 254760
4 1-01-01-6.pdf 1425127
5 1-01-01-8.pdf 1456785
6 1-02-01-1.pdf 264252
7 1-02-01-2.pdf 1769278

(在Windows中是1164 Ko, 865 Ko, 249 Ko…)

假设我只想选择尺寸大于18Ko的文件当我使用以下语法时，我的列表

中没有结果

pdf_files = [f for f in listdir(folder_location) if isfile(join(folder_location, f)) & f.lower().endswith(".pdf") & getsize(folder_location+f)>18000 ]

使用and作为布尔逻辑，而不是&。

但问题在这里:getsize(folder_location+f)

这样分隔符就丢失了，使用getsize(join(folder_location, f))

您的问题是&具有比and更高的优先级。因此，当您这样做时:

isfile(join(folder_location, f)) & f.lower().endswith(".pdf") & getsize(folder_location+f)>18000

它执行>左边的所有作为一个值，计算isfile、endswith(".pdf")和getsize的位和。由于前两种方法只返回True或False，因此按位计算的结果将总是是0或1，两者都小于18000，因此没有任何东西通过测试。

你有一个次要问题，如果folder_location没有以斜杠结尾，你会在getsize调用上得到FileNotFoundErrors;你显然有这样一个尾斜杠，所以它不会损害这段代码，但是如果folder_size缺少一个你想要的斜杠，那么就像你对is_file所做的那样，始终使用os.path.join。

虽然括号可以用来修复按位和与>优先级的问题，所以它在逻辑上是有效的(你所有的测试都返回True/False，你不依赖于"真值";测试其中一个测试可能会产生10并打破位和测试)，使其成为... f.lower().endswith(".pdf") & (getsize(join(folder_location, f))>18000)](注意getsize > 18000周围的括号并使用join，就像在isfile测试中一样，以确保正确的目录分隔符)，在实践中，您希望使用布尔运算符(not,and,or)，而不是位运算符(~,&,|);即使结果不是严格的0/1/False/True，它们也能工作，并且它们会短路;便宜的文件名测试失败时，可以避免昂贵的stat系统调用来检查大小。要修复和优化这个问题，您需要:

pdf_files = [f for f in listdir(folder_location)
if f.lower().endswith(".pdf") and isfile(join(folder_location, f)) and
getsize(join(folder_location, f)) > 18000]

首先执行文件名检查(相对于其他测试有效地释放)，通过在任何测试失败时使用and短路来最小化stat调用。

也就是说，这是一个完美的用os.scandir(惰性地返回文件数据，在大文件夹上更有效地操作，并返回DirEntry对象，便宜地为您提供名称和合格的路径，免费提供一些文件信息，而不需要stat)替换os.listdir(只返回文件名，急切地，因此需要将它们重新连接到目录中，并为每个测试重复stat所述文件)。特别是在Windows上，所有信息都是免费的，对于非Windows上的非自由的东西，它缓存单个stat调用，这将减少代码中的工作，从每个文件1-3个stat调用到0-1):

from os import scandir  # scandir removes the need for any os.path imports at all
# This listcomp performs at most one stat system call per file on non-Windows,
# and only if the entry is a file and the name ends in .pdf, while
# on Windows it never performs a single stat call, ever
pdf_files = [e for e in scandir(folder_location)
if e.name.lower().endswith(".pdf") and e.is_file() and
e.stat().st_size > 18000] 
# Every entry in pdf_files that needed to be stat-ed cached the results,
# so this loop involves no stats, on any OS; .stat() cached the result on the first call
for index, e in enumerate(pdf_files):
print(index, e.name, e.stat().st_size)

对于大型目录，这将显着快，特别是如果所述目录是nfs挂载的或位于任何其他慢速存储介质上。

相关内容

最新更新

热门标签：