如何根据文件名称中的日期处理多个文件

让我们假设我有一个这样的结构：

Folder1
`XX_20201212.txt`
Folder1
`XX_20201212.txt`
Folder1
`XX_20201212.txt`

我当前的脚本收集了每个文件夹中的3个文件，对它们进行处理，并制作了一个文件。所以现在我的脚本在一个日期内完成了这项工作。

现在让我们把结构改成这样：

Folder1
`XX_20201201.txt`
`XX_20201202.txt`
Folder1
`YY_20201201.txt`
`YY_20201202.txt`
Folder1
`ZZ_20201201.txt`
`ZZ_20201202.txt`
`ZZ_20201203.txt`

我希望我的脚本现在也这样做，但要有多个日期。我希望我的脚本检查一个文件的名称中是否有日期，该日期也存在于名为missing_dates的列表中，以及该文件是否在每个目录中都可用。如果是这样的话，我想收集它并将其处理成一个文件。因此，如果我们假设20201201, 20201202 and 20201203在missing_list中。需要执行以下操作。

脚本将把XX_20201201.txt, YY_20201201.txt和ZZ_20201201.txt的文件处理成一个文件，因为该日期存在于missing_dates中，并且存在于每个目录中
脚本将把XX_20201202.txt, YY_20201202.txt和ZZ_20201202.txt的文件处理成一个文件，因为该日期存在于missing_dates和中，并且存在于每个目录中
脚本将NOT处理ZZ_20201203.txt的文件，因为该日期并不存在于每个目录中，即使它存在于missing_dates.中

实际上很快就说：3个日期相同的文件(在3个不同的目录中(与missing_dates中存在的日期=继续

请注意，下面将文件处理成1个文件的代码已经在工作，潜在的问题是我必须调整我的循环，使其始终处理超过1个日期。我不知道该怎么做。。。。

这是读取文件的代码：

for root, dirs, files in os.walk(counter_part):
for file in files:
date_files= re.search('_(.d+).', file).group(1) 
with open(file_path, 'r') as my_file:
reader = csv.reader(my_file, delimiter = ',')
next(reader)
for row in reader:
if filter_row(row):                      
vehicle_loc_dict[(row[9], location_token(row))].append(row)

使用pathlib中的工具，这相当容易。

给定：

% tree /tmp/test
/tmp/test
├── dir_1
│   ├── XX_20201201.txt
│   └── XX_20201202.txt
├── dir_2
│   ├── YY_20201201.txt
│   └── YY_20201202.txt
└── dir_3
├── ZZ_20201201.txt
├── ZZ_20201202.txt
└── ZZ_20201203.txt
3 directories, 7 files

你可以做：

from pathlib import Path
root=Path('/tmp/test')
missing_dates=['20201201']
for fn in (e for e in root.glob('**/*.txt') 
if e.is_file() and any(d in str(e) for d in missing_dates)):
print(fn)
# here do what you mean by 'proceed' with path fn

打印：

/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt

或者，你可以做：

missing_dates=['20201201', '20201202']
for d in missing_dates:
print(f"processing {d}")
for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file()):
print(fn)
# here do what you mean by 'proceed'

打印：

processing 20201201
/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt
processing 20201202
/tmp/test/dir_2/YY_20201202.txt
/tmp/test/dir_3/ZZ_20201202.txt
/tmp/test/dir_1/XX_20201202.txt

如果你只对3人一组感兴趣，你可以做：

missing_dates=['20201201', '20201202', '20201203']
for d in missing_dates:
print(f"processing {d}")
files=[fn for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file())]
if len(files)==3:
print(files)

打印：

processing 20201201
[PosixPath('/tmp/test/dir_2/YY_20201201.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201201.txt'), PosixPath('/tmp/test/dir_1/XX_20201201.txt')]
processing 20201202
[PosixPath('/tmp/test/dir_2/YY_20201202.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201202.txt'), PosixPath('/tmp/test/dir_1/XX_20201202.txt')]
processing 20201203

你可以用os.walk和glob.glob做同样的事情，但这只是更多的工作。。。

相关内容

最新更新

热门标签：