比较来自 txt 的字符串与 bash 或 python 忽略模式

我想在 txt 文件中搜索不包括 [p] 和比较中的扩展名的重复行。确定相等的线后，仅显示不包含 [p] 及其扩展名的线。我在测试中有以下行.txt：

Peliculas/Desperados (2020)[p].mp4
Peliculas/La Duquesa (2008)[p].mp4
Peliculas/Nueva York Año 2012 (1975).mkv
Peliculas/Acoso en la noche (1980) .mkv
Peliculas/Angustia a Flor de Piel (1982).mkv
Peliculas/Desperados (2020).mkv
Peliculas/Angustia (1947).mkv
Peliculas/Días de radio (1987) BR1080[p].mp4
Peliculas/Mona Lisa (1986) BR1080[p].mp4
Peliculas/La decente (1970) FlixOle WEB-DL 1080p [Buzz][p].mp4
Peliculas/Mona Lisa (1986) BR1080.mkv

在此文件中，第 1-6 行和第 9-11 行是相同的(没有 ext 和 [p](。所需输出：

Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv

我尝试这样做，但只显示相同的行删除扩展和模式 [P]，但我不知道正确的行，我需要整行完成

sed 's/[p]//' ./test.txt | sed 's.[^.]*$//' | sort | uniq -d

错误输出(缺少扩展名(：

Peliculas/Desperados (2020)
Peliculas/Mona Lisa (1986) BR1080

因为你提到了bash...

删除任何带有p的行：

cat test.txt | grep -v p                     
home/folder/house from earth.mkv
home/folder3/window 1.avi

删除任何带有[p]的行：

cat test.txt | grep -v '[p]'
home/folder/house from earth.mkv
home/folder3/window 1.avi
home/folder4/little mouse.mpg

不太可能是您的需求，而只是因为：从每行中删除[p]，然后重复数据删除：

cat test.txt | sed 's/[p]//g' | sort | uniq
home/folder/house from earth.mkv
home/folder/house from earth.mp4
home/folder2/test.mp4
home/folder3/window 1.avi
home/folder3/window 1.mp4
home/folder4/little mouse.mpg

如果 2 遍解决方案(读取test.txt文件两次(是可以接受的，请您尝试：

declare -A ary                          # associate the filename with the base
while IFS= read -r file; do
if [[ $file != *[p]* ]]; then     # the filename does not include "[p]"
base="${file%.*}"               # remove the extension
ary[$base]="$file"              # create a map
fi
done < test.txt
while IFS= read -r base; do
echo "${ary[$base]}"
done < <(sed 's/[p]//' ./test.txt | sed 's/.[^.]*$//' | sort | uniq -d)

输出：

Peliculas/Desperados (2020).mkv
Peliculas/Mona Lisa (1986) BR1080.mkv

在第 1 遍中，它逐行读取文件以创建一个映射，该映射将文件名(带扩展名(与基号(不带扩展名(相关联。
在第二遍中，它将输出(基(替换为文件名。

如果您更喜欢 1 次通过解决方案(会更快(，请尝试：

declare -A ary                  # associate the filename with the base
declare -A count                # count the occurrences of the base
while IFS= read -r file; do
base="${file%.*}"           # remove the extension
if [[ $base =~ (.*)[p](.*) ]]; then
# "$base" contains the substring "[p]"
(( count[${BASH_REMATCH[1]}${BASH_REMATCH[2]}]++ ))
# increment the counter
else
(( count[$base]++ ))    # increment the counter
ary[$base]="$file"      # map the filename
fi
done < test.txt
for base in "${!ary[@]}"; do    # loop over the keys of ${ary[@]}
if (( count[$base] > 1 )); then
# it duplicates
echo "${ary[$base]}"
fi
done

在 Python 中，您可以将itertools.groupby与函数一起使用，该函数生成一个键，该键由文件名组成，没有任何[p]，并且删除了扩展名。

对于大小为 2 或更大的任何组，将打印不包含"[p]"的任何文件名。

import itertools
import re
def make_key(line):
return re.sub(r'.[^.]*$', '', line.replace('[p]', ''))
with open('test.txt') as f:
lines = [line.strip() for line in f]
for key, group in itertools.groupby(lines, make_key):
files = [file for file in group]
if len(files) > 1:
for file in files:
if '[p]' not in file:
print(file)

这给出了：

home/folder/house from earth.mkv
home/folder3/window 1.avi

相关内容

最新更新

热门标签：