为什么两个md5sum文件的比较不能正常工作



我有两个列表,其中包含带有md5sum检查的文件,并且这些列表对于相同的文件具有不同的路径。

带有校验和的第一个文件中的内容示例(server.list(:

2c03ff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03  tmp/fastq1_L002_R1_001.fastq.gz

带有校验和的两个文件中的内容示例(已下载.list(:

2c03ff18a643a1437ec0cf051b8b7b9d  /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03  /home/projects/fastq1_L002_R1_001.fastq.gz

当我运行以下行时,我得到了以下行:

awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sumn",$NF}' downloaded.list server.list
fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum

为什么我收到这条消息,因为两个文件中的第一列是相同的?有人能在这个问题上启发我吗?

编辑:

如果我删除路径,只保留文件名,它就可以正常工作。

编辑2:

正如所指出的,文件路径形式还有另一种可能性,它不是以/开头的。在这种情况下,我不能使用/作为字段分隔符。

假设:

  • filename(sans-path(和md5sum必须匹配
  • 文件名不能按相同顺序列出
  • 两个文件中可能都不存在文件名

样本数据:

$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d  /home/projects/fastq1_L001_R1_001.fastq.gz   # match
YYYYf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R5_911.fastq.gz   # different md5sum
c430f587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R2_001.fastq.gz   # match
MNOPf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R8_abc.fastq.gz   # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621  /home/projects/fastq1_L001_R9_004.fastq.gz   # different filename but matching md5sum (vs last line of other file)
==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L001_R1_001.fastq.gz             # match
c430f587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R2_001.fastq.gz             # match
XXXXf587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R5_911.fastq.gz             # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d  /tmp/fastq1_L999_R6_922.fastq.gz             # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621  /tmp/fastq1_L001_R7_933.fastq.gz             # different filename but matching md5sum (vs last line of other file)

awk解决空白空间问题以及验证文件名匹配的一个想法:

awk '                                    # stick with default field delimiter of white space but ...
{ md5sum=$1
n=split($2,arr,"/")                    # split 2nd field on "/" delimiter
fname=arr[n]
if (FNR==NR)
filearray[fname]=md5sum
else {
if (fname in filearray && filearray[fname] == $1)
next
printf "%s has a different md5sumn",fname
}
}
' downloaded.list server.list

这将生成:

fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum

$1上用作数组键的空白导致了问题。移除:

awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sumn",$NF}' list1.txt list2.txt

相关内容

  • 没有找到相关文章

最新更新