Perl比较两个文件并打印出现的内容

全部，

我有两个文件——SC和ID。在ID中，我只有两列用空格分隔。在SC中，有更多的列，但ID中可能存在一对。

例如

ID  
chain_0 123
chain_1 456
chain_2 789

SC  
chain_0 123 toronto ontario canada
chain_1 456 toronto New Delhi India 
chain_2 789 housing_crisis mortgage_rates first_time_buyers miserable

不，我想在SC中打印与ID中的行相匹配的行。我尝试了以下操作，但这不起作用。

open(ID, '<', $id) or die $!;
while(<ID>){
my @array = split ' ', $_;
$output = `awk '$1 ~ /<$array[0]>/' scan_cells | awk '$2 ~ /<$array[1]>/'` ;
print "$output";
}
close(ID);

谢谢！！1

只使用grep的一种方法，通过bash过程替换和sed，对ID的行进行一些按摩，将它们转换为仅在行开头匹配的正则表达式：

grep -f <(sed 's/^/^/; s/[[:space:]]/[[:space:]]/; s/$/[[:space:]]/' ID) SC

在perl:中

#!/usr/bin/env perl
use strict;
use warnings;
# Takes the files as command line arguments
my ($id_file, $sc_file) = @ARGV;
my %ids;
open my $ID, "<", $id_file or die "Unable to open $id_file: $!n";
while (<$ID>) {
# Just in case there's a tab instead of a single space between columns
$_ = join(" ", split);
$ids{$_} = 1;
}
close $ID;
open my $SC, "<", $sc_file or die "Unable to open $sc_file: $!n";
while (<$SC>) {
my @cols = split;
print if exists $ids{"@cols[0,1]"};
}
close $SC;

这里的想法是将ID的每一行作为关键字存储在哈希表中，然后对于SC的每一行都，查看前两列是否作为关键字存在于该表中，如果是，则打印它

同样的方法可以在awk中更简洁地完成，不过：

awk 'FNR == NR { ids[$1,$2] = 1; next }
($1,$2) in ids' ID SC

从Perl程序中使用awk几乎总是一个错误。无论您使用awk做什么，使用Perl都可能更容易。

以下是我如何处理你的问题。创建一个散列，其中键是ID，值是一些真值(1是最简单的(。然后在SC文件中迭代，只有当行的开头与哈希中的键匹配时才打印。

类似这样的东西：

#!/usr/bin/perl
# Always :-)
use strict;
use warnings;
# Open the id file
open my $id, '<', 'id' or die $!;
# Read the ids in to an array
chomp( my @ids = <$id> );
# Convert the array into a hash
my %id = map { $_ => 1 } @ids;
# Read a line at a time from the file
# given on the command line.
while (<>) {
# split the line into fields (on whitespace)
my @data = split;
# Print only if the first two fields match
# a record in %id
print if $id{"$data[0] $data[1]"};
}

这将对ID文件的名称进行硬编码，但您可以在命令行上传递SD文件的名称。如果你把这个程序称为idfilter，那么你会这样运行它：

$ ./idfilter sc

假设文件SC中感兴趣的列也用于前2列，并且字段分隔符(空白(相同，则可以将文件ID的整行存储在数组a[$0]中

在处理第二个文件时，检查保存文件ID中所有条目的数组中是否出现了列1(由列2与输出字段分隔符连接(。

awk 'FNR == NR{a[$0]; next} $1 OFS $2 in a' ID SC

文件的测试内容：

$ cat ID
chain_0 123
chain_1 456
chain_2 789
chain_9 999
$cat SC
chain_0 123 toronto ontario canada
chain_1 456 toronto New Delhi India
chain_2 789 housing_crisis mortgage_rates first_time_buyers miserable
chain_3 999 housing_crisis mortgage_rates first_time_buyers miserable

输出

chain_0 123 toronto ontario canada
chain_1 456 toronto New Delhi India
chain_2 789 housing_crisis mortgage_rates first_time_buyers miserable

如果输出字段分隔符不同，您也可以使用多维数组：

awk 'FNR==NR{a[$1, $2];next} 
{
for (pair in a) {
split(pair, sep, SUBSEP);
if ($1 == sep[1] && $2 == sep[2]) print
}
}
' ID SC

相关内容

最新更新

热门标签：