基于相同的键折叠行



我想根据第一列的相等性折叠行。然后将第二列的内容添加到新的折叠表中,以逗号分隔并带有额外的空格。此外,如果第二列的内容相同,请折叠它们,也就是说,如果"非毒性"在输出文件中出现两次,则只显示一次。

我在这里很新,请解释如何运行它。希望有人能帮助我!

输入(制表符分隔):

HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

所需输出(制表符分隔):

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung

来自命令行的 Perl,

perl -lane'
  ($n, $p) =@F;
  $s{$n}++ or push @r, $n;
  $c{$n}{$p}++ or push @{$h{$n}}, $p;
  END {
    $" = ",t";
    print "$_t@{$h{$_}}" for @r;
  }
' file

输出

HS372_01446     non-virulent,   lung
HS372_00498     non-virulent,   lung
HS372_00954     jointlungCNS,   non-virulent,   moderadamentevirulenta(nose),  lung

另一个Perl解决方案:

#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/uniq/;
my %hash;
while ( <DATA> )
{
    chomp;
    my ( $key, $value ) = split;
    push @{$hash{$key}}, $value;
}
while ( my ( $key, $values ) = each %hash )
{
    print "$keyt", join ', ', uniq @$values, "n";  
}
__DATA__
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung

这将执行您的要求,此外,如果重要,还会使 ID 和描述保持它们在文件中出现的顺序相同:

use strict;
use warnings;
open my $fh, '<', 'diseases.txt';
my %diseases;
my @ids;
while (<$fh>) {
  my ($id, $desc) = split;
  if (not $diseases{$id}) {
    $diseases{$id}{list} = [$desc];
    $diseases{$id}{seen}{$desc} = 1;
    push @ids, $id;
  }
  elsif (not $diseases{$id}{seen}{$desc}) {
    push @{ $diseases{$id}{list} }, $desc;
    $diseases{$id}{seen}{$desc} = 1;
  }
}
for my $id (@ids) {
  printf "%s %sn", $id, join ', ', @{ $diseases{$id}{list} };
}

输出

HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
from collections import defaultdict
a = """HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung""".split("n")
stuff = defaultdict(set)
for line in a:
    uid, symp = line.split(" ")
    stuff[uid].add(symp)
for uid, symps in stuff.iteritems():
    print "%s %s" % (uid, ", ".join(list(symps)))

Java:

爪哇崩溃.java

java 折叠输入.txt

import java.io.*;
import java.util.*;
public class Collapse {
    public static void main(String[] args) throws Exception {
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0])));
        Map<String, Set<String>> output = new HashMap<String, Set<String>>();
        String line;
        while ((line = br.readLine()) != null) {
            StringTokenizer st = new StringTokenizer(line, "t");
            String key = st.nextToken();
            Set<String> set = output.get(key);
            if (set == null) {
                output.put(key, set = new LinkedHashSet<String>());
            }
            set.add(st.nextToken());
        }
        for (String key : output.keySet()) {
            StringBuilder sb = new StringBuilder();
            for (String value : output.get(key)) {
                if (sb.length() != 0) sb.append(", ");
                sb.append(value);
            }
            System.out.println(key + "t" + sb);
        }
    }
}

用于解析文本文件的标准 UNIX 工具很笨拙:

$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "t") $2} END{for (i in a) print i a[i]}' file
HS372_00498     non-virulent, lung
HS372_00954     jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
HS372_01446     non-virulent, lung

如果您的数据来自 mysql 数据库(您可以将其导入到一个数据库中),您可以使用 group_concat 运算符。

查看此答案我可以将多个 MySQL 行连接成一个字段吗?

这目前标记为 431 个赞成票,所以你的问题是一个非常普遍的问题,答案显示了一个非常优雅的解决方案。

在 perl 中:

use warnings;
use strict; 
open my $input, '<', 'in.txt';
my %hash;
while (<$input>){
    chomp;
    my @split = split(' ');
    $hash{$split[0]}{$split[1]} = 1;
}
for my $key (keys %hash){
    print "$keyt";
        for my $info (keys $hash{$key}){
            print "$infot";
        }
    print "n";
} 

哪些打印:

HS372_01446 non-virulent    lung    
HS372_00954 non-virulent    moderadamentevirulenta(nose)    jointlungCNS    lung    
HS372_00498 non-virulent    lung

相关内容

  • 没有找到相关文章

最新更新