我想根据第一列的相等性折叠行。然后将第二列的内容添加到新的折叠表中,以逗号分隔并带有额外的空格。此外,如果第二列的内容相同,请折叠它们,也就是说,如果"非毒性"在输出文件中出现两次,则只显示一次。
我在这里很新,请解释如何运行它。希望有人能帮助我!
输入(制表符分隔):
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung
所需输出(制表符分隔):
HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
来自命令行的 Perl,
perl -lane'
($n, $p) =@F;
$s{$n}++ or push @r, $n;
$c{$n}{$p}++ or push @{$h{$n}}, $p;
END {
$" = ",t";
print "$_t@{$h{$_}}" for @r;
}
' file
输出
HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
另一个Perl解决方案:
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/uniq/;
my %hash;
while ( <DATA> )
{
chomp;
my ( $key, $value ) = split;
push @{$hash{$key}}, $value;
}
while ( my ( $key, $values ) = each %hash )
{
print "$keyt", join ', ', uniq @$values, "n";
}
__DATA__
HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung
这将执行您的要求,此外,如果重要,还会使 ID 和描述保持它们在文件中出现的顺序相同:
use strict;
use warnings;
open my $fh, '<', 'diseases.txt';
my %diseases;
my @ids;
while (<$fh>) {
my ($id, $desc) = split;
if (not $diseases{$id}) {
$diseases{$id}{list} = [$desc];
$diseases{$id}{seen}{$desc} = 1;
push @ids, $id;
}
elsif (not $diseases{$id}{seen}{$desc}) {
push @{ $diseases{$id}{list} }, $desc;
$diseases{$id}{seen}{$desc} = 1;
}
}
for my $id (@ids) {
printf "%s %sn", $id, join ', ', @{ $diseases{$id}{list} };
}
输出
HS372_01446 non-virulent, lung
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
from collections import defaultdict
a = """HS372_01446 non-virulent
HS372_01446 non-virulent
HS372_01446 lung
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 non-virulent
HS372_00498 lung
HS372_00498 lung
HS372_00954 jointlungCNS
HS372_00954 non-virulent
HS372_00954 non-virulent
HS372_00954 moderadamentevirulenta(nose)
HS372_00954 lung""".split("n")
stuff = defaultdict(set)
for line in a:
uid, symp = line.split(" ")
stuff[uid].add(symp)
for uid, symps in stuff.iteritems():
print "%s %s" % (uid, ", ".join(list(symps)))
Java:
爪哇崩溃.java
java 折叠输入.txt
import java.io.*;
import java.util.*;
public class Collapse {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0])));
Map<String, Set<String>> output = new HashMap<String, Set<String>>();
String line;
while ((line = br.readLine()) != null) {
StringTokenizer st = new StringTokenizer(line, "t");
String key = st.nextToken();
Set<String> set = output.get(key);
if (set == null) {
output.put(key, set = new LinkedHashSet<String>());
}
set.add(st.nextToken());
}
for (String key : output.keySet()) {
StringBuilder sb = new StringBuilder();
for (String value : output.get(key)) {
if (sb.length() != 0) sb.append(", ");
sb.append(value);
}
System.out.println(key + "t" + sb);
}
}
}
用于解析文本文件的标准 UNIX 工具很笨拙:
$ awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "t") $2} END{for (i in a) print i a[i]}' file
HS372_00498 non-virulent, lung
HS372_00954 jointlungCNS, non-virulent, moderadamentevirulenta(nose), lung
HS372_01446 non-virulent, lung
如果您的数据来自 mysql 数据库(您可以将其导入到一个数据库中),您可以使用 group_concat
运算符。
查看此答案我可以将多个 MySQL 行连接成一个字段吗?
这目前标记为 431 个赞成票,所以你的问题是一个非常普遍的问题,答案显示了一个非常优雅的解决方案。
在 perl 中:
use warnings;
use strict;
open my $input, '<', 'in.txt';
my %hash;
while (<$input>){
chomp;
my @split = split(' ');
$hash{$split[0]}{$split[1]} = 1;
}
for my $key (keys %hash){
print "$keyt";
for my $info (keys $hash{$key}){
print "$infot";
}
print "n";
}
哪些打印:
HS372_01446 non-virulent lung
HS372_00954 non-virulent moderadamentevirulenta(nose) jointlungCNS lung
HS372_00498 non-virulent lung