AWK循环遍历列以计数匹配

  • 本文关键字:循环 遍历 AWK loops awk
  • 更新时间 :
  • 英文 :


我有一个以制表符分隔的文件,看起来像这样:

Sample3G td>

这里有一个可能的解决方案:

awk '
BEGIN{
OFS="t"; printf "%st", "SampleID"
}
NR==1{
for(i=5;i<=NF;i++){
printf "Sample%st", (i-4)
}
}
NR>1{
sample[$1]++
!sampleID[$1]++
for(i=5;i<=NF;i++){
if($3 == $i){
count[$1, i]++
}
}
}
END{
for (j in sampleID) {
print ""
printf "%st", j
for(i=5;i<=NF;i++){
printf "%st", count[j, i] / sample[j]
}
}
}' inputfile
SampleID  Sample1  Sample2  Sample3
311       0.4      0.6      0

不是除以(NR-1),而是除以SampleID的行数。因此,如果文件中有其他sampleid:

cat test.txt
SampleID    dbSNP   Min.alle    M.zygo  Sample1 Sample2 Sample3
311 rs1490413   A   Homo    G   A   G
311 rs730123    G   Homo    A   G   A
311 rs7532151   A   Homo    A   C   C
311 rs1434369   G   Homo    T   G   T
311 rs1563172   T   Homo    T   C   C
312 rs1490413   A   Homo    G   A   G
312 rs730123    G   Homo    A   G   A
312 rs7532151   A   Homo    A   C   C
312 rs1434369   G   Homo    T   G   T
312 rs1563172   G   Homo    T   C   C
awk '
BEGIN{
OFS="t"; printf "%st", "SampleID"
}
NR==1{
for(i=5;i<=NF;i++){
printf "Sample%st", (i-4)
}
}
NR>1{
sample[$1]++
!sampleID[$1]++
for(i=5;i<=NF;i++){
if($3 == $i){
count[$1, i]++
}
}
}
END{
for (j in sampleID) {
print ""
printf "%st", j
for(i=5;i<=NF;i++){
printf "%st", count[j, i] / sample[j]
}
}
}' test.txt
SampleID    Sample1 Sample2 Sample3
311         0.4     0.6     0
312         0.2     0.6     0

根据文件的大小,可能值得看看其他语言来完成这项任务,也就是说,这在R中是相对微不足道的:

library(dplyr)
df <- read.table(text = "SampleID   dbSNP   Min.alle    M.zygo  Sample1 Sample2 Sample3
311 rs1490413   A   Homo    G   A   G
311 rs730123    G   Homo    A   G   A
311 rs7532151   A   Homo    A   C   C
311 rs1434369   G   Homo    T   G   T
311 rs1563172   T   Homo    T   C   C", header = TRUE)
df %>%
group_by(SampleID) %>%
summarise(across(starts_with("Sample"), ~mean(.x == Min.alle)))
#> # A tibble: 1 × 4
#>   SampleID Sample1 Sample2 Sample3
#>      <int>   <dbl>   <dbl>   <dbl>
#> 1      311     0.4     0.6       0

编辑

要打印列名(而不是"Sample_n"),可以使用:

awk '
BEGIN{
OFS="t"; printf "%st", "SampleID"
}
NR==1{
for(i=5;i<=NF;i++){
printf "%st", $i
}
}
NR>1{
sample[$1]++
!sampleID[$1]++
for(i=5;i<=NF;i++){
if($3 == $i){
count[$1, i]++
}
}
}
END{
for (j in sampleID) {
print ""
printf "%st", j
for(i=5;i<=NF;i++){
printf "%st", count[j, i] / sample[j]
}
}
}' inputfile

最新更新