如何使用 Awk 或 Bash 在 1 个文件中组合具有相同标头的列

我想知道如何使用 bash/sed/awk 在文件中组合具有重复标题的列。

   x y  x  y
s1 3 4  6 10
s2 3 9 10  7
s3 7 1  3  2

自：

$ cat file
   x y  x  y
s1 3 4  6 10
s2 3 9 10  7
s3 7 1  3  2
$ cat tst.awk
NR==1 {
   for (i=1;i<=NF;i++) {
      flds[$i] = flds[$i] " " i+1
   }
   printf "%-3s",""
   for (hdr in flds) {
      printf "%3s",hdr
   }
   print ""
   next
}
{
   printf "%-3s",$1
   for (hdr in flds) {
      n = split(flds[hdr],fldNrs)
      sum = 0
      for (i=1; i<=n; i++) {
         sum += $(fldNrs[i])
      }
      printf "%3d",sum
   }
   print ""
}
$ awk -f tst.awk file
     x  y
s1   9 14
s2  13 16
s3  10  3
$ time awk -f ./tst.awk file
     x  y
s1   9 14
s2  13 16
s3  10  3
real    0m0.265s
user    0m0.030s
sys     0m0.108s

如果您愿意，可以以明显的方式调整 printf 行以获得不同的输出格式。

这是响应评论 elsethread 的 bash 等效物。不要使用它，awk 解决方案是正确的，这只是为了展示如果你出于某种莫名其妙的原因想这样做，你应该如何在 bash 中编写它：

$ cat tst.sh
declare -A flds
while IFS= read -r rec
do
   lineNr=$(( lineNr + 1 ))
   set -- $rec
   if (( lineNr == 1 ))
   then
      fldNr=1
      for fld
      do
         fldNr=$(( fldNr + 1 ))
         flds[$fld]+=" $fldNr"
      done
      printf "%-3s" ""
      for hdr in "${!flds[@]}"
      do
         printf "%3s" "$hdr"
      done
      printf "n"
   else
      printf "%-3s" "$1"
      for hdr in "${!flds[@]}"
      do
         fldNrs=( ${flds[$hdr]} )
         sum=0
         for fldNr in "${fldNrs[@]}"
         do
            eval val="$$fldNr"
            sum=$(( sum + val ))
         done
         printf "%3d" "$sum"
      done
      printf "n"
   fi
done < "$1"
$
$ time ./tst.sh file
     x  y
s1   9 14
s2  13 16
s3  10  3
real    0m0.062s
user    0m0.031s
sys     0m0.046s

请注意，它的运行时间与 awk 脚本大致相同（请参阅注释 elsethread）。警告 - 我从不编写用于处理文本文件的 bash 脚本，所以我并不是说上面的 bash 脚本是完美的，只是如何在 bash 中处理它的示例，以便与我声称应该重写的这个线程中的其他脚本进行比较！

这不是

一行。你可以使用 Bash v4、Bash 的字典和一些 shell 工具来完成。

使用文件名执行下面的脚本以处理参数

bash script_below.sh your_file

这是脚本：

declare -A coltofield
headerdone=0
# Take the first line of the input file and extract all fields 
# and their position. Start with position value 2 because of the 
# format of the following lines
while read line; do
    colnum=$(echo $line | cut -d "=" -f 1)
    field=$(echo $line | cut -d "=" -f 2)
    coltofield[$colnum]=$field
done < <(head -n 1 $1 | sed  -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]+/n/g;' | nl -v 2 -n ln  | sed -e 's/[[:space:]]+/=/g;')
# Read the rest of the file starting with the second line             
while read line; do
    declare -A computation
    declare varname

    # Turn the line in key value pair. The key is the position of 
    # the value in the line
    while read value; do
        vcolnum=$(echo $value | cut -d "=" -f 1)
        vvalue=$(echo $value | cut -d "=" -f 2)
        # The first value is the line variable name 
        # (s1, s2)                                       
        if [[ $vcolnum == "1" ]]; then
            varname=$vvalue
            continue
        fi
        # Get the name of the field by the column 
        # position                                                     
        field=${coltofield[$vcolnum]}
        # Add the value to the current sum for this field
        computation[$field]=$((computation[$field]+${vvalue}))
    done < <(echo $line | sed  -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]+/n/g;' | nl -n ln  | sed -e 's/[[:space:]]+/=/g;')

    if [[ $headerdone == "0" ]]; then
        echo -e -n "t"
        for key in ${!computation[@]}; do echo -n -e "$keyt" ; done; echo
        headerdone=1
    fi
    echo -n -e "$varnamet"
    for value in ${computation[@]}; do echo -n -e "$valuet"; done; echo
    computation=()
done < <(tail -n +2 $1)

另一个 AWK 替代方案：

$ cat f
   x y  x  y
s1 3 4  6 10
s2 3 9 10  7
s3 7 1  3  2
$ cat f.awk
BEGIN {
OFS="t";
}
NR==1 {
  #need header for 1st column
  for(f=NF; f>=1; --f)
    $(f+1) = $f;
  $1="";
  for(f=1; f<=NF; ++f)
    fld2hdr[f]=$f;
}
{
  for(f=1; f<=NF; ++f)
    if($f ~ /^[0-9]/)
      colValues[fld2hdr[f]]+=$f;
    else
      colValues[fld2hdr[f]]=$f;
  for (i in colValues)
    row = row colValues[i] OFS;
  print row;
  split("", colValues);
  row=""
}
$ awk -f f.awk f
        x       y
s1      9       14
s2      13      16
s3      10      3

$ awk 'BEGIN{print "   x y"} a=$2+$4, b=$3+$5 {print $1, a, b}' file
   x y
s1 9 14
s2 13 16
s3 10 3

毫无疑问，有一种更好的方法来显示标题，但我awk有点粗略。

这是一个Perl解决方案，只是为了好玩：

cat table.txt | perl -e'@h=grep{$_}split/s+/,<>;while(@l=grep{$_}split/s+/,<>){for$i(1..$#l){$t{$l[0]}{$h[$i-1]}+=$l[$i]}};printf "    %sn",(join"  ",sort keys%{$t{(keys%t)[0]}});for$h(sort keys%t){printf"$h %sn",(join " ",map{sprintf"%2d",$_}@{$t{$h}}{sort keys%{$t{$h}}})};'

相关内容

最新更新

热门标签：