AWK根据file1中的部分头匹配选择file2中的列

  • 本文关键字:选择 file2 根据 file1 AWK bash awk
  • 更新时间 :
  • 英文 :


我有一个文件("File1")有~40-80k列和~10k行。File1中的列标题由一个唯一标识符(例如"4b_1.04:")和一个描述(例如"Colname_3")组成。File2包含一个唯一标识符列表(即不完全匹配)。是否有一种方法从File1中提取列,使用基于部分匹配的File2列标题列表?

例如:

"File1">

patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47

"File2">

patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:

NB。File2中的标识符与File1中的列标题的顺序不同,并且File1中的分隔符是制表符,而不是逗号(从SO复制粘贴时制表符会转换为空格吗?)

我的尝试:

awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"t" }' File2 <(tr -s "," "t" < File1)

不幸的是,这打印了'部分'头-我想要完整的头-并且似乎与File1的顺序不同而挣扎。

预期结果(awk 'BEGIN{FS=","; OFS="t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):

patient_ID  response    0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3  3_2.83: Colname_1   3_8.5102: Colname_2
ID_237.vcf  Benefit 13008   4343    0.17    0.65    1.23
ID_841.vcf  Benefit 15127   2468    2.39    0.9 0.68
ID_767.vcf  Benefit 5190    3261    1.99    0.73    1.16
ID_263.vcf  Benefit 16888   9548    0.73    0.61    1.66
ID_179.vcf  Benefit 3545    842 0.48    0.22    0.67
ID_408.vcf  Benefit 1427    4583    0.17    0.92    0.76
ID_850.vcf  Benefit 13835   4682    0.05    0.8 1.21
ID_856.vcf  Benefit 8939    8435    2.5 0.31    0.99
ID_328.vcf  Benefit 14220   8481    0.79    0.23    0.22
ID_704.vcf  Benefit 18145   914 0.17    0.66    1.69
ID_828.vcf  No_Benefit  4798    8163    1.04    0.74    0.89
ID_16.vcf   No_Benefit  6472    528 1.74    0.47    1.5
ID_380.vcf  No_Benefit  9827    8359    2.41    0.86    1.59
ID_559.vcf  No_Benefit  10247   9150    1.02    0.68    0.78
ID_466.vcf  No_Benefit  11092   4078    0.4 0.16    0.03
ID_925.vcf  No_Benefit  4809    2908    2.32    0.01    1.49
ID_573.vcf  No_Benefit  4341    4307    2.63    0.87    0.14
ID_497.vcf  No_Benefit  18279   663 2.96    0.1 1.06
ID_830.vcf  No_Benefit  18505   456 1.96    0.31    0.25
ID_665.vcf  No_Benefit  15072   2962    0.76    0.43    1.35

请尝试以下操作:

awk -F"t" '
NR==FNR {                                   # handle File2
partial[FNR] = $i                       # create a list of desired header (partial)
len = FNR                               # array lenth of "partial"
next
}
FNR==1 {                                    # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) {    # test the partial match
header[++n] = i             # if match, store the position
line = line ofs $i
ofs = "t"
}
}
}
print line                              # print the desired header (full)
}
FNR>1 {                                     # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) {              # positions of desired columns
line = line ofs $header[i]
ofs = "t"
}
print line
}
' File2 File1

输出:

patient_ID  response    0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf  Benefit 13008   4343    0.17    0.65    1.23
ID_841.vcf  Benefit 15127   2468    2.39    0.9 0.68
ID_767.vcf  Benefit 5190    3261    1.99    0.73    1.16
ID_263.vcf  Benefit 16888   9548    0.73    0.61    1.66
ID_179.vcf  Benefit 3545    842 0.48    0.22    0.67
ID_408.vcf  Benefit 1427    4583    0.17    0.92    0.76
ID_850.vcf  Benefit 13835   4682    0.05    0.8 1.21
ID_856.vcf  Benefit 8939    8435    2.5 0.31    0.99
ID_328.vcf  Benefit 14220   8481    0.79    0.23    0.22
ID_704.vcf  Benefit 18145   914 0.17    0.66    1.69
ID_828.vcf  No_Benefit  4798    8163    1.04    0.74    0.89
ID_16.vcf   No_Benefit  6472    528 1.74    0.47    1.5
ID_380.vcf  No_Benefit  9827    8359    2.41    0.86    1.59
ID_559.vcf  No_Benefit  10247   9150    1.02    0.68    0.78
ID_466.vcf  No_Benefit  11092   4078    0.4 0.16    0.03
ID_925.vcf  No_Benefit  4809    2908    2.32    0.01    1.49
ID_573.vcf  No_Benefit  4341    4307    2.63    0.87    0.14
ID_497.vcf  No_Benefit  18279   663 2.96    0.1 1.06
ID_830.vcf  No_Benefit  18505   456 1.96    0.31    0.25
ID_665.vcf  No_Benefit  15072   2962    0.76    0.43    1.35