使用大于数字的列值筛选文件(awk不起作用)



我正在尝试使用8列中的值筛选文件>=10.我正在使用awk,但由于某种原因它不起作用。我做错了什么吗?我错过了什么?

head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375
#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'
LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355
#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)

通过查看OP的注释,如果您不想在行的开头搜索LQN文本,并且想检查第8列是否大于10,则尝试以下操作(检查行是否有LQN,从以下代码中删除!(。

awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv

或者,要获得总行数,请尝试:计算那些匹配的行数可以在单个awk中完成。

awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv

解释:添加以上详细解释。

awk -F"," '               ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{    ##Checking condition if 8th field is greater than 10 and NOT LQN.
count++                 ##Increasing count with 1 here.
}
END{                      ##Starting END block of this awk program from here.
print count             ##Printing count value here.
}
' df_TPM.csv              ##Mentioning Input_file name here.

要处理awk代码本身中的控制M字符,请尝试:考虑到您不希望在Input_file中使用控制M字符。

awk -F"," '{gsub(/r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv

您需要告诉awk通过计算$8+0$8强制为一个数字。建议您确保已安装GNU awk以避免出现问题。此外,在处理文件以规范行尾之前,您可能会使用dos2unix

整个命令可以写成

awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv

请参阅在线awk演示。

注意:若要仅计数这些行,请使用整个命令可以写成

awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv

要查找不以LQN开头的行,只需在/^LQN/:之前添加否定运算符!

awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv

详细信息

  • -F","(=-F,(-将字段分隔符设置为逗号
  • /^LQN/ && $8+0 >= 10-如果当前行以LQN开始,并且第八个字段等于或大于10
  • !/^LQN/ && $8+0 >= 10-如果当前行不是以LQN开始,并且第八个字段等于或大于10
  • {print $1, $8}-打印字段1和8
  • {cnt++}-递增cnt变量
  • END{print cnt}-一旦awk完成处理行,就打印cnt变量

最新更新