根据bash/posh中数组中的列匹配值，将CSV拆分为两个文件

我有一个输入CSV，我想将其拆分为两个CSV文件。如果第4列的值与WLTarray中的任何值匹配，则应将其放入输出文件1中，如果不匹配则应放入输出文件2中。

WLTarray:

"22532" "79994" "18809" "21032"

输入CSV文件：

header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"

输出CSV文件1:

header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"

输出CSV文件2:

header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"

我一直在寻找awk来过滤它(python和perl在我的环境中不是一个选项(，但我认为可能有一个更聪明的方法：

declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[@]}" #Everything in the WLTarray will go to $filename-WLT.tmp
do
awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
# now filter to remove non matching values? why not just move the rows entirely?        
done

使用常规awk，您可以使用split和substr(处理双引号删除以进行比较(，并根据指示拆分csv文件。例如，您可以使用：

awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
split (s,a," ")     # split s into array a
for (i in a)        # loop over each index in a
b[a[i]]=1       # use value in a as index for b
}
FNR == 1 {      # first record, write header to both output files
print $0 > "output1.csv"
print $0 > "output2.csv"
next
}
substr($4,2,length($4)-2) in b {    # 4th field w/o quotes in b?
print $0 > "output1.csv"        # write to output1.csv
next
}
{ print $0 > "output2.csv" }        # otherwise write to output2.csv
' input.csv

其中：

在BEGIN {...}规则中，您将字段分隔符(FS(设置为逗号分隔，并将包含所需output1.csv域4匹配项的字符串拆分到数组a中，然后使用a中的值作为数组b中的索引进行循环(以允许进行简单的i in b检查(
第一条规则被应用于文件中的第一条记录(头行(，该记录被简单地写入两个输出文件
下一个规则移除字段4周围的双引号，然后检查字段4中的数字是否与数组CCD_。如果是，则将记录写入output1.csv，否则将其写入output2.csv

示例输入文件

$ cat input.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"

结果输出文件

$ cat output1.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
$ cat output2.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"

您可以这样使用gawk：

test.awk

#!/usr/bin/gawk -f
BEGIN {
split("22532 79994 18809 21032", a)
for(i in a) {
WLTarray[a[i]]
}
FPAT="[^",]+"
}
NR > 1 {
if ($4 in WLTarray) {
print >> "output1.csv"
} else {
print >> "output2.csv"
}
}

使其可执行并像这样运行：

chmod +x test.awk
./test.awk input.csv

使用带有过滤器文件的grep作为输入是最简单的答案。

declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[@]}" 
do
awkstring="'$4 == ""\"$WLTvalue\""" {print}'"
eval "awk -F, $awkstring input.csv >> output.WLT.csv"
done
grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv

相关内容

最新更新

热门标签：