我有一个大的CSV字段,我使用awk,字段分隔符设置为逗号。然而,有些字段被引用并包含逗号,我面临的问题是:
原始文件:
Downloads $ cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"
我正在尝试这种方式:
Downloads $ cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd
预期结果:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
我会使用一个能够正确解析CSV的工具,比如xsv。有了它,命令看起来就像
$ xsv select 2-4 testfile.csv
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"
或者,如果你真的想引用每一个值,可以使用第二步:
$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
在字段分隔符标志中包含(转义(引号,并将它们添加到输出打印字段中:
testfile.csv | awk -F "","" '{print """$2"",""$3"",""$4}'
输出:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
如果gawk
或GNU awk
可用,则可以使用与字段匹配的FPAT
,而不是在字段分隔符上进行拆分。
awk -v FPAT='([^,]+)|("[^"]+")' -v OFS=, '{print $2, $3, $4}' testfile.csv
结果:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
字符串([^,]+)|("[^"]+")
是一个正则表达式模式,它匹配以下任一项:
([^,]+)
。。。匹配逗号以外的任何字符的序列("[^"]+")
。。。匹配用双引号括起来的字符串(其中可能包含逗号(
模式周围的括号是为了视觉清晰,正则表达式将在没有它们(如FPAT='[^,]+|"[^"]+"'
(的情况下工作,因为替代|
的优先级较低。