如何在awk中解析包含逗号的带引号字段的CSV文件



我有一个大的CSV字段,我使用awk,字段分隔符设置为逗号。然而,有些字段被引用并包含逗号,我面临的问题是:

原始文件:

Downloads $  cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"

我正在尝试这种方式:

Downloads $  cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd

预期结果:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

我会使用一个能够正确解析CSV的工具,比如xsv。有了它,命令看起来就像

$ xsv select 2-4 testfile.csv 
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"

或者,如果你真的想引用每一个值,可以使用第二步:

$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

在字段分隔符标志中包含(转义(引号,并将它们添加到输出打印字段中:

testfile.csv | awk -F "","" '{print """$2"",""$3"",""$4}'

输出:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

如果gawkGNU awk可用,则可以使用与字段匹配的FPAT,而不是在字段分隔符上进行拆分。

awk -v FPAT='([^,]+)|("[^"]+")' -v OFS=, '{print $2, $3, $4}' testfile.csv

结果:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

字符串([^,]+)|("[^"]+")是一个正则表达式模式,它匹配以下任一项:

  • ([^,]+)。。。匹配逗号以外的任何字符的序列
  • ("[^"]+")。。。匹配用双引号括起来的字符串(其中可能包含逗号(

模式周围的括号是为了视觉清晰,正则表达式将在没有它们(如FPAT='[^,]+|"[^"]+"'(的情况下工作,因为替代|的优先级较低。

相关内容

  • 没有找到相关文章

最新更新