检索双引号以外的分号之间的数据

我在csv数据中有以下数据。有些列有数据，在没有数据的列中放置一个分号。这是无法改变的。以下是三行代码的示例:

first;;;;;
;;Second;;;;
;;;"Third;Fourth";;;
;;"Fifth;Sixth";;;

我想要得到分号个数不等于6的行。此外，我只想计数双引号外的分号。所以第三行不应该算为6。第四个，也应该包括在内，因为双引号外的分号个数不等于6。

我使用以下代码

TARGETFILE=data.csv
variable=$(awk -F ';' 'NF != 7' <$TARGETFILE)

我怎样才能得到分号个数不等于6的行?

如果您有GNU awk:

，则此一行代码应该可以达到目的。

awk 'BEGIN { FPAT = ""[^"]*"|[^;]*" } NF != 7' file

或者，您可以使用以下sed解决方案:

sed 'h; s/"[^"]*"//g; s/[^;]//g; /^;;;;;;$/d; x' file

使用任意awk:

$ awk '{x=$0; gsub(/"[^"]*"/,"",x)} gsub(/;/,"",x) != 6' file
first;;;;;
;;"Fifth;Sixth";;;

或者

$ awk -F';' '{x=$0; gsub(/"[^"]*"/,"")} NF != 7{print x}' file
first;;;;;
;;"Fifth;Sixth";;;

如果你只想要六个分隔符(非数据)分号的行，grep可以处理这个问题。

$: cat tst
;;;;;;
bad;;;;;
good;;;;;;
;;;;;;;bad
"is;ok";;;;;;
;;good;;;;
;;;;;;"is;ok"
;this;"not;ok";;;
;;;;;;fine
"not;ok";;;
;;"nope;again";;;
;;;;;;;;;"not;ok"
$: grep -En '^(("[^"]*"|[^;"]*)*;("[^"]*"|[^;"]*)*){6}$' tst
1:;;;;;;
3:good;;;;;;
5:"is;ok";;;;;;
6:;;good;;;;
7:;;;;;;"is;ok"
9:;;;;;;fine
$: grep -Env '^(("[^"]*"|[^;"]*)*;("[^"]*"|[^;"]*)*){6}$' tst
2:bad;;;;;
4:;;;;;;;bad
8:;this;"not;ok";;;
10:"not;ok";;;
11:;;"nope;again";;;
12:;;;;;;;;;"not;ok"

Even给出行号。

借用Paul的例子:

echo '
;;;;;;
bad;;;;;
good;;;;;;
;;;;;;;bad
"is;ok";;;;;;
;;good;;;;
;;;;;;"is;ok"
;this;"not;ok";;;
;;;;;;fine
"not;ok";;;
;;"nope;again";;;
;;;;;;;;;"not;ok"' | gcat -n |

awk -F';?"[^"]*";?|;' NF==7

1  ;;;;;;
3  good;;;;;;
5  "is;ok";;;;;;
6  ;;good;;;;
7  ;;;;;;"is;ok"
9  ;;;;;;fine

但对于原始测试样品，必须稍微修改

(NF-7达到与'NF != 7'相同的效果而无需shell引用)

echo '
first;;;;;
;;Second;;;;
;;;"Third;Fourth";;;
;;"Fifth;Sixth";;;' |

awk -F';?"[^"]*"|;' NF-7

first;;;;;
;;"Fifth;Sixth";;;

CSV格式比它最初看起来要复杂得多。例如，我认为在字符串中使用双引号的方法是使用两个双引号:"。我怀疑上述解决方案是否能解决这个问题，但我现在没有精力去分析它们。我建议要正确处理这个问题是足够困难的，你真的需要一个专门的程序来处理所有的边缘情况。

重用paul文件:

cat file
;;;;;;
bad;;;;;
good;;;;;;
;;;;;;;bad
"is;ok";;;;;;
;;good;;;;
;;;;;;"is;ok"
;this;"not;ok";;;
;;;;;;fine
"not;ok";;;
;;"nope;again";;;
;;;;;;;;;"not;ok"

你可以使用Ruby来计数字段:

ruby -r csv -e '$<.each{|line| 
len=CSV.parse(line, col_sep:";").flatten.length
puts "#{sprintf("%2s",$.)}: "#{line.chomp}" => #{len} fields" 
}' file

打印:

1: ";;;;;;" => 7 fields
2: "bad;;;;;" => 6 fields
3: "good;;;;;;" => 7 fields
4: ";;;;;;;bad" => 8 fields
5: ""is;ok";;;;;;" => 7 fields
6: ";;good;;;;" => 7 fields
7: ";;;;;;"is;ok"" => 7 fields
8: ";this;"not;ok";;;" => 6 fields
9: ";;;;;;fine" => 7 fields
10: ""not;ok";;;" => 4 fields
11: ";;"nope;again";;;" => 6 fields
12: ";;;;;;;;;"not;ok"" => 10 fields

如果你想过滤那些有7个字段的行:

ruby -r csv -e '$<.each{|line| 
len=CSV.parse(line, col_sep:";").flatten.length
if len==7 then puts line end
}' file

打印:

;;;;;;
good;;;;;;
"is;ok";;;;;;
;;good;;;;
;;;;;;"is;ok"
;;;;;;fine

注意:与计数字段分隔符相比，计数数据字段少一个:

1;2;3;4;"five; with sep";6 # six fields, five field separators...

相关内容

最新更新

热门标签：