用逗号分隔CSV中的文本

我正在尝试编写一些RHEL安全强化自动化脚本，并且我有一个CSV文件，我正在尝试将信息生成为可读内容。以下是我迄今为止所拥有的。。。

#!/bin/bash
# loop through the file
while read line; do
# get all of the content
vulnid=`echo $line | cut -d',' -f1`
ruleid=`echo $line | cut -d',' -f2`
stigid=`echo $line | cut -d',' -f3`
title=`echo $line | cut -d',' -f4`
discussion=`echo $line | cut -d',' -f5`
check=`echo $line | cut -d',' -f6`
fix=`echo $line | cut -d',' -f7`
# Format the content
echo "########################################################"
echo "# Vulnerability ID: $vulnid"
echo "# Rule ID: $ruleid"
echo "# STIG ID: $stigid"
echo "#"
echo "# Rule: $title"
echo "#"
echo "# Discussion:"
echo "# $discussion"
echo "# Check:"
echo "# $check"
echo "# Fix:"
echo "# $fix"
echo "########################################################"
echo "# Start Check"
echo
echo "# Start Remediation"
echo
echo "########################################################"
done < STIG.csv

我遇到的问题是CSV中的文本包含逗号。这实际上是非常好的，并且符合IETF标准(https://www.rfc-editor.org/rfc/rfc4180#page-2第2.4节)。然而，正如你所能想象的，cut命令不会向前看逗号后面是否有尾随空格(就像你在英语中通常会有的那样)。这导致我所有的领域都一团糟，我不知道如何让这一切正常运转。

现在，我有一种感觉，我可以使用一些神奇的正则表达式，比如‘，！[：blank:]'，但如果我知道如何使用它，我会被诅咒的。我习惯于使用cut，只是因为它又快又脏，但也许有人有更好的建议，可以使用awk或sed。这主要是为了生成我的节目的主体结构，它重复自己，是一吨评论。

另外需要注意的是，这必须在RHEL6的干净安装上运行。我会用Ruby、Python等语言写这篇文章。然而，大多数都是必须安装的额外软件包。这个脚本将被部署的环境是机器没有任何互联网访问或额外包的地方。Python2.6默认在CentOS6上，但RHEL6(我认为)。否则，相信我，我会用Ruby写这整件事。

以下是CSV:的示例

V-38447,SV-50247r1_rule,RHEL-06-000519,The system package management tool must verify contents of all files associated with packages.,The hash on important files like system executables should match the information given by the RPM database. Executables with erroneous hashes could be a sign of nefarious activity on the system.,"The following command will list which files on the system have file hashes different from what is expected by the RPM database. # rpm -Va | grep '$1 ~ /..5/ && $2 != 'c''If there is output, this is a finding.","The RPM package management system can check the hashes of installed software packages, including many that are important to system security. Run the following command to list which files on the system have hashes that differ from what is expected by the RPM database: # rpm -Va | grep '^..5'A 'c' in the second column indicates that a file is a configuration file, which may appropriately be expected to change. If the file that has changed was not expected to then refresh from distribution media or online repositories. rpm -Uvh [affected_package]OR yum reinstall [affected_package]"

此外，如果有人好奇的话，整个项目都在GitHub上发布。

对你的问题的所有评论都是好的。bash不支持内置CSV，所以如果你不想使用Python、Ruby、Erlang甚至Perl之类的语言，你就必须自己使用。

请注意，虽然awk可以使用逗号作为字段分隔符，但它也不能正确支持逗号嵌入带引号字段中的CSV。你可以像Håkon建议的那样，用一种模式破解一个解决方案。

但你不需要在awk中这样做；您可以单独在bash中完成这项工作，避免调用外部工具。这样的怎么样？

#!/bin/bash
nextfield () {
case "$line" in
"*)
value="${line%%",*}""
line="${line#*",}"
;;
*)
value="${line%%,*}"
line="${line#*,}"
;;
esac
}
# loop through the file
while read line; do
# get the content
nextfield; vulnid="$value"
nextfield; ruleid="$value"
nextfield; stigid="$value"
nextfield; title="$value"
nextfield; discussion="$value"
nextfield; check="$value"
nextfield; fix="$value"
# format the content
printf "########################################################n"
printf "# Vulnerability ID: %sn" "$vulnid"
printf "# Rule ID: %sn# STIG ID: %sn#n" "$ruleid" "$stigid"
printf "# Rule: %sn" "$title"
printf "#n# Discussion:n"
fmt -w68 <<<"$discussion" | sed 's/^/#   /'
printf "# Check:n"
fmt -w68 <<<"$check" | sed 's/^/#   /'
printf "# Fix:n"
fmt -w68 <<<"$fix" | sed 's/^/#   /'
printf "########################################################n"
printf "# Start Checknn"
printf "# Start Remediationnn"
printf "########################################################n"
done < STIG.csv

如果你经常这样做的话，速度优势将是巨大的。

请注意fmt提供的改进的格式。这种方式扼杀了避免调用外部程序的速度优势，但它确实使您的输出更容易阅读。：)

在Gnu Awk版本4中，您可以尝试：

gawk -f a.awk STIG.csv

其中a.awk为：

BEGIN {
FPAT = "([^,]*)|("[^"]+")"
}
{
for (i=1; i<=NF; i++) 
print "$"i"=|"$i"|"
print "# Rule: "$4
}

输出：

$ cat STIG.csv
vulnid,ruleid,stigid,"This is a title, hello","A discussion, ,,",check,fix
$ gawk -f a.awk STIG.csv
$1=|vulnid|
$2=|ruleid|
$3=|stigid|
$4=|"This is a title, hello"|
$5=|"A discussion, ,,"|
$6=|check|
$7=|fix|
# Rule: "This is a title, hello"

+1到John Y的评论。这是一个ruby示例

ruby -rcsv -e 'CSV.foreach("STIG.csv") do |row|
(vulnid, ruleid, stigid, title, disc, check, fix) = row
puts "#" * 40
puts "# Vulnerability ID: #{vulnid}"
puts "# Rule ID: #{ruleid}"
puts "# STID ID: #{stigid}"
puts "#"
puts "# Discussion:"
puts "# #{disc}"
puts "# Check:"
puts "# #{check}"
puts "# Fix:"
puts "# #{fix}"
puts "#" * 40
end'

如果你想把长队包起来，可以这样做：

puts fix.gsub(/(.{1,78})(?:s+|Z)/) {|s| "# " + s + "n"}

最大的问题是字段可能包含换行符。本着这种精神，建议使用支持CSV的语言是最好的解决方案。

然而，如果你唯一的问题是逗号(你知道字段中不会有任何换行符)，你可以在bash中轻松解决，方法是用你选择的未使用的字符组合临时替换引号空间序列，并在输出前将其替换回：

#!/bin/bash
while IFS=',' read vulnid ruleid stigid title discussion check fix; do
echo "# Vulnerability ID: $vulnid"
...
echo "# Discussion:"
echo "# $discussion"
...
done <<<"$(sed 's/, /COMMASPACE/g' <STIG.csv)" | sed 's/COMMASPACE/, /g'

下面是我在管道分隔文件中Count number of column的答案的改进版本，该文件也是针对这个特定问题定制的。一个真正的CSV解析器实现是最好的，但下面使用awk的破解只要字段不在多行中分割就可以工作，当字段以引号开始并持续到下一个不在同一行的引号时，这是可能的。它还假定它接收到的文件已经格式良好。它唯一的问题是它将在最后一个字段之后输出OFS。在你的特殊情况下，这不应该是一个问题。

只需在上面的while循环之前添加以下内容，并根据需要更改OFS的值，确保更改cut的分隔符以匹配。OFS默认为|，但如果您希望使用awk允许的-v选项，您可以覆盖它，如图所示：

outfile="$(mktemp 2>/dev/null || printf '%s' "/tmp/STIG.$$")"
outdelim='|'
awk -F',' -vOFS="$outdelim" STIG.csv >"$outfile" <<EOF
#WARNING: outputs OFS after the last field, meaning an empty field is at the end.
BEGIN{ if (OFS=="") OFS='|' }
{
for (i = 1; i <= NF; i++) {
if ($i ~ /^".*[^"]$/)
for (; i <= NF && ($i !~ /.*"$/); i++) {
printf("%s%s", $i, FS);
}
printf("%s%s", $i, OFS);
}
}
EOF
# loop through the file
while read line; do
# get all of the content
vulnid="$(echo $line | cut -d"$outdelim" -f1)"
.
.
.
done < "$outfile"
rm -f "$outfile"

相关内容

最新更新

热门标签：