如何使用外壳脚本外联接两个CSV文件?



我有两个CSV文件,如下所示:

文件1.csv

label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"

文件2.csv

label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"

期望的输出.csv

label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

目前,使用以下awk命令,我可以组合匹配的命令,这些匹配的命令在PQR和XYZ等两个文件中都有标签条目,但无法附加两个文件中不存在标签值的条目:

awk -F, 'NR==FNR{a[$1]=substr($0,length($1)+2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv

此解决方案使用任何 AWK 准确打印所需的结果。 请注意,排序算法取自 mawk 手册。

# SO71053039.awk
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
out = "Output.csv"
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1[""Part-A""]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2[""Part-B""]
}
}
close(ARGV[2])
# print the header
print "label" OFS ""Part-A"" OFS ""Part-B"" > out
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", LABEL_KEY_SWAP[i])
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
print result_string > out
}
}

叫:

awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

我想把米勒介绍给你。它是一种工具,可以使用几种文件格式做一些事情,并且可以作为独立的二进制文件使用。您只需要下载存档,将mlr可执行文件放在某个地方(最好在您的 PATH 中),即可完成安装。

mlr --csv 
join -f file1.csv -j 'label' --ul --ur 
then 
unsparsify --fill-with 'NA' 
then 
sort -f 'label' 
file2.csv

命令部分:

  • mlr --csv
    表示您要读取 CSV 文件并输出 CSV 格式。作为另一个示例,如果要读取CSV文件并输出JSON格式,则mlr --icsv --ojson
  • join -f file1.csv -j 'label' --ul --ur...
    file2.csv表示在字段label上联接file1.csvfile2.csv,并发出两个文件的不匹配记录
  • then是米勒链接操作的方式
  • unsparsify --fill-with 'NA'
    意味着创建每个文件中不存在的字段并用NA填充它们。具有 uniq 标签的记录需要它

  • thensort -f 'label'表示对字段上的记录进行排序label

关于更新的问题:mlr自行处理CSV引用。与新的预期输出的唯一区别是它删除了多余的引号:

label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0
awk -v OFS=, '{
if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
} 
END{
for(v in o1) { print v, o1[v], o2[v] }
}' file{1,2}
## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4

我认为这会做得很好。

我们建议gawk标准的Linux脚本awk

脚本.awk

NR == FNR {
valsStr = sprintf("%s,%s", $2, "na");
rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
split(rowsArr[$1],valsArr);
valsStr = sprintf("%s,%s", valsArr[1], $2);
rowsArr[$1] = valsStr;
next;
}
NR != FNR {
valsStr = sprintf("%s,%s", "na", $2);
rowsArr[$1] = valsStr;
}
END {
printf("%s,%sn", "label", rowsArr["label"]);
for (rowName in rowsArr) {
if (rowName == "label") continue;
printf("%s,%sn", rowName, rowsArr[rowName]);
}
}

输出:

awk -F, -f script.awk input.{1,2}.txt
label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4

由于您的问题标题为"如何做......在 shell 脚本中?"不一定使用awk,我将推荐 GoCSV,这是一个命令行工具,其中包含几个用于处理 CSV(分隔文件)的子命令。

它没有一个命令可以完成您需要的内容,但您可以编写许多命令以获得正确的结果。

此解决方案的核心是join命令,它可以执行内部(默认)、左、右和外部连接;您希望外部连接保留非重叠元素:

gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC   | 2      |       |        |
+-------+--------+-------+--------+
| XYZ   | 3      | XYZ   | 4      |
+-------+--------+-------+--------+
| PQR   | 6      | PQR   | 6      |
+-------+--------+-------+--------+
|       |        | LMN   | 8      |
+-------+--------+-------+--------+
|       |        | EFG   | 1      |
+-------+--------+-------+--------+

数据部分是正确的,但需要一些工作才能使列正确,并获取其中的NA值。

下面是一个完整的管道:

gocsv join -c 'label' -outer file1.csv file2.csv 
| gocsv rename -c 1 -names 'Label_A' 
| gocsv rename -c 3 -names 'Label_B' 
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' 
| gocsv select -c 'label','Part-A','Part-B' 
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' 
| gocsv sort -c 'label' 
> final.csv
echo 'Final'
gocsv view final.csv

这为我们提供了正确的最终文件:

Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC   | 2      | NA     |
+-------+--------+--------+
| EFG   | NA     | 1      |
+-------+--------+--------+
| LMN   | NA     | 8      |
+-------+--------+--------+
| PQR   | 6      | 6      |
+-------+--------+--------+
| XYZ   | 3      | 4      |
+-------+--------+--------+

该管道中发生了很多事情,重点是:

合并两个标签字段

| gocsv rename -c 1 -names 'Label_A' 
| gocsv rename -c 3 -names 'Label_B' 
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' 

仅按顺序排列到所需的 3 列

| gocsv select -c 'label','Part-A','Part-B' 

添加 NA 值并按标签排序

| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' 
| gocsv sort -c 'label' 

我已经在这个要点中做了一个分步的解释。

你提到加入我另一个答案的评论,我忘记了这个实用程序:

#!/bin/sh
rm -f *sorted.csv
# Join two files, normally inner-join only, but
# -  `-a 1 -a 2`:    include "unpaired lines" from file 1 and file 2
# -  `-1 1 -2 1`:    the first column from each is the "join column"
# -  `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2
join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv 
# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv
# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv
# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv

而且,这是排序的.csv

+--------+--------+--------+
| label  | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0    | NA     |
+--------+--------+--------+
| EFG    | NA     | 1.0    |
+--------+--------+--------+
| LMN Wv | NA     | 8      |
+--------+--------+--------+
| PQR SN | 6      | 6      |
+--------+--------+--------+
| XYZ    | 3.0    | 4.0    |
+--------+--------+--------+

正如@Fravadona在他的评论中正确指出的那样,对于可以包含分隔符、字段中换行符或双引号的 CSV 文件,需要适当的 CSV 解析器

实际上,只需要两个函数:一个用于将CSV字段取消引用到普通AWK字段,另一个用于引用AWK字段以将数据写回CSV字段。

我已经写了一个我之前的答案(https://stackoverflow.com/a/71056926/18135892)的变体,它使用Ed Morton的CSV解析器(https://stackoverflow.com/a/45420607/18135892 与适用于任何AWK版本的gsub变体)给出一个正确的CSV解析示例:

此解决方案打印使用任何AWK正确排序的所需结果。 请注意,排序算法取自 mawk 手册。

# SO71053039_2.awk
# unquote CSV:
# Ed Morton's CSV parser: https://stackoverflow.com/a/45420607/18135892
function buildRec(      fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\" for -separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? """" : "" )
$0 = ""
fpat = "([^" FS "]*)|("([^"]|"")+")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
gsub(/""/, """, fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
if ((field ~ sep) || (field ~ /["rn]/)) {
gsub(/"/, """", field)
field = """ field """
}
return field
}
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
}
}
close(ARGV[2])
# print the header
print "label" OFS "Part-A" OFS "Part-B"
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
print result_string
}
}

叫:

awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

最新更新