BASH: normalize列表列—为列表中的每个项目复制行



我想在BASH

中为每个项目创建一个单独的行,用逗号分隔例如:

TYPE  NAME  
Fruit  apple,strawberry
Vegetable  potato

Into this table:

TYPE  NAME  
Fruit  apple
Fruit  strawberry
Vegetable  potato

我试过这个脚本:

#!/bin/bash
# define the name of the input file
input_file="plants.tsv"
# define the name of the output file
output_file="normalized_plants.tsv"
# define the index of the list column (counting from 1)
list_column=2
# create a new file with the headers for the output table
head -n 1 "$input_file" > "$output_file"
# read each line of the input file
tail -n +2 "$input_file" | while IFS=$'t' read -r line; do
# extract the values for the list column
list_values=$(echo "$line" | awk -F$'t' '{print $'"$list_column"'}' | tr ',' 'n')
# iterate over each value in the list column
echo "$line" | awk -F$'t' -v OFS=$'t' -v list_column="$list_column" -v list_values="$list_values" '
NR == 1 { next } # skip the header row
{ 
split(list_values, values, "n")
for (i in values) {
$list_column = values[i]
print $0
}
}' >> "$output_file"
done

但是我得到的是一个空的输出文件。你知道这里出了什么问题,或者可能有更好的解决方案来实现这一点吗?我是BASH的初学者,这可能不是实现规范化的最佳方法。

不要为此使用shell读取循环,请参阅为什么使用shell循环来处理文本考虑的不良做法,只需一个awk脚本就可以运行得更快,更可移植,并且更容易编写健壮(例如-您目前有两个使用shell读取循环的答案,如果";type";包含一个空白(如果输入包含任何反斜杠,其中一个也会失败),例如使用任何awk:

$ cat tst.sh
#!/usr/bin/env bash
# define the name of the input file
input_file="plants.tsv"
# define the name of the output file
output_file="normalized_plants.tsv"
# define the index of the list column (counting from 1)
list_column=2
awk -v list_column="$list_column" '
BEGIN { FS=OFS="t" }
{
n = split($list_column,names,",")
for ( i=1; i<=n; i++ ) {
print $1, names[i]
}
}
' "$input_file" > "$output_file"

$ ./tst.sh

$ cat normalized_plants.tsv
TYPE    NAME
Fruit   apple
Fruit   strawberry
Vegetable       potato

我使用for ( i=1; i<=n; i++ )而不是上面的for ( i in names )来保证输入的名称顺序保留在输出中,参见https://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array。

这个答案只是告诉你,你的脚本使用纯bash可以浓缩为:

#!/bin/bash
while read -r type names; do
echo "$type"$'t'"${names//,/$'n'$type$'t'}"
done < plants.tsv > normalized_plants.tsv

一般情况下,首选awk溶液。

bash:

while read type name_list; do                # Read the 2 fields in type and name_list
readarray -d, -t names <<< "$name_list," # Split the name_list by comma and save it in names array.
unset names[-1]                          # This line is only to remove the tailing newline for the last entry.
for name in "${names[@]}"; do            # For each name, ...
echo "$type $name"                   # ... print type and name
done
done < plants.tsv > output_plants.tsv        # Input, output file redirection.

awk版本:

awk '{split($2, s, ","); for(i in s){print $1, s[i]}}' plants.tsv > output_plants.tsv

为了多样化,使用sed提供一个简单的字符串处理解决方案。

$: sed -E ':x ; s/^([^[:space:]]+)[[:space:]]+([^,]+),/1t2n1t/; t;' file
TYPE  NAME
Fruit   apple
Fruit   strawberry
Vegetable  potato

与给定的简单文件一起工作。请务必确认任何更复杂的事情。

echo '
TYPE  NAME  
Fruit  apple,strawberry,banana
Vegetable  potato' | 
mawk 'NR==!_ || $NF!~/,/ || gsub(",[^,]+", "n"$!_ " &", $NF) + gsub(",",_)' 
TYPE  NAME  
Fruit apple
Fruit strawberry
Fruit banana
Vegetable  potato

如果你想对输出间隔进行研究,那么

gawk 'NR==!_ ? OFS = substr($_, match($_, "[ t]+"),RLENGTH) 
: $NF!~/,/ || gsub(",[^,]+", "n" $!_ OFS "&", $NF) gsub(",",_)' 
TYPE  NAME  
Fruit  apple
Fruit  strawberry
Fruit  banana
Vegetable  potato

最新更新