我想在BASH
中为每个项目创建一个单独的行,用逗号分隔例如:
TYPE NAME
Fruit apple,strawberry
Vegetable potato
Into this table:
TYPE NAME
Fruit apple
Fruit strawberry
Vegetable potato
我试过这个脚本:
#!/bin/bash
# define the name of the input file
input_file="plants.tsv"
# define the name of the output file
output_file="normalized_plants.tsv"
# define the index of the list column (counting from 1)
list_column=2
# create a new file with the headers for the output table
head -n 1 "$input_file" > "$output_file"
# read each line of the input file
tail -n +2 "$input_file" | while IFS=$'t' read -r line; do
# extract the values for the list column
list_values=$(echo "$line" | awk -F$'t' '{print $'"$list_column"'}' | tr ',' 'n')
# iterate over each value in the list column
echo "$line" | awk -F$'t' -v OFS=$'t' -v list_column="$list_column" -v list_values="$list_values" '
NR == 1 { next } # skip the header row
{
split(list_values, values, "n")
for (i in values) {
$list_column = values[i]
print $0
}
}' >> "$output_file"
done
但是我得到的是一个空的输出文件。你知道这里出了什么问题,或者可能有更好的解决方案来实现这一点吗?我是BASH的初学者,这可能不是实现规范化的最佳方法。
不要为此使用shell读取循环,请参阅为什么使用shell循环来处理文本考虑的不良做法,只需一个awk脚本就可以运行得更快,更可移植,并且更容易编写健壮(例如-您目前有两个使用shell读取循环的答案,如果";type";包含一个空白(如果输入包含任何反斜杠,其中一个也会失败),例如使用任何awk:
$ cat tst.sh
#!/usr/bin/env bash
# define the name of the input file
input_file="plants.tsv"
# define the name of the output file
output_file="normalized_plants.tsv"
# define the index of the list column (counting from 1)
list_column=2
awk -v list_column="$list_column" '
BEGIN { FS=OFS="t" }
{
n = split($list_column,names,",")
for ( i=1; i<=n; i++ ) {
print $1, names[i]
}
}
' "$input_file" > "$output_file"
$ ./tst.sh
$ cat normalized_plants.tsv
TYPE NAME
Fruit apple
Fruit strawberry
Vegetable potato
我使用for ( i=1; i<=n; i++ )
而不是上面的for ( i in names )
来保证输入的名称顺序保留在输出中,参见https://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array。
这个答案只是告诉你,你的脚本使用纯bash
可以浓缩为:
#!/bin/bash
while read -r type names; do
echo "$type"$'t'"${names//,/$'n'$type$'t'}"
done < plants.tsv > normalized_plants.tsv
一般情况下,首选awk
溶液。
与bash
:
while read type name_list; do # Read the 2 fields in type and name_list
readarray -d, -t names <<< "$name_list," # Split the name_list by comma and save it in names array.
unset names[-1] # This line is only to remove the tailing newline for the last entry.
for name in "${names[@]}"; do # For each name, ...
echo "$type $name" # ... print type and name
done
done < plants.tsv > output_plants.tsv # Input, output file redirection.
awk
版本:
awk '{split($2, s, ","); for(i in s){print $1, s[i]}}' plants.tsv > output_plants.tsv
为了多样化,使用sed
提供一个简单的字符串处理解决方案。
$: sed -E ':x ; s/^([^[:space:]]+)[[:space:]]+([^,]+),/1t2n1t/; t;' file
TYPE NAME
Fruit apple
Fruit strawberry
Vegetable potato
与给定的简单文件一起工作。请务必确认任何更复杂的事情。
echo '
TYPE NAME
Fruit apple,strawberry,banana
Vegetable potato' |
mawk 'NR==!_ || $NF!~/,/ || gsub(",[^,]+", "n"$!_ " &", $NF) + gsub(",",_)'
TYPE NAME
Fruit apple
Fruit strawberry
Fruit banana
Vegetable potato
如果你想对输出间隔进行研究,那么
gawk 'NR==!_ ? OFS = substr($_, match($_, "[ t]+"),RLENGTH) : $NF!~/,/ || gsub(",[^,]+", "n" $!_ OFS "&", $NF) gsub(",",_)'
TYPE NAME
Fruit apple
Fruit strawberry
Fruit banana
Vegetable potato