减少'While read'循环的处理时间



shell脚本编写新手…

我有一个巨大的csv文件,长度为f11,如

000000aaad000000bhb200000uwwed…">
000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew…">
。.

将字符串分割为10后,我需要6-9个字符。然后我必须使用分隔符'|'将它们连接起来像

0 aaa | 0 bhb | uwwe……
0 aba | bbrb wwq | 0 |卡巴bhb | 0 | 0 qwe…

并将处理后的f11与其他字段连接

这是处理10k条记录所花费的时间->

real 4m43.506s
user 0m12.366s
sys 0m12.131s

20K records ->
real 5m20.244s
user 2m21.591s
sys 3m20.042s

80K条记录(约370万条)->

real 21m18.854s
user 9m41.944s
sys 13m29.019s

我的预期时间是30分钟处理650K记录(大约5600万f11分割和合并)。有什么优化方法吗?

while read -r line1; do
f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
echo $f10 >> $path/other_fields

f11=$( echo $line1 | cut -d',' -f11 )
f11_trim=$(echo "$f11" | tr -d '"')
echo $f11_trim | fold -w10 > $path/f11_extract 
cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim

arr=($(cat $path/str_list_trim))
printf "%s|" ${arr[@]} >> $path/str_list_serialized
printf 'n' >> $path/str_list_serialized
arr=()

rm $path/f11_extract
rm $path/str_list_trim
done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/(.*)/"1"/g' $path/str_list_serialized
paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out

你的代码不省时,因为:

  • 在循环中调用多个命令,包括awk。
  • 生成许多中间临时文件。

您可以使用awk:

awk -F, -v OFS="," '                                    # assign input/output field separator to a comma
{
len = length($11)                                   # length of the 11th field
s = ""; d = ""                                      # clear output string and the delimiter
for (i = 1; i <= len / 10; i++) {                   # iterate over the 11th field
s = s d substr($11, (i - 1) * 10 + 6, 4)        # concatenate 6-9th substring of 10 characters long chunks
d = "|"                                         # set the delimiter to a pipe character
}
$11 = """ s """                                   # assign the 11th field to the generated string
} 1' "$input"                                           # the final "1" tells awk to print all fields

输入示例:

1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew

输出:

1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"

最新更新