我有一个输入文本文件:
EL.EEX.FRANCE.DELMONTHS.JAN2016.SPOT.VOL 15JAN2016
EL.EEX.GERMANY.DELMONTHS.JAN2016.SPOT.L 15JAN2016
EL.EEX.GERMANY.DELMONTHS.JAN2016.SPOT.H 15JAN2016
EL.EEX.GERMANY.DELMONTHS.JAN2016.SPOT.S 15JAN2016
EL.EEX.ITALY.DELMONTHS.JAN2016.FWD 15JAN2016
EL.EEX.ITALY.DELMONTHS.JAN2016.FWD 15JAN2016
给定样本数据达到 dot(.) 的最大级别,我们需要唯一类型的 1 个代表性样本(完整行),没有日期。所以输出将是
EL.EEX.FRANCE.DELMONTHS.JAN2016.SPOT.VOL
EL.EEX.GERMANY.DELMONTHS.JAN2016.SPOT.L
EL.EEX.ITALY.DELMONTHS.JAN2016.FWD
(输出中行的顺序无关紧要。
下面的程序工作正常,但它会生成许多中间临时文件。在壳中没有它,我们怎么能做到呢?
#input file name and path assumed in current directory
file="./osc.txt"
resultfilepath="./OSCoutput.txt"
tmpfilepath="./OSCtempoutput.txt"
tmp1filepath="./OSCtemp1output.txt"
tmp2filepath="./OSCtemp2output.txt"
rm $resultfilepath
rm $tmpfilepath
#using awk to filter only series data without dates
awk -F' ' '{print $1}' $file >> $tmpfilepath
#getting all the unique dataclass_names at column 1
DATACLASSNAME=(`cut -f 1 -d'.' $tmpfilepath | sort | uniq`)
for i in "${DATACLASSNAME[@]}"; do
rm $tmp1filepath
#we are filtering the file with that dataclass
awk -F'.' -v awk_dataclassname="$i" '$1==awk_dataclassname' $tmpfilepath >> $tmp1filepath
#also we are calculating the number of delimeter in filtered record and sorting it
COLCOUNT=(`awk -F'.' '{print NF}' $tmp1filepath | uniq | sort`)
for j in "${COLCOUNT[@]}"; do
rm $tmp2filepath
#in the filtered data we are taking series of a particular dimension length and dumping data
awk -F '.' -v awk_colcount="$j" '(NF==awk_colcount){print}' $tmp1filepath >> $tmp2filepath
#reducing column no by 1
newj=$(echo $((j - 1)))
#removing last column(generally observation dimension) by cut column
GREPSAMPLE=(`cut -f -$newj -d'.' $tmp2filepath | uniq`)
SAMPLELENGTH=(`wc -l $tmp2filepath`)
#we are now taking unique series sample
for k in "${GREPSAMPLE[@]}"; do
#doing grep of unique sample but taking the whole line
echo `grep $k $tmp1filepath | head -1` >> $resultfilepath
done
done
done
cat $resultfilepath
echo "processing finish"
整个事情都可以通过这个awk
调用来完成。
awk '{
key = $0;
sub("\.[^.]*$", "", key); # Let key be everything up to the last dot
if (!seen[key]) { print $1 } # If key has not been seen, print 1st col
seen[key] = 1; # Mark the key as seen
}' "$file" > "$resultfilepath"
一般来说,当你有一个涉及大量尴尬和嘎嘎声的脚本时,你很可能只写一个 awk 脚本。