基于第一条/最后一条记录的最大/最小值对文件进行成对修剪



我对相当大的csv文件有点问题。我能够编写简单的bash/awk脚本,但对于我有限的awk/bash编程经验来说,这个问题更难解决。

问题:

  • 我所有的文件都在文件夹中。文件夹中有偶数个csv文件需要成对修剪(我将用这种方法进行解释)。文件的名称如下:f1L、f1R、f2L、f2R、f3L、f3R。。。,fnL、fnR。

  • 文件需要成对读取,即f1L和f1R。f2L与f2R等

  • 文件有两个逗号分隔的字段。f1L(文件开始/结束)和f1R,看起来像

f1L (START)
1349971210, -0.984375 
1349971211, -1.000000 
f1R (START) 
1349971206, -0.015625
1349971207, 0.000000
f1L (END)
1350230398, 0.500000
1350230399, 0.515625
f1R (END) 
1350230402, 0.484375
1350230403, 0.515625

我想用awk做的是:

  1. 读取f1L的记录1,字段1(即1349971210),然后读取f1R的记录1字段1(如1349971206)。然后取两个值中的最大值(即x1=1349971210)
  2. 读取最后一个记录,f1L的字段1(即1350230399),然后读取最后一条记录,f1R的字段1。然后取最小值(即x2=1350230399)
  3. 然后提取并以相同的名称重新保存f1L和f1R中大于/等于x1和小于/等于x2之间的所有行
  4. 对我目录中的所有对重复此过程

想知道你们中是否有人建议用bash/awk编写一个小脚本来完成任务。

在bash中实现这一点的一种简单方法。这里根本不追求效率。没有错误检查(好吧,只有强制性的最低限度)。

将此脚本命名为myscript。它将使用两个参数(文件fxLfxR)。

#!/bin/bash
tmp=''
die() {
    echo >&2 "$@"
    exit 1
}
on_exit() {
    [[ -f $tmpL ]] && rm -- "$tmpL"
    [[ -f $tmpR ]] && rm -- "$tmpR"
}
last_non_blank_line() {
   sed -n -e $'/^$/ !hn$ {x;p;}' "$1"
}
(($#==2)) || die "script takes two arguments"
fL=$1
fR=$2
[[ -r "$fL" && -w "$fL" ]] || die "problem with file `$fL'"
[[ -r "$fR" && -w "$fR" ]] || die "problem with file `$fR'"
# read record1, line1 of fL and fR
IFS=, read min _ < "$fL"
[[ $min =~ ^[[:digit:]]+$ ]] || die "first line of `$fL' has a bad record"
IFS=, read t _ < "$fR"
[[ $t =~ ^[[:digit:]]+$ ]] || die "first line of `$fR' has a bad record"
((t>min)) && ((min=t))
# read record1, last line of fL and fR
IFS=, read max _ < <( last_non_blank_line "$fL")
[[ $max =~ ^[[:digit:]]+$ ]] || die "last line of `$fL' has a bad record"
IFS=, read t _ < <(last_non_blank_line "$fR")
[[ $t =~ ^[[:digit:]]+$ ]] || die "last line of `$fR' has a bad record"
((t<max)) && ((max=t))
# create tmp files
tmpL=$(mktemp --tmpdir) || die "can't create tmp file"
tmpR=$(mktemp --tmpdir) || die "can't create tmp file"
trap 'on_exit' EXIT
# Read fL line by line, and only keep those
# the first record of which is between min and max
while IFS=, read a b; do
    [[ $a =~ ^[[:digit:]]+$ ]] && ((a<=max)) && ((a>=min)) && echo "$a,$b"
done < "$fL" > "$tmpL"
mv -- "$tmpL" "$fL"
# Same with fR:
while IFS=, read a b; do
    [[ $a =~ ^[[:digit:]]+$ ]] && ((a<=max)) && ((a>=min)) && echo "$a,$b"
done < "$fR" > "$tmpR"
mv -- "$tmpR" "$fR"

并称之为:

$ myscript f1L f1R

先在暂存文件上使用它!无保修!使用风险自负!

洞穴由于脚本使用bash算法进行比较,因此假设每个文件中每行的第一条记录是bash处理范围内的整数


编辑由于您的第一条记录是浮点记录,因此不能使用上面使用bash算法的方法。一个非常有趣的方法是让bash执行所有必要的操作(获取第一行、最后一行、打开文件和hellip;),并使用bc作为算术部分。这样,您就不会受到数字大小的限制(bc使用任意精度),并且欢迎使用浮点运算!例如:

#!/bin/bash
tmp=''
die() {
    echo >&2 "$@"
    exit 1
}
on_exit() {
    [[ -f $tmpL ]] && rm -- "$tmpL"
    [[ -f $tmpR ]] && rm -- "$tmpR"
}
last_non_blank_line() {
   sed -n -e $'/^$/ !hn$ {x;p;}' "$1"
}
(($#==2)) || die "script takes two arguments"
fL=$1
fR=$2
[[ -r "$fL" && -w "$fL" ]] || die "problem with file `$fL'"
[[ -r "$fR" && -w "$fR" ]] || die "problem with file `$fR'"
# read record1, line1 of fL and fR
IFS=, read a _ < "$fL"
IFS=, read b _ < "$fR"
min=$(bc <<< "if($b>$a) { print "$b" } else { print "$a" }" 2> /dev/null)
[[ -z $min ]] && die "problem in first line of files `$fL' or `$fR'"
# read record1, last line of fL and fR
IFS=, read a _ < <( last_non_blank_line "$fL")
IFS=, read b _ < <(last_non_blank_line "$fR")
max=$(bc <<< "if($b<$a) { print "$b" } else { print "$a" }" 2> /dev/null)
[[ -z $max ]] && die "problem in last line of files `$fL' or `$fR'"
# create tmp files
tmpL=$(mktemp --tmpdir) || die "can't create tmp file"
tmpR=$(mktemp --tmpdir) || die "can't create tmp file"
trap 'on_exit' EXIT
# Read fL line by line, and only keep those
# the first record of which is between min and max
while read l; do
    [[ $l =~ ^[[:space:]]*$ ]] && continue
    r=${l%%,*}
    printf "if(%s>=$min && %s<=$max) { print "%sn" }n" "$r" "$r" "$l"
done < "$fL" | bc > "$tmpL" || die "Error in bc while doing file `$fL'"
# Same with fR:
while read l; do
    [[ $l =~ ^[[:space:]]*$ ]] && continue
    r=${l%%,*}
    printf "if(%s>=$min && %s<=$max) { print "%sn" }n" "$r" "$r" "$l"
done < "$fR" | bc > "$tmpR" || die "Error in bc while doing file `$fR'"
mv -- "$tmpL" "$fL"
mv -- "$tmpR" "$fR"

使用perl:

use warnings;
use strict;
my $dir = $ARGV[0];  # directory is argument
my @pairs;
for my $file (glob "$dir/f[0-9]*L") {
    my $n = ($file =~ /(d+)/)[0];
    my ($fn1, $fn2) = ($file, "f${n}R");
    my ($dL, $dR) = (loadfile($fn1), loadfile($fn2));
    my ($min, $max) = (min($dL->[0][1], $dR->[0][1]),
                       max($dL->[-1][1], $dR->[-1][1]));    
    trimfile($fn1, $dL, $min, $max);
    trimfile($fn2, $dL, $min, $max);
}
sub loadfile {
    my ($fname, @d) = (shift);
    open(my $fh, "<", $fname) or die ("$!");
    chomp, push(@d, [ split(/[, ]+/, $_) ]) while <$fh>;
    close $fh;
    return @d;
}
sub trimfile {
    my ($fname, $data, $min, $max) = @_;
    open(my $fh, ">", $fname) or die ("$!");
    print($fh $_->[0], " ", $_->[1], "n") for @$data;
    close $fh;
}
sub min { my ($a,$b) = @_; return $a < $b ? $a : $b; }
sub max { my ($a,$b) = @_; return $a > $b ? $a : $b; }

我试图包括所有必要的健全性检查,并最大限度地减少磁盘I/O(假设您的文件足够大,读取它们是时间限制因素)。此外,这些文件永远不需要从内存中整体读取(假设你的文件可能比可用的RAM还要大)。

然而,这只是使用一个非常基本的伪输入来尝试的,所以请测试它并报告任何问题。

首先,我写了一个脚本,修剪了一对(由f…L文件名标识):

#!/bin/sh
#############    
# trim_pair #
#-----------#############################
# given fXL file path, trim fXL and fXR #
#########################################
#---------------# 
# sanity checks #
#---------------#
# error function
error(){
 echo >&2 "$@"
 exit 1
}
# argument given?
[[ $# -eq 1 ]] || 
 error "usage: $0 <file>"
LFILE="$1"
# argument format valid?
[[ `basename "$LFILE" | egrep '^f[[:digit:]]+L$'` ]] || 
 error "invalid file name: $LFILE (has to match /^f[[:digit:]]+L$/)"
RFILE="`echo $LFILE | sed s/L$/R/`" # is there a better POSIX compliant way?
# files exists?
[[ -e "$LFILE" ]] || 
 error "file does not exist: $LFILE"
[[ -e "$RFILE" ]] || 
 error "file does not exist: $RFILE"
# files readable?
[[ -r "$LFILE" ]] || 
 error "file not readable: $LFILE"
[[ -r "$RFILE" ]] || 
 error "file not readable: $RFILE"
# files writable?
[[ -w "$LFILE" ]] || 
 error "file not writable: $LFILE"
[[ -w "$RFILE" ]] || 
 error "file not writable: $RFILE"
#------------------#
# create tmp files #
# & ensure removal #
#------------------#
# cleanup function
cleanup(){
 [[ -e "$LTMP" ]] && rm -- "$LTMP"
 [[ -e "$RTMP" ]] && rm -- "$RTMP"
}
# cleanup on exit
trap 'cleanup' EXIT
#create tmp files
LTMP=`mktemp --tmpdir` || 
 error "tmp file creation failed"
RTMP=`mktemp --tmpdir` || 
 error "tmp file creation failed"
#----------------------#
# process both files   #
# prepended by their   #
# first and last lines #
#----------------------#
# extract first and last lines without reading the whole files twice
{
 head -q -n1 "$LFILE" "$RFILE"  # no need to read the whole files
 tail -q -n1 "$LFILE" "$RFILE"  # no need to read the whole files
} | awk -F, '
 NF!=2{
  print "incorrect file format: record "FNR" in file "FILENAME > "/dev/stderr"
  exit 1    
 }
 NR==1{                         # read record 1,
  x1=$1                         # field 1 of L,
  next                          # then read
 }
 NR==2{                         # record 1 of R,
  x1=$1>x1?$1:x1                # field 1 & take the max,
  next                          # then
 }
 NR==3{                         # read last record,
  x2=$1                         # field 1 of L,
  next                          # then
 }
 NR==4{                         # last record of R
  x2=$1>x2?$1:x2                # field 1 & take the max
  next
 }
 FILENAME!="-"&&NR<5{
  print "too few lines in input" > "/dev/stderr"
 }
 FNR==1{
  outfile=FILENAME~/L$/?"'"$LTMP"'":"'"$RTMP"'"
 }
 $1>=x1&&$1<=x2{
  print > outfile
 }
' - "$LFILE" "$RFILE" || 
 error "error while trimming"
#-----------------------#
# re-save trimmed files #
# under the same names  #
#-----------------------#
mv -- "$LTMP" "$LFILE" || 
 error "cannot re-save $LFILE"
mv -- "$RTMP" "$RFILE" || 
 error "cannot re-save $RFILE"

正如您所看到的,其主要思想是使用headtail通过重要行对输入进行预处理,然后根据您的请求使用awk对其进行处理。

要为某个目录中的所有文件调用该脚本,您可以使用以下脚本(不如上面所述,但我想您自己也可以想出类似的脚本):

#!/bin/sh
############
# trim all #
#----------###################################
# find L files in current or given directory #
# and trim the corresponding file pairs      #
##############################################
TRIM_PAIR="trim_pair"   # path to the trim script for one pair
if [[ $# -eq 1 ]]
then
 WD="$1"
else
 WD="`pwd`"
fi
find "$WD"                         
 -type f                           
 -readable                         
 -writable                         
 -regextype posix-egrep            
 -regex "^$WD/"'f[[:digit:]]+L'    
 -exec "$TRIM_PAIR" "{}" ;

请注意,您必须在PATH上具有trim_pair脚本,或者调整trim_all脚本中的TRIM_PAIR变量。

相关内容

最新更新