用锥子计算滑动窗的中值



我需要生成一个数百万行的滑动窗口,并计算第3列的中值。我的数据看起来是这样的,第1列总是相同的,第2列等于行号,第3列是我需要的中值信息:

HiC_scaffold_1  1   34
HiC_scaffold_1  2   34
HiC_scaffold_1  3   36
HiC_scaffold_1  4   37
HiC_scaffold_1  5   38
HiC_scaffold_1  6   39
HiC_scaffold_1  7   40
HiC_scaffold_1  8   40
HiC_scaffold_1  9   40
HiC_scaffold_1  10  41
HiC_scaffold_1  11  41
HiC_scaffold_1  12  41
HiC_scaffold_1  13  44
HiC_scaffold_1  14  44
HiC_scaffold_1  15  55

我需要这样的结果,假设滑动窗口为4,并四舍五入到最接近的整数。在真实的数据集中,我可能会使用1000:的滑动窗口

HiC_scaffold_1  4   35
HiC_scaffold_1  5   37
HiC_scaffold_1  6   38
HiC_scaffold_1  7   39
HiC_scaffold_1  8   40
HiC_scaffold_1  9   40
HiC_scaffold_1  10  40
HiC_scaffold_1  11  41
HiC_scaffold_1  12  41
HiC_scaffold_1  13  41
HiC_scaffold_1  14  43
HiC_scaffold_1  15  44

我在这里找到了以下脚本,用于做我想做的事情,但用于均值,而不是中值:

awk -v OFS="t" 'BEGIN {
window = 4
slide = 1
}
{
mod = NR % window
if (NR <= window) {
count++
} else {
sum -= array[mod]
}
sum += $3
array[mod] = $3
}
(NR % slide) == 0 {
print $1, NR, sum / count
}
' file.txt

以及这个用awk计算中值的脚本:

sort -n -k3 file.txt |
awk '{
arr[NR] = $3
}
END {
if (NR % 2 == 1) {
print arr[(NR + 1) / 2]
} else {
print $1 "t" $2 "t" (arr[NR / 2] + arr[NR / 2 + 1]) / 2
}
}
'

但我不能让他们一起工作。另一个问题是中值计算需要排序输入。我也找到了这个datamash解决方案,但我不知道如何使用滑动窗口有效地工作。

以下假设函数asort的可用性,由GNU awk(gawk(提供。该程序由窗口大小wsize参数化——这里是4:

gawk -v wsize=4 '
BEGIN { 
if (wsize % 2 == 0) { m1=wsize/2; m2=m1+1; } else { m1 = m2 = (wsize+1)/2; } 
}
function roundedmedian() {
asort(window, a);
return (m1==m2) ? a[m1] : int(0.5 + ((a[m1] + a[m2]) / 2));
}
function push(value) {
window[NR % wsize] = value;
}
NR < wsize { window[NR]=$3; next; }
{ push($3);
$3 = roundedmedian();
print $0;
}' 

asort():使用GNU awk

$ cat tst.awk
BEGIN {
OFS = "t"
window = 4
befMid = int(window / 2)
aftMid = befMid + (window % 2 ? 0 : 1)
}
{ array[NR % window] = $3 }
NR >= window {
asort(array,vals)
print $1, $2, int( (vals[befMid] + vals[aftMid]) / 2 + 0.5 )
}

$ awk -f tst.awk file
HiC_scaffold_1  4       35
HiC_scaffold_1  5       37
HiC_scaffold_1  6       38
HiC_scaffold_1  7       39
HiC_scaffold_1  8       40
HiC_scaffold_1  9       40
HiC_scaffold_1  10      40
HiC_scaffold_1  11      41
HiC_scaffold_1  12      41
HiC_scaffold_1  13      41
HiC_scaffold_1  14      43
HiC_scaffold_1  15      44

下面的GNU awk脚本似乎生成了您提供的输出:

awk -v OFS='t' -v window=4 '
{
# I store the numbers in an array `nums` indexed with `1 ... window`
mod = NR % window + 1;
nums[mod] = $3;
}
# If the count of numbers is greater or equal the window,
# we can start calculating the median.
NR >= window {
# Copy the array nums, cause we need to sort it.
for (i = 1; i <= window; ++i) {
copy[i] = nums[i];
}
# Sort the copy.
# asort is a GNU extension if I remember.
# For non-gnu, write a sorting function yourself.
asort(copy);
# Calculate the median.
# I hope that is ok.
half = int( (window + 1) / 2 );
if (window % 2 == 0) {
# You seem to want to round 0.5 up.
# Just add 1 and round down.
median = int( (copy[half] + copy[half + 1] + 1) / 2 );
} else {
median = copy[half];
}
# Output
print $1, $2, median 
}'

相关内容

  • 没有找到相关文章

最新更新