我在Stack Overflow上看到了这个问题的不同版本,但没有遇到解决这个特定用例的版本。
目标
根据整行(不仅仅是一列(查找重复的行,最后一列中的值除外。消除除一行之外的所有重复行,但首先对每个重复项的最后一列中的值求和,并在剩余重复行的最后一列中显示结果值。我想在 Bash 中做到这一点。
用例
我有一个网站中每个页面的表格和它收到的浏览量,以及其他一些元数据。但是,表中的某些行表示同一页面,只是视图数不同。需要对这些视图进行求和,以显示每个页面的所有时间视图。
例
原始文件:
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",100
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",15
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",10
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",20
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",30
所需文件:
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
我想在这里做的是分享我尝试过的每个脚本迭代,并分解哪些有效,哪些无效。这太过分了,以至于我甚至很难做到这一点。我的过程是利用类似 Stack Overflow 问题的部分答案(所有这些都在awk
中,这对我来说很有意义(并更改比较列。但是由于某些答案仅比较一列,因此我的修改结果不一致且奇怪。脚本足够复杂,以至于我很难理解为什么。
- 使用 awk 对重复的行值求和
- 如何使用awk对重复行的值求和?
有没有人能够提供关于我如何去发现答案或为我指明正确方向的例子的教育?如果是这样,谢谢。
无论任何带引号的字段是否可以包含,
,这都将起作用(例如,如果任何带有"Page Title 1"
占位符文本的字段实际上类似于"I, Robot - Page 1"
(:
$ awk '
BEGIN { FS=OFS="," }
NR==1 { print; next }
{ num=$NF; sub(/,[^,]*$/,""); sum[$0]+=num }
END { for (key in sum) print key, sum[key] }
' file
url,title,tag,version,guide,views
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
使用 GNU datamash 的一种方式:
$ echo "url,title,tag,version,guide,views" && datamash --header-in -st, -g1,2,3,4,5 sum 6 < input.txt
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
或者用awk:
$ awk -F, 'NR==1 { print; next }
{ groups[$1 "," $2 "," $3 "," $4 "," $5] += $6 }
END { PROCINFO["sorted_in"] = "@ind_str_asc" # Sorted output when using GNU awk
for (g in groups) print g "," groups[g]
}' input.txt
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
另一个awk
$ awk -F, -v OFS=, 'NR==1 {print; next}
{v=$NF; NF--; a[$0]+=v}
END {for(k in a) print k,a[k] | "sort"}' file
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
说明
打印标题行;保存值(最后一个字段(,减少字段数,使记录的其余部分成为键($0(,将值与键一起添加到累加器中(将相加等效的键值(。 最后打印键和值并进行排序。