我应该如何修改为只计算成对值的频率而不考虑列的位置/名称



我想计算两列中成对值的频率,但我想忽略成对的位置。例如下面的示例,一般聚合或表函数将报告三个成对的值(-0.25,0.9(、(0.9,-0.25(和(-0.77,2.9(,但我只想得到两对,分别是(-0.25、0.9(和(0.77,2.9(。我应该如何修改为只计算成对值的频率,而不考虑列的位置/名称?

data <- data.frame(col1=c(-.25, 0.9, -.25, -.77, -.25),
col2=c(0.9, -.25, 0.9, 2.9, 0.9))

更新

给定数据data <- data.frame(col1 = c("a", "a", "c", "c", "a", "c"), col2 = c("c", "a", "a", "c", "c", "c")),我们可以尝试

aggregate(
freq ~ .,
transform(
data,
col1 = pmin(col1, col2),
col2 = pmax(col1, col2),
freq = 1
),
sum
)

它给出

col1 col2 freq
1    a    a    1
2    a    c    3
3    c    c    2

试试这个

> data[!duplicated(cbind(do.call(pmax, data), do.call(pmin, data))), ]
col1 col2
1 -0.25  0.9
4 -0.77  2.9

一个解决方案。首先,我们将两列粘贴在一起:

paste(data$col1, data$col2)
[1] "-0.25 0.9" "0.9 -0.25" "-0.25 0.9" "-0.77 2.9" "-0.25 0.9"

然后将它们分成一个列表:

str_split(paste(data$col1, data$col2), " ")
[[1]]
[1] "-0.25" "0.9"  
[[2]]
[1] "0.9"   "-0.25"
[[3]]
[1] "-0.25" "0.9"  
[[4]]
[1] "-0.77" "2.9"  
[[5]]
[1] "-0.25" "0.9" 

创建一个自定义函数,将值排序并粘贴回一起,并将sapply粘贴到列表中:

count_function = function(x) {
x = sort(x)
paste(x, collapse=", ")
}
sapply(str_split(paste(data$col1, data$col2), " "), count_function)
[1] "-0.25, 0.9" "-0.25, 0.9" "-0.25, 0.9" "-0.77, 2.9" "-0.25, 0.9"

然后取此矢量的唯一值:

> table(sapply(str_split(paste(data$col1, data$col2), " "), count_function))
-0.25, 0.9 -0.77, 2.9 
4          1

这里有一个tidyverse解决方案:

library(tidyverse)
data <- data.frame(col1=c(-.25, 0.9, -.25, -.77, -.25),
col2=c(0.9, -.25, 0.9, 2.9, 0.9))
data %>%
# Create min, max (col1, col2 respectively); uses temporary column to
# not clobber the values in col2.
mutate(
col2.tmp = pmax(col1, col2),
col1 = pmin(col1, col2),
col2 = col2.tmp) %>%
# Remopve temporary column.
select(-col2.tmp) %>%
# Determine frequency of pair.
count(col1, col2, name = "frequency") %>%
arrange(desc(col1))
#    col1 col2 frequency
# 1 -0.25  0.9         4
# 2 -0.77  2.9         1

相关内容

最新更新