r语言 - 按列的阈值对数据帧进行子集化



>我有一个包含市政当局地方选举信息的数据框,我想在我的数据集中仅保留与市政当局投票最多的两名候选人有关的信息。

我的数据帧是使用如下信息创建的:

Candidates <- c('Liza S.', 'John R.', 'Peter F.', 'Ignacio R.', 'Hanna F.', 'Anthony E.', 'Andrew W.')`
Municipality <- ('A', 'A', 'A', 'B', 'B', 'B', 'B')
Votes <- c(100, 120, 140, 110, 335, 225, 400)
data <- data.frame(Candidates, Municipality, Votes)

看起来像这样:

Candidates     Municipality     Votes
Liza S.              A           100
John R.              A           120
Peter F.             A           140
Ignacio R.           B           110
Hanna F.             B           335
Anthony E.           B           225
Andrew W.            B           400

我希望我的数据帧如下所示:

Candidates     Municipality     Votes
John R.              A           120
Peter F              A           140
Hanna F.             B           335
Andrew W.            B           400

知道如何做到这一点吗?

您可以使用dplyr包来实现此目的:

library(dplyr)
data %>% 
group_by(Municipality) %>% 
top_n(n = 2, wt = Votes)

# A tibble: 4 x 3
# Groups:   Municipality [2]
Candidates Municipality Votes
<fct>      <fct>        <dbl>
1 John R.    A              120
2 Peter F.   A              140
3 Hanna F.   B              335
4 Andrew W.  B              400

您可以按市政当局特定(reversed(order()使用ave()进行子集。

dat[ave(dat$Votes, dat$Municipality, FUN=function(x) rev(order(x))) <= 2, ]
#   Candidates Municipality Votes
# 2    John R.            A   120
# 3   Peter F.            A   140
# 5   Hanna F.            B   335
# 7  Andrew W.            B   400

数据:

dat <- structure(list(Candidates = structure(c(6L, 5L, 7L, 4L, 3L, 2L, 
1L), .Label = c("Andrew W.", "Anthony E.", "Hanna F.", "Ignacio R.", 
"John R.", "Liza S.", "Peter F."), class = "factor"), Municipality = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
Votes = c(100, 120, 140, 110, 335, 225, 400)), class = "data.frame", row.names = c(NA, 
-7L))

这是使用 dplyr 包的一种方法;

library(dplyr)
# Creating data
Candidates <- c('Liza S.', 'John R.', 'Peter F.', 'Ignacio R.', 'Hanna F.', 'Anthony E.', 'Andrew W.')
Municipality <- c('A', 'A', 'A', 'B', 'B', 'B', 'B')
Votes <- c(100, 120, 140, 110, 335, 225, 400)
data <- data.frame(Candidates, Municipality, Votes)
data %>%
# Grouping by Municipality
group_by(Municipality) %>%
# Sorting the data desc by Votes
arrange(desc(Votes)) %>%
# Getting first and second highest values for each group
slice(1:2) %>%
ungroup() %>%
#   # A tibble: 4 x 3
#   # Groups:   Municipality [2]
#   Candidates Municipality Votes
# <fct>      <fct>        <dbl>
# Peter F.   A              140
# John R.    A              120
# Andrew W.  B              400
# Hanna F.   B              335

在基本 R 中:

do.call("rbind",
lapply(split.data.frame(data, Municipality), 
function(x) x[rev(order(x$Votes))[1:2],]))    
#    Candidates  Municipality Votes
# A.3   Peter F.            A   140
# A.2    John R.            A   120
# B.7  Andrew W.            B   400
# B.5   Hanna F.            B   335

最新更新