我正在使用该篮球游戏数据,其中包含约50,000行的数据帧游戏。我正在尝试比较每个游戏中每个团队(A和B)的统计数据。
我还有另一个名为TeamStats的数据框架,每个赛季都有大约3000行。
到目前为止,我已经组装了一个代码:
for (i in 1:nrow(games)) {
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i]))) == 1) {
selectTeamA <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i])),4:45]
} else {
selectTeamA <- as.numeric(rep(NA, ncol(differences)))
}
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i]))) == 1) {
selectTeamB <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i])),4:45]
} else {
selectTeamB <- as.numeric(rep(NA, ncol(differences)))
}
differences[i,] <- selectTeamA - selectTeamB
}
基本上,此代码在征服了正确的赛季后,为每个团队A和B搜索了正确的团队ID。由于TeamStats中没有每个赛季的每个球队,因此我现在已经用NA填补了失踪的行。"差异"数据框是一个空的数据框,它将填补我的for循环中A和B统计数据的差异。
让您了解数据:
游戏 - 前6行
Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100
108126 2010 1107 1393 B 43 75
108127 2010 1143 1178 A 95 61
teamStats-前6行,仅用于空间的前6列 - 许多列在完整数据帧中的列表。该代码为TeamID找到正确的行,然后取得统计列,例如G w l etc
School Year teamID G W L
1 abilene christian 2018 1101 32 16 16
2 air force 2018 1102 31 12 19
3 akron 2018 1103 32 14 18
4 alabama a&m 2018 1105 31 3 28
5 alabama-birmingham 2018 1412 33 20 13
,要关闭这篇很长的帖子,我的问题。我的循环代码有效,并填充差异数据框。问题是运行此代码需要20-30分钟。我并不是很有经验来处理这么多数据。我不知道有一种技术吗?如何以更有效的方式重写此代码?
这是一种使用tidyverse
软件包的方法,我期望它应该比OP中的循环解决方案快得多。速度(我希望)来自更多地依靠数据库加入操作(例如基本merge
或DPLYR的left_join
)来连接两个表。
library(tidyverse)
# First, use the first few columns from the `games` table, and convert to long format with
# a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
select(row, Season, teamA, teamB) %>%
gather(team_cat, teamID, teamA:teamB) %>%
# Join to the teamStats table to bring in the team's total stats for that year
left_join(teamStats %>% select(-row), # We don't care about this "row"
by = c("teamID", "Season" = "Year")) %>%
# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
# all the stats into long format so that we can do the reversal on all of them, and
# then spread back out.
gather(stat, value, G:L) %>%
mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
spread(stat, value) %>%
# Get the difference in stats for each row in the original games table.
group_by(row) %>%
summarise_at(vars(G:W), sum)
# Finally, add the output to the original table
output <- games %>%
left_join(stat_differences)
为了测试此问题,我更改了给定的示例数据,以使两个表将彼此相关:
games <- read.table(header = T, stringsAsFactors = F,
text = "row Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100")
teamStats <- read.table(header = T, stringsAsFactors = F,
text = "row School Year teamID G W L
1 abilene_christian 2010 1143 32 16 16
2 air_force 2010 1293 31 12 19
3 akron 2010 1314 32 14 18
4 alabama_a&m 2010 1198 31 3 28
5 alabama-birmingham 2010 1108 33 20 13
6 made_up_team 2018 1326 160 150 10 # To confirm getting right season
7 made_up_team 2010 1326 60 50 10"
)
然后我得到以下输出,这似乎很有意义。(我刚刚意识到我应用的聚集/突变/传播更改了列的顺序;如果我有时间,我可能会尝试使用mutate_if保留顺序。)
> output
row Season teamA teamB winner scoreA scoreB G L W
1 108123 2010 1143 1293 A 75 70 1 -3 4
2 108124 2010 1198 1314 B 72 88 -1 10 -11
3 108125 2010 1108 1326 B 60 100 -27 3 -30
一种方法是合并 games
和 teamStats
,作为跨行迭代的替代方法。
一些代码复制您的设置,以创建一个最小的工作示例:
library(dplyr)
library(purrr)
set.seed(123)
n_games <- 50000
n_teams <- 400
n_years <- 10
games <- data.frame(Season = rep(2005:(2005 + n_years - 1),
each = n_games / n_years)) %>%
mutate(teamA = sample(1000:(1000 + n_teams - 1), n_games, r = TRUE),
teamB = map_int(teamA, ~sample(setdiff(1000:(1000 + n_teams - 1), .), 1)),
scoreA = as.integer(rnorm(n_games, 80, 20)),
scoreB = as.integer(rnorm(n_games, 80, 20)),
scoreB = ifelse(scoreA == scoreB, scoreA + sample(c(-1, 1), n_games, r = TRUE), scoreB),
winner = ifelse(scoreA > scoreB, "A", "B"))
gen_random_string <- function(...) {
paste(sample(c(letters, " "), rpois(1, 10), r = TRUE), collapse = "")
}
schools_ids <- data.frame(teamID = 1000:(1000 + n_teams - 1)) %>%
mutate(School = map_chr(teamID, gen_random_string))
teamStats <- data.frame(Year = rep(2005:(2005 + n_years - 1), each = 300)) %>%
mutate(teamID = as.vector(replicate(n_years, sample(schools_ids$teamID, 300))),
G = 32, W = rpois(length(teamID), 16), L = G - W) %>%
left_join(schools_ids)
我们的games
具有50k行和3K行的TeamStats。现在,我们通过Year
和teamID
将teamStats
倒入tibble:
teamStats <- teamStats %>%
group_by(Year, teamID) %>%
nest()
# # A tibble: 3,000 x 3
# Year teamID data
# <int> <int> <list>
# 1 2005 1321 <tibble [1 x 4]>
# 2 2005 1192 <tibble [1 x 4]>
# 3 2005 1074 <tibble [1 x 4]>
# <snip>
做出较小的便利功能来计算差异:
calculate_diff <- function(x, y) {
if (is.null(x) | is.null(y)) {
data.frame(G = NA, W = NA, L = NA)
} else {
x[, 1:3] - y[, 1:3]
}
}
现在,我们(1)与teamStats
加入(或合并)games
,(2)使用加入的数据集计算差异,(3)unnest
(或Un-collapse)dataFrame。
start <- Sys.time()
differences <- games %>%
left_join(teamStats, c("Season" = "Year", "teamA" = "teamID")) %>%
rename(teamA_stats = data) %>%
left_join(teamStats, c("Season" = "Year", "teamB" = "teamID")) %>%
rename(teamB_stats = data) %>%
mutate(diff = map2(teamA_stats, teamB_stats, calculate_diff)) %>%
select(Season, teamA, teamB, diff) %>%
unnest()
difftime(Sys.time(), start)
# Time difference of 11.27832 secs
结果
head(differences)
# Season teamA teamB G W L
# 1 2005 1115 1085 NA NA NA
# 2 2005 1315 1177 NA NA NA
# 3 2005 1163 1051 0 -9 9
# 4 2005 1353 1190 0 -4 4
# 5 2005 1376 1286 NA NA NA
# 6 2005 1018 1362 0 -1 1