我有一个for循环,我试图运行,这是相当慢的,当我将它应用到具有100k+观察数据集。这段代码所做的是使用来自一个列(df$country
)的信息,该列描述了分配给特定ID的国家(例如,ID == 1, country == Japan),并将列值更改为对应的列名(例如,名为"Japan"的列)等于1。
示例数据(dput()):
structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
代码如下:
#Assign vector of column names of my dataframe,
#all named after countries (i.e. "Japan").
cols <- names(df[3:5])
#For each ID, for each column name,
#if ID == j and country == column name,
#Change entry in this row under column name to be unity.
for(j in df$id){
for(c in cols){
df[df$id == j & df$country == c, c] <- 1
}}
代码处理起来太慢了。它已经运行了20分钟,进行了10万次观测,但仍然没有完成。有什么方法可以加快这个过程吗?谢谢你!
你可以遍历列而不是行:
for (col in cols) df[[col]] = +(df$country == col)
# id country USA Japan Germany
# 1 1 USA 1 0 0
# 2 2 Japan 0 1 0
# 3 3 Germany 0 0 1
# 4 4 Japan 0 1 0
# 5 5 Japan 0 1 0
# 6 6 Germany 0 0 1
R也有一个函数(model.matrix
)来做这件事:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
您可以使用pivot_wider
一次完成所有操作:
library(tidyverse)
df |>
mutate(value = 1) |>
pivot_wider(id,
names_from = "country",
values_fill = 0) |>
select(-id)
输出:
# A tibble: 6 × 3
USA Japan Germany
<dbl> <dbl> <dbl>
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
5 0 1 0
6 0 0 1
数据:
df <- as.data.frame(structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"))))