r语言 - 在for循环中分配列值——太慢了



我有一个for循环,我试图运行,这是相当慢的,当我将它应用到具有100k+观察数据集。这段代码所做的是使用来自一个列(df$country)的信息,该列描述了分配给特定ID的国家(例如,ID == 1, country == Japan),并将列值更改为对应的列名(例如,名为"Japan"的列)等于1。

示例数据(dput()):

structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan",  "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0,  0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0,  0)), row.names = c(NA, 6L), class = "data.frame")

代码如下:

#Assign vector of column names of my dataframe, 
#all named after countries (i.e. "Japan"). 
cols <- names(df[3:5]) 
#For each ID, for each column name,
#if ID == j and country == column name,
#Change entry in this row under column name to be unity.
for(j in df$id){
for(c in cols){
df[df$id == j & df$country == c, c] <- 1
}}

代码处理起来太慢了。它已经运行了20分钟,进行了10万次观测,但仍然没有完成。有什么方法可以加快这个过程吗?谢谢你!

你可以遍历列而不是行:

for (col in cols) df[[col]] = +(df$country == col)
#   id country USA Japan Germany
# 1  1     USA   1     0       0
# 2  2   Japan   0     1       0
# 3  3 Germany   0     0       1
# 4  4   Japan   0     1       0
# 5  5   Japan   0     1       0
# 6  6 Germany   0     0       1

R也有一个函数(model.matrix)来做这件事:

df[levels(factor(df$country))] = model.matrix(~country - 1, df)

您可以使用pivot_wider一次完成所有操作:

library(tidyverse)
df |>
mutate(value = 1) |>
pivot_wider(id,
names_from = "country",
values_fill = 0) |>
select(-id)

输出:

# A tibble: 6 × 3
USA Japan Germany
<dbl> <dbl>   <dbl>
1     1     0       0
2     0     1       0
3     0     0       1
4     0     1       0
5     0     1       0
6     0     0       1

数据:

df <- as.data.frame(structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"))))

最新更新