有没有一个R函数可以使x行等于一个特定的行并重复该操作



大家好!

作为一个R软件的初学者(我认为我的要求在这个软件上是可行的(,我想问你一个问题。

在一个大型Excel类型的文件中,我有一列,其中我感兴趣的值仅为每193行。所以我希望前面的192行等于第193个位置的值。。。依此类推直到列的末尾。

具体来说,以下是我想为这个小例子得到的:

Month    Fund_number    Cluster_ref_INPUT      Expected_output
1        1              1                      1
2        1              1                      1
3        1              3                      1
4        1              1                      1
1        3              2                      NA
2        3              NA                     NA
3        3              NA                     NA
4        3              NA                     NA
1        8              4                      5
2        8              5                      5
3        8              5                      5
4        8              5                      5

"Cluster_ref_ININPUT"列根据"Fund_number"列进行划分(每个基金每月一次观察,持续4个月(。INPUT列中我感兴趣的值每4个观测值出现一次(第4个月的值(。

因此,我们可以看到,对于每个基金编号,我们在"Expected_output"列中找到与"Cluster_ref_ININPUT"列最后一行中找到的值相对应的值。(每4行(。我认为我们应该用"Fund_number"进行分区,并假设所有的行都等于最后一行。。。像那样的东西?

你知道我应该用什么代码来实现这一点吗?

我希望这已经足够清楚了。如果我需要澄清,请不要犹豫。

提前非常感谢,

香草

下面是一个使用data.table:的单线解决方案

library(data.table)
exdata <- fread(text = "
Month    Fund_number    Cluster_ref_INPUT      Expected_output
1        1              1                      1
2        1              1                      1
3        1              3                      1
4        1              1                      1
1        2              2                      NA
2        2              NA                     NA
3        2              NA                     NA
4        2              NA                     NA
1        3              4                      5
2        3              5                      5
3        3              5                      5
4        3              5                      5")
# You can read you data directly as data.table using fread or convert using setDT(exdata)
exdata[, newvar := Cluster_ref_INPUT[.N], by = Fund_number]
> exdata
Month Fund_number Cluster_ref_INPUT Expected_output newvar
1:     1           1                 1               1      1
2:     2           1                 1               1      1
3:     3           1                 3               1      1
4:     4           1                 1               1      1
5:     1           2                 2              NA     NA
6:     2           2                NA              NA     NA
7:     3           2                NA              NA     NA
8:     4           2                NA              NA     NA
9:     1           3                 4               5      5
10:     2           3                 5               5      5
11:     3           3                 5               5      5
12:     4           3                 5               5      5

使用tidyverse的解决方案可能会快得多,但这里有一个使用baseR的解决方案。

#Your data
df <- data.frame(Month = rep_len(c(1:4), 12), 
Fund_number = rep(c(1:3), each = 4), 
Cluster_ref_INPUT = c(1, 1, 3, 1, 2, NA, NA, NA, 4, 5, 5, 5), 
stringsAsFactors = FALSE)
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), Fund_number = c(), Cluster_ref_INPUT = c(), expected_input = c(), stringsAsFactors = FALSE)
#Using a for loop
#Iterate through the list of unique Fund_number values
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
outdat
#    Month Fund_number Cluster_ref_INPUT expected_input
# 1      1           1                 1              1
# 2      2           1                 1              1
# 3      3           1                 3              1
# 4      4           1                 1              1
# 5      1           2                 2             NA
# 6      2           2                NA             NA
# 7      3           2                NA             NA
# 8      4           2                NA             NA
# 9      1           3                 4              5
# 10     2           3                 5              5
# 11     3           3                 5              5
# 12     4           3                 5              5

编辑:附加解决方案+基准

根据OP对此答案的评论,我提出了一些更快的解决方案(dplyr和另一个答案中的data.table解决方案(,并在类似于OP示例的950004行模拟数据集上对它们进行了基准测试。以下代码和结果;只要安装了必要的库(microbenchmarkdplyrdata.table(及其依赖关系,就可以直接复制粘贴并运行整个代码块。(如果有人知道基于apply()的解决方案,欢迎在此处添加。(

rm(list = ls())
#Library for benchmarking
library(microbenchmark)
#Dplyr
library(dplyr)
#Data.table
library(data.table)
#Your data
df <- data.frame(Month = rep_len(c(1:12), 79167), 
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334), 
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE), 
stringsAsFactors = FALSE)
#Data in format for data.table
df_t <- data.table(Month = rep_len(c(1:12), 79167), 
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334), 
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE), 
stringsAsFactors = FALSE)
#----------------
#Base R solution

#Using a for loop
#Iterate through the list of unique Fund_number values
base_r_func <- function(df) {
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), 
Fund_number = c(), 
Cluster_ref_INPUT = c(), 
expected_input = c(), 
stringsAsFactors = FALSE)
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
#This return is needed for the base_r_func function wrapper
#this code is enclosed in (not necessary otherwise)
return(outdat)
}

#----------------
#Tidyverse solution
dplyr_func <- function(df){
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
}
#----------------
#Data table solution
dt_func <- function(df_t){
#For this function, we are using
#dt_t (created above)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
dt_func_conv <- function(df){
#Converting data.frame to data.table format
df_t <- data.table(df)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
#----------------
#Benchmarks
bm_vals <- microbenchmark(base_r_func(df), 
dplyr_func(df),
dt_func(df_t),
dt_func_conv(df), times = 8)
bm_vals
# Unit: milliseconds
#              expr       min        lq      mean    median        uq       max neval
#   base_r_func(df) 618.58202 702.30019 721.90643 743.02018 754.87397 756.28077     8
#    dplyr_func(df) 119.18264 123.26038 128.04438 125.64418 133.37712 140.60905     8
#     dt_func(df_t)  38.06384  38.27545  40.94850  38.88269  43.58225  48.04335     8
#  dt_func_conv(df)  48.87009  51.13212  69.62772  54.36058  57.68829 181.78970     8
#----------------

可以看出,如果速度是必要的,那么使用data.table将是可行的。即使考虑了将常规data.frame转换为data.table的开销,data.table也比dplyrbase R快(参见dt_func_conv()的结果(。

编辑:根据Carlos Eduardo Lagosta的评论,使用setDT()dfdata.frame强制为data.table,使所述强制的开销接近于零。下面的代码段和基准值。

#This version includes the time taken
#to coerce a data.frame to a data.table
dt_func_conv <- function(df){
#Logic similar to dplyr solution
#setDT() coerces data.frames to the data.table format
setDT(df)[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
bm_vals
# Unit: milliseconds
#              expr       min        lq      mean    median        uq       max neval
#   base_r_func(df) 271.60196 344.47280 353.76204 348.53663 368.65696 435.16163     8
#    dplyr_func(df) 121.31239 122.67096 138.54481 128.78134 138.72509 206.69133     8
#     dt_func(df_t)  38.21601  38.57787  40.79427  39.53428  43.14732  45.61921     8
#  dt_func_conv(df)  41.11210  43.28519  46.72589  46.74063  50.16052  52.32235     8

特别是对于OP:无论您希望使用什么解决方案,您要查找的代码都在相应函数的主体中。因此,例如,如果您想使用dplyr解决方案,您需要使用此代码并根据您的数据对象进行定制:

df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))

相关内容

最新更新