大家好!
作为一个R软件的初学者(我认为我的要求在这个软件上是可行的(,我想问你一个问题。
在一个大型Excel类型的文件中,我有一列,其中我感兴趣的值仅为每193行。所以我希望前面的192行等于第193个位置的值。。。依此类推直到列的末尾。
具体来说,以下是我想为这个小例子得到的:
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 3 2 NA
2 3 NA NA
3 3 NA NA
4 3 NA NA
1 8 4 5
2 8 5 5
3 8 5 5
4 8 5 5
"Cluster_ref_ININPUT"列根据"Fund_number"列进行划分(每个基金每月一次观察,持续4个月(。INPUT列中我感兴趣的值每4个观测值出现一次(第4个月的值(。
因此,我们可以看到,对于每个基金编号,我们在"Expected_output"列中找到与"Cluster_ref_ININPUT"列最后一行中找到的值相对应的值。(每4行(。我认为我们应该用"Fund_number"进行分区,并假设所有的行都等于最后一行。。。像那样的东西?
你知道我应该用什么代码来实现这一点吗?
我希望这已经足够清楚了。如果我需要澄清,请不要犹豫。
提前非常感谢,
香草
下面是一个使用data.table
:的单线解决方案
library(data.table)
exdata <- fread(text = "
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 2 2 NA
2 2 NA NA
3 2 NA NA
4 2 NA NA
1 3 4 5
2 3 5 5
3 3 5 5
4 3 5 5")
# You can read you data directly as data.table using fread or convert using setDT(exdata)
exdata[, newvar := Cluster_ref_INPUT[.N], by = Fund_number]
> exdata
Month Fund_number Cluster_ref_INPUT Expected_output newvar
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 3 1 1
4: 4 1 1 1 1
5: 1 2 2 NA NA
6: 2 2 NA NA NA
7: 3 2 NA NA NA
8: 4 2 NA NA NA
9: 1 3 4 5 5
10: 2 3 5 5 5
11: 3 3 5 5 5
12: 4 3 5 5 5
使用tidyverse
的解决方案可能会快得多,但这里有一个使用base
R的解决方案。
#Your data
df <- data.frame(Month = rep_len(c(1:4), 12),
Fund_number = rep(c(1:3), each = 4),
Cluster_ref_INPUT = c(1, 1, 3, 1, 2, NA, NA, NA, 4, 5, 5, 5),
stringsAsFactors = FALSE)
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), Fund_number = c(), Cluster_ref_INPUT = c(), expected_input = c(), stringsAsFactors = FALSE)
#Using a for loop
#Iterate through the list of unique Fund_number values
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
outdat
# Month Fund_number Cluster_ref_INPUT expected_input
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 3 1
# 4 4 1 1 1
# 5 1 2 2 NA
# 6 2 2 NA NA
# 7 3 2 NA NA
# 8 4 2 NA NA
# 9 1 3 4 5
# 10 2 3 5 5
# 11 3 3 5 5
# 12 4 3 5 5
编辑:附加解决方案+基准
根据OP对此答案的评论,我提出了一些更快的解决方案(dplyr
和另一个答案中的data.table
解决方案(,并在类似于OP示例的950004行模拟数据集上对它们进行了基准测试。以下代码和结果;只要安装了必要的库(microbenchmark
、dplyr
、data.table
(及其依赖关系,就可以直接复制粘贴并运行整个代码块。(如果有人知道基于apply()
的解决方案,欢迎在此处添加。(
rm(list = ls())
#Library for benchmarking
library(microbenchmark)
#Dplyr
library(dplyr)
#Data.table
library(data.table)
#Your data
df <- data.frame(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#Data in format for data.table
df_t <- data.table(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#----------------
#Base R solution
#Using a for loop
#Iterate through the list of unique Fund_number values
base_r_func <- function(df) {
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(),
Fund_number = c(),
Cluster_ref_INPUT = c(),
expected_input = c(),
stringsAsFactors = FALSE)
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
#This return is needed for the base_r_func function wrapper
#this code is enclosed in (not necessary otherwise)
return(outdat)
}
#----------------
#Tidyverse solution
dplyr_func <- function(df){
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
}
#----------------
#Data table solution
dt_func <- function(df_t){
#For this function, we are using
#dt_t (created above)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
dt_func_conv <- function(df){
#Converting data.frame to data.table format
df_t <- data.table(df)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
#----------------
#Benchmarks
bm_vals <- microbenchmark(base_r_func(df),
dplyr_func(df),
dt_func(df_t),
dt_func_conv(df), times = 8)
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 618.58202 702.30019 721.90643 743.02018 754.87397 756.28077 8
# dplyr_func(df) 119.18264 123.26038 128.04438 125.64418 133.37712 140.60905 8
# dt_func(df_t) 38.06384 38.27545 40.94850 38.88269 43.58225 48.04335 8
# dt_func_conv(df) 48.87009 51.13212 69.62772 54.36058 57.68829 181.78970 8
#----------------
可以看出,如果速度是必要的,那么使用data.table
将是可行的。即使考虑了将常规data.frame
转换为data.table
的开销,data.table
也比dplyr
和base R
快(参见dt_func_conv()
的结果(。
编辑:根据Carlos Eduardo Lagosta的评论,使用setDT()
将df
从data.frame
强制为data.table
,使所述强制的开销接近于零。下面的代码段和基准值。
#This version includes the time taken
#to coerce a data.frame to a data.table
dt_func_conv <- function(df){
#Logic similar to dplyr solution
#setDT() coerces data.frames to the data.table format
setDT(df)[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 271.60196 344.47280 353.76204 348.53663 368.65696 435.16163 8
# dplyr_func(df) 121.31239 122.67096 138.54481 128.78134 138.72509 206.69133 8
# dt_func(df_t) 38.21601 38.57787 40.79427 39.53428 43.14732 45.61921 8
# dt_func_conv(df) 41.11210 43.28519 46.72589 46.74063 50.16052 52.32235 8
特别是对于OP:无论您希望使用什么解决方案,您要查找的代码都在相应函数的主体中。因此,例如,如果您想使用dplyr
解决方案,您需要使用此代码并根据您的数据对象进行定制:
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))