R语言 如何在列表中有效地存储和检索数据处理期间使用的参数/参数



我使用一个大的计数表,对于我的分析,通常需要根据观察值、变量、值或上下文信息将此表拆分为子集。

# generating toy data
count_df1 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1, 10),
column3 = runif(50, 1, 10)
)
count_df2 <- data.frame(
column1 = c(1:50),
column2 = runif(50, 1.5, 9),
column3 = runif(50, 1.5, 9)
)
list_count_df <- list(count_df1 = count_df1, count_df2 = count_df2)

我学会了使用列表和 for 循环以相同的方式处理所有生成的子集。我宁愿使用for loops而不是apply因为我使用对象的名称(使用计数器)来跟踪我如何修改它们,我不知道如何做到这一点,例如lapply.

# set values to iterate over
thresholds <- c(2, 4)
conditions <- c(TRUE, FALSE)
# perform some kind of subsetting and store the parameters used
output_list <- list()
counter <- 0
for (current_threshold in thresholds) {
for (count_df in list_count_df) {
counter <- counter + 1
# modify the name to keep track of changes
current_name <- paste(names(list_count_df)[counter], current_threshold, sep = "_")
output_list[[current_name]] <- subset(count_df1, column2 < current_threshold)
}
counter <- 0
}

此外,耗时的部分通常是身体的主要功能,因此通过应用减少开销的循环可能不会安全太多时间(我仍然对此持开放态度)。

在我准备完各种子集并对其进行分析后,我需要存储分析结果和不同子集的随附参数。这可能是一项常见的任务。

# allocate for df to store the results
result_length <- length(output_list) * length(conditions)
df_headers <- c("Names", "Threshold", "Input_table", "Standard_deviation", "Scaling")
df_results <- setNames(data.frame(matrix(ncol = length(df_headers), 
nrow = result_length)), df_headers)
# perform some analyses (here: PCA) on the dfs while looping over 
# analysis parameters and storing some results directly
iii <- 0
table_counter <- 0
for (item in output_list) {
table_counter <- table_counter + 1
for (condition in conditions) {  
iii <- iii + 1
current_name <- paste(names(output_list)[table_counter], condition, sep = "_")
tmp <- prcomp(item, scale = condition)
# let's pretend we are only interested in standard deviation per item
df_results[iii, 1] <- current_name
df_results[iii, 4] <- tmp$sdev[1]
rm(tmp)
}
}

但是,我部分是通过提取对象名称的一部分来做到这一点的,这是高度重复的,也是非常自定义的,并且必须针对事先包含的每个附加步骤进行更改。由于我想尽快启动自己的软件包,因此其他用户无法轻松遵循。

# extract more values from the name of the former object
df_results$Threshold <- as.numeric(sapply(strsplit(as.character(df_results$Names), '_'), "[", 3))
df_results$Input_table <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 2))
df_results$Scaling <- as.factor(sapply(strsplit(as.character(df_results$Names), '_'), "[", 4))
df_results
# now I could this into long format, do plotting etc

我在下面提供了一个简短的示例,说明这样的工作流程是什么样的。我的问题是:

1) 关于如何存储用于的参数以及如何在处理后提取它们,有哪些一般的良好做法?

2) 如果解决方案对于一般方法来说过于特定于案例:

a) 有什么想法要在这里改变吗?

b) 列表和/或循环是要走的路吗?

我这样做是因为我不清楚修改lapply中的名称,没有这个,我就会忘记什么是什么。我也不知道如何在一个大数据帧中有效地处理所有这些不同的子集。

请考虑我的原始数据包含数字、因子和字符列,其中包含 100 行/观察值和数万列/变量。

老实说,有很多方法可以做到这一点,这将归结为个人喜好。一种常见的方法是定义一个类对象,该对象将设置您如何访问其上的信息的标准。创建类意味着您也可以创建 S3 方法。这有助于在生成类的方式上提供更大的灵活性,具体取决于您是在处理列表、df 还是仅处理向量。

generate_foo <- function(x, ...){ 
UseMethod("generate_foo")}
generate_foo.default <- function(x, current_threshold, conditions, name = NULL){
if(is.null(name)){
name <- as.character(substitute(x))
}
x <- x[x[["column2"]]<current_threshold,]
tmp <- tryCatch({prcomp(x, scale = conditions)}, error=function(er){return("Error")})
retval <- list(list(subset = x,
pcaObj = tmp, #could store the entire object or just the parts you care about.
subsetparam = current_threshold,
condition = conditions,
name = name))
class <- "foo"
return(retval)
}

generate_foo.list <- function(x,
current_threshold,
conditions, name = NULL){
if(is.null(name)||length(name)!=length(x)){
name <- names(x)
}
#Generate combinations
combi <- separate( #generate all the possible combination indexes at once
data.frame(
indx = levels(suppressWarnings(interaction(1:length(x),
1:length(current_threshold),
1:length(conditions))))),
col = "indx", into = c("df","thresh","cond"), sep = "\.")
x <- x[as.numeric(combi$df)]
name <- name[as.numeric(combi$df)]
current_threshold <- current_threshold[as.numeric(combi$thresh)]
conditions <- conditions[as.numeric(combi$cond)]
foolist <- mapply(FUN = generate_foo.default,
x = x,
current_threshold = current_threshold,
conditions = conditions,
name = name)
class(foolist) <- "foolist"
return(foolist)
}

使用此方法调用时:

foo <- generate_foo(x = list_count_df,
current_threshold = thresholds,
conditions = conditions,
name = c("Custname1","Custname2"))

您最终会得到一个带有类"foo"的对象列表。具体来说,在这种情况下,生成的对象的长度为 8,每个参数包含 5 个参数,subsetpcaObjsubsetparamconditionname。除了pcaObj子集太小有时会引发错误之外,tryCatch循环可防止代码失败。 通过编写自定义printsummary函数更进一步!

#summary
summary.foolist <- function(x){
subsetdim <- unlist(lapply(x, function(y){prod(dim(y[["subset"]]))}))
pcasdev <- unlist(lapply(x, function(y){y[["pcaObj"]]$sdev[1]}))
subsetparam <- unlist(lapply(x, function(y){y[["subsetparam"]]}))
condition <- unlist(lapply(x, function(y){y[["condition"]]}))
name <- unlist(lapply(x,function(y){y[["name"]]}))
df <- data.frame(SubsetDim=subsetdim, PCAsdev=pcasdev, SubsetParam=subsetparam, condition=condition, name = name)
return(df)
}
summary(foo)
SubsetDim   PCAsdev SubsetParam condition      name
1        24  1.207833           2      TRUE Custname1
2         6  1.732051           2      TRUE Custname2
3        54  1.324284           4      TRUE Custname1
4        33  1.372508           4      TRUE Custname2
5        24 16.258848           2     FALSE Custname1
6         6 12.024556           2     FALSE Custname2
7        54 15.592938           4     FALSE Custname1
8        33 14.057929           4     FALSE Custname2

使用这样的约定可确保以规范方式存储数据。当然,可以通过多种方式选择生成自定义 R 类和对象。

您可以创建一个函数来生成子集数据帧的列表,并将其设置为一个类。然后创建另一个函数来执行分析并生成新的类对象。只要你坚持构建一个命名列表,那么访问对象的各个部分就会变得更容易,因为它们是有组织的。

功能解决方案

0. 生成源数据框

# for reproducibility of random tasks
set.seed(1)
df <- data.frame(
col1 = c(1:100),
col2 = c(runif(50,1,10), runif(50,11,20)),
col3 = c(runif(50,1,10), runif(50,11,20))
)
# so half of the rows have numbers 1 to 10 in col2 and col3
# and other have 11 to 20 in col2 and col3.
# let's randomize the order of rows
df <- df[sample(1:100),]
# and take this data frame `df` as our source data frame
# fromw which we will do the analysis.

1. 问题描述

我们想将原始 df 细分为子数据帧 应用 2 种不同的标准。

然后,我们分析每个子数据框 使用 2 个不同参数的所有可能组合, 最后,收集所有分析值并将其放大到数据框。

标准:

  • Criterium1:如果 Col2 值为 <= 10,我们将 "DF1" 分配给行 "DF2"。

    • 分类c("df1", "df2").
  • Criterium2:如果 Col3 值是第一个下限,则该行被分配为"class5" 如果 Col3 值>第一个限制,但 <= 第二个限制,则分配"class15" 我们对其他情况不感兴趣 - 让我们分配"其他">

    • 类别 #c("class5", "class15", "other")每行将细分为 其中之一

我们希望两个条件的每个组合都有一个自己的子数据帧 应该对其进行分析。

用于分析的参数:

  • 参数 1:'scale.=' c(TRUE, FALSE)

    • parameter_categoriesc("sc+", "sc-")
  • 参数 2:'center=' c(TRUE, FASE)

    • parameter_categoriesc("cen+", "cen-")

分析结果值:

  • 我们希望两个参数的每个组合都有自己的报告或值 对于"标准差"。
    • PC1、PC2、PC3 的 3 个标准列

要收集的其他信息:

  • 我们希望每个组合都有一个可区分(唯一)的名称

阿拉伯数字。整个分析的外观:

# 0. categorize and split data frame
categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("class5", "class15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# 1. Declare parameters and prepare all parameter combinations
parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# 2. Apply analysis over all sub dfs and parameter combinations
#    and extract and aggravate analysis results into a final data frame
df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev,   # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_"                              # separator for names
)
# 3. rename parameter column contents
df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))

df_final

df limits scale center   std_PC1   std_PC2   std_PC3
df1_15_scpos_cpos    df1     15  TRUE   TRUE  1.205986 0.9554013 0.7954906
df1_15_scpos_cneg    df1     15  TRUE  FALSE  1.638142 0.5159250 0.2243043
df1_15_scneg_cpos    df1     15 FALSE   TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg    df1     15 FALSE  FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos     df1      5  TRUE   TRUE  1.128371 1.0732246 0.7582659
df1_5_scpos_cneg     df1      5  TRUE  FALSE  1.613217 0.4782639 0.4108470
df1_5_scneg_cpos     df1      5 FALSE   TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg     df1      5 FALSE  FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos    df2     15  TRUE   TRUE  1.129298 1.0069030 0.8431092
df2_15_scpos_cneg    df2     15  TRUE  FALSE  1.720909 0.1523516 0.1235295
df2_15_scneg_cpos    df2     15 FALSE   TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg    df2     15 FALSE  FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2  other  TRUE   TRUE  1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2  other  TRUE  FALSE  1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2  other FALSE   TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2  other FALSE  FALSE 82.297724 5.0580949 1.9356444

3. 逐步说明

3.1 声明帮助程序函数

# for preparing parameter combinations as lists
named_cross_combine <- function(seq1, seq2, seq1_names, seq2_names, sep="_") {
res <- list()
i <- 1
namevec <- c()
for (j1 in seq_along(seq1)) {
for (j2 in seq_along(seq2)) {
res[[i]] <- c(seq1[j1], seq2[j2])
namevec[i] <- paste0(seq1_names[j1], sep, seq2_names[j2])
i <- i + 1
}
}
names(res) <- namevec
res
}
# correctly named params list - `sep=` determines how names are joined
# you can apply `gsub()` on the namevec before assignment to adjust further the names.
# useful for doing analysis
do.call2 <- function(fun, x, rest) {
do.call(fun, c(list(x), rest))
}
apply_parameters <- function(funcname, 
dfs, 
params) {
lapply(dfs, function(df) lapply(params_list, function(pl) do.call2(funcname, df, pl)))
}
split_names_to_data_frame <- function(names_vec, sep) {
res <- lapply(names_vec, function(s) strsplit(s, sep)[[1]])
df  <- Reduce(rbind, res)
# colnames(df) <- col_names
rownames(df) <- names_vec
df
}
apply_to_subdf_and_combine <- function(
res_list,
accessor_func=function(x) x,             # subdf result
subdf_level_combiner_func=as.data.frame, # within subdf result
combine_prepare_func=function(x) x,      # applied on each subdf result
final_combiner_func=rbind,               # combine the results
col_names=NULL,                          # column names for final
sep="_") {                               # joiner for names
res_accessed_combined <- lapply(res_list, 
function(x) do.call(what=subdf_level_combiner_func, 
list(lapply(x, accessor_func))))
res_prepared <- lapply(res_accessed_combined, combine_prepare_func)
res_df <- Reduce(final_combiner_func, res_prepared)
rownames(res_df) <- paste(unlist(sapply(names(res_prepared), rep, nrow(res_prepared[[1]]))),
unlist(sapply(res_prepared, rownames)),
sep = sep)
names_df <- split_names_to_data_frame(rownames(res_df), sep = sep)
final_df <- as.data.frame(cbind(names_df, res_df))
if (!is.null(col_names)) {
colnames(final_df) <- col_names
}
final_df
}
# for simplifying the function call
extract_and_combine <- function(res_list,
result_extractor_func,
col_names,
sep="_") {
apply_to_subdf_and_combine(
res_list = res_list,
accessor_func = result_extractor_func,
subdf_level_combiner_func=as.data.frame,
combine_prepare_func=function(x) as.data.frame(t(x)),
final_combiner_func=rbind,
col_names=col_names,
sep=sep
)
}
# for even more simplifying function call
apply_extract_aggravate <- function(dfs,
params,
analyzer_func,
extractor_func,
col_names,
sep="_") {
extract_and_combine(
res_list=apply_parameters(funcname=analyzer_func, dfs=dfs, params=params),
result_extractor_func=extractor_func,
col_names=col_names,
sep=sep
)
}

# useful for renaming the data frame columns values
lookup <- function(x, seq1, seq2) {
seq2[sapply(x, function(x) which(x == seq1))]
}

3.2 数据框分类拆分

categories1 <- c("df1", "df2")[cut(df[, "col2"], c(1, 11, 20, Inf))]
categories2 <- c("5", "15", "other")[cut(df[, "col3"], c(-Inf, 5, 15, Inf))]
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))

但要完全控制分类,您可以 声明您自己的分类器函数并分类和 拆分数据框:

# write rules for criterium1 1 element as function
categorizer1 <- function(x) {
if (1 <= x && x <= 10) {
"df1"
} else if (11 <= x && x <= 20) {
"df2"
}
}
# vectorize it to be able to apply it on entire columns
categorizer1 <- Vectorize(categorizer1)
# do the same for critreium2
categorizer2 <- function(x) {
if (x <= 5) {
"class5"
} else if (5 < x && x <= 15) {
"class15"
} else {
"other"
}
}
categorizer2 <- Vectorize(categorizer2)
# apply on col2 and col3 the corresponding categorizers
categories1 <- categorizer1(df[,"col2"])
categories2 <- categorizer2(df[,"col3"])
# get the list of sub data frames according to categories
dfs <- split(df, gsub("class", "", paste(categories1, categories2, sep="_")))
# Let the categorizer functions return strings and
# for the second argument use `paste()` with `sep=` to determine
# how the names should be combined - here with "_".
# Use `gsub(pattern, replacement, x, ignore.case=F, perl=T)`
# to process the name using regex patterns to how you want it at the end.
# Here, we remove the bulky "class".

3.3 将参数声明为列表及其在文件名中的对应名称

parameters1 <- list("scale." = TRUE, "scale."=FALSE)
np1 <- c("scpos", "scneg")
parameters2 <- list("center"=TRUE, "center"=FALSE)
np2 <- c("cpos", "cneg")
# prepare all combinations of them in a list of lists
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
# this produces a list of all possible parameter combination lists.
# Each parameter combination has to be kept itself in a list, because
# `do.call()` later requires the parameters being in a list.
# `named_cross_combine()` takes care of correct naming, 
# joining the names using `sep` values.
# The first element in `parameter1` is taken and is paired with each of 
# `parameters2`. Then the second of `parameter1` through all `parameters2`, etc.

3.4 将所有参数应用于dfs并将结果收集到数据框中

df_final <- apply_extract_aggravate(
dfs=dfs,
params=params_list,
analyzer_func=prcomp,
extractor_func=function(x) x$sdev,   # extractor must return a vector
col_names=c("df", "limits", "scale", "center", "std_PC1", "std_PC2", "std_PC3"),
sep="_"                              # separator for names
)
# This function takes the dfs and the parameters list and runs the
# analyzer_func, here `prcomp()` over all combinations of boths.
# The `extractor_func` must be chosen in a way that the returned result is a vector.
# If it is already a vector, set here `function(x) x` the identity function.
# The column names should give new names to the resulting columns.
# The number of the names are determined by:
# - the number of categoriesN,
# - the number of parametersN,
# - the number of elements of result after extractor_func() was applied.
# `sep=` determines which joiner is used for joining the names.

3.5 最后,使用lookup()+先前声明的参数列表(parametersN)及其对应的名称向量(npN)重命名参数列的内容

df_final$scale <- unlist(lookup(df_final$scale, np1, parameters1))
df_final$center <- unlist(lookup(df_final$center, np2, parameters2))
# Two parameter columns, so two commands.

这会从下面转换df_final

#                       df limits scale center   std_PC1   std_PC2   std_PC3
# df1_15_scpos_cpos    df1     15 scpos   cpos  1.205986 0.9554013 0.7954906
# df1_15_scpos_cneg    df1     15 scpos   cneg  1.638142 0.5159250 0.2243043
# df1_15_scneg_cpos    df1     15 scneg   cpos 15.618145 2.4501942 1.3687843
# df1_15_scneg_cneg    df1     15 scneg   cneg 31.425246 5.9055013 1.7178626
# df1_5_scpos_cpos     df1      5 scpos   cpos  1.128371 1.0732246 0.7582659
# df1_5_scpos_cneg     df1      5 scpos   cneg  1.613217 0.4782639 0.4108470
# df1_5_scneg_cpos     df1      5 scneg   cpos 13.525868 2.5524661 0.9894493
# df1_5_scneg_cneg     df1      5 scneg   cneg 30.007511 3.9094993 1.6020638
# df2_15_scpos_cpos    df2     15 scpos   cpos  1.129298 1.0069030 0.8431092
# df2_15_scpos_cneg    df2     15 scpos   cneg  1.720909 0.1523516 0.1235295
# df2_15_scneg_cpos    df2     15 scneg   cpos 14.061532 2.4172787 1.2348606
# df2_15_scneg_cneg    df2     15 scneg   cneg 80.543382 3.8409639 1.8480111
# df2_other_scpos_cpos df2  other scpos   cpos  1.090057 0.9588241 0.9446865
# df2_other_scpos_cneg df2  other scpos   cneg  1.718190 0.1881516 0.1114570
# df2_other_scneg_cpos df2  other scneg   cpos 15.168160 2.5579403 1.3354016
# df2_other_scneg_cneg df2  other scneg   cneg 82.297724 5.0580949 1.9356444

对此:

df limits scale center   std_PC1   std_PC2   std_PC3
df1_15_scpos_cpos    df1     15  TRUE   TRUE  1.205986 0.9554013 0.7954906
df1_15_scpos_cneg    df1     15  TRUE  FALSE  1.638142 0.5159250 0.2243043
df1_15_scneg_cpos    df1     15 FALSE   TRUE 15.618145 2.4501942 1.3687843
df1_15_scneg_cneg    df1     15 FALSE  FALSE 31.425246 5.9055013 1.7178626
df1_5_scpos_cpos     df1      5  TRUE   TRUE  1.128371 1.0732246 0.7582659
df1_5_scpos_cneg     df1      5  TRUE  FALSE  1.613217 0.4782639 0.4108470
df1_5_scneg_cpos     df1      5 FALSE   TRUE 13.525868 2.5524661 0.9894493
df1_5_scneg_cneg     df1      5 FALSE  FALSE 30.007511 3.9094993 1.6020638
df2_15_scpos_cpos    df2     15  TRUE   TRUE  1.129298 1.0069030 0.8431092
df2_15_scpos_cneg    df2     15  TRUE  FALSE  1.720909 0.1523516 0.1235295
df2_15_scneg_cpos    df2     15 FALSE   TRUE 14.061532 2.4172787 1.2348606
df2_15_scneg_cneg    df2     15 FALSE  FALSE 80.543382 3.8409639 1.8480111
df2_other_scpos_cpos df2  other  TRUE   TRUE  1.090057 0.9588241 0.9446865
df2_other_scpos_cneg df2  other  TRUE  FALSE  1.718190 0.1881516 0.1114570
df2_other_scneg_cpos df2  other FALSE   TRUE 15.168160 2.5579403 1.3354016
df2_other_scneg_cneg df2  other FALSE  FALSE 82.297724 5.0580949 1.9356444

4. 结语

这与您的方法没有太大区别。所有信息都收集在名称中。以及用于生成数据框部分的名称,该部分解释了分析数据的背景。lookup()函数对于重命名参数的列非常有用。

列的分类可以通过cat()函数非常简化。但是在cut()函数中,您无法完全控制 是否包括上限/下限(<=)或不包括(<)。 这就是为什么有时声明自己的分类程序函数可能具有优势的原因。(特别是对于更复杂的分类)。

扩展

更多类别
  • :只需定义更多类别categories1 categories2 categories3 ...
# then do
dfs <- split(df, paste(categories1, categories2, categories3, ..., sep="_"))
# use `gsub()` around `paste()` or do
# names(dfs) <- gsub("search_term", "replace_term", names(dfs)) - over and over again
# until all names are as they should be.
更多
  • 参数:只需定义更多parametersN-npN对。
# then do
params_list <- named_cross_combine(parameters1, parameters2, np1, np2, sep="_")
params_list <- named_cross_combine(params_list, parameters3, names(params_list), np3, sep="_")
params_list <- named_cross_combine(params_list, parameters4, names(params_list), np4, sep="_")
... (and so on ...)
# use then at the end more lines for renaming parameter column contents:
df_final[, prmcol_name1] <- unlist(lookup(df_final[, prmcol_name1], np1, parameters1))
df_final[, prmcol_name2] <- unlist(lookup(df_final[, prmcol_name2], np2, parameters2))
df_final[, prmcol_name3] <- unlist(lookup(df_final[, prmcol_name3], np2, parameters3))
... (and so on ...)

因此,类别和参数的数量很容易增加。 核心帮助程序函数保持不变。并且不必修改。

(使用高阶函数(将函数作为参数的函数)作为辅助函数是其灵活性的关键 - 函数式编程的优势之一)。

最新更新