评估和提高将大量CSV数据集加载到R中并将其全部分配给列表对象的运行时效率

这个问题中包含的所有代码都可以在这个项目的GitHub Repository中的'LASSO code(Version for Antony)'脚本中找到。

这是一个研究项目的一部分，我们正在探索一个新的统计学习算法的属性，并测量最优变量选择的性能。它的性能被测量并与三个最优变量选择基准(LASSO, Backward Stepwise， &前向逐步回归)，在所有4个都在同一组260,000个合成数据集上运行之后，所有这些数据集在相同数量的列上具有相同数量的合成观测值，并通过蒙特卡罗模拟以这样一种方式生成，即每个观测值的分布和表征每个数据集的"真实"底层统计属性是通过构造已知的。

所以，所有必须做的就是在这个充满260k csv文件的大文件夹上运行所有4种算法，在我的系统上它被命名为"数据集文件夹"。在加载了所有必要的库之后，下面是加载/导入数据集命令之前的所有代码:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/datasets folder"
filepaths_list <- list.files(path = folderpath, full.names = TRUE, 
recursive = TRUE)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(filepaths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
# split apart the numbers, convert them to numeric 
strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
# get them in a data frame
matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
# get the appropriate ordering to sort the data frame
do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
filepaths_list = filepaths_list[my_order]

这是我用来将260k数据集加载/导入到我的环境中的代码，这样我就可以对每个数据集运行LASSO回归并计算它们的效果:

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- makeCluster(detectCores() - 1L)
clusterExport(CL, c('filepaths_list'))
system.time( datasets <- lapply(filepaths_list, read.csv) )
stopCluster(CL)

…这里的问题是，我点击运行上面包含的所有代码除了stopCluster(CL)，因为尽管点击运行系统。time(datasets <- lapply(filepaths_list, read.csv))超过54小时以前，它还没有完成加载我的数据集到RStudio的工作区!!我有一台2022年的惠普笔记本电脑，中等质量，我从12 gb的RAM升级到32 gb，当我几个月前用58,500个数据集而不是26万个数据集做同样的操作时，数据集

注。我知道语法本身没有什么错误，因为我在其他RStudio Windows中使用只有10个文件夹的相同脚本。40个数据集只是为了确保这不是问题。还有一件事，几天前我也试过这样做，没有并行部分或system.time()部分，但我不小心让我的笔记本电脑拔掉了大约90分钟的插头，它工作得太努力了，以至于在那段时间里死机了。

有没有比read()更快的读取大数据的方法?

查看这个速度。读取和选择列可以提高您的速度。包微基准测试也可以帮助您运行测试，以确定最快的方法。

相关内容

最新更新

热门标签：