使用r(内存错误)将多个csv文件导入postgresql数据库

我正试图将一个数据集(包含许多csv文件(导入r中，然后将数据写入postgresql数据库中的表中。

我成功地连接到数据库，创建了一个循环来导入csv文件，并尝试导入。R然后返回一个错误，因为我的电脑内存不足。

我的问题是：有没有一种方法可以创建一个循环，一个接一个地导入文件，将它们写入postgresql表，然后删除它们？那样我就不会耗尽记忆。

返回内存错误的代码：

`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'  
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'  
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db, 
port=db_port, user=db_user, password=db_password)
#check if connection was succesfull
dbExistsTable(con, "prices")
#create function to load multiple csv files
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
}

#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`

提前感谢您的反馈！

如果将lapply()更改为包含匿名函数，则可以读取每个文件并将其写入数据库，从而减少所需的内存量。由于lapply()充当隐含的for()循环，因此不需要额外的循环机制。

import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,function(x){ 
prices <- read.csv(x) 
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
})
}

我假设您的csv文件非常大，您要导入到数据库中吗？据我所知，R首先想将数据存储在一个数据帧中，并将您编写的代码存储在内存中。另一种选择是像使用Python的Pandas一样，分块读取CSV文件。

当调用?read.csv时，我看到以下输出：

nrows：读取的最大行数。负值和其他无效值将被忽略。

skip：开始读取数据之前要跳过的数据文件的行数。

为什么不尝试一次在数据帧中读取5000行，写入PostgreSQL数据库，然后对每个文件进行读取呢。

例如，对每个文件执行以下操作：

number_of_lines = 5000                 # Number of lines to read at a time
row_skip = 0                           # number of lines to skip initially
keep_reading = TRUE                    # We will change this value to stop the while
while (keep_reading) {
my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB
row_skip = 1 + row_skip + number_of_lines   # The "1 +" is there due to inclusivity avoiding duplicates
# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
keep_reading = FALSE
} # end-if    
} # end-while

通过这样做，您将csv分解为更小的部分。您可以使用number_of_lines变量来减少循环的数量。这可能看起来有点棘手，涉及到一个循环，但我相信它会工作

相关内容

最新更新

热门标签：