如何在 R 中导入和排序格式不佳的堆叠 CSV 文件


  1. 如何导入和排序这些数据(以下代码部分)以便 R 轻松操作?

  2. 是否考虑了器官名称,剂量单位"Gy",体积单位"CC"这三者 R的"因素"?数据集名称和数据变量的术语是什么?

这些直方图按顺序放置一个数据集,如下所示:

示例数据文件:

Bladder,,
GY, (CC),
0.0910151,1.34265
0.203907,1.55719
[skipping to end of this data set]
57.6659,0.705927
57.7787,0.196091
,,
CTV-operator,,
GY, (CC),
39.2238,0.00230695
39.233,0
[repeating for remainder of data sets; skipping to end of file]
53.1489,0
53.2009,0.0161487
,,
[blank line]

数据集标签(例如膀胱、CTV 操作员、直肠)有时是小写的,通常在文件中以随机顺序排列。我有几十个文件分类在两个文件夹中,可以作为一个大型患者样本导入和分析。

我已经启动了这个脚本,但我怀疑有更好的方法:

[file = file.path()]
DVH = read.csv(file, header = FALSE, sep = ",", fill = TRUE)
DVH[3] <- NULL      # delete last column from data
loop = 1; notover = TRUE
factor(DVH[loop,1]) # Store the first element as a factor
while(notover)
{loop = loop + 1   # move to next line
DVH$1<-factor(DVH[loop,1]) # I must change ...
DVH$2<-factor(DVH[loop,2]) # ... these lines.
if([condition indicating end of file; code to be learned]) {notover = FALSE}
}
# store first element as data label
# store next element as data label
# store data for (observations given) this factor
# if line is blank, move to next line, store first element as new factor, and repeat until end of file

Walter Roberson帮助我准备了这段代码来导入和解析MATLAB中的数据,到目前为止,我或多或少地尝试在R中做同样的事情:

for fileloop = 1:length(indexnumber)
num = 0;
fid = fopen(['filepath to folder',num2str(indexnumber(fileloop)),'.csv'],'rt');
while true 
H1 = fgetl(fid) ;
if feof(fid); break; end 
H2 = fgetl(fid) ;
if feof(fid); break; end 
datacell = textscan(fid, '%f%f', 'delimiter', ',', 'collectoutput', true) ;
if isempty(datacell) || isempty(datacell{1}); break; end 
if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end
num = num + 1;
headers(num,:) = {H1, H2} ;
data(num) = datacell;
end
fclose(fid);
clear datacell H1 H2

附加信息:

我是 R 的新手,具有中级 MATLAB 经验。我正在从 MATLAB 切换到 ARE,以便我的工作可以更容易地被世界各地的其他人复制。(R 是免费的;MATLAB 不是。

该数据来自放射肿瘤学软件Velocity导出的剂量体积直方图,用于癌症治疗研究。

(我之前问过Python这个问题,但一位计算机科学家建议我改用R。

谢谢你的时间。

这应该将文件读入结构良好的数据帧以进行进一步处理。它将允许您处理多个文件并将数据合并到一个数据帧中。有更有效和动态的方法来处理获取文件路径,但这应该给你一个起点。

# Create function to process a file
process.file <- function(filepath){
# Open connection to file
con = file(filepath, "r")
# Create empty dataframe
df <- data.frame(Organ = character(),
Dosage = numeric(),
Dosage.Unit = character(),
Volume = numeric(),
Volumne.Unit = character(),
stringsAsFactors = FALSE)
# Begin looping through file
while ( TRUE )
{
# Read current line
line <- readLines(con, n = 1)
# If at end of file, break the loop
if ( length(line) == 0 ) { break }
# If the current line is not equal to ",," and is not a blank line, then process the line
if(line != ",," & line != ""){
# If the last two characters of the line are ",,"
if(substr(line, nchar(line) - 1, nchar(line)) == ",,"){
# Remove the commas from the line and set the organ type
organ <- gsub(",,","",line)
} 
# If the last character of the line is equal to ","
else if(substr(line, nchar(line), nchar(line)) == ","){
# Split the line at the comma
units <- strsplit(line,",")
# Set the dosage unit and volume unit
dose.unit <- units[[1]][1]
vol.unit <- units[[1]][2]
}
# If the line is not a special case
else{
# Split the line at the comma
vals <- strsplit(line,",")
# Set the dosage value and the volume value
dosage <- vals[[1]][1]
volume <- vals[[1]][2]
# Add the values into the dataframe
df <- rbind(df, as.data.frame(t(c(organ,dosage,dose.unit,volume,vol.unit))))
}
}
}
# Set the column names for the dataframe
colnames(df) <- c("Organ","Dosage","Dosage.Unit","Volume","Volume.Unit")
# Close the connection to a file
close(con)
# Return the dataframe
return(df)
}

# Create a vector of the files to process
filenames <- c("C:/path/to/file/file1.txt",
"C:/path/to/file/file2.txt",
"C:/path/to/file/file3.txt",
"C:/path/to/file/file4.txt")
# Create a dataframe to hold processed data
df.patient.sample <- data.frame(Organ = character(),
Dosage = numeric(),
Dosage.Unit = character(),
Volume = numeric(),
Volumne.Unit = character(),
stringsAsFactors = FALSE)
# Process each file in the vector of filenames
for(f in filenames){
df.patient.sample <- rbind(df.patient.sample, process.file(f))
}

这是一个替代版本,它应该比在 for 循环中逐行处理文件要快得多。此版本首先将整个数据文件读取到单个列数据帧,然后清理数据,这应该比通过 for 循环处理快得多。

# Load required library
library(tidyr)
# Create function to process file
process.file <- function(path){
# Import data into a single column dataframe
df <- as.data.frame(scan(path, character(), sep = "n", quiet = TRUE), stringsAsFactors = FALSE)
# Set column name
colnames(df) <- "col1"
# Copy organ names to new column
df$organ <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{2}$", x)) == ",,", gsub('.{2}$', '', x), NA))
# Fill organ name for all rows
df <- fill(df, organ, .direction = "down")
# Remove the rows that contained the organ
df <- df[regmatches(df[,1], regexpr(".{2}$", df[,1])) != ",,", ]
# Copy units into a new column
df$units <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{1}$", x)) == ",", gsub('.{1}$', '', x), NA))
# Fill units field for all rows
df <- fill(df, units, .direction = "down")
# Separate units into dose.unit and vol.unit columns
df <- separate(df, units, c("dose.unit","vol.unit"), ", ")
# Remove the rows that contained the units
df <- df[regmatches(df[,1], regexpr(".{1}$", df[,1])) != ",", ]
# Separate the remaining data into dosage and volume columns
df <- separate(df, col1, c("dosage","volume"), ",")
# Set data type of dosage and volume to numeric
df[,c("dosage","volume")] <- lapply(df[,c("dosage","volume")], as.numeric)
# Reorder columns
df <- df[, c("organ","dosage","dose.unit","volume","vol.unit")]
# Return the dataframe
return(df)
}
# Set path to root folder directory
source.dir <- # Path to root folder here
# Retrieve all files from folder
# NOTE: To retrieve all files from the folder and all of it's subfolders, set: recursive = TRUE
# NOTE: To only include files with certain words in the name, include: pattern = "your.pattern.here"
files <- list.files(source.dir, recursive = FALSE, full.names = TRUE)
# Process each file and store dataframes in list
ldf <- lapply(files, process.file)
# Combine all dataframes to a single dataframe
final.df <- do.call(rbind, ldf)

最新更新