在R中操作.txt文件中的数据



问题简介

你好,

我正在为我的实验室制定数据计划,该实验室将从一月份开始进行盲法临床试验。这项任务的一部分是建立一些数据处理管道,以便在收集完所有数据后,我们可以快速运行代码。

我们正在使用的一个结果衡量标准是行为测试。有人开发了一个javascript程序,可以自动为测试打分;然而,输出反射镜5个表彼此叠置。在一些stackoverflow用户的帮助下,我能够开发一个管道,将单个txt文件重组为可以分析的数据帧。我现在遇到的问题是如何同时处理所有文件。

我的想法是将所有文件加载到一个列表中,然后使用map.list或lapply操作列表中的每个元素。然而,我有两个问题,我将在下面概述。

首先,这是可以很好地操作单个数据帧的代码和数据。

input <- c("Cognitive Screen", "Subtest/SectiontttScoretT-Score", 
"1. Line Bisectiontt9t53", "2. Semantic Memorytt8t51", 
"3. Word Fluencyttt1t56*", "4. Recognition Memorytt40t59", 
"5. Gesture Object Usett2t68", "6. Arithmeticttt5t49", 
"Cognitive TOTALttt65", "", "Language Battery", "Part 1: Language Comprehension", 
"Spoken LanguagetttScoretT-Score", "7. Spoken Wordsttt17t45*", 
"9. Spoken Sentencestt25t53*", "11. Spoken Paragraphstt4t60", 
"Spoken Language TOTALtt46t49*", "", "Written LanguagettScoretT-Score", 
"8. Written Wordstt14t45*", "10. Written Sentencestt21t48*", 
"Written Language TOTALtt35t46*", "", "Part 2: Expressive Language", 
"RepetitiontttScoretT-Score", "12. Wordsttt24t55*", "13. Complex Wordstt8t52*", 
"14. Nonwordsttt10t58", "15. Digit Stringstt8t55", "16. Sentencesttt12t63", 
"Repetition TOTALtt62t57*", "", "Spoken LanguagetttScoretT-Score", 
"17. Naming Objectstt30t55*", "18. Naming Actionstt36t63", 
"3. Word Fluencyttt12t56*", "Naming TOTALttt56t57*", 
"", "Spoken Picture DescriptiontScoretT-Score", "19. Spoken Picture Descriptiontt", 
"", "Reading AloudtttScoretT-Score", "20. Wordsttt25t50*", 
"21. Complex Wordstt8t51*", "22. Function Wordstt3t62", 
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "", 
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52", 
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68", 
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score", 
"27. Written Picture Descriptiontt")  

创建输入文件后,这里是我用来创建数据帧的代码(我知道数据帧是用字符表示的,稍后会修复(

input <- read_lines('Example_data')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title)  # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\s+([0-9]+)\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
scores3 <- mutate(scores, row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
data <- left_join(scores,
header,
by = 'row'
)
#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title, "Subtest/Section", "cognition")),
subtest = data$V3,
score = data$V4,
t.score = data$V5)
head(data2) 

好的,现在多个数据文件。我的计划是将所有的txt文件放在一个文件夹中,然后列出所有文件,如下所示:

# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp, readLines)

可复制示例文件:

myfiles <- list(c("Cognitive Screen", "Subtest/SectiontttScoretT-Score", 
"1. Line Bisectiontt9t53", "2. Semantic Memorytt8t51", 
"3. Word Fluencyttt1t56*", "4. Recognition Memorytt40t59", 
"5. Gesture Object Usett2t68", "6. Arithmeticttt5t49", 
"Cognitive TOTALttt65", "", "Language Battery", "Part 1: Language Comprehension", 
"Spoken LanguagetttScoretT-Score", "7. Spoken Wordsttt17t45*", 
"9. Spoken Sentencestt25t53*", "11. Spoken Paragraphstt4t60", 
"Spoken Language TOTALtt46t49*", "", "Written LanguagettScoretT-Score", 
"8. Written Wordstt14t45*", "10. Written Sentencestt21t48*", 
"Written Language TOTALtt35t46*", "", "Part 2: Expressive Language", 
"RepetitiontttScoretT-Score", "12. Wordsttt24t55*", "13. Complex Wordstt8t52*", 
"14. Nonwordsttt10t58", "15. Digit Stringstt8t55", "16. Sentencesttt12t63", 
"Repetition TOTALtt62t57*", "", "Spoken LanguagetttScoretT-Score", 
"17. Naming Objectstt30t55*", "18. Naming Actionstt36t63", 
"3. Word Fluencyttt12t56*", "Naming TOTALttt56t57*", 
"", "Spoken Picture DescriptiontScoretT-Score", "19. Spoken Picture Descriptiontt", 
"", "Reading AloudtttScoretT-Score", "20. Wordsttt25t50*", 
"21. Complex Wordstt8t51*", "22. Function Wordstt3t62", 
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "", 
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52", 
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68", 
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score", 
"27. Written Picture Descriptiontt"), c("Cognitive Screen", 
"Subtest/SectiontttScoretT-Score", "1. Line Bisectiontt9t53", 
"2. Semantic Memorytt8t51", "3. Word Fluencyttt1t56*", 
"4. Recognition Memorytt40t59", "5. Gesture Object Usett2t68", 
"6. Arithmeticttt5t49", "Cognitive TOTALttt65", "", "Language Battery", 
"Part 1: Language Comprehension", "Spoken LanguagetttScoretT-Score", 
"7. Spoken Wordsttt17t45*", "9. Spoken Sentencestt25t53*", 
"11. Spoken Paragraphstt4t60", "Spoken Language TOTALtt46t49*", 
"", "Written LanguagettScoretT-Score", "8. Written Wordstt14t45*", 
"10. Written Sentencestt21t48*", "Written Language TOTALtt35t46*", 
"", "Part 2: Expressive Language", "RepetitiontttScoretT-Score", 
"12. Wordsttt24t55*", "13. Complex Wordstt8t52*", "14. Nonwordsttt10t58", 
"15. Digit Stringstt8t55", "16. Sentencesttt12t63", "Repetition TOTALtt62t57*", 
"", "Spoken LanguagetttScoretT-Score", "17. Naming Objectstt30t55*", 
"18. Naming Actionstt36t63", "3. Word Fluencyttt12t56*", 
"Naming TOTALttt56t57*", "", "Spoken Picture DescriptiontScoretT-Score", 
"19. Spoken Picture Descriptiontt", "", "Reading AloudtttScoretT-Score", 
"20. Wordsttt25t50*", "21. Complex Wordstt8t51*", "22. Function Wordstt3t62", 
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "", 
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52", 
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68", 
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score", 
"27. Written Picture Descriptiontt"), c("Cognitive Screen", 
"Subtest/SectiontttScoretT-Score", "1. Line Bisectiontt9t53", 
"2. Semantic Memorytt8t51", "3. Word Fluencyttt1t56*", 
"4. Recognition Memorytt40t59", "5. Gesture Object Usett2t68", 
"6. Arithmeticttt5t49", "Cognitive TOTALttt65", "", "Language Battery", 
"Part 1: Language Comprehension", "Spoken LanguagetttScoretT-Score", 
"7. Spoken Wordsttt17t45*", "9. Spoken Sentencestt25t53*", 
"11. Spoken Paragraphstt4t60", "Spoken Language TOTALtt46t49*", 
"", "Written LanguagettScoretT-Score", "8. Written Wordstt14t45*", 
"10. Written Sentencestt21t48*", "Written Language TOTALtt35t46*", 
"", "Part 2: Expressive Language", "RepetitiontttScoretT-Score", 
"12. Wordsttt24t55*", "13. Complex Wordstt8t52*", "14. Nonwordsttt10t58", 
"15. Digit Stringstt8t55", "16. Sentencesttt12t63", "Repetition TOTALtt62t57*", 
"", "Spoken LanguagetttScoretT-Score", "17. Naming Objectstt30t55*", 
"18. Naming Actionstt36t63", "3. Word Fluencyttt12t56*", 
"Naming TOTALttt56t57*", "", "Spoken Picture DescriptiontScoretT-Score", 
"19. Spoken Picture Descriptiontt", "", "Reading AloudtttScoretT-Score", 
"20. Wordsttt25t50*", "21. Complex Wordstt8t51*", "22. Function Wordstt3t62", 
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "", 
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52", 
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68", 
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score", 
"27. Written Picture Descriptiontt")) 

这就是问题的根源

我已尝试在rlist包中使用lapply和list.map。首先,lapply似乎不喜欢管道功能,所以我尝试分步骤工作。我还尝试为这一步创建一个函数。

创建tibble。这很管用

list_header <- lapply(myfiles, as.tibble)

即将发生错误-试图启动数据操作

list_header2 <- lapply(list_header, str_match(list_header, "^(.*?)\s+Score.*")[, 2, drop = FALSE])

这行代码提供了以下错误:

"match.fun(fun(中出现错误:'str_match(list_header,"^(.?(\s+Score."([,2,drop=FALSE]'不是函数、字符或符号此外:警告消息:在stri_match_first_regex(字符串,模式,opts_regex=opts(模式((中:自变量不是一个原子向量;胁迫">

所以我试着做一个函数放在这里:

drop_rows <- function(df) {
new_df <- str_match_all(df[[1:3]]$value, "^(.*?)\s+Score.*")
}
list_header2 <- lapply(list_header, drop_rows)

现在我得到这个错误:

"match.fun(fun(中出现错误:'str_match(list_header,"^(.?(\s+Score."([,2,drop=FALSE]'不是函数、字符或符号此外:警告消息:在stri_match_first_regex(字符串,模式,opts_regex=opts(模式((中:自变量不是一个原子向量;胁迫">

摘要:

所提供的代码在加载单个txt文件时运行良好。然而,当我试图运行代码来批量处理多个列表时,我遇到了麻烦。如果有人能够提供一些关于如何修复这个错误的见解,**我想**我将能够完成剩下的部分。然而,如果你愿意帮助实现代码的其余部分,我不会对此提出异议。

我没有尝试调试您的代码,而是决定尝试找到一个可以使用示例数据的解决方案。以下似乎适用于单个矢量和矢量列表:

library(tidyverse)
text_to_tibb <- function(char_vec){
str_split(char_vec, "t") %>% 
map_dfr(~ .[nchar(.) > 0] %>% matrix(., nrow = T) %>%
as_tibble
) %>% 
filter(!is.na(V2), !str_detect(V1, "TOTAL")) %>%
mutate(title = str_detect(V1, "^\d+\.", negate = T),
group = cumsum(title)
) %>% 
group_by(group) %>%
mutate(domain = first(V1)) %>% 
filter(!title) %>% 
ungroup() %>% 
select(domain, V1, V2, V3, -title, -group) %>% 
mutate(V1 = str_remove(V1, "^\d+\. "),
domain = str_replace(domain, "Subtest.*", "Cognition")) %>% 
rename(subtest = V1, score = V2, t_score = V3)
}

如果你在input变量上运行它,你应该得到一个干净的tibble:

text_to_tibb(input)
#### OUTPUT ####
# A tibble: 26 x 4
domain           subtest            score t_score
<chr>            <chr>              <chr> <chr>  
1 Cognition        Line Bisection     9     53     
2 Cognition        Semantic Memory    8     51     
3 Cognition        Word Fluency       1     56*    
4 Cognition        Recognition Memory 40    59     
5 Cognition        Gesture Object Use 2     68     
6 Cognition        Arithmetic         5     49     
7 Spoken Language  Spoken Words       17    45*    
8 Spoken Language  Spoken Sentences   25    53*    
9 Spoken Language  Spoken Paragraphs  4     60     
10 Written Language Written Words      14    45*    
# … with 16 more rows

它也适用于上面包含的矢量列表。只需使用lapplypurrr::map:

map(myfiles, text_to_tibb)

如果您认为某个表中可能存在一些不一致,您可能需要尝试safely

safe_text_to_tibb <- safely(text_to_tibb)
map(myfiles, safe_text_to_tibb)

最新更新