问题简介
你好,
我正在为我的实验室制定数据计划,该实验室将从一月份开始进行盲法临床试验。这项任务的一部分是建立一些数据处理管道,以便在收集完所有数据后,我们可以快速运行代码。
我们正在使用的一个结果衡量标准是行为测试。有人开发了一个javascript程序,可以自动为测试打分;然而,输出反射镜5个表彼此叠置。在一些stackoverflow用户的帮助下,我能够开发一个管道,将单个txt文件重组为可以分析的数据帧。我现在遇到的问题是如何同时处理所有文件。
我的想法是将所有文件加载到一个列表中,然后使用map.list或lapply操作列表中的每个元素。然而,我有两个问题,我将在下面概述。
首先,这是可以很好地操作单个数据帧的代码和数据。
input <- c("Cognitive Screen", "Subtest/SectiontttScoretT-Score",
"1. Line Bisectiontt9t53", "2. Semantic Memorytt8t51",
"3. Word Fluencyttt1t56*", "4. Recognition Memorytt40t59",
"5. Gesture Object Usett2t68", "6. Arithmeticttt5t49",
"Cognitive TOTALttt65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken LanguagetttScoretT-Score", "7. Spoken Wordsttt17t45*",
"9. Spoken Sentencestt25t53*", "11. Spoken Paragraphstt4t60",
"Spoken Language TOTALtt46t49*", "", "Written LanguagettScoretT-Score",
"8. Written Wordstt14t45*", "10. Written Sentencestt21t48*",
"Written Language TOTALtt35t46*", "", "Part 2: Expressive Language",
"RepetitiontttScoretT-Score", "12. Wordsttt24t55*", "13. Complex Wordstt8t52*",
"14. Nonwordsttt10t58", "15. Digit Stringstt8t55", "16. Sentencesttt12t63",
"Repetition TOTALtt62t57*", "", "Spoken LanguagetttScoretT-Score",
"17. Naming Objectstt30t55*", "18. Naming Actionstt36t63",
"3. Word Fluencyttt12t56*", "Naming TOTALttt56t57*",
"", "Spoken Picture DescriptiontScoretT-Score", "19. Spoken Picture Descriptiontt",
"", "Reading AloudtttScoretT-Score", "20. Wordsttt25t50*",
"21. Complex Wordstt8t51*", "22. Function Wordstt3t62",
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "",
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52",
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68",
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score",
"27. Written Picture Descriptiontt")
创建输入文件后,这里是我用来创建数据帧的代码(我知道数据帧是用字符表示的,稍后会修复(
input <- read_lines('Example_data')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\s+([0-9]+)\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
scores3 <- mutate(scores, row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
data <- left_join(scores,
header,
by = 'row'
)
#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title, "Subtest/Section", "cognition")),
subtest = data$V3,
score = data$V4,
t.score = data$V5)
head(data2)
好的,现在多个数据文件。我的计划是将所有的txt文件放在一个文件夹中,然后列出所有文件,如下所示:
# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp, readLines)
可复制示例文件:
myfiles <- list(c("Cognitive Screen", "Subtest/SectiontttScoretT-Score",
"1. Line Bisectiontt9t53", "2. Semantic Memorytt8t51",
"3. Word Fluencyttt1t56*", "4. Recognition Memorytt40t59",
"5. Gesture Object Usett2t68", "6. Arithmeticttt5t49",
"Cognitive TOTALttt65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken LanguagetttScoretT-Score", "7. Spoken Wordsttt17t45*",
"9. Spoken Sentencestt25t53*", "11. Spoken Paragraphstt4t60",
"Spoken Language TOTALtt46t49*", "", "Written LanguagettScoretT-Score",
"8. Written Wordstt14t45*", "10. Written Sentencestt21t48*",
"Written Language TOTALtt35t46*", "", "Part 2: Expressive Language",
"RepetitiontttScoretT-Score", "12. Wordsttt24t55*", "13. Complex Wordstt8t52*",
"14. Nonwordsttt10t58", "15. Digit Stringstt8t55", "16. Sentencesttt12t63",
"Repetition TOTALtt62t57*", "", "Spoken LanguagetttScoretT-Score",
"17. Naming Objectstt30t55*", "18. Naming Actionstt36t63",
"3. Word Fluencyttt12t56*", "Naming TOTALttt56t57*",
"", "Spoken Picture DescriptiontScoretT-Score", "19. Spoken Picture Descriptiontt",
"", "Reading AloudtttScoretT-Score", "20. Wordsttt25t50*",
"21. Complex Wordstt8t51*", "22. Function Wordstt3t62",
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "",
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52",
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68",
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score",
"27. Written Picture Descriptiontt"), c("Cognitive Screen",
"Subtest/SectiontttScoretT-Score", "1. Line Bisectiontt9t53",
"2. Semantic Memorytt8t51", "3. Word Fluencyttt1t56*",
"4. Recognition Memorytt40t59", "5. Gesture Object Usett2t68",
"6. Arithmeticttt5t49", "Cognitive TOTALttt65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken LanguagetttScoretT-Score",
"7. Spoken Wordsttt17t45*", "9. Spoken Sentencestt25t53*",
"11. Spoken Paragraphstt4t60", "Spoken Language TOTALtt46t49*",
"", "Written LanguagettScoretT-Score", "8. Written Wordstt14t45*",
"10. Written Sentencestt21t48*", "Written Language TOTALtt35t46*",
"", "Part 2: Expressive Language", "RepetitiontttScoretT-Score",
"12. Wordsttt24t55*", "13. Complex Wordstt8t52*", "14. Nonwordsttt10t58",
"15. Digit Stringstt8t55", "16. Sentencesttt12t63", "Repetition TOTALtt62t57*",
"", "Spoken LanguagetttScoretT-Score", "17. Naming Objectstt30t55*",
"18. Naming Actionstt36t63", "3. Word Fluencyttt12t56*",
"Naming TOTALttt56t57*", "", "Spoken Picture DescriptiontScoretT-Score",
"19. Spoken Picture Descriptiontt", "", "Reading AloudtttScoretT-Score",
"20. Wordsttt25t50*", "21. Complex Wordstt8t51*", "22. Function Wordstt3t62",
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "",
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52",
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68",
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score",
"27. Written Picture Descriptiontt"), c("Cognitive Screen",
"Subtest/SectiontttScoretT-Score", "1. Line Bisectiontt9t53",
"2. Semantic Memorytt8t51", "3. Word Fluencyttt1t56*",
"4. Recognition Memorytt40t59", "5. Gesture Object Usett2t68",
"6. Arithmeticttt5t49", "Cognitive TOTALttt65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken LanguagetttScoretT-Score",
"7. Spoken Wordsttt17t45*", "9. Spoken Sentencestt25t53*",
"11. Spoken Paragraphstt4t60", "Spoken Language TOTALtt46t49*",
"", "Written LanguagettScoretT-Score", "8. Written Wordstt14t45*",
"10. Written Sentencestt21t48*", "Written Language TOTALtt35t46*",
"", "Part 2: Expressive Language", "RepetitiontttScoretT-Score",
"12. Wordsttt24t55*", "13. Complex Wordstt8t52*", "14. Nonwordsttt10t58",
"15. Digit Stringstt8t55", "16. Sentencesttt12t63", "Repetition TOTALtt62t57*",
"", "Spoken LanguagetttScoretT-Score", "17. Naming Objectstt30t55*",
"18. Naming Actionstt36t63", "3. Word Fluencyttt12t56*",
"Naming TOTALttt56t57*", "", "Spoken Picture DescriptiontScoretT-Score",
"19. Spoken Picture Descriptiontt", "", "Reading AloudtttScoretT-Score",
"20. Wordsttt25t50*", "21. Complex Wordstt8t51*", "22. Function Wordstt3t62",
"23. Nonwordsttt6t51*", "Reading TOTALttt42t50*", "",
"WritingttttScoretT-Score", "24. Writing: Copyingtt26t52",
"25. Writing Picture Namest14t53*", "26. Writing to Dictationt28t68",
"Writing TOTALttt68t58*", "", "Written Picture DescriptiontScoretT-Score",
"27. Written Picture Descriptiontt"))
这就是问题的根源
我已尝试在rlist包中使用lapply和list.map。首先,lapply似乎不喜欢管道功能,所以我尝试分步骤工作。我还尝试为这一步创建一个函数。
创建tibble。这很管用
list_header <- lapply(myfiles, as.tibble)
即将发生错误-试图启动数据操作
list_header2 <- lapply(list_header, str_match(list_header, "^(.*?)\s+Score.*")[, 2, drop = FALSE])
这行代码提供了以下错误:
"match.fun(fun(中出现错误:'str_match(list_header,"^(.?(\s+Score."([,2,drop=FALSE]'不是函数、字符或符号此外:警告消息:在stri_match_first_regex(字符串,模式,opts_regex=opts(模式((中:自变量不是一个原子向量;胁迫">
所以我试着做一个函数放在这里:
drop_rows <- function(df) {
new_df <- str_match_all(df[[1:3]]$value, "^(.*?)\s+Score.*")
}
list_header2 <- lapply(list_header, drop_rows)
现在我得到这个错误:
"match.fun(fun(中出现错误:'str_match(list_header,"^(.?(\s+Score."([,2,drop=FALSE]'不是函数、字符或符号此外:警告消息:在stri_match_first_regex(字符串,模式,opts_regex=opts(模式((中:自变量不是一个原子向量;胁迫">
摘要:
所提供的代码在加载单个txt文件时运行良好。然而,当我试图运行代码来批量处理多个列表时,我遇到了麻烦。如果有人能够提供一些关于如何修复这个错误的见解,**我想**我将能够完成剩下的部分。然而,如果你愿意帮助实现代码的其余部分,我不会对此提出异议。
我没有尝试调试您的代码,而是决定尝试找到一个可以使用示例数据的解决方案。以下似乎适用于单个矢量和矢量列表:
library(tidyverse)
text_to_tibb <- function(char_vec){
str_split(char_vec, "t") %>%
map_dfr(~ .[nchar(.) > 0] %>% matrix(., nrow = T) %>%
as_tibble
) %>%
filter(!is.na(V2), !str_detect(V1, "TOTAL")) %>%
mutate(title = str_detect(V1, "^\d+\.", negate = T),
group = cumsum(title)
) %>%
group_by(group) %>%
mutate(domain = first(V1)) %>%
filter(!title) %>%
ungroup() %>%
select(domain, V1, V2, V3, -title, -group) %>%
mutate(V1 = str_remove(V1, "^\d+\. "),
domain = str_replace(domain, "Subtest.*", "Cognition")) %>%
rename(subtest = V1, score = V2, t_score = V3)
}
如果你在input
变量上运行它,你应该得到一个干净的tibble:
text_to_tibb(input)
#### OUTPUT ####
# A tibble: 26 x 4
domain subtest score t_score
<chr> <chr> <chr> <chr>
1 Cognition Line Bisection 9 53
2 Cognition Semantic Memory 8 51
3 Cognition Word Fluency 1 56*
4 Cognition Recognition Memory 40 59
5 Cognition Gesture Object Use 2 68
6 Cognition Arithmetic 5 49
7 Spoken Language Spoken Words 17 45*
8 Spoken Language Spoken Sentences 25 53*
9 Spoken Language Spoken Paragraphs 4 60
10 Written Language Written Words 14 45*
# … with 16 more rows
它也适用于上面包含的矢量列表。只需使用lapply
或purrr::map
:
map(myfiles, text_to_tibb)
如果您认为某个表中可能存在一些不一致,您可能需要尝试safely
:
safe_text_to_tibb <- safely(text_to_tibb)
map(myfiles, safe_text_to_tibb)