我有R代码从一个文档中提取信息.我该如何为文件夹中的所有文档循环使用它



我有一个txt文件文件夹,我想从中提取特定的文本,并将它们单独列成一个新的数据框架。我为一个文件编写了代码,但似乎无法将其编辑成一个循环,该循环将在我的文件夹中的所有文档中运行。

这是我的一个txt文件的代码:

clean_text <- as.data.frame(strsplit(text$text, '\*' ), col.names = "text") %>% 
mutate(text = str_replace_all(text, "n", " "),
text = str_replace_all(text, "- ", ""), 
text = str_replace_all(text,"^\s", "")) %>% 

filter(!text == " ") %>% 

mutate(paragraphs = ifelse(grepl("^[[:digit:]]", text) == T, text, NA)) %>% 

rename(category = text) %>% 
mutate(category = ifelse(grepl("^[[:digit:]]", category) == T, NA, category)) %>% 
fill(category) %>% 
filter(!is.na(paragraphs)) %>% 

mutate(paragraphs = strsplit(paragraphs, '^[[:digit:]]{1,3}\.|\t\s[[:digit:]]{1,3}\.')) %>% 
unnest(paragraphs) %>% 
mutate(paragraphs = strsplit(paragraphs, 'Download as PDF')) %>%
unnest(paragraphs) %>% 
mutate(paragraphs = str_replace_all(paragraphs, "t", "")) %>% 
mutate(paragraphs = ifelse(grepl("javascript", paragraphs), "", paragraphs)) %>%
mutate(paragraphs = str_replace_all(paragraphs, "^\s+", "")) %>%
filter(!paragraphs == "") 

我该如何把它变成一个循环?我意识到也有类似的问题,但没有一个解决方案对我有效。提前感谢您的帮助!

将代码放入函数中:

extract_info = function(file) {
## Add the code you need to read the text from the file
## Something like
## text <- readLines(file)
## or whatever you are using to read in the file
clean_text <- as.data.frame(strsplit(text$text, '\*' ), col.names = "text") %>% 
mutate(text = str_replace_all(text, "n", " "),
text = str_replace_all(text, "- ", ""), 
text = str_replace_all(text,"^\s", "")) %>% 

filter(!text == " ") %>% 

mutate(paragraphs = ifelse(grepl("^[[:digit:]]", text) == T, text, NA)) %>% 

rename(category = text) %>% 
mutate(category = ifelse(grepl("^[[:digit:]]", category) == T, NA, category)) %>% 
fill(category) %>% 
filter(!is.na(paragraphs)) %>% 

mutate(paragraphs = strsplit(paragraphs, '^[[:digit:]]{1,3}\.|\t\s[[:digit:]]{1,3}\.')) %>% 
unnest(paragraphs) %>% 
mutate(paragraphs = strsplit(paragraphs, 'Download as PDF')) %>%
unnest(paragraphs) %>% 
mutate(paragraphs = str_replace_all(paragraphs, "t", "")) %>% 
mutate(paragraphs = ifelse(grepl("javascript", paragraphs), "", paragraphs)) %>%
mutate(paragraphs = str_replace_all(paragraphs, "^\s+", "")) %>%
filter(!paragraphs == "") 
}

测试你的功能以确保它在一个文件上工作:

extract_info("your_file_name.txt")
## does the result work and look right? 
## work on your function until it does

获取要运行的所有文件的列表

my_files = list.files()
## by default this will give you all the files in your working directory
## use the `pattern` argument if you only want files that follow
## a certain naming convention

将您的功能应用于这些文件:

results = lapply(my_files, extract_info)

我不是在使用循环,而是使用lapply,并且函数具有与循环相同的行为:

my_path <- "C:/Users/SAID ABIDI/Desktop/test/"
my_a <- list.files(path = my_path)
my_function <- function(x) {
read_file(paste(my_path, my_a[x], sep = ""))
}
my_var <- lapply(1:length(my_a), my_function)

这对你有帮助吗?

最新更新