我有一个长字文档,其中列出了如下项目:
- 项目1
- entry1
- entry2
- entry3
- 项目2
- entry1
- entry2
- entry3
- (等)
条目是物种名称,条目是对应的位置和日期信息,但现在这些都不重要了。
我正试图把这个非常长的文档变成一个合理的表/标题对象在R中,我的想法是使用:
library (stringr)
data <- readLines("data.txt")
test_data <- str_sub(data, 1, 3)
,然后用"Item"数据"中每个元素的标识(即每个日期+地点对应的物种)。我试图为此使用for循环并测试每行是否以"开头;";或不,但我卡住了。
results <- vector (length = length(data))
for (i in 1:length(data)) {
if (test_data[i] != " ") {
results[i] = data[i]
} else {
while #here I am stuck
谢谢
我想我已经有东西可以开始了。其思想是将文本文件加载为单个长字符串,然后将其分解为对应于Item +条目的片段,并将其存储在列表中。最后,在列表中使用lapply
来分隔Item和entries。
filename <- "test.txt"
# read your file a single long string
step1 <- readChar(filename, file.info(filename)$size)
# find the pattern that separate each Item (from a copy/paste of the example it is "rnrn") and make a list of items
# with associated entries
step2 <- as.list(unlist(strsplit(step1, split = "rnrn")))
# lastly split the vectors from step2
step3 <- lapply(step2, function(x) unlist(strsplit(x, split = "rn ")))
输出:
> step3
[[1]]
[1] "Item 1" "entry1" "entry2" "entry3"
[[2]]
[1] "Item 2" "entry1" "entry2" "entry3"
从这里你可以开始使用" usually "清理和组织数据的工具,例如
df <- as.data.frame(do.call(rbind, step3))
df <- tidyr::pivot_longer(df, 2:ncol(df))
df <- df[, -2]
names(df) <- c("Items", "Entries")
df
# A tibble: 6 x 2
Items Entries
<chr> <chr>
1 Item 1 entry1
2 Item 1 entry2
3 Item 1 entry3
4 Item 2 entry1
5 Item 2 entry2
6 Item 2 entry3
这是一个基于"每个条目以12个空格开始"这一事实的整洁宇宙方法。
# fake data
obj <- c("Item 1",
" entry1",
" entry2",
" entry3",
" entry4",
"Item 2",
" entry1",
" entry2",
" entry3"
)
writeLines(obj, con = "data2.txt")
# read in and convert
library(tidyverse)
dat <- readLines("data.txt", skipNul = TRUE)
dat |>
enframe() |>
separate(
value,
into = c("item", "entry"),
sep = "\s{12}",
convert = TRUE,
fill = "right"
) |>
mutate(item = na_if(item, "")) |>
fill(item, .direction = "down") |>
filter(!(is.na(entry)))