我有一个包含字符串的列表(是dataframe中的列),例如:
list("4 pieces of tissue, the largest measuring 4 x 3 x 2 m",
NA_character_, NA_character_, "4 pieces of tissue, the largest measuring 4 x 2 x 2m",
"2 pieces of tissue, the larger measuring 4 x 2 x 2 m", c("4 pieces of tissue, the largest measuring 5 x 4 x 2 m",
"4 pieces of tissue, the largest measuring 6 x 2 x 1 m",
"4 pieces of tissue, the largest measuring 4 x 3 x 1 m"),
NA_character_, c("4 pieces of tissue, the largest measuring 4 x 3 x 2 m",
"4 pieces of tissue, the largest measuring 5 x 2 x 2 m",
"4 pieces of tissue, the largest measuring 4 x 2 x 1 m"),
NA_character_, "4 pieces of tissue, the largest measuring 8 x 2 x 2m")
此列表是从该行生成的
x$NumbOfBx <- str_extract_all(x[,y], "([A-Za-z]*|[0-9]) (specimens|pieces).*?(([0-9]).*?x.*?([0-9]).*?x.*?([0-9])).*?([a-z])")
作为下面功能的一部分
我想为列表中每个元素提取组织碎片数量的总和。我一直在尝试:
function(x,y) {
x<-data.frame(x)
x$NumbOfBx <- str_extract_all(x[,y], "([A-Za-z]*|[0-9]) (specimens|pieces).*?(([0-9]).*?x.*?([0-9]).*?x.*?([0-9])).*?([a-z])")
x$NumbOfBx <- sapply(x$NumbOfBx, function(x) sum(as.numeric(unlist(str_extract_all(x$NumbOfBx, "^\d+")))))
x$NumbOfBxs <- unlist(x$NumbOfBx)
x$NumbOfBx <- as.numeric(str_extract(x$NumbOfBx, "^.*?\d"))
return(x)
}
,但我得到了错误
Error in x$NumbOfBx : $ operator is invalid for atomic vectors
data
L <- list("4 pieces of tissue, the largest measuring 4 x 3 x 2 m",
NA_character_, NA_character_, "4 pieces of tissue, the largest measuring 4 x 2 x 2m",
"2 pieces of tissue, the larger measuring 4 x 2 x 2 m", c("4 pieces of tissue, the largest measuring 5 x 4 x 2 m",
"4 pieces of tissue, the largest measuring 6 x 2 x 1 m",
"4 pieces of tissue, the largest measuring 4 x 3 x 1 m"),
NA_character_, c("4 pieces of tissue, the largest measuring 4 x 3 x 2 m",
"4 pieces of tissue, the largest measuring 5 x 2 x 2 m",
"4 pieces of tissue, the largest measuring 4 x 2 x 1 m"),
NA_character_, "4 pieces of tissue, the largest measuring 8 x 2 x 2m")
一个衬里基础R解决方案
sapply(L, function(x) sum(as.numeric(substr(x, regexpr("\d+(?= pieces of tissue)", x, perl=TRUE, useBytes=TRUE),
regexpr("\d+(?= pieces of tissue)", x, perl=TRUE, useBytes=TRUE)))))
输出
4 NA NA 4 2 12 NA 12 NA 4
类似的东西?简而言之,假设您的数据是列表,则可以提取单词样本|样品之前的数字值,将其转换为数字,然后汇总列表中每个向量中发现的计数。这是您提出的策略,并进行了一些修改...
# Assuming your list is defined as my.list
xtr.pieces <- function(ml) {
my.sums <- lapply(ml, (function(el){
sum (sapply(el, (function(tmp){
if (!is.na(tmp)) {
loc <- regexpr("[0-9]{1,2}.{0,3}[sample|specimen]", tmp)
if (loc > 0) {
tmp <- substr(tmp, loc, loc + attributes(loc)$match.length)
as.numeric(gsub("[^[:digit:]]", "", tmp))
}
} else {
0
}
})))
}))
return (my.sums)
}
nas在这里被计为0。您可以执行,然后得到:
unlist(xtr.pieces(ml))
[1] 4 0 0 4 2 12 0 12 0 4