我在一个tibble中有一些数据,其中有一个数字列和一个相关的感知。我还有一个大约450个较短字符串的向量,我想检查每个感知。最终,我想知道与~450个字符串中的每个字符串相关联的数字项的和值(在根据点击次数按比例分配每个句子后,例如,如果其中一个句子的数值为3,并且450个字符串中有两个点击,我想在它们的每个计数上加1.5,参见示例1和2,它们都出现在第一个"感知"字符串中(
下面的例子得到了";final_result";我想要用于4个示例字符串,但对于~450个字符串不实用。(我并不特别热衷于构建一个由2+~450列组成的大表来实现这一点,所以如果这可以通过搜索匹配的单个列表返回或任何其他方式来实现,那也没关系。(
有人能提出一种更具可扩展性和适当的方法来获得相同的基本输出吗?
非常感谢。
##Tibble with some strings and associated numbers
pacman::p_load(stringi, tidyverse)
set.seed(1)
entries <- tibble("numbers" = rnorm(100),
"strings" = stri_rand_strings(100, 15, "[A-Za-z]"))
#Strings known to show up for example
strings_to_find <- c("NJad", "GNl", "Qaw", "bQ")
#Answers in the form of a table
answers_as_table <- entries %>%
mutate(String1 = str_detect(entries$strings, pattern = strings_to_find[[1]]),
String2 = str_detect(entries$strings, pattern = strings_to_find[[2]]),
String3 = str_detect(entries$strings, pattern = strings_to_find[[3]]),
String4 = str_detect(entries$strings, pattern = strings_to_find[[4]]))
#Find the number of strings in each entry
answers_as_table$CountofHits <- rowSums(answers_as_table[,3:6])
#prorate accordingly
answers_as_table$proration <- answers_as_table$numbers / answers_as_table$CountofHits
#Find the sum of the prorated amount
SumString1 <- sum(answers_as_table[answers_as_table$String1,8])
SumString2 <- sum(answers_as_table[answers_as_table$String2,8])
SumString3 <- sum(answers_as_table[answers_as_table$String3,8])
SumString4 <- sum(answers_as_table[answers_as_table$String4,8])
(final_product <- tibble("strings_to_find" = strings_to_find,
"Sums" = c(SumString1, SumString2, SumString3, SumString4)))```
咯咯笑的基本R尝试:
g <- stack(sapply(strings_to_find, grep, x=entries$strings, simplify=FALSE))
g$numbers <- entries$numbers[g$values]
g$prorata <- ave(g$numbers, g$values, FUN=function(x) x/length(x))
out <- aggregate(prorata ~ ind, data=g, sum)
out
# ind prorata
#1 NJad -0.3132269
#2 GNl -0.3132269
#3 Qaw 0.1836433
#4 bQ 0.3575099
比较良好:
out == final_product
# ind prorata
#[1,] TRUE TRUE
#[2,] TRUE TRUE
#[3,] TRUE TRUE
#[4,] TRUE TRUE
我们可以在vector
上循环并创建列
library(purrr)
library(stringr)
library(dplyr)
answers_as_table <- map2_dfc(strings_to_find,
str_c("String", seq_along(strings_to_find)),
~ entries %>%
transmute(!! .y := str_detect(strings, .x))) %>%
mutate(CountofHits = rowSums(.))
sumstring <- answers_as_table %>%
summarise(across(starts_with('String'), sum))