我有一个包含一组文本的数据框架,每个文本都有一个单词计数。它看起来像这样:
df1 <- data.frame(Items = sample(1:495, 495, replace = FALSE), Length.in.words = sample(380:820, 495, replace = TRUE))
我需要把这些文本分配给法官。这个任务有几个参数。第一,每篇文本应由三名法官审阅。因此,我将原始数据帧复制了两次,因此每个文本现在表示三次:
df2 <- df1
df3 <- df1
df <- rbind(df1, df2, df3)
接下来,我编写了一个小函数来将文本分配给评委。它使用了tidyverse和groupdata2:
library(tidyverse)
library(groupdata2)
sample_scripts <- function(x, judges){
all <- fold(x, num_col = "Length.in.words", k = judges)
judge_list <- split(all, all$.folds)
while (TRUE %in% lapply(X = judge_list, FUN = duplicated)){
all <- fold(all, num_col = "Length.in.words", k = judges)
judge_list <- split(all, all$.folds)
}
assign("judge_list", judge_list, envir = globalenv())
}
函数做两件事。它使用groupdata2::fold
在字数方面平衡每位法官的文本。并且,至少在理论上,它使用while
迭代,直到每个法官的文本列表不包含重复。评委不能看同一篇文章两次!我像这样运行这个函数:
sample_scripts(df, 33)
但是"while"循环没有按预期工作-每次运行该函数时,我都会在几个法官的文本列表中得到重复。我这样测试:
lapply(X = judge_list, FUN = duplicated)
谁能帮我解决这个问题,以便while循环迭代文本分配,直到在任何法官列表中没有重复?
感谢彼得
左边的解决方案。我认为这可能被认为是一个np难题,但当我读到它时,它给我的印象是一个约束离散优化问题——分配分配给每个法官的页数的差异最小化,每个项目分配三个法官,并确保没有法官被分配相同的项目两次。
运行需要一段时间,但这里有一个使用遗传算法的概率解决方案
library(GA)
library(purrr)
library(dplyr)
library(tidyr)
df1 <- data.frame(
Items = sample(1:495, 495, replace = FALSE),
Length.in.words = sample(380:820, 495, replace = TRUE)
)
sample_scripts <- function(x, judges){
decode <- function(encX) {
# enumerate the three judges assigned to each text
judgesAssign <- floor(encX) |>
split(seq_along(x$Items))
# check for duplicates
dups <- map_lgl(judgesAssign, (.x) any(duplicated(.x))) |> sum()
list(judgesAssign = judgesAssign, dups = dups)
}
fitness <- function(encX) {
decX <- decode(encX)
# enumerate the texts assigned to each judge
itemAssign <- decX$judgesAssign |>
as_tibble() |>
pivot_longer(
everything(),
names_to = "Items",
values_to = "judge"
) |>
# match types in df
mutate(Items = as.integer(Items))
# determine words and items assigned to each count
assignCounts <- itemAssign|>
left_join(x, by = "Items") |>
# ensure every judge is represented
bind_rows(
tibble(judge = seq(judges), Length.in.words = 0)
) |>
group_by(judge) |>
# subtract 1 for empty row
summarise(n_words = sum(Length.in.words), n_items = n() - 1)
# minimize variance
f <- -(var(assignCounts$n_words)/mean(assignCounts$n_words) +
var(assignCounts$n_items)/mean(assignCounts$n_items))
# penalty term for duplicates
pen <- sqrt(.Machine$double.xmax)
dupPen <- decX$dups*pen
# fitness function minimizes variance and heavy penalty for duplicates
f - dupPen
}
scriptCt <- NROW(x)
GA <- ga(
"real-valued",
fitness = fitness,
lower = rep(1, scriptCt*3), upper = rep(judges, scriptCt*3) + 1,
maxiter = 750, run = 200, seed = 123, pmutation = 0.25,
suggestions = sample(seq(judges), 3*scriptCt, TRUE),
maxFitness = 0
)
decSol <- decode(GA@solution[1, ])
finalAssignments <- decSol$judgesAssign |>
as_tibble() |>
pivot_longer(
everything(),
names_to = "Items",
values_to = "judge"
) |>
# match types in df
mutate(Items = as.integer(Items))
list(
gaRslt = GA,
fitness = fitness_tot(GA@solution),
decodedSol = decSol,
assignmentsByJudge = split(finalAssignments$Items, finalAssignments$judge),
assignmentsByItem = decSol$judgesAssign,
anyDuplicates = decSol$dups
)
}
sample_scripts(df1, 33)
不保证每个法官都有相同数量的Items
,但它优化使数量保持相对相同。当我运行它时,大多数评委有45到55个项目,但也有长尾。我没能成功地把它限制在45个。您可以使用ga
的一些参数,这可能会有所帮助,但我认为解决方案空间有很多局部最小值,因此可能很难。
如果你有手动调整的选项,你可以用这个作为起点,从评委那里拿一些有很多的项目,然后把它们分配给那些几乎没有的。不是全局最优,但可能很接近。
我试图仅使用base
R函数重新编码sample_scripts
函数:
sample_scripts <- function(data, n_judges = 33, n_judge_per_item = 3) {
# Default-Values:
# 33 Judges
# each Item has to be revised by 3 judges
# initialize output
judges_list <- lapply(seq_len(n_judges),
(n) data.frame(
Item = integer(0),
Length.in.words = integer(0)
)
)
# Check if a judge is still available
free_judge <- rep(TRUE, n_judges)
# target_length tries to distribute the text sizes
target_length <- n_judge_per_item * sum(data$Length.in.words) / n_judges
# loop over items data.frame
for (i in seq_len(nrow(data))) {
# randomly pick three free judges
samp_judge <- sample(seq_len(n_judges)[free_judge], size = n_judge_per_item)
# assign items to judge
for (j in samp_judge) {
judges_list[[j]] <-
rbind(
judges_list[[j]],
data.frame(Item = data[["Items"]][i],
Length.in.words = data[["Length.in.words"]][i])
)
# check if judge is still available
free_judge[j] <- (sum(judges_list[[j]][["Length.in.words"]]) < target_length)
}
}
# return list
judges_list
}
您只需输入df1
而不是df <- rbind(df1, df1, df1)
。在本例中,sample_scripts(df1)
返回
[[14]]
Item Length.in.words
1 12 817
2 245 389
3 372 649
4 413 731
5 329 698
6 193 492
7 405 731
8 376 593
9 297 539
10 356 774
11 125 492
12 359 716
13 472 611
14 128 397
[...]
用tibble
s替换data.frame
s返回
[[1]]
# A tibble: 42 × 2
Item Length.in.words
<int> <int>
1 99 738
2 15 483
3 341 763
4 333 702
5 407 797
6 376 593
7 356 774
8 4 535
9 53 739
10 354 570
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 47 × 2
Item Length.in.words
<int> <int>
1 86 770
2 404 550
3 357 675
4 331 431
5 116 589
6 438 554
7 459 497
8 319 651
9 97 718
10 85 510
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows
[...]
你对lapply(X = judge_list, duplicated)
副本的检查仍然给我们
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
通过使用unlist(lapply(X = judge_list, (x) sum(x$Length.in.words)))
,你可以检查每个裁判是否有相似的单词量。
感谢大家的建议。关于每个裁判的相同条目数量的约束是严格的(我显然应该提到这一点)——比最小化每个裁判的每个文本的平均长度的约束更严格。我写了下面的代码,其中使用了"贪婪"这个词。分组项目的方法,以确保每个法官的列表中没有重复的项目,给每个法官45个项目,并确保每个项目正好被三个法官看到。它不管理项目的平均长度,但它在分配之前随机化项目列表,所以我可以重新运行这个过程几次,直到我满意的范围足够低。
sample_scripts <- function(x, texts_per_judge){
x1 <- x
x1 <- x1[sample(1:nrow(x1)), ]
x2 <- x
x2 <- x2[sample(1:nrow(x2)), ]
x3 <- x
x3 <- x3[sample(1:nrow(x3)), ]
all <- rbind(x1, x2, x3)
all <- group(all, method = "greedy", n = texts_per_judge)
judge_list <- split(all, all$.groups)
assign("judge_mean_length", value = aggregate(Length.in.words ~ .groups, data = judge_df, FUN = "mean"), envir = globalenv())
assign("judge_list", judge_list, envir = globalenv())
}
我对这个解决方案很满意,但我仍然很想知道是否有人能想出一个解决方案,在平均文本长度范围可控之前不需要重新运行这个过程。