r语言 - While循环不符合条件



我有一个包含一组文本的数据框架,每个文本都有一个单词计数。它看起来像这样:

df1 <- data.frame(Items = sample(1:495, 495, replace = FALSE), Length.in.words = sample(380:820, 495, replace = TRUE))

我需要把这些文本分配给法官。这个任务有几个参数。第一,每篇文本应由三名法官审阅。因此,我将原始数据帧复制了两次,因此每个文本现在表示三次:

df2 <- df1
df3 <- df1
df <- rbind(df1, df2, df3)

接下来,我编写了一个小函数来将文本分配给评委。它使用了tidyverse和groupdata2:

library(tidyverse)
library(groupdata2)
sample_scripts <- function(x, judges){
all <- fold(x, num_col = "Length.in.words", k = judges)
judge_list <- split(all, all$.folds)
while (TRUE %in% lapply(X = judge_list, FUN = duplicated)){
all <- fold(all, num_col = "Length.in.words", k = judges)
judge_list <- split(all, all$.folds)
}
assign("judge_list", judge_list, envir = globalenv())
}

函数做两件事。它使用groupdata2::fold在字数方面平衡每位法官的文本。并且,至少在理论上,它使用while迭代,直到每个法官的文本列表不包含重复。评委不能看同一篇文章两次!我像这样运行这个函数:

sample_scripts(df, 33)

但是"while"循环没有按预期工作-每次运行该函数时,我都会在几个法官的文本列表中得到重复。我这样测试:

lapply(X = judge_list, FUN = duplicated)

谁能帮我解决这个问题,以便while循环迭代文本分配,直到在任何法官列表中没有重复?

感谢彼得

左边的解决方案。我认为这可能被认为是一个np难题,但当我读到它时,它给我的印象是一个约束离散优化问题——分配分配给每个法官的页数的差异最小化,每个项目分配三个法官,并确保没有法官被分配相同的项目两次。

运行需要一段时间,但这里有一个使用遗传算法的概率解决方案

library(GA)
library(purrr)
library(dplyr)
library(tidyr)
df1 <- data.frame(
Items = sample(1:495, 495, replace = FALSE), 
Length.in.words = sample(380:820, 495, replace = TRUE)
)

sample_scripts <- function(x, judges){

decode <- function(encX) {
# enumerate the three judges assigned to each text
judgesAssign <- floor(encX) |> 
split(seq_along(x$Items))

# check for duplicates
dups <- map_lgl(judgesAssign, (.x) any(duplicated(.x))) |> sum()

list(judgesAssign = judgesAssign, dups = dups)
}

fitness <- function(encX) {

decX <- decode(encX)

# enumerate the texts assigned to each judge
itemAssign <- decX$judgesAssign |> 
as_tibble() |> 
pivot_longer(
everything(), 
names_to = "Items",
values_to = "judge"
) |> 
# match types in df
mutate(Items = as.integer(Items))

# determine words and items assigned to each count
assignCounts <-  itemAssign|> 
left_join(x, by = "Items") |> 
# ensure every judge is represented
bind_rows(
tibble(judge = seq(judges), Length.in.words = 0)
) |> 
group_by(judge) |> 
# subtract 1 for empty row
summarise(n_words = sum(Length.in.words), n_items = n() - 1)

# minimize variance
f <- -(var(assignCounts$n_words)/mean(assignCounts$n_words) + 
var(assignCounts$n_items)/mean(assignCounts$n_items))

# penalty term for duplicates
pen <- sqrt(.Machine$double.xmax)
dupPen <- decX$dups*pen

# fitness function minimizes variance and heavy penalty for duplicates
f - dupPen

}

scriptCt <- NROW(x)

GA <- ga(
"real-valued",
fitness = fitness,
lower = rep(1, scriptCt*3), upper = rep(judges, scriptCt*3) + 1,
maxiter = 750, run = 200, seed = 123, pmutation = 0.25,
suggestions = sample(seq(judges), 3*scriptCt, TRUE),
maxFitness = 0
)
decSol <- decode(GA@solution[1, ])
finalAssignments <- decSol$judgesAssign |>
as_tibble() |>
pivot_longer(
everything(),
names_to = "Items",
values_to = "judge"
) |>
# match types in df
mutate(Items = as.integer(Items))

list(
gaRslt = GA,
fitness = fitness_tot(GA@solution),
decodedSol = decSol,
assignmentsByJudge = split(finalAssignments$Items, finalAssignments$judge),
assignmentsByItem = decSol$judgesAssign,
anyDuplicates = decSol$dups
)
}
sample_scripts(df1, 33)

不保证每个法官都有相同数量的Items,但它优化使数量保持相对相同。当我运行它时,大多数评委有45到55个项目,但也有长尾。我没能成功地把它限制在45个。您可以使用ga的一些参数,这可能会有所帮助,但我认为解决方案空间有很多局部最小值,因此可能很难。

如果你有手动调整的选项,你可以用这个作为起点,从评委那里拿一些有很多的项目,然后把它们分配给那些几乎没有的。不是全局最优,但可能很接近。

我试图仅使用baseR函数重新编码sample_scripts函数:

sample_scripts <- function(data, n_judges = 33, n_judge_per_item = 3) {
# Default-Values: 
# 33 Judges
# each Item has to be revised by 3 judges

# initialize output
judges_list <- lapply(seq_len(n_judges), 
(n) data.frame(
Item = integer(0),
Length.in.words = integer(0)
)
)
# Check if a judge is still available  
free_judge <- rep(TRUE, n_judges)

# target_length tries to distribute the text sizes
target_length <- n_judge_per_item * sum(data$Length.in.words) / n_judges 
# loop over items data.frame
for (i in seq_len(nrow(data))) {
# randomly pick three free judges
samp_judge <- sample(seq_len(n_judges)[free_judge], size = n_judge_per_item)

# assign items to judge
for (j in samp_judge) {
judges_list[[j]] <- 
rbind(
judges_list[[j]],
data.frame(Item = data[["Items"]][i],
Length.in.words = data[["Length.in.words"]][i])
)

# check if judge is still available
free_judge[j] <- (sum(judges_list[[j]][["Length.in.words"]]) < target_length)

}
}

# return list
judges_list
}

您只需输入df1而不是df <- rbind(df1, df1, df1)。在本例中,sample_scripts(df1)返回

[[14]]
Item Length.in.words
1    12             817
2   245             389
3   372             649
4   413             731
5   329             698
6   193             492
7   405             731
8   376             593
9   297             539
10  356             774
11  125             492
12  359             716
13  472             611
14  128             397
[...]

tibbles替换data.frames返回

[[1]]
# A tibble: 42 × 2
Item Length.in.words
<int>           <int>
1    99             738
2    15             483
3   341             763
4   333             702
5   407             797
6   376             593
7   356             774
8     4             535
9    53             739
10   354             570
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 47 × 2
Item Length.in.words
<int>           <int>
1    86             770
2   404             550
3   357             675
4   331             431
5   116             589
6   438             554
7   459             497
8   319             651
9    97             718
10    85             510
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows
[...]

你对lapply(X = judge_list, duplicated)副本的检查仍然给我们

[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

通过使用unlist(lapply(X = judge_list, (x) sum(x$Length.in.words))),你可以检查每个裁判是否有相似的单词量。

感谢大家的建议。关于每个裁判的相同条目数量的约束是严格的(我显然应该提到这一点)——比最小化每个裁判的每个文本的平均长度的约束更严格。我写了下面的代码,其中使用了"贪婪"这个词。分组项目的方法,以确保每个法官的列表中没有重复的项目,给每个法官45个项目,并确保每个项目正好被三个法官看到。它不管理项目的平均长度,但它在分配之前随机化项目列表,所以我可以重新运行这个过程几次,直到我满意的范围足够低。

sample_scripts <- function(x, texts_per_judge){
x1 <- x
x1 <- x1[sample(1:nrow(x1)), ]
x2 <- x
x2 <- x2[sample(1:nrow(x2)), ]
x3 <- x
x3 <- x3[sample(1:nrow(x3)), ]
all <- rbind(x1, x2, x3)
all <- group(all, method = "greedy", n = texts_per_judge)
judge_list <- split(all, all$.groups)
assign("judge_mean_length", value = aggregate(Length.in.words ~ .groups, data = judge_df, FUN = "mean"), envir = globalenv())
assign("judge_list", judge_list, envir = globalenv())
}

我对这个解决方案很满意,但我仍然很想知道是否有人能想出一个解决方案,在平均文本长度范围可控之前不需要重新运行这个过程。

最新更新