分裂任意大小的数组元素



我有一个我从一篇文章中读到的句子列表,我希望将它们分成大小均匀(尽可能多)的桶,最多N个字符,额外的复杂性是保持桶的大小大致相同,并且句子不分裂。

我正在使用python,我想到的简单方法是:

  1. 遍历句子并填充每个bucket,直到它被填满。
  2. 遍历桶并尝试均匀它们。

第二部分恐怕不会那么直截了当,我想知道是否有一种快速/聪明的方法来实现它?

完整的示例数据在这里太大了,但以更简单的形式:[[30个单词的第一句],[3个单词的第一句][[15个单词的第一句][[10个单词的第一句],…]

bucket应该是[sent1, sent 2…]]的总字符长度为句子的总和。

我想组X桶,在每个桶将句子总长度尽可能均匀分布,而不能通过某种thredhold。

第一步将把句子压缩到较低的组(左侧),因此第二步可以通过将句子移到右侧来执行优化。

请记住,这些句子移位将有非常多的排列(它将随着句子的数量呈指数增长)。

实现这一点的一个简单方法是处理句子长度列表而不是句子本身。当你完成时,你可以根据长度分组对实际的句子进行分组。

测试数据:

sentences="""Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.""".split(". ")
sentences *=60              # 240 sentences
from random import shuffle
shuffle(sentences)

句子长度分组

SL = [len(s) for s in sentences]
bucketCount = 10
bucketSize  = max(max(SL),sum(SL)//(bucketCount-1)) # safe max bucket size
groups = []
for size in SL:
if not groups or sum(groups[-1])+size>bucketSize:
groups.append([size])
else:
groups[-1].append(size)
print(*groups,sep="n")
[101, 122, 110, 106, 122, 101, 110, 101, 110, 101, 106, 101, 101, 106, 101, 110, 122, 106, 106, 101, 122, 122, 122, 110, 122, 110]
[110, 106, 110, 110, 101, 106, 122, 110, 106, 101, 101, 122, 106, 122, 122, 110, 122, 106, 122, 122, 106, 106, 122, 101, 122, 122]
[106, 122, 122, 101, 101, 110, 106, 110, 110, 101, 110, 110, 106, 106, 106, 106, 101, 122, 101, 110, 122, 122, 110, 110, 101, 122]
[106, 122, 101, 110, 106, 110, 122, 122, 110, 101, 122, 101, 110, 110, 122, 122, 106, 110, 101, 122, 101, 106, 106, 101, 101, 101]
[106, 110, 122, 101, 122, 110, 122, 122, 122, 110, 106, 101, 106, 106, 110, 106, 122, 106, 110, 106, 110, 101, 110, 110, 122, 122]
[110, 101, 110, 106, 101, 101, 110, 101, 110, 122, 122, 101, 106, 110, 110, 122, 110, 110, 122, 106, 110, 122, 106, 101, 101, 110]
[101, 101, 101, 106, 101, 106, 106, 101, 101, 122, 101, 122, 106, 122, 106, 110, 101, 122, 101, 110, 106, 106, 122, 122, 106, 106, 106]
[110, 101, 106, 106, 101, 101, 122, 106, 101, 122, 106, 101, 122, 110, 106, 106, 110, 122, 110, 110, 101, 101, 106, 110, 110, 106, 101]
[122, 101, 110, 122, 106, 122, 110, 101, 110, 101, 110, 122, 106, 106, 122, 106, 122, 110, 101, 101, 106, 106, 106, 101, 110, 110]
[122, 101, 110, 106]

正如你所看到的,最后一组有很多空间可以移动到

集团优化

这是一个"蛮力"探索组成员移位排列(向右)并返回最优分组的优化函数。

注意这个函数的性能还有很大的提升空间,我并没有尝试优化这个优化函数。

# Recursive function that goes through shifting permutations    
def optimize(groups,bucketSize):
if len(groups)==1: return groups
optimal = groups
minSum  = min(map(sum,groups))
for i in reversed(range(1,len(groups))):
if groups[i-1][-1]+sum(groups[i])<=bucketSize:
oGroups = [g.copy() for g in groups]
oGroups[i].insert(0,oGroups[i-1].pop(-1))
oGroups[:i+1] = optimize(oGroups[:i+1],bucketSize)
ms = min(map(sum,oGroups))
if ms>minSum:
minSum,optimal = ms,oGroups
return optimal

optimal = optimize(groups,bucketSize)  # takes about a minute for 240 sentences  

bucketSize = max(map(sum,optimal)) # Effective bucket size                          
for g in optimal:
print(sum(g),sum(g)-bucketSize,g)
2620 -80 [101, 122, 110, 106, 122, 101, 110, 101, 110, 101, 106, 101, 101, 106, 101, 110, 122, 106, 106, 101, 122, 122, 122, 110]
2681 -19 [122, 110, 110, 106, 110, 110, 101, 106, 122, 110, 106, 101, 101, 122, 106, 122, 122, 110, 122, 106, 122, 122, 106, 106]
2634 -66 [122, 101, 122, 122, 106, 122, 122, 101, 101, 110, 106, 110, 110, 101, 110, 110, 106, 106, 106, 106, 101, 122, 101, 110]
2700 0 [122, 122, 110, 110, 101, 122, 106, 122, 101, 110, 106, 110, 122, 122, 110, 101, 122, 101, 110, 110, 122, 122, 106, 110]
2621 -79 [101, 122, 101, 106, 106, 101, 101, 101, 106, 110, 122, 101, 122, 110, 122, 122, 122, 110, 106, 101, 106, 106, 110, 106]
2630 -70 [122, 106, 110, 106, 110, 101, 110, 110, 122, 122, 110, 101, 110, 106, 101, 101, 110, 101, 110, 122, 122, 101, 106, 110]
2599 -101 [110, 122, 110, 110, 122, 106, 110, 122, 106, 101, 101, 110, 101, 101, 101, 106, 101, 106, 106, 101, 101, 122, 101, 122]
2606 -94 [106, 122, 106, 110, 101, 122, 101, 110, 106, 106, 122, 122, 106, 106, 106, 110, 101, 106, 106, 101, 101, 122, 106, 101]
2643 -57 [122, 106, 101, 122, 110, 106, 106, 110, 122, 110, 110, 101, 101, 106, 110, 110, 106, 101, 122, 101, 110, 122, 106, 122]
2606 -94 [110, 101, 110, 101, 110, 122, 106, 106, 122, 106, 122, 110, 101, 101, 106, 106, 106, 101, 110, 110, 122, 101, 110, 106]

有效桶大小为2700,句子总长度最大差异为101个字符(3.7%)

分组实际句子

iSentence = iter(sentences)
sentenceGroups = [[next(iSentence) for _ in g] for g in optimal]

[编辑]我发现了一种更快的方法来执行分组,即递归地找到最佳的两部分。它并不总是那么完美,但运行起来要快得多。

def makeGroups(SL,buckets):
if buckets == 1: return [SL]
left = buckets//2            # left side buckets
lSum = sum(SL)*left/buckets  # target sum for left side
mid  = min(range(len(SL)),key=lambda i:abs(sum(SL[:i])-lSum))
return makeGroups(SL[:mid],left)+makeGroups(SL[mid:],buckets-left)

def groupSentences(sentences,buckets):
iS = iter(sentences)
return [ [next(iS) for _ in g] for g in makeGroups([*map(len,sentences)]) ]

这几乎可以为10,000个句子提供即时结果(相比之下,我最初的解决方案可能需要几分钟才能完成240个句子)

假设你必须将大量的项目(假设327个项目)放入最大存储桶大小(假设每个存储桶12个项目)对吗?您希望每个桶的大小在填充后长度相同,但桶的长度不超过最大桶大小

得到最大桶大小和物品数量的最大公分母:然后这就是每个桶可以容纳多少物品:

from math import gcd
from random import shuffle
no_items = 327
max_bucket_size = 12
items_per_bucket = gcd(len(items), max_bucket_size)
list(range(327))
shuffle(items)
print(list((zip(*[iter(items)]*items_per_bucket))))

[(38, 273, 165),
(212, 259, 197),
(4, 181, 43),
(162, 137, 23),
(42, 146, 315),
(99, 240, 292),
(96, 182, 89),
(102, 164, 73),
(153, 248, 5),
(84, 48, 81),
(148, 214, 321),
(130, 44, 40),
(210, 312, 202),
(127, 126, 31),
(83, 9, 7),
(284, 57, 194),
(303, 109, 95),
(223, 238, 227),
(104, 10, 94),
(78, 155, 224),
(125, 271, 279),
(123, 189, 135),
(249, 290, 243),
(141, 139, 119),
(59, 311, 299),
(186, 289, 306),
(174, 241, 278),
(75, 11, 286),
(62, 228, 250),
(169, 16, 314),
(281, 63, 263),
(265, 132, 108),
(287, 129, 138),
(177, 103, 142),
(22, 230, 173),
(203, 319, 239),
(72, 180, 64),
(47, 205, 0),
(178, 255, 234),
(274, 71, 117),
(305, 316, 6),
(301, 80, 118),
(200, 86, 45),
(163, 258, 147),
(27, 232, 209),
(183, 17, 157),
(56, 302, 121),
(156, 225, 3),
(26, 282, 2),
(150, 300, 53),
(76, 207, 215),
(152, 206, 74),
(198, 28, 247),
(51, 87, 12),
(98, 19, 20),
(172, 116, 298),
(270, 195, 41),
(188, 18, 256),
(233, 226, 187),
(134, 262, 309),
(179, 190, 237),
(317, 113, 69),
(313, 34, 199),
(35, 253, 159),
(269, 308, 213),
(295, 168, 285),
(176, 54, 46),
(242, 140, 100),
(288, 191, 244),
(219, 280, 193),
(260, 307, 235),
(320, 196, 264),
(115, 277, 60),
(33, 110, 158),
(55, 24, 166),
(236, 204, 293),
(88, 120, 101),
(32, 322, 149),
(296, 97, 68),
(79, 52, 14),
(15, 29, 222),
(58, 8, 201),
(128, 272, 275),
(112, 208, 283),
(246, 220, 254),
(111, 325, 324),
(114, 304, 124),
(131, 92, 39),
(21, 1, 67),
(323, 184, 245),
(160, 171, 151),
(77, 192, 268),
(266, 136, 133),
(291, 65, 30),
(252, 276, 37),
(49, 70, 50),
(267, 13, 217),
(170, 82, 211),
(221, 216, 144),
(310, 93, 25),
(326, 106, 218),
(318, 66, 294),
(167, 154, 36),
(185, 161, 107),
(143, 145, 261),
(61, 231, 297),
(85, 257, 229),
(175, 105, 90),
(122, 251, 91)]
> 

最新更新