使列表尽可能不排序的功能



我正在寻找一个函数来使列表尽可能不排序。最好是Python。

背景故事:

我想检查 URL 状态,看看 URL 是否给出 404。我只使用asynciorequests模块。没什么好看的。

现在我不想使服务器过载,因此我想尽量减少同时检查同一域上的URL。我有一个想法,即对 URL 进行排序的方式是,彼此接近的项目(具有相同的排序键 = 域名)在列表中尽可能彼此分开。

一个数字示例:

a=[1,1,2,3,3]  # <== sorted list, sortness score = 2
0,1,2,3,4   # <== positions

可以取消排序为:

b=[1,3,2,1,3]  # <== unsorted list, sortness score = 6
0,1,2,3,4   # <== positions

我想说的是,我们可以通过对相等项目(具有相同的键=域名)之间的距离相加来计算排序分数。更高的分拣意味着更好的未分拣。也许有更好的方法来测试不分类。

列表a的排序得分为 2。1 的距离之和为 (1-0)=1,2 为 0,3 的距离之和为 (4-3)=1。

列表b的排序得分为 6。1 的距离之和为 (3-0)=3,2 为 0,3 的距离之和为 (4-1)=3。

URL 列表看起来像(域、URL)元组列表:

[
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
...
]

我正在研究一个可以正常工作的原型,但不是最佳的,因为我可以找到一些更好地手动分类的变体。

有人想试一试吗?

这是我的代码,但它不是很好:):

from collections import Counter
from collections import defaultdict
import math

def test_unsortness(lst:list) -> float:
pos = defaultdict(list)
score = 0
# Store positions for each key
# input = [1,3,2,3,1] => {1: [0, 4], 3: [1, 3], 2: [2]}
for c,l in enumerate(lst):
pos[l].append(c)
for k,poslst in pos.items():
for i in range(len(poslst)-1):
score += math.sqrt(poslst[i+1] - poslst[i])
return score

def unsort(lst:list) -> list:
free_positions = list(range(0,len(lst)))
output_list = [None] * len(free_positions)
for val, count in Counter(lst).most_common():
pos = 0
step = len(free_positions) / count
for i in range(count):
output_list[free_positions[int(pos)]] = val
free_positions[int(pos)] = None  # Remove position later
pos = pos + step
free_positions = [p for p in free_positions if p]
return output_list

lsts = list()
lsts.append( [1,1,2,3,3] )
lsts.append( [1,3,2,3,1] )       # This has the worst score after unsort()
lsts.append( [1,2,3,0,1,2,3] )   # This has the worst score after unsort()
lsts.append( [3,2,1,0,1,2,3] )   # This has the worst score after unsort()
lsts.append( [3,2,1,3,1,2,3] )   # This has the worst score after unsort()
lsts.append( [1,2,3,4,5] )
for lst in lsts:
ulst = unsort(lst)
print( ( lst, '%.2f'%test_unsortness(lst), '====>', ulst, '%.2f'%test_unsortness(ulst), ) )
#  Original               score             Unsorted               score
#  -------                -----             --------               -----
# ([1, 1, 2, 3, 3],       '2.00',  '====>', [1, 3, 1, 3, 2],       '2.83')
# ([1, 3, 2, 3, 1],       '3.41',  '====>', [1, 3, 1, 3, 2],       '2.83')
# ([1, 2, 3, 0, 1, 2, 3], '6.00',  '====>', [1, 2, 3, 1, 2, 3, 0], '5.20')
# ([3, 2, 1, 0, 1, 2, 3], '5.86',  '====>', [3, 2, 1, 3, 2, 1, 0], '5.20')
# ([3, 2, 1, 3, 1, 2, 3], '6.88',  '====>', [3, 2, 3, 1, 3, 2, 1], '6.56')
# ([1, 2, 3, 4, 5],       '0.00',  '====>', [1, 2, 3, 4, 5],       '0.00')

附言。我不只是在寻找随机化函数,我知道有爬虫可以管理域负载,但这是为了锻炼。

与其取消对 URL 列表的排序,不如按域对它们进行分组,每个都在队列中,然后异步处理它们,中间有延迟(随机?

在我看来,它没有您尝试实现相同事情的复杂程度,如果您有很多域,您可以随时限制在这一点上同时运行的数字。

我使用Google OR工具来解决这个问题。我把它框定为约束优化问题,并以这种方式建模。

from collections import defaultdict
from itertools import chain, combinations
from ortools.sat.python import cp_model
model = cp_model.CpModel()
data = [
('example.com', 'http://example.com/404'),
('test.com', 'http://test.com/404'),
('test.com', 'http://test.com/405'),
('example.com', 'http://example.com/405'),
('google.com', 'http://google.com/404'),
('example.com', 'http://example.com/406'),
('stackoverflow.com', 'http://stackoverflow.com/404'),
('test.com', 'http://test.com/406'),
('example.com', 'http://example.com/407')
]
tmp = defaultdict(list)
for (domain, url) in sorted(data):
var = model.NewIntVar(0, len(data) - 1, url)
tmp[domain].append(var)  # store URLs as model variables where the key is the domain
vals = list(chain.from_iterable(tmp.values()))  # create a single list of all variables
model.AddAllDifferent(vals)  # all variables must occupy a unique spot in the output
constraint = []
for urls in tmp.values():
if len(urls) == 1:  # a single domain does not need a specific constraint
constraint.append(urls[0])
continue
combos = combinations(urls, 2)
for (x, y) in combos:  # create combinations between each URL of a specific domain
constraint.append((x - y))
model.Maximize(sum(constraint))  # maximize the distance between similar URLs from our constraint list
solver = cp_model.CpSolver()
status = solver.Solve(model)
output = [None for _ in range(len(data))]
if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE:
for val in vals:
idx = solver.Value(val)
output[idx] = val.Name()
print(output)
['http://example.com/407',
'http://test.com/406',
'http://example.com/406',
'http://test.com/405',
'http://example.com/405',
'http://stackoverflow.com/404',
'http://google.com/404',
'http://test.com/404',
'http://example.com/404']

没有明显的未排序定义最适合您,但这里有一些至少可以很好地工作的东西:

  1. 对列表进行排序
  2. 如果列表的长度不是 2 的幂,则将项目均匀地分布在具有下一个 2 次幂的列表中
  3. 通过反转旧索引中的位来查找每个项目的新索引。
  4. 删除间隙以使列表恢复到其原始大小。

按照排序顺序,靠近的项的索引通常仅在最小位上有所不同。 通过反转位顺序,可以使靠近的项的新索引在最大位上有所不同,因此它们最终将相距很远。

def bitreverse(x, bits):
# reverse the lower 32 bits
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1)
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2)
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4)
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8)
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16)
# take only the appropriate length
return (x>>(32-bits)) & ((1<<bits)-1)
def antisort(inlist): 
if len(inlist) < 3:
return inlist
inlist = sorted(inlist)
#get the next power of 2 list length
p2len = 2
bits = 1
while p2len < len(inlist):
p2len *= 2
bits += 1
templist = [None] * p2len
for i in range(len(inlist)):
newi = i * p2len // len(inlist)
newi = bitreverse(newi, bits)
templist[newi] = inlist[i]
return [item for item in templist if item != None]
print(antisort(["a","b","c","d","e","f","g",
"h","i","j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]))

输出:

['a', 'n', 'h', 'u', 'e', 'r', 'k', 'x', 'c', 'p', 'f', 's',
'm', 'z', 'b', 'o', 'i', 'v', 'l', 'y', 'd', 'q', 'j', 'w', 'g', 't']

您可以实现倒置二叉搜索。

from typing import Union, List
sorted_int_list = [1, 1, 2, 3, 3]
unsorted_int_list = [1, 3, 2, 1, 3]
sorted_str_list = [
"example.com",
"example.com",
"test.com",
"stackoverflow.com",
"stackoverflow.com",
]
unsorted_str_list = [
"example.com",
"stackoverflow.com",
"test.com",
"example.com",
"stackoverflow.com",
]

def inverted_binary_search(
input_list: List[Union[str, int]],
search_elem: Union[int, str],
list_selector_start: int,
list_selector_end: int,
) -> int:
if list_selector_end - list_selector_start <= 1:
if search_elem < input_list[list_selector_start]:
return list_selector_start - 1
else:
return list_selector_start
list_selector_mid = (list_selector_start + list_selector_end) // 2
if input_list[list_selector_mid] > search_elem:
return inverted_binary_search(
input_list=input_list,
search_elem=search_elem,
list_selector_start=list_selector_mid,
list_selector_end=list_selector_end,
)
elif input_list[list_selector_mid] < search_elem:
return inverted_binary_search(
input_list=input_list,
search_elem=search_elem,
list_selector_start=list_selector_start,
list_selector_end=list_selector_mid,
)
else:
return list_selector_mid

def inverted_binary_insertion_sort(your_list: List[Union[str, int]]):
for idx in range(1, len(your_list)):
selected_elem = your_list[idx]
inverted_binary_search_position = (
inverted_binary_search(
input_list=your_list,
search_elem=selected_elem,
list_selector_start=0,
list_selector_end=idx,
)
+ 1
)
for idk in range(idx, inverted_binary_search_position, -1):
your_list[idk] = your_list[idk - 1]
your_list[inverted_binary_search_position] = selected_elem
return your_list

输出

inverted_sorted_int_list = inverted_binary_insertion_sort(sorted_int_list)
print(inverted_sorted_int_list)
>> [1, 3, 3, 2, 1]
inverted_sorted_str_list = inverted_binary_insertion_sort(sorted_str_list)
print(inverted_sorted_str_list)
>> ['example.com', 'stackoverflow.com', 'stackoverflow.com', 'test.com', 'example.com']

更新:

给定注释,您还可以运行该函数两次。这将解开重复项。

inverted_sorted_int_list = inverted_binary_insertion_sort(
inverted_binary_insertion_sort(sorted_int_list)
)
>> [1, 3, 2, 1, 3]

这里有一个刺痛,但我不确定它是否会在给定的特定输入集时退化一点。

我们选择最常找到的项目,并将其第一次出现附加到列表中。 然后与第二个最常见的相同,依此类推。

重复找到最多的项目的一半大小。 这是列表的左半部分。

然后从最不频繁到最频繁,选择第一项并添加其值。 当发现一个物品不到最大值的一半时,选择要将其放在哪一侧。

从本质上讲,我们逐个键分层,最终在未排序列表中最左边和最右边的位置得到更频繁的项目,而在中间留下不太频繁的项目。

def unsort(lst:list) -> list:
"""
build a dictionary by frequency first
then loop thru the keys and append 
key by key with the other keys in between
"""
result = []

#dictionary by keys (this would be domains to urls)
di = defaultdict(list)
for v in lst:
di[v].append(v)
#sort by decreasing dupes length
li_len = [(len(val),key) for key, val in di.items()]
li_len.sort(reverse=True)
#most found count
max_c = li_len[0][0]
#halfway point
odd = max_c % 2
num = max_c // 2
if odd:
num += 1
#frequency, high to low
by_freq = [tu[1] for tu in li_len]

di_side = {}
#for the first half, pick from frequent to infrequent
#alternating by keys
for c in range(num):
#frequent to less
for key in by_freq:
entries = di[key]
#first pass:  less than half the number of values 
#we don't want to put all the infrequents first
#and have a more packed second side
if not c:
#pick on way in or out?
if len(entries) <= num:
#might be better to alternate L,R,L
di_side[key] = random.choice(["L","R"])
else:
#pick on both
di_side[key] = "B"

#put in the left half
do_it = di_side[key] in ("L","B")

if do_it and entries:
result.append(entries.pop(0))
#once you have mid point pick from infrequent to frequent
for c in range(num):
#frequent to less
for key in reversed(by_freq):
entries = di[key]
#put in the right half 
do_it = di_side[key] in ("R","B")

if entries:
result.append(entries.pop(0)) 
return result

运行这个我得到:

([1, 1, 2, 3, 3], '2.00', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 3, 2, 3, 1], '3.41', '====>', [3, 1, 2, 1, 3], '3.41')
([1, 2, 3, 0, 1, 2, 3], '6.00', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 0, 1, 2, 3], '5.86', '====>', [3, 2, 1, 0, 1, 2, 3], '5.86')
([3, 2, 1, 3, 1, 2, 3], '6.88', '====>', [3, 2, 3, 2, 1, 3, 1], '5.97')
([1, 2, 3, 4, 5], '0.00', '====>', [5, 1, 2, 3, 4], '0.00')

哦,我还添加了一个assert来检查取消排序时没有删除或更改任何内容:

assert(sorted(lst) == sorted(ulst))

替代方法?

我现在将其作为脚注,但是不聚类的一般想法(不是OP不重载域的特定应用)似乎是强制排斥方法的候选者,其中相同的域将尝试保持尽可能远彼此。 即1, 1, 2=>1, 2, 1因为 1 会相互排斥。 然而,这是一种完全不同的算法方法。

当我遇到类似的问题时,我是如何解决的:

  1. 将两个字符串(在本例中为 URL)之间的"距离"定义为它们的 Levenshtein 距离(计算此值的代码随时可用)
  2. 采用你最喜欢的旅行推销员算法来找到通过你的字符串集的(近似)最短路径(找到确切的最短路径在计算上是不可行的,但近似算法相当有效)
  3. 现在修改你的"距离"指标要反转 - 即计算两个字符串(s1,s2)之间的距离MAX_INT - LevenshteinDistance(s1,s2)
  4. 通过
  5. 此修改,通过集合的"最短路径"现在实际上是最长路径,即最未排序的路径。

打乱列表的一种简单方法是使用具有排列染色体的遗传算法来最大化其"排序"分数。我能够使用 GA 包快速破解 R 中的版本。我不是 Python 用户,但我确信有 Python 的 GA 库包含排列染色体。如果没有,则可以调整具有实值矢量染色体的通用GA文库。您只需使用值在 [0, 1] 中的向量作为染色体,并将每个向量转换为其排序索引。

我希望这个算法正常工作:

unsorted_list = ['c', 'a', 'a', 'a', 'a', 'b', 'b']
d = {i: unsorted_list.count(i) for i in unsorted_list}
print(d)  # {'c': 1, 'a': 4, 'b': 2}
d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
print(d)  # {'a': 4, 'b': 2, 'c': 1}
result = [None] * len(unsorted_list)
border_index_left = 0
border_index_right = len(unsorted_list) - 1
it = iter(d)

def set_recursively(k, nk):
set_borders(k)
set_borders(nk)
if d[k]:
set_recursively(k, nk)

def set_borders(key):
global border_index_left, border_index_right
if key is not None and d[key]:
result[border_index_left] = key
d[key] = d[key] - 1
border_index_left = border_index_left + 1
if key is not None and d[key]:
result[border_index_right] = key
d[key] = d[key] - 1
border_index_right = border_index_right - 1

next_element = next(it, None)
for k, v in d.items():
next_element = next(it, None)
set_recursively(k, next_element)
print(result)  # ['a', 'b', 'a', 'c', 'a', 'b', 'a']

从视觉上看,它看起来像是从边缘走到中间:

[2, 3, 3, 3, 1, 1, 0]
[None, None, None, None, None, None, None]
[3, None, None, None, None, None, None]
[3, None, None, None, None, None, 3]
[3, 1, None, None, None, None, 3]
[3, 1, None, None, None, 1, 3]
[3, 1, 3, None, None, 1, 3]
[3, 1, 3, 2, None, 1, 3]
[3, 1, 3, 2, 0, 1, 3]

只是说,放置一小段时间延迟就可以了。我想有人已经提到过这一点。它非常简单且非常可靠。您可以执行以下操作:

from random import sample
from time import sleep
import requests
intervalList = list(range(0.1, 0.5))
error404 = []
connectionError = []
for i in your_URL_list:
ststusCode = req.get(str(i)).status_code
if ststusCode == 404:
error404.append(i)
sleep(sample(intervalList,1))

干杯

最新更新