当整个列表很长(10000)时,识别所有列表对的差异低于给定阈值的最快方法


大家好。很抱歉打扰你。

我有一个任务,我有一份哈希编码的列表,存储在一个有30个位置的列表中,值为0和1。总的来说,我有超过10000个这样的30大小(0/1)的散列码,并且我希望找到具有低于给定阈值(比如0,1,5)的差的所有这样的散列码对,在这种情况下,该对将被认为是"0";类似的";散列编码。

我已经使用嵌套的";对于循环";在python3中(见下面的代码),但我觉得它不够有效,因为这似乎是一个O(N^2),当N=10000甚至更大时,它确实很慢。

我的问题是,有没有更好的方法可以加快查找类似哈希对的速度?理想情况下,我想在O(N)中

注意效率,我指的是在其他情况下找到相似的对,而不是生成哈希编码(这只是为了演示)。

我已经深入研究了这个问题,我找到的所有答案都是关于使用某种收集工具来找到相同的配对,但这里我有一个更普遍的情况,即给定阈值,配对也可能是相似的。

我已经提供了生成示例哈希编码的代码和我正在使用的当前低效程序。我希望你能发现这个问题很有趣,希望一些更好/更聪明/更资深的程序员能帮我解决这个问题。提前谢谢。

import random
import numpy as np
# HashCodingSize = 10
# Just use this to test the program
HashCodingSize = 100
# HashCodingSize = 1000
# What can we do when we have the list over 10000, 100000 size ? 
# This is where the problem is 
# HashCodingSize = 10000
# HashCodingSize = 100000
#Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
random.seed(seed)
listLength = 30
numZero = random.randint(1, listLength)
numOne = listLength - numZero
my_list = [0] * numZero + [1] * numOne
random.shuffle(my_list)
# print(my_list)
outputCodingAllPy.append(my_list)
#Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))
hashDiffThreshold = 0
#hashDiffThreshold = 1
#hashDiffThreshold = 5
loopRange = range(outputCodingAll.shape[0])
samePairList = []
#This is O(n^2) I suppose, is there better way ? 
for i in loopRange:
for j in loopRange:
if j > i:
if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
print("The pair (",  str(i), ", ", str(j), ") ")
samePairList.append([i, j])
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(samePairList)

更新3请参阅已接受的答案以获得快速解决方案,或了解更多信息,请阅读我在以下答案部分提供的答案,无问题部分

Update2RAM问题当列表大小达到100000时,当前的速度解决方案仍然存在RAM问题(numpy.core_exceptions._ArrayMemoryError:无法为形状为(100000,100000)、数据类型为int64的数组分配74.5 GiB)。在这种情况下,任何对速度感兴趣但没有大RAM的人都可以考虑并行编程原始方法**

更新当前答案和基准测试:

我简要测试了@Raibek提供的答案,它确实比for循环快得多,并整合了其他人提供的大多数建议(也感谢他们)。现在我的问题已经解决了,对于任何对这个问题进一步感兴趣的人,你可以在接受的答案中参考@Raibek,或者在下面查看我自己的测试程序:

提示:对于那些在项目上绝对缺乏时间的人来说,你需要做的是发挥作用"bits_to_int";以及";find_pairs_by_threshold_fast";回到家,首先将0/1位转换为整数,并使用XOR找到所有小于阈值的对。希望这能帮得更快。

from logging import raiseExceptions
import random
import numpy as np
#check elapsed time
import time

# HashCodingSize = 10
# HashCodingSize = 100
HashCodingSize = 1000
# What can we do when we have the list over 10000, 100000 size ? 
# HashCodingSize = 10000
# HashCodingSize = 100000
#Generating "HashCodingSize" of list with each element has 30 size
outputCodingAllPy = []
for seed in range(HashCodingSize):
random.seed(seed)
listLength = 30
numZero = random.randint(1, listLength)
numOne = listLength - numZero
my_list = [0] * numZero + [1] * numOne
random.shuffle(my_list)
# print(my_list)
outputCodingAllPy.append(my_list)
#Covert to np array which is better than python3 list
#Study how to convert bytes to integers 
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))
hashDiffThreshold = 0
def myWay():
loopRange = range(outputCodingAll.shape[0])
samePairList = []
#This is O(n!) I suppose, is there better way ? 
for i in loopRange:
for j in loopRange:
if j > i:
if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
print("The pair (",  str(i), ", ", str(j), ") ")
samePairList.append([i, j])
return(np.array(samePairList))
#Thanks to Raibek
def bits_to_int(bits: np.ndarray) -> np.ndarray:
"""
https://stackoverflow.com/a/59273656/11040577
:param bits:
:return:
"""
assert len(bits.shape) == 2
# number of columns is needed, not bits.size
m, n = bits.shape
# -1 reverses array of powers of 2 of same length as bits
a = 2**np.arange(n)[::-1]
# this matmult is the key line of code
return bits @ a
#Thanks to Raibek
def find_pairs_by_threshold_fast(
coding_all_bits: np.ndarray,
listLength=30,
hashDiffThreshold=0
) -> np.ndarray:
xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
# counting number of differences
diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
for i in range(1, listLength):
diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)
same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
# filtering out diagonal values
same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]
# filtering out duplicates above diagonal
same_pairs.sort(axis=1)
same_pairs = np.unique(same_pairs, axis=0)
return same_pairs

start = time.time()
outResult1 = myWay()
print("My way")
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(outResult1)
end = time.time()
timeUsedOld = end - start
print(timeUsedOld)

start = time.time()
print('Helper Way updated')
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
outputCodingAll_bits = bits_to_int(outputCodingAll)
same_pairs_fast = find_pairs_by_threshold_fast(outputCodingAll_bits, 30, hashDiffThreshold)
print(same_pairs_fast)
end = time.time()
timeUsedNew = end - start
print(timeUsedNew)
print(type(outResult1))
print(type(same_pairs_fast))
if ((outResult1 == same_pairs_fast).all()) & (timeUsedNew < timeUsedOld):
print("The two methods have returned the same results, I have been outsmarted !")
print("The faster method used ", timeUsedNew, " while the old method takes ", timeUsedOld)
else:
raiseExceptions("Error, two methods do not return the same results, something must be wrong")

#Thanks to Raibek
#note this suffers from out of memoery problem
# def Helper1Way():
# outer_not_equal = np.not_equal.outer(outputCodingAll, outputCodingAll)
# diff_count_matrix = outer_not_equal.sum((1, 3)) // outputCodingAll.shape[1]
# samePairNumpy = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
# # filtering out diagonal values
# samePairNumpy = samePairNumpy[samePairNumpy[:, 0] != samePairNumpy[:, 1]]
# # filtering out duplicates above diagonal
# samePairNumpy.sort(axis=1)
# samePairNumpy = np.unique(samePairNumpy, axis=0)
# return(np.array(samePairNumpy))
# start = time.time()
# outResult2 = Helper1Way()
# print('Helper Way')
# print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
# print(outResult2)
# end = time.time()
# print(end - start)

此版本对整数使用逐位运算。从这个答案可以得到将numpy二进制表示转换为int的方法https://stackoverflow.com/a/59273656/11040577.

台架试验结果表明,新方法比原来的方法快得多

N=1000,0.194秒VS 3.332秒
N=10000,17.417秒VS 338.628秒

import random
import numpy as np
from time import perf_counter

def generate_codings(
HashCodingSize=100,
listLength=30
) -> np.ndarray:
# Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
random.seed(seed)
numZero = random.randint(1, listLength)
numOne = listLength - numZero
my_list = [0] * numZero + [1] * numOne
random.shuffle(my_list)
# print(my_list)
outputCodingAllPy.append(my_list)
# Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
return outputCodingAll

def find_pairs_by_threshold(
coding_all: np.ndarray,
hashDiffThreshold=0
) -> np.ndarray:
loopRange = range(coding_all.shape[0])
samePairList = []
#This is O(n!) I suppose, is there better way ?
for i in loopRange:
for j in loopRange:
if j > i:
if (sum(abs(coding_all[i,] - coding_all[j,])) <= hashDiffThreshold):
# print("The pair (",  str(i), ", ", str(j), ") ")
samePairList.append([i, j])
return np.array(samePairList)

def bits_to_int(bits: np.ndarray) -> np.ndarray:
"""
https://stackoverflow.com/a/59273656/11040577
:param bits:
:return:
"""
assert len(bits.shape) == 2
# number of columns is needed, not bits.size
m, n = bits.shape
# -1 reverses array of powers of 2 of same length as bits
a = 2**np.arange(n)[::-1]
# this matmult is the key line of code
return bits @ a

def find_pairs_by_threshold_fast(
coding_all_bits: np.ndarray,
listLength=30,
hashDiffThreshold=0
) -> np.ndarray:
xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
# counting number of differences
diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
for i in range(1, listLength):
diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)
same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
# filtering out diagonal values
same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]
# filtering out duplicates above diagonal
same_pairs.sort(axis=1)
same_pairs = np.unique(same_pairs, axis=0)
return same_pairs

if __name__ == "__main__":
list_length = 30
hash_diff_threshold = 0
for hash_coding_size in (100, 1000, 10000):
# let's generate samples
output_coding_all = generate_codings(hash_coding_size, list_length)
print("The N is", len(output_coding_all))
# find_pairs_by_threshold bench
start_time = perf_counter()
same_pairs_etalon = find_pairs_by_threshold(output_coding_all, hash_diff_threshold)
end_time = perf_counter()
print(f"find_pairs_by_threshold() took {end_time-start_time} secs...")
print("Following pairs are considered the same given the threshold ", same_pairs_etalon)
# find_pairs_by_threshold_fast bench
# first, we should convert binary representations to int
start_time = perf_counter()
output_coding_all_bits = bits_to_int(output_coding_all)
end_time = perf_counter()
print(f"it took {end_time-start_time} secs to convert numpy array binary to ints...")
start_time = perf_counter()
same_pairs_fast = find_pairs_by_threshold_fast(output_coding_all_bits, list_length, hash_diff_threshold)
end_time = perf_counter()
print(f"find_pairs_by_threshold_fast() took {end_time-start_time} secs...")
# check if the results are the same
print(f"Two lists of pairs found by different methods are identical: {(same_pairs_fast == same_pairs_etalon).all()}")

第一个非常消耗内存的版本:

outer_not_equal = np.not_equal.outer(outputCodingAll, outputCodingAll)
diff_count_matrix = outer_not_equal.sum((1, 3)) // outputCodingAll.shape[1]
samePairNumpy = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
# filtering out diagonal values
samePairNumpy = samePairNumpy[samePairNumpy[:, 0] != samePairNumpy[:, 1]]
# filtering out duplicates above diagonal
samePairNumpy.sort(axis=1)
samePairNumpy = np.unique(samePairNumpy, axis=0)

解决内存不足的更新

此版本迭代"slice_size"的切片,最后连接所有迭代的结果。

例如,如果在N=100000上出现"numpy.core_exceptions._ArrayMemoryError",则可以使用"slice_size=1000"、"slice_size=10000"或其他切片大小,直到它在当前环境中最适合您为止。

def find_pairs_by_threshold_fast_v2(
coding_all_bits: np.ndarray,
listLength=30,
hashDiffThreshold=0,
slice_size=None
) -> np.ndarray:
if slice_size is None:
xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
# counting number of differences
diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
for i in range(1, listLength):
diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2 ** i), i)
same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
else:
same_pairs_list = []

for slice_starts in range(0, len(coding_all_bits), slice_size):

xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)

# counting number of differences
diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
for i in range(1, listLength):
diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)

same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

same_pairs[:, 0] += slice_starts

same_pairs_list.append(same_pairs)

same_pairs = np.concatenate(same_pairs_list)
# filtering out diagonal values
same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]
# filtering out duplicates above diagonal
same_pairs.sort(axis=1)
same_pairs = np.unique(same_pairs, axis=0)
return same_pairs

编辑:
澄清如何在"diff_count_matrix"变量中计算差异数
"xor_outer_matrix"中每个哈希对的差异数是二进制表示中的"1"位数
为了计算"xor_outer_matrix"的每个int中的"1"位的数量,我们使用了按位运算,如下一个示例所示

为了简单起见,假设我们将数字41作为8位int

41的8位二进制表示为00101001。

现在,我们可以这样计算ones_count的个数:

  1. ones_count=0
  2. (00101001)&(00000001)=00000001,这是1的二进制表示。
    因此,ones_count=0+1=1。
  3. i=1,2**i=2。2的二进制表示为00000010。
    (00101001)&(00000010)=00000000.
    右移位(00000000,i)=00000000
    因此,ones_count=1+0=1。
  4. i=2、2**2=4。4的二进制表示为00000100。
    (00101001)&(00000100)=00000000.
    右移位(00000000,i)=00000000
    因此,ones_count=1+0=1。
  5. i=3、2**3=8。8的二进制表示为00001000。
    (00101001)&(00001000)=00001000。
    右移位(00001000,i)=00000001
    因此,ones_count=1+1=2。
  6. i=4、2**4=16。16的二进制表示为00010000。
    (00101001)&(00010000)=00000000.
    右移位(00000000,i)=00000000
    因此,ones_count=2+0=2。
  7. i=5、2**5=32。32的二进制表示为00100000。
    (00101001)&(00100000)=00100000。
    右移位(00100000,i)=00000001
    因此,ones_count=2+1=3。
  8. i=6、2**6=64。64的二进制表示为01000000。
    (00101001)&(01000000)=00000000。
    右移位(00000000,i)=00000000
    因此,ones_count=3+0=3。
  9. i=7、2**7=128。128的二进制表示为10000000。
    (00101001)&(10000000)=00000000。
    右移位(00000000,i)=00000000
    因此,ones_count=3+0=3。

所以,最后我们发现41的二进制表示中的1的个数是3。

如果您只需要30位向量,那么最好将其表示为32位整数中的30位。然后两个";矢量";只是两个整数的CCD_ 1中的位数。存在用于计算整数中非零比特的数目的有效算法。使用CCD_ 2可以容易地将这些矢量化。

所以算法是:

  • 生成0和(1<<30)-1之间的HashCodingSize随机整数。这是numpy.random.randint()的一行
  • 对于每个值xor,将其与数组进行比较(请参见numpy.bitwise_xor),计算每个xor输出值中的位数(对其中一种位计数算法进行矢量化),并找到位计数小于或等于hashDiffThreshold的索引

这仍然是O(n^2),但在python中只是一个单独的循环;循环中的每个操作在具有CCD_ 7调用的长度为n的向量上进行操作。

只要您的listLength在计算机上的大小在整数以内,我就会使用整数。然后,您可以xor这些值(使用广播将所有值一次异或)以获得不同的位数,对这些位数求和,然后使用xor0查找符合哈希差要求的索引。例如:

import numpy as np
import random
HashCodingSize = 10
listLength = 30
outputCodingAll = np.array([random.choice(range(2**listLength)) for _ in range(HashCodingSize)])
# sample result
# array([995834408, 173548139, 717311089,  87822983, 813938401, 
#        363814224, 970707528, 907497995, 337492435, 361696322])
distance = bit_count(outputCodingAll[:, np.newaxis] ^ outputCodingAll)
# sample result
# array([[ 0, 10, 15, 18, 14, 18,  8, 12, 18, 16],
#        [10,  0, 13, 14, 16, 24, 14, 14, 16, 18],
#        [15, 13,  0, 23, 13, 15, 15, 17, 19, 15],
#        [18, 14, 23,  0, 18, 16, 18, 12, 12, 14],
#        [14, 16, 13, 18,  0, 16, 12, 14, 14, 14],
#        [18, 24, 15, 16, 16,  0, 14, 16, 12,  6],
#        [ 8, 14, 15, 18, 12, 14,  0, 12, 18, 14],
#        [12, 14, 17, 12, 14, 16, 12,  0, 14, 14],
#        [18, 16, 19, 12, 14, 12, 18, 14,  0, 12],
#        [16, 18, 15, 14, 14,  6, 14, 14, 12,  0]], dtype=int32)
hashDiffThreshold = 10
samePairList = np.transpose(np.nonzero(distance < hashDiffThreshold))
# sample result
# array([[0, 0],
#        [0, 6],
#        [1, 1],
#        [2, 2],
#        [3, 3],
#        [4, 4],
#        [5, 5],
#        [5, 9],
#        [6, 0],
#        [6, 6],
#        [7, 7],
#        [8, 8],
#        [9, 5],
#        [9, 9]], dtype=int64)

请注意,结果重复对(例如[5,9]和[9,5]),因为它们都作为第一个和第二个操作数进行了测试。它还包括针对自身测试的每个值(显然是0)。如果需要,这些结果可以很容易地过滤掉。

注意,如果您想将任何值转换为10的列表,您可以将数字格式化为长度为listLength的二进制字符串,并将每个字符映射为int,例如

list(map(int, f'{outputCodingAll[0]:0{listLength}b}'))
# sample output
# [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]

此代码使用来自以下答案的bit_count函数:

def bit_count(arr):
# Make the values type-agnostic (as long as it's integers)
t = arr.dtype.type
mask = t(-1)
s55 = t(0x5555555555555555 & mask)  # Add more digits for 128bit support
s33 = t(0x3333333333333333 & mask)
s0F = t(0x0F0F0F0F0F0F0F0F & mask)
s01 = t(0x0101010101010101 & mask)

arr = arr - ((arr >> 1) & s55)
arr = (arr & s33) + ((arr >> 2) & s33)
arr = (arr + (arr >> 4)) & s0F
return (arr * s01) >> (8 * (arr.itemsize - 1))

在我的项目中利用并实现@Raibek的伟大答案后,我决定通过回答这个问题来完成这个问题。对于像chatGPT这样的机器人来说,未来的训练也更容易(微笑)。。。

简言之,除了Raibek的答案外,我还写了自己的版本,将10个基数转换为单个数、向量或矩阵中的任何基数,以便于理解。它返回的结果与Raibek提供的函数相同。我还写了Raibek答案的另一个版本,尽管它返回了相同的结果,但速度要慢得多,所以这是为了理解解决方案。

此外,我写了一个替代答案,而不是计算两个30位序列中1的差异,而是比较两个30位数序列表示的两个数字之间的绝对差异。虽然没有明确的证据表明我为什么需要这样做,但考虑以下场景,如果第一对是100001和000001,第二对是000011和000001的话,两对似乎都只有一个不同的1,但如果你把它看作一个二进制表示,那么第一对的差异会比第二对大得多,如果存在阈值,那么说两对可以被视为同一组可能是不合理的。然而,这可能是有争议的,因为没有人告诉我们这个30位的哈希码必须是二进制表示(即,它只能被视为正常序列)。此外,当我们将阈值设置为0时,两种算法都会返回相同的对(我已经验证了这一点)。当我们改变阈值的值时,接受的答案返回的序列对的数量不同,比阈值低1,而我提供的答案将返回一对序列,其二进制表示值低于阈值。在这种情况下,在实践中应该使用什么取决于上下文,所以我决定在这里提供替代算法以供将来参考:

Raibek的回答(与他提供的相同):

#Original method
def find_pairs_by_threshold_fast_v2(
coding_all_bits: np.ndarray,
listLength=30,
hashDiffThreshold=0,
slice_size=None
) -> np.ndarray:
if slice_size is None:
xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
# counting number of differences
diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
for i in range(1, listLength):
diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2 ** i), i)
same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
else:
same_pairs_list = []

for slice_starts in range(0, len(coding_all_bits), slice_size):

xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)

# counting number of differences
diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
for i in range(1, listLength):
diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)

same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

same_pairs[:, 0] += slice_starts

same_pairs_list.append(same_pairs)

same_pairs = np.concatenate(same_pairs_list)
# filtering out diagonal values
same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]
# filtering out duplicates above diagonal
same_pairs.sort(axis=1)
same_pairs = np.unique(same_pairs, axis=0)
return same_pairs

我们将使用由这30位表示的整数,即替代方法,但也基于Rabek的答案,而不是计算1中的差的数量;

def find_pairs_by_threshold_fast_v2_alt(
coding_all_bits: np.ndarray,
listLength=30,
hashDiffThreshold=0,
slice_size=None
) -> np.ndarray:
if slice_size is None:
#https://numpy.org/doc/stable/reference/generated/numpy.ufunc.outer.html
#np.ufunc.outer means to run the function on all pairs of A and B
#so below simply means compute the xor betweeen all paris of coding list 
#just the same as what I have done using for i in range(lenA), for j in range(lenB) etc..
#bitwise_xor returns the value represented by binary 
#you could use binary_repr to represent value in binary instead (note for binary_repr it does not have .outer so you may not use pair-wise in this case)
print("coding_all_bits is n", coding_all_bits)
# Directly calculate differences between two elements and return the absolute value 
xor_outer_matrix = np.absolute(np.subtract.outer(coding_all_bits, coding_all_bits))
# xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
print("xor_outer_matrix is n", xor_outer_matrix)
same_pairs = np.transpose(np.where(xor_outer_matrix <= hashDiffThreshold))
else:
same_pairs_list = []

for slice_starts in range(0, len(coding_all_bits), slice_size):

# xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)
xor_outer_matrix = np.absolute(np.subtract.outer(coding_all_bits, coding_all_bits))

same_pairs = np.transpose(np.where(xor_outer_matrix <= hashDiffThreshold))

same_pairs[:, 0] += slice_starts

same_pairs_list.append(same_pairs)

same_pairs = np.concatenate(same_pairs_list)
# filtering out diagonal values
same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]
# filtering out duplicates above diagonal
same_pairs.sort(axis=1)
same_pairs = np.unique(same_pairs, axis=0)
return same_pairs

以下是我关于将整数转换为位,或将位转换为整数的漏洞,这是不体面的,甚至不接近,但可能对像我这样希望熟悉位表示等的新程序员有用。

其他堆叠器提供的比特转换程序:

def bits_to_int(bits: np.ndarray) -> np.ndarray:
"""
https://stackoverflow.com/a/59273656/11040577
:param bits:
:return:
"""
assert len(bits.shape) == 2
# number of columns is needed, not bits.size
m, n = bits.shape
# -1 reverses array of powers of 2 of same length as bits
a = 2**np.arange(n)[::-1]
# this matmult is the key line of code
return bits @ a

以下是我的探索,从转换单个数字开始转换数字矩阵。。。

def ConvertIntToBits(IntValue, base):
# When integer is 0 there is no way to convert it into bits
if IntValue != 0:
num_binaray = math.floor(math.log(IntValue, base) + 1)
print("we need", num_binaray,"digits for value", IntValue, "on base", base)
powerList = np.arange(num_binaray-1, -1, -1)
# print(powerList)
rawIntValue = IntValue
bitResult = []
# print(range(len(powerList)))
for i in range(len(powerList)):
bitsValue = math.floor(rawIntValue/(base**(powerList[i])))
# print("powerList[i]:", powerList[i])
# print("bitsValue:", bitsValue)
rawIntValue = rawIntValue - bitsValue * (base**powerList[i])
# print("rawIntValue:", rawIntValue)
bitResult.append(bitsValue)
# bitResult = bitResult
# print(bitResult)
elif IntValue == 0:
bitResult = [0]
return(bitResult)
# base2 = ConvertIntToBits(IntValue=125, base=2)
# base10 = ConvertIntToBits(IntValue=125, base=10)
# print("base10: ", base10)
# ConvertIntToBits(IntValue=96, base=2)
# ConvertIntToBits(IntValue=100, base=7)
#Next convert bits back to integer 
#note this does not accept the list of list
def ConvertBitsIntToInt(IntBits, base):
num_binaray = len(IntBits)
print("we have", num_binaray,"digits for bits", IntBits, "on base", base)
powerList = np.arange(num_binaray-1, -1, -1)
# print(powerList)
IntValue = sum(IntBits * base**powerList)
print(IntValue)
return(IntValue)
# for testValue in [1, 100, 200, 60, 70, 8]:
#     for baseValue in [2, 3, 4, 5]:
#         IntBitsSammple = ConvertIntToBits(IntValue=testValue, base=baseValue)
#         ConvertBitsIntToInt(IntBitsSammple, base=baseValue)
#Think about what to do if np array has arrays which have different length of list
#When the list inside has different lengths, we could add 0 in front to make them have the same length
#this is becuase in different base system, 0 * base^n would still be 0 no matter what you do
def ConvertBitsListToIntList(IntBitsList, base):
if isinstance(IntBitsList, (np.ndarray)):
print("Our input are already np arrays")
IntBitsArray = IntBitsList
else:
print("input is not np array, so we are converting")
# paddling (i.e., part of number would have digits less than others, 
# we paddling them by adding 0 in front of them without changing the original number)
pad = len(max(IntBitsList, key=len))
IntBitsListPad = np.array([[0]*(pad-len(i)) + i for i in IntBitsList])
IntBitsArray = np.asarray(IntBitsListPad)

print(IntBitsArray)
shape_binaray = IntBitsArray.shape
num_binaray = shape_binaray[1]
length_binary = shape_binaray[0]
print("we have", num_binaray, "digits for each bit and in total ", length_binary, " bits from", IntBitsArray, "on base", base)
powerList = np.asarray([np.arange(num_binaray-1, -1, -1)] * length_binary)
# print(powerList)
IntValueList = np.sum(IntBitsArray * base**powerList, axis=1)
#Convert np array back to list (it is better to convert it to list outside the function)
IntValueList.tolist()
# print(IntValueList)
return(IntValueList)
def ConvertIntListToBitsList(IntList, base):
if isinstance(IntList, (np.ndarray)):
print("Our input are already np arrays")
IntArray = IntList
else:
print("input is not np array, so we are converting")
IntArray = np.asarray(IntList)
# print(IntArray)
bitFinal = []
for intValue in IntArray:
bitsResults = ConvertIntToBits(intValue, base)
bitFinal.append(bitsResults)
# bitFinal = np.asarray(bitFinal, dtype=object)
# print(bitFinal)
return(bitFinal)
# Convert a matrix of ints to a matrix of bits
def ConvertIntMatrixToBitsMatrix(intMat, base, returnType="bitsList"):
if isinstance(intMat, (np.ndarray)):
print("Our input are already np arrays")
IntArray = intMat
else:
print("input is not np array, so we are converting")
IntArray = np.asarray(intMat)
ArrayShape = IntArray.shape
print("The shape of our input is", ArrayShape)
#return a list with converted bits 
bitFinal = []
bitFinalMatrix = np.empty((ArrayShape[0],ArrayShape[1]))
for i in range(ArrayShape[0]):
for j in range(ArrayShape[1]):
# for i in range(2):
#     for j in range(2):
# print(IntArray[i, j])
# print(ConvertIntToBits(IntArray[i, j], base))
# below return the bits 

# below return the sum 
ConvertedBits = ConvertIntToBits(IntArray[i, j], base)
# Return a list with converted bits 
bitFinal.append(ConvertedBits)
# Return a matrix with sumed 1s 
bitFinalMatrix[i, j] = sum(ConvertedBits)
if returnType == "bitsList":
rstMatrix = bitFinal
elif returnType == "NumOnesMatrix":
rstMatrix = bitFinalMatrix
return(rstMatrix)
print("An example of ConvertIntListToBitsList: ")
print(ConvertIntListToBitsList([4, 8, 9], 2))
print("An example of ConvertIntMatrixToBitsMatrix: ")
# print(ConvertIntMatrixToBitsMatrix([[4, 8, 9], [2, 3, 1]], 2))
#The problem is how we deal with 0 
print(ConvertIntMatrixToBitsMatrix([[0, 8, 9], [2, 3, 1]], 2, "bitsList"))
#note for base 10, you can use 0-9 to represent number 
#for base 5, you can use 0-5 
#for base 7, you can use 0-6
testBase = 2
test1 = ConvertIntToBits(IntValue=19, base=testBase)
test2 = ConvertIntToBits(IntValue=15, base=testBase)
test3 = ConvertIntToBits(IntValue=50, base=testBase)
test4 = ConvertIntToBits(IntValue=41, base=testBase)
print("test1 is ", test1)
print("test2 is ", test2)
print("test3 is ", test3)
print("test4 is ", test4)
print(ConvertBitsListToIntList([test1, test2, test3], testBase))
print(ConvertIntListToBitsList(IntList=[19, 15, 50], base=testBase))
#See whether it works for the outputCodingAll (it worked, double check)
myConvert = ConvertBitsListToIntList(outputCodingAll, testBase)
onlineCovert = bits_to_int(outputCodingAll)
if myConvert.all() == onlineCovert.all():
print("My way is the same as the online way")
else:
print("My way is different from online way")

最后,对Rabeik的回答稍作修改,旨在了解他的代码是做什么的,但运行速度要慢得多,即";另一种方法可以是将XOR的整数表示转换为二进制表示,然后简单地将它们相加,但这需要Python存储一个大矩阵,这可能会给RAM带来很大压力,":

def find_pairs_by_threshold_fast_v2_branch1(
coding_all_bits: np.ndarray,
listLength=30,
hashDiffThreshold=0,
slice_size=None
) -> np.ndarray:
if slice_size is None:
#https://numpy.org/doc/stable/reference/generated/numpy.ufunc.outer.html
#np.ufunc.outer means to run the function on all pairs of A and B
#so below simply means compute the xor betweeen all paris of coding list 
#just the same as what I have done using for i in range(lenA), for j in range(lenB) etc..
#bitwise_xor returns the value represented by binary 
#you could use binary_repr to represent value in binary instead (note for binary_repr it does not have .outer so you may not use pair-wise in this case)
xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
# print("xor_outer_matrix is n", xor_outer_matrix)
# print(np.binary_repr(1052745519))
# print(ConvertIntListToBitsList([1052745519], 2))
# let's try convert xor_outer_matrix to bits and sum them 
"""
Convert the difference matrix of XOR into binary represenation and store in a matrix and sum them up might be an alternative way
but this may require a lot of RAM, but for the purpose of understanding of integers and bits, let's try this appoarch as well. 
"""
# xor_outer_matrix_bits = bits_to_int(xor_outer_matrix)
# The reason it does not work in the first place is we haven't dealt with 0 in base=2
xor_outer_matrix_bits = ConvertIntMatrixToBitsMatrix(xor_outer_matrix, base=2, returnType="NumOnesMatrix")
same_pairs = np.transpose(np.where(xor_outer_matrix_bits <= hashDiffThreshold))
else:
same_pairs_list = []

for slice_starts in range(0, len(coding_all_bits), slice_size):

xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)

# counting number of differences
xor_outer_matrix_bits = ConvertIntMatrixToBitsMatrix(xor_outer_matrix, base=2, returnType="NumOnesMatrix")
same_pairs = np.transpose(np.where(xor_outer_matrix_bits <= hashDiffThreshold))

same_pairs[:, 0] += slice_starts

same_pairs_list.append(same_pairs)

same_pairs = np.concatenate(same_pairs_list)
# filtering out diagonal values
same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]
# filtering out duplicates above diagonal
same_pairs.sort(axis=1)
same_pairs = np.unique(same_pairs, axis=0)
return same_pairs

希望这能有所帮助。

最新更新