从集合中查找多个最大不同的二进制向量

考虑长度为 n 的所有二进制向量的集合S，其中每个向量正好包含 m 个向量;因此每个向量中都有n-m个零。
我的目标是从S构造一个向量的数量k，这样这些向量彼此尽可能不同。

举个简单的例子，取n=4、m=2 和k=2，那么可能的解决方案是：[1,1,0,0] 和 [0,0,1,1]。

这似乎是编码理论文献中的一个开放问题(？

有没有办法(即算法)找到一个次优但好的解决方案？
在这种情况下，汉明距离是正确的性能衡量标准吗？

一些想法：
在本文中，作者提出了几种算法来查找向量的子集，使得成对汉明距离为>= 某个值d。
我已经实现了随机方法，如下所示：取一个集合SS，它由S中的任何向量初始化。然后，我考虑剩余的向量在S中。对于这些向量中的每一个，我检查该向量相对于SS中的每个向量是否至少具有距离d。如果是这样，则将其添加到SS中。
通过取最大可能的d，如果SS的大小为>= k，那么我认为SS是一个最佳解决方案，我从SS中选择k向量的任何子集。使用这种方法，我认为生成的 SS 将取决于SS中初始向量的身份;即有多种解决方案(？
但是，如果SS的大小为<<em>k，该怎么办？
从论文中提出的算法来看，我只了解了随机算法。我对二进制词典搜索(第 2.3 节)感兴趣，但我不知道如何实现它(？

也许你觉得这篇论文很有用(我写的)。它包含有效地创建位串排列的算法。

例如，inc()算法：

long  inc(long h_in , long m0 , long m1) {
long  h_out = h_in | (~m1); //pre -mask
h_out ++;
// increment
h_out = (h_out & m1) | m0; //post -mask
return  h_out;
}

它接受一个输入h_in并返回下一个比h_in至少大 1 的更高值，并"匹配"边界m0和m1。"匹配"的意思是：无论m0在哪里有1，结果就有1，无论m1有0，结果就有0。并不是说h_in必须是关于mo和m1的有效值！另外，请注意，m0必须按位小于m1，这意味着m0不能在m1具有0的位置具有1。

这可用于生成与给定输入字符串具有最小编辑距离的排列：

假设你有0110，你首先把它否定为1001(编辑距离=k)。设置"m0=1001"和"m1=1001"。使用此方法只会在"1001"本身上产生结果。

现在要获取编辑距离为 k-1的所有值，您可以执行以下操作，只需翻转m0或m1位之一，然后inc()将返回所有位串的有序序列，这些位串的差值为k或k-1。

我知道，还不是很有趣，但是您最多可以修改k位，inc()将始终返回所有排列，并具有关于m0和m1的最大允许编辑差异。

现在，要获得所有排列，您必须使用所有可能的m0和m1组合重新运行算法。

示例：要使用编辑距离2获得0110的所有可能排列，您必须使用以下m0=0110和m1=0110排列运行 inc()(要获得排列，必须扩展位位置，这意味着m0设置为0并且m1设置为1：

位 0 和 1扩展：m0=0010和m1=1110
位 0 和 2扩展：m0=0100和m1=1110
位 0 和 3扩展：m0=0110和m1=1111
扩展了位 1 和 2：m0=0000和m1=0110
扩展了位 1 和 3：m0=0010和m1=0111
扩展了位 2 和 3：m0=0100和m1=0111

作为h_0的起始值，我建议简单地使用m0。迭代可以在inc()返回m1后中止。

上述算法在O(x)中生成所有二进制向量x这些向量与给定向量v至少相差y位(可配置)。

使用您对n=number of bits in a vector v的定义，设置y=n会生成 1 个向量，这与输入向量v完全相反。对于y=n-1，它将生成n+1向量：除了一个位之外，n个向量都不同，1个向量在所有位上都不同。等等y的不同值.

**编辑：添加了摘要，并在上面的文本中将错误的"XOR"替换为"否定"。

我不知道最大化汉明距离的总和是否是获得一组"最大不同"二进制向量的最佳标准，但我强烈怀疑它是。此外，我强烈怀疑我将要介绍的算法正好产生一组 k 向量，该向量最大化 n 位和 m 个零的向量的汉明距离之和。不幸的是，我没有时间证明这一点(当然，我可能是错的——在这种情况下，根据您的要求，您将得到一个"次优但好"的解决方案。

警告：在下文中，我假设，作为进一步的条件，结果集可能不会包含两次相同的向量。

我提出的算法如下：

从只有一个向量的结果集开始，重复添加以下一个向量具有最大汉明距离和的剩余向量从结果集中已有的所有向量。停止时结果集包含 k 个向量或所有可用向量添加。

请注意，结果集的汉明距离之和不依赖于第一个或任何后续向量的选择。

鉴于您在评论中提到的约束，我发现"蛮力"方法是可行的：

N<25， 1<米>

<10，>100 (或 10<50)>

蛮力"包括预先计算数组中"词典"顺序的所有向量，并保持相同大小的数组的最新状态，该数组包含具有相同索引的每个向量，该向量到结果集中所有向量的总汉明距离。在每次迭代中，都会更新总汉明距离，并选择与当前结果集具有最大总汉明距离的所有向量中的第一个(按"字典顺序")。所选向量被添加到结果集中，数组被移动以填充其位置，从而有效地减小它们的大小。

这是我在 Java 中的解决方案。如果需要，它可以轻松翻译成任何程序语言。计算 n 个项组合的部分可以用库调用替换(如果可用)。以下 Java 方法具有相应的 C/C++ 宏，该宏在现代 CPU 上使用快速专用处理器指令：Long.numberOfTrailingZeros→__builtin_ctzl，Long.bitCount→__builtin_popcountl.

package waltertross.bits;
public class BitsMain {
private static final String USAGE =
"USAGE: java -jar <thisJar> n m k (1<n<64, 0<m<n, 0<k)";
public static void main (String[] args) {
if (args.length != 3) {
throw new IllegalArgumentException(USAGE);
}
int n = parseIntArg(args[0]); // number of bits
int m = parseIntArg(args[1]); // number of ones
int k = parseIntArg(args[2]); // max size of result set
if (n < 2 || n > 63 || m < 1 || m >= n || k < 1) {
throw new IllegalArgumentException(USAGE);
}
// calculate the total number of available bit vectors
int c = combinations(n, m);
// truncate k to the above number
if (k > c) {
k = c;
}
long[] result   = new long[k]; // the result set (actually an array)
long[] vectors  = new long[c - 1]; // all remaining candidate vectors
long[] hammingD = new long[c - 1]; // their total Hamming distance to the result set
long firstVector = (1L << m) - 1; // m ones in the least significant bits
long lastVector  = firstVector << (n - m); // m ones in the most significant bits
result[0] = firstVector; // initialize the result set
// generate the remaining candidate vectors in "lexicographical" order
int size = 0;
for (long v = firstVector; v != lastVector; ) {
// See http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
long t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
v = (t + 1) | (((~t & -~t) - 1) >>> (Long.numberOfTrailingZeros(v) + 1));
vectors[size++] = v;
}
assert(size == c - 1);
// chosenVector is always the last vector added to the result set
long chosenVector = firstVector;
// do until the result set is filled with k vectors
for (int r = 1; r < k; r++) {
// find the index of the new chosen vector starting from the first
int chosen = 0;
// add the distance to the old chosenVector to the total distance of the first
hammingD[0] += Long.bitCount(vectors[0] ^ chosenVector);
// initialize the maximum total Hamming distance to that of the first
long maxHammingD = hammingD[0];
// for all the remaining vectors
for (int i = 1; i < size; i++) {
// add the distance to the old chosenVector to their total distance
hammingD[i] += Long.bitCount(vectors[i] ^ chosenVector);
// whenever the calculated distance is greater than the max,
// update the max and the index of the new chosen vector
if (maxHammingD < hammingD[i]) {
maxHammingD = hammingD[i];
chosen = i;
}
}
// set the new chosenVector to the one with the maximum total distance
chosenVector = vectors[chosen];
// add the chosenVector to the result set
result[r] = chosenVector;
// fill in the hole left by the chosenVector by moving all vectors
// that follow it down by 1 (keeping vectors and total distances in sync)
System.arraycopy(vectors,  chosen + 1, vectors,  chosen, size - chosen - 1);
System.arraycopy(hammingD, chosen + 1, hammingD, chosen, size - chosen - 1);
size--;
}
// dump the result set
for (int r = 0; r < k; r++) {
dumpBits(result[r], n);
}
}
private static int parseIntArg(String arg) {
try {
return Integer.parseInt(arg);
} catch (NumberFormatException ex) {
throw new IllegalArgumentException(USAGE);
}
}
private static int combinations(int n, int m) {
// calculate n over m = n! / (m! (n - m)!)
// without using arbitrary precision numbers
if (n <= 0 || m <= 0 || m > n) {
throw new IllegalArgumentException();
}
// possibly avoid unnecessary calculations by swapping m and n - m
if (m * 2 < n) {
m = n - m;
}
if (n == m) {
return 1;
}
// primeFactors[p] contains the power of the prime number p
// in the prime factorization of the result
int[] primeFactors = new int[n + 1];
// collect prime factors of each term of n! / m! with a power of 1
for (int term = n; term > m; term--) {
collectPrimeFactors(term, primeFactors, 1);
}
// collect prime factors of each term of (n - m)! with a power of -1
for (int term = n - m; term > 1; term--) {
collectPrimeFactors(term, primeFactors, -1);
}
// multiply the collected prime factors, checking for overflow
int combinations = 1;
for (int f = 2; f <= n; f += (f == 2) ? 1 : 2) {
// multiply as many times as requested by the stored power
for (int i = primeFactors[f]; i > 0; i--) {
int before = combinations;
combinations *= f;
// check for overflow
if (combinations / f != before) {
String msg = "combinations("+n+", "+m+") > "+Integer.MAX_VALUE;
throw new IllegalArgumentException(msg);
}
}
}
return combinations;
}
private static void collectPrimeFactors(int n, int[] primeFactors, int power) {
// for each candidate prime that fits in the remaining n
// (note that non-primes will have been preceded by their component primes)
for (int i = 2; i <= n; i += (i == 2) ? 1 : 2) {
while (n % i == 0) {
primeFactors[i] += power;
n /= i;
}
}
}
private static void dumpBits(Long bits, int nBits) {
String binary = Long.toBinaryString(bits);
System.out.println(String.format("%"+nBits+"s", binary).replace(' ', '0'));
}
}

n=5、m=2、k=4 的算法数据：

result
00011   00101 00110 01001 01010 01100 10001 10010 10100 11000 vectors
0→2   0→2   0→2   0→2   0→4   0→2   0→2   0→4   0→4 hammingD
^                         chosen
00011   00101 00110 01001 01010 10001 10010 10100 11000
01100     2→4   2→4   2→4   2→4   2→6   2→6   4→6   4→6
^
00011   00101 00110 01001 01010 10010 10100 11000
01100     4→6   4→8   4→6   4→8   6→8   6→8   6→8
10001             ^
00011   00101 01001 01010 10010 10100 11000
01100       6     6     8     8     8     8
10001
00110

样本输出(n=24， m=9， k=20)：

[wtross ~/Dropbox/bits]$ time java -jar bits-1.0-SNAPSHOT.jar 24 9 20
000000000000000111111111
000000111111111000000000
111111000000000000000111
000000000000111111111000
000111111111000000000000
111000000000000000111111
000000000111111111000000
111111111000000000000000
000000000000001011111111
000000111111110100000000
111111000000000000001011
000000000000111111110100
001011111111000000000000
110100000000000000111111
000000001011111111000000
111111110100000000000000
000000000000001101111111
000000111111110010000000
111111000000000000001101
000000000000111111110010
real    0m0.269s
user    0m0.244s
sys     0m0.046s

在我的 Mac 上，在您的约束范围内(n=24、m=9、k=99)最艰难的情况需要 ~550 毫秒。

该算法可以通过一些优化来更快，例如，通过移动较短的数组块。值得注意的是，在Java中，我发现"向上"移动比"向下"移动要慢得多。

更新的答案

查看 Walter Tross 代码的示例输出，我认为生成随机解决方案可以简化为：

以任何向量开头，例如，对于 n=8， m=3， k=5：

A:   01001100

每一步之后，对向量求和以获得每个位置的使用次数：

SUM: 01001100

然后，对于下一个向量，将它们放置在使用最少的位置(在本例中为零次)，例如：

B:   00110001

获得：

A:   01001100  
B:   00110001
SUM: 01111101

然后，还剩下 2 个最少使用的位置，因此对于下一个向量中的 3 个位置，使用这 2 个位置，然后将第三个位置放在任何地方：

C:   10010010

获得：

A:   01001100  
B:   00110001
C:   10010010
SUM: 11121111  (or reset to 00010000 at this point)

然后对于下一个向量，你有 7 个最少使用的位置(总和中的那个)，所以选择任意 3 个，例如：

D:   10100010

获得：

A:   01001100  
B:   00110001
C:   10010010
D:   10100010
SUM: 21221121

对于最终向量，选择 4 个最少使用的位置中的任何一个，例如：

E:   01000101

要生成所有解决方案，只需在每个步骤中生成每个可能的向量：

A:   11100000, 11010000, 11001000, ... 00000111

然后，例如，当 A 和 SUM 为 11100000 时：

B:   00011100, 00011010, 00011001, ... 00000111

然后，例如，当 B 00011100且 SUM 11111100时：

C:   10000011, 01000011, 00100011, 00010011, 00001011, 00000111

然后，例如，当 C 10000011并且 SUM 21111111时：

D:   01110000, 01101000, 01100100, ... 00000111

最后，例如，当 D 为 01110000 且 SUM 为 22221111 时：

E:   00001110, 00001101, 00001011, 00000111

这将导致 C(8,3) × C(5,3) × C(8,1) × C(7,3) × C(4,3) = 56 × 10 × 8 × 35 × 4 = 627,200 个解，n=8， m=3， k=5。

实际上，你需要添加一个方法来避免重复相同的向量，避免把自己画在角落里;所以我认为这不会比沃尔特的答案更简单。

初步答案 - 存在重大问题

(我假设 m 不大于 n/2，即 1 的数量不大于 0 的数量。否则，请使用对称方法。

当 k×m 不大于 n 时，显然存在最优解，例如：

n=10, m=3, k=3:  
A: 1110000000  
B: 0001110000  
C: 0000001110

其中汉明距离均为 2×m：

|AB|=6, |AC|=6, |BC|=6, total=18

当 k×m 大于 n 时，连续向量之间汉明距离差最小的解提供最大的总距离：

n=8, m=3, k=4:
A: 11100000
B: 00111000
C: 00001110
D: 10000011
|AB|=4, |AC|=6, |AD|=4, |BC|=4, |BD|=6, |CD|=4, total=28

n=8, m=3, k=4:
A: 11100000
B: 00011100
C: 00001110
D: 00000111
|AB|=6, |AC|=6, |AD|=6, |BC|=2, |BD|=4, |CD|=2, total=26

所以，实际上，你取 m×k 看看它比 n 大多少，我们称之为 x = m×k−n，这个 x 是重叠的数量，即一个向量在与前一个向量处于相同位置的频率。然后，将重叠分布在不同的矢量上，以最大化总距离。

在上面的例子中，x = 3×4−8 = 4，我们有 4 个向量，所以我们可以均匀地分布重叠，每个向量在与前一个向量相同的位置有 1 个。

要生成所有独特的解决方案，您可以：

计算 x = m×k−n 并将 x 的所有分区生成为 k 个部分，具有尽可能低的最大值：

n=8, m=3, k=5  ->  x=7  
22111, 21211, 21121, 21112, 12211, 12121, 12112, 11221, 11212, 11122  
(discard partitions with value 3)

生成所有要用作向量 A 的向量，例如：

A: 11100000, 11010000, 11001000, 11000100, ... 00000111

对于其中的每一个，生成所有向量 B，它们在字典上小于向量 A，并且与向量 A 具有正确数量的重叠(在示例中为 1 和 2)，例如：

A: 10100100
overlap=1:  
B: 10011000, 10010010, 10010001, 10001010, 10001001, 10000011, 01110000, ... 00000111
overlap=2:  
B: 10100010, 10100001, 10010100, 10001100, 10000110, 10000101, 01100100, ... 00100101

对于其中的每一个，生成所有向量 C，依此类推，直到您有 k 个向量集。生成最后一个向量时，必须考虑与上一个和下一个(即第一个)向量的重叠。

我认为最好将 x 的分区视为 k 作为二叉树：

1                                      2
11                      12                    21         22
111        112           121       122        211       212    221
1112   1121   1122   1211   1212   1221   2111   2112   2121   2211
11122  11212  11221  12112  12121  12211  21112  21121  21211  22111

并在创建解决方案时遍历此树，以便每个向量只需生成一次。

我认为这种方法仅适用于 n、m 和 k 的某些值;我不确定它可以适用于一般情况。

相关内容

最新更新

热门标签：