算法挑战:任意就地基转换无损字符串压缩



从一个真实世界的例子开始可能会有所帮助。假设我正在编写一个由MongoDB支持的web应用程序,因此我的记录有一个长十六进制主键,使我的url查看记录看起来像/widget/55c460d8e2d6e59da89d08d0。这似乎太长了。url可以使用更多的字符。虽然在24位十六进制数中有刚好低于8 x 10^28 (16^24)可能的值,只是将自己限制为[a-zA-Z0-9] regex类匹配的字符(YouTube视频id使用更多),62个字符,您可以在仅17个字符中获得8 x 10^28

我想要一个算法,将任何字符串,限制在一个特定的字符字母表的任何其他字符串与另一个字符字母表,其中每个字符c的值可以被认为是alphabet.indexOf(c)

形式为:

convert(value, sourceAlphabet, destinationAlphabet)
假设

  • 所有参数均为字符串
  • value中的每个字符都存在于sourceAlphabet
  • sourceAlphabetdestinationAlphabet中的每个字符都是唯一的

简单例子
var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";

但我也希望它的工作转换战争&和平从俄语字母加上一些标点符号到整个unicode字符集,然后再无损地返回。

这可能吗?

在《计算机科学101》中,我学到的唯一的进制转换方法是先把digit * base^position加起来转换成十进制整数,然后再反过来转换成目标进制。这种方法对于非常长的字符串的转换是不够的,因为整数变得太大了。

当然直观地感觉可以在适当的位置进行基数转换,当您遍历字符串时(可能向后以保持标准有效数字顺序),以某种方式跟踪余数,但我不够聪明,无法解决如何。

这就是你来的地方,StackOverflow。你够聪明吗?

也许这是一个已经解决的问题,由某个18世纪的数学家在纸上完成,在1970年用LISP在打孔卡上实现,并且是密码学101的第一个家庭作业,但是我的搜索没有结果。

我更喜欢用函数式风格的javascript解决方案,但任何语言或风格都可以,只要你不欺骗一些大的整数库。当然,效率是加分项。

请不要批评原始示例。解决问题的一般书呆子信誉比解决方案的任何应用都重要。

这是一个在C中非常快的解决方案,使用位移位操作。它假定您知道已解码字符串的长度。字符串是0..之间的整数向量。每个字母的最大值。用户可以自行决定在字符范围有限的字符串之间进行转换。对于题目标题中的"in-place",源向量和目标向量可以重叠,但前提是源字母不大于目标字母。

/*
  recode version 1.0, 22 August 2015
  Copyright (C) 2015 Mark Adler
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  Mark Adler
  madler@alumni.caltech.edu
*/
/* Recode a vector from one alphabet to another using intermediate
   variable-length bit codes. */
/* The approach is to use a Huffman code over equiprobable alphabets in two
   directions.  First to encode the source alphabet to a string of bits, and
   second to encode the string of bits to the destination alphabet. This will
   be reasonably close to the efficiency of base-encoding with arbitrary
   precision arithmetic. */
#include <stddef.h>     // size_t
#include <limits.h>     // UINT_MAX, ULLONG_MAX
#if UINT_MAX == ULLONG_MAX
#  error recode() assumes that long long has more bits than int
#endif
/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
   code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
   *dlen returns the length of the result, which will not exceed the value of
   *dlen when called.  If the original *dlen is not large enough to hold the
   full result, then recode() will return non-zero to indicate failure.
   Otherwise recode() will return 0.  recode() will also return non-zero if
   either of the smax or dmax parameters are less than one.  The non-zero
   return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
   and 3 if any of the elements of source are greater than smax.
   Using this same operation on the result with smax and dmax reversed reverses
   the operation, restoring the original vector.  However there may be more
   symbols returned than the original, so the number of symbols expected needs
   to be known for decoding.  (An end symbol could be appended to the source
   alphabet to include the length in the coding, but then encoding and decoding
   would no longer be symmetric, and the coding efficiency would be reduced.
   This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
           const unsigned *source, size_t slen, unsigned smax)
{
    // compute sbits and scut, with which we will recode the source with
    // sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
    if (smax < 1)
        return 2;
    unsigned sbits = 0;
    unsigned scut = 1;          // 2**sbits
    while (scut && scut <= smax) {
        scut <<= 1;
        sbits++;
    }
    scut -= smax + 1;
    // same thing for dbits and dcut
    if (dmax < 1)
        return 2;
    unsigned dbits = 0;
    unsigned dcut = 1;          // 2**dbits
    while (dcut && dcut <= dmax) {
        dcut <<= 1;
        dbits++;
    }
    dcut -= dmax + 1;
    // recode a base smax+1 vector to a base dmax+1 vector using an
    // intermediate bit vector (a sliding window of that bit vector is kept in
    // a bit buffer)
    unsigned long long buf = 0;     // bit buffer
    unsigned have = 0;              // number of bits in bit buffer
    size_t i = 0, n = 0;            // source and dest indices
    unsigned sym;                   // symbol being encoded
    for (;;) {
        // encode enough of source into bits to encode that to dest
        while (have < dbits && i < slen) {
            sym = source[i++];
            if (sym > smax) {
                *dlen = n;
                return 3;
            }
            if (sym < scut) {
                buf = (buf << (sbits - 1)) + sym;
                have += sbits - 1;
            }
            else {
                buf = (buf << sbits) + sym + scut;
                have += sbits;
            }
        }
        // if not enough bits to assure one symbol, then break out to a special
        // case for coding the final symbol
        if (have < dbits)
            break;
        // encode one symbol to dest
        if (n == *dlen)
            return 1;
        sym = buf >> (have - dbits + 1);
        if (sym < dcut) {
            dest[n++] = sym;
            have -= dbits - 1;
        }
        else {
            sym = buf >> (have - dbits);
            dest[n++] = sym - dcut;
            have -= dbits;
        }
        buf &= ((unsigned long long)1 << have) - 1;
    }
    // if any bits are left in the bit buffer, encode one last symbol to dest
    if (have) {
        if (n == *dlen)
            return 1;
        sym = buf;
        sym <<= dbits - 1 - have;
        if (sym >= dcut)
            sym = (sym << 1) - dcut;
        dest[n++] = sym;
    }
    // return recoded vector
    *dlen = n;
    return 0;
}
/* Test recode(). */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>
// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
    unsigned bits = 0;
    unsigned long long mask = 1;
    while (mask <= max) {
        mask <<= 1;
        bits++;
    }
    mask--;
    unsigned long long ran = 0;
    unsigned have = 0;
    size_t n = 0;
    while (n < len) {
        while (have < bits) {
            ran = (ran << 31) + random();
            have += 31;
        }
        if ((ran & mask) <= max)
            vec[n++] = ran & mask;
        ran >>= bits;
        have -= bits;
    }
}
// Get a valid number from str and assign it to var
#define NUM(var, str) 
    do { 
        char *end; 
        unsigned long val = strtoul(str, &end, 0); 
        var = val; 
        if (*end || var != val) { 
            fprintf(stderr, 
                    "invalid or out of range numeric argument: %sn", str); 
            return 1; 
        } 
    } while (0)
/* "bet n m len count" generates count test vectors of length len, where each
   entry is in the range 0..n.  Each vector is recoded to another vector using
   only symbols in the range 0..m.  That vector is recoded back to a vector
   using only symbols in 0..n, and that result is compared with the original
   random vector.  Report on the average ratio of input and output symbols, as
   compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
    // get sizes of alphabets and length of test vector, compute maximum sizes
    // of recoded vectors
    unsigned smax, dmax, runs;
    size_t slen, dsize, bsize;
    if (argc != 5) { fputs("need four argumentsn", stderr); return 1; }
    NUM(smax, argv[1]);
    NUM(dmax, argv[2]);
    NUM(slen, argv[3]);
    NUM(runs, argv[4]);
    dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
    bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));
    // generate random test vectors, encode, decode, and compare
    srandomdev();
    unsigned source[slen], dest[dsize], back[bsize];
    unsigned mis = 0, i;
    unsigned long long dtot = 0;
    int ret;
    for (i = 0; i < runs; i++) {
        ranvec(source, slen, smax);
        size_t dlen = dsize;
        ret = recode(dest, &dlen, dmax, source, slen, smax);
        if (ret) {
            fprintf(stderr, "encode error %dn", ret);
            break;
        }
        dtot += dlen;
        size_t blen = bsize;
        ret = recode(back, &blen, smax, dest, dlen, dmax);
        if (ret) {
            fprintf(stderr, "decode error %dn", ret);
            break;
        }
        if (blen < slen || memcmp(source, back, slen))  // blen > slen is ok
            mis++;
    }
    if (mis)
        fprintf(stderr, "%u/%u mismatches!n", mis, i);
    if (ret == 0)
        printf("mean dest/source symbols = %.4f (optimal = %.4f)n",
               dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
    return 0;
}

正如在其他StackOverflow答案中指出的那样,尽量不要将digit * base^position求和为将其转换为十进制;相反,我们可以把它看作是指示计算机用自己的术语来表示这个数字所表示的数量(对于大多数计算机来说,可能更接近于我们的以2为基数的概念)。一旦计算机有了自己的数量表示,我们就可以指示它以任何我们喜欢的方式输出这个数字。

通过拒绝"大整数"实现并要求逐字母转换,您同时认为数量的数字/字母表示实际上不是它的本质,即每个位置代表digit * base^position的数量。如果《战争与和平》的第900万个字符确实代表了你要求转换它的内容,那么计算机在某些时候将需要生成Д * 33^9000000的表示。

我不认为任何解决方案可以一般工作,因为如果ne != m对于一些整数e和一些MAX_INT因为没有办法计算目标基数的值在某个地方p如果np> MAX_INT。

对于某些e,当ne == m时,你可以避免这种情况,因为这个问题是递归可行的(n的前e位可以被求和并转换成m的第一位,然后被截断并重复)。

如果你没有这个有用的属性,那么最终你将不得不尝试取原始基数的一部分并尝试在np中执行模数并且np将大于MAX_INT,这意味着这是不可能的。

最新更新