使用php中的回溯，使用ASCII字符代码压缩字符串

我想使用字符ASCII码压缩字符串。我想用数字模式来压缩它们。因为ASCII码是数字，所以我想在ASCII字符码列表中找到子模式。

理论

这将是我发现的每个模式的格式：

[nnn][n][nn]，其中：
- [nnn]是第一个字符的ASCII代码，来自具有相同模式的组号
- [n] 是某个模式/规则的自定义编号(我将在下面详细解释)
- [nn]显示此规则发生的次数

数字模式尚未具体建立。但让我给你举几个例子：

相同字符
线性增长(每个数字/ascii都比以前大，有一个)
线性减少(每个数字/ascii都比以前小，有一个)

现在让我们看看一些情况：

"adeflk"；变成"；097.1.01-100.2.03-108.3.02"；
- 相同焦度，线性增长3倍，线性下降2倍
"rrrrrrrrr"；变成"；114.1.11"；
- 同一个字符11次
"tsrqpozh"；变成"；116.3.06-122.1.01-104.1.01"；
- 线性下降6倍，相同的字符，相同的字母

我添加了点('.')和短划线('-')，这样你就可以很容易地看到它们。

事实上，我们没有看到好的结果(压缩)。我想对大字符串使用此算法。添加更多的规则(数字模式)，我们会增加更改，使结果比原来更短。我知道现有的压缩解决方案。我想要这个解决方案，因为结果只有数字，它对我有帮助

我尝试过的

// recursive function
function run (string $data, array &$rules): string {
if (strlen($data) == 1) {
// str_pad for having always ASCII code with 3 digits
return (str_pad(ord($data), 3, '0', STR_PAD_LEFT) .'.'. '1' .'.'. '01');
}
$ord = ord($data); // first char
$strlen = strlen($data);
$nr = str_pad($ord, 3, '0', STR_PAD_LEFT); // str_pad for having always ASCII code with 3 digits
$result = '';
// compares every rule
foreach ($rules as $key => $rule) {
for ($i = 1; $i < $strlen; $i++) {
// check for how many times this rule matches
if (!$rule($ord, $data, $i)) {
// save the shortest result (so we can compress)
if (strlen($r = ($nr .'.'. $key .'.'. $i .' - '. run(substr($data, $i), $rules))) < strlen($result)
|| !$result) {
$result = $r;
}
continue 2; // we are going to next rule
}
}
// if comes here, it means entire $data follow this rule ($key)
if (strlen($r = (($nr .'.'. $key .'.'. $i))) < strlen($result)
|| !$result) {
$result = $r; // entire data follow this $rule
}
}
return $result; // it will return the shortest result it got
}
// ASCII compressor
function compress (string $data): string {
$rules = array( // ASCII rules
1 => function (int $ord, string $data, int $i): bool { // same char
return ($ord == ord($data[$i]));
},
2 => function (int $ord, string $data, int $i): bool { // linear growth
return (($ord+$i) == ord($data[$i]));
},
3 => function (int $ord, string $data, int $i): bool { // progressive growth
return ((ord($data[$i-1])+$i) == ord($data[$i]));
},
4 => function (int $ord, string $data, int $i): bool { // linear decrease
return (($ord-$i) == ord($data[$i]));
},
5 => function (int $ord, string $data, int $i): bool { // progressive decrease
return ((ord($data[$i-1])-$i) == ord($data[$i]));
}
);
// we use base64_encode because we want only ASCII chars
return run(base64_encode($data), $rules);
}

我添加了点('.')和短划线('-')只是为了方便测试。

结果

compress("ana ar") => "089.1.1 - 087.1.1 - 053.1.1 - 104.1.1 - 073.1.1 - 071.4.2 - 121.1.01"

没关系。而且跑得很快。没有问题。

compress("ana aros") => Fatal error: Maximum execution time of 15 seconds exceeded

若字符串稍微长一点，它就会变得太多。它在1-7个字符内快速正常工作。但当字符串中有更多的字符时，就会发生这种情况。

该算法运行得并不完美，也没有返回完美的6位数模式。在到达那里之前，我已经被它卡住了。

问题

我如何提高这种回溯的性能，以便现在运行正常，并使用更多规则？

搜索渐变/中缀重复对于压缩自然语言来说不是一个好的匹配。使用基于字典的方法(与压缩数据捆绑在一起的动态字典，以及在参考集上训练的预编译字典)，自然语言更容易压缩，因为即使是ASCII编码中的重复序列通常也不会遵循任何琐碎的几何模式，但是当仅观察单个字符的顺序表示时显得相当随机。

也就是说，你的算法之所以如此缓慢，是因为你正在探索所有可能的模式，这导致了输入长度的运行时间指数，精确地说是O(5^n)。对于您自己设定的目标，即在一组5个任意规则中找到理想压缩，这已经尽可能好了。如果有什么不同的话，你只能通过一个常数因子来降低运行时的复杂性，但你无法摆脱指数运行时。换言之，即使您应用了完美的优化，在不可避免地再次遇到超时之前，也只能将您可以处理的最大输入长度增加30-50%。

@noam的解决方案甚至没有试图找到理想的模式，而是贪婪地使用第一个匹配模式来消耗输入。因此，它将错误地忽略更好的匹配，但作为回报，它也只需查看每个输入字符一次，从而导致线性运行时复杂性O(n)。

当然，您当前的解决方案中有一些细节，仅基于对规则的简单观察，这些细节就更容易解决。但要注意，当你试图添加更多规则时，这些假设会被打破。

具体来说，如果你对尝试规则的顺序很明智，你可以避免大部分回溯：

尝试首先启动一个新的几何模式ord(n[i])=ord(n[0])+i，并且只有当它与前面至少3个字符匹配时才接受为匹配
尝试继续当前的几何图案
尝试继续当前的渐变模式
尝试启动新的渐变ord(n[i])=ord(n[0])+i，并且只有当它匹配至少前面2个字符时才接受为匹配
试着最后开始/继续简单的重复，并始终接受

一旦输入中的字符被这些规则中的任何一个接受(意味着它已被序列消耗)，您将不再需要从中回溯或检查任何其他规则，因为您已经找到了它的最佳表示形式。您仍然需要重新检查添加到序列中的每个后续字符的规则，因为可能需要梯度规则的后缀作为几何规则的前缀。

一般来说，规则集中允许这样做的模式是，对于每一个优先级较高的规则，该规则的匹配项在以下任何规则中都不可能有更好的匹配。如果你愿意，你可以很容易地正式证明你的每一对可能的规则。

如果您想测试您的实现，您应该专门测试ABDHIK之类的模式。即使H与当前运行的几何序列ABDH匹配，使用它作为新几何序列HIK的起点也是无条件的更好选择。

我为您的问题提出了初始解决方案。请注意：

你永远不会得到只有一个字母的序列，因为每两个连续的字母都是一个"线性增长"；有一定的区别
我的溶液不是很干净。例如，可以将$matches和$rules组合为一个阵列
我的解决方案是天真和贪婪的。例如，在示例adeflk中，序列def是3的序列，但由于我的解决方案是贪婪的，它将ad视为2的序列，而ef视为另一个2的序列。话虽如此，您仍然可以改进我的代码
代码很难测试。您可能应该使用OOP，并将代码划分为许多易于单独测试的小方法

<?php
function compress($string, $rules, $matches) {
if ($string === '') {
return getBestMatch($matches);
}
$currentCharacter = $string[0];
$matchFound = false;
foreach ($rules as $index => &$rule) {
if ($rule['active']) {
$soFarLength = strlen($matches[$index]);
if ($soFarLength === 0) {
$matchFound = true;
$matches[$index] = $currentCharacter;
} elseif ($rule['callback']($currentCharacter, $matches[$index])) {
$matches[$index] .= $currentCharacter;
$matchFound = true;
} else {
$rule['active'] = false;
}
}
}
if ($matchFound) {
return compress(substr($string, 1), $rules, $matches);
} else {
return getBestMatch($matches) . startNewSequence($string);
}
}
function getBestMatch($matches) {
$rule = -1;
$length = -1;
foreach ($matches as $index => $match) {
if (strlen($match) > $length) {
$length = strlen($match);
$rule = $index;
}
}
if ($length <= 0) {
return '';
}
return ord($matches[$rule][0]) . '.' . $rule . '.' . $length . "n";
}
function startNewSequence($string) {
$rules = [
// rule number 1 - all characters are the same
1 => [
'active' => true,
'callback' => function ($a, $b) {
return $a === substr($b, -1);
}
],
// rule number 2 - ASCII code of current letter is one more than the last letter ("linear growth")
2 => [
'active' => true,
'callback' => function ($a, $b) {
return ord($a) === (1 + ord(substr($b, -1)));
}
],
// rule number 3 - ASCII code is a geometric progression. The ord() of each character increases with each step.
3 => [
'active' => true,
'callback' => function ($a, $b) {
if (strlen($b) == 1) {
return ord($a) > ord($b);
}
$lastCharOrd = ord(substr($b, -1));
$oneBeforeLastCharOrd = ord(substr($b, -2, 1));
$lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
$currentOrd = ord($a);
return ($currentOrd - $lastCharOrd) === ($lastDiff + 1);
}
],
// rule number 4 - ASCII code of current letter is one less than the last letter ("linear decrease")
4 => [
'active' => true,
'callback' => function ($a, $b) {
return ord($a) === (ord(substr($b, -1)) - 1);
}
],
// rule number 5 - ASCII code is a negative geometric progression. The ord() of each character decreases by one
// with each step.
5 => [
'active' => true,
'callback' => function ($a, $b) {
if (strlen($b) == 1) {
return ord($a) < ord($b);
}
$lastCharOrd = ord(substr($b, -1));
$oneBeforeLastCharOrd = ord(substr($b, -2, 1));
$lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
$currentOrd = ord($a);
return ($currentOrd - $lastCharOrd) === ($lastDiff - 1);
}
],
];
$matches = [
1 => '',
2 => '',
3 => '',
4 => '',
5 => '',
];
return compress($string, $rules, $matches);
}
echo startNewSequence('tsrqpozh');

理论

我尝试过的

结果

问题

相关内容

最新更新

热门标签：