在终端中根据括号或括号(仅限顶级)拆分文本文件



我有几个文本文件(utf-8(要在shell脚本中处理。它们的形式并不完全相同,但如果我能把它们分解成可食用的块状,我就能处理好。这可以用C或python编程,但我不喜欢。

EDIT:我用C编写了一个解决方案;看看我自己的答案。我认为这可能是最简单的方法。如果你认为我错了,请根据我下面答案中更复杂的示例输入来测试你的解决方案。

--jcxz100

为了清晰(并且能够更容易地调试(,我希望将块保存为子文件夹中的单独文本文件。

所有类型的输入文件包括:

  1. 垃圾线
  2. 垃圾文本后接括号或圆括号的行,即"["{"<"或"(",可能后接有效载荷
  3. 有效载荷线路
  4. 带有括号或圆括号的行嵌套在顶级对中;也被视为有效载荷
  5. 带有尾括号或括号的有效负载行,即"]"}">"或"("-可能后面跟着一些东西(垃圾文本和/或新负载的开始(

我只想根据顶级括号/括号的匹配对来分解输入。这些对中的有效负载不得更改(包括换行符和空白(。顶级配对之外的所有东西都应该作为垃圾丢弃。

双引号内的任何垃圾或有效载荷都必须被视为原子(作为原始文本处理,因此其中的任何括号或圆括号也应被视为文本(。

以下是一个示例(仅使用{}对(:

junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file    }
end junk

抱歉:有些输入文件真的很乱。

第一个输出文件应该是:

{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}

第二个输出文件:

{
payload that goes in second output file    }

注意:

  • 我还没有完全决定是否有必要在输出中保留这对开始/结束字符,或者它们本身是否应该作为垃圾丢弃。我认为一个能让他们留在家里的解决方案是更通用的。

  • 在同一个输入文件中,可以有多种类型的顶级括号/副括号对。

  • 注意:输入文件中有*和$字符,因此请避免混淆bash;-(

  • 比起简洁,我更喜欢可读性;但不是以指数速度为代价。

很高兴拥有:

  • 文本中有反斜杠转义的双引号;最好应该处理(我有一个黑客,但不好看(。

  • 脚本不应打断垃圾和/或有效负载中不匹配的括号/圆括号对(注意:在原子中,必须允许它们!(

更多值得拥有的东西:

  • 我还没有看到它,但可以推测一些输入可能有单引号而不是双引号来表示原子内容。。。或者甚至两者的混合。

  • 如果可以很容易地修改脚本以解析类似结构但具有不同起始/结束字符或字符串的输入,那就太好了。

我可以看出这很难理解,但我认为如果我把它分解成更简单的问题,它就不会给出一个稳健的解决方案。

主要问题是正确地划分输入——其他一切都可以忽略或"忽略";已解决";有黑客,所以请随意忽略美好的拥有更遥远的美好的拥有

给定:

$ cat file
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file    }
end junk

这个perl文件将把您描述的块提取到文件block_1block_2等中:

#!/usr/bin/perl
use v5.10;
use warnings;
use strict;
use Text::Balanced qw(extract_multiple extract_bracketed);
my $txt;
while (<>){$txt.=$_;}  # slurp the file
my @blocks = extract_multiple(
$txt,
[
# Extract {...}
sub { extract_bracketed($_[0], '{}') },
],
# Return all the fields
undef,
# Throw out anything which does not match
1
);
chdir "/tmp";
my $base="block_";
my $cnt=1;
for my $block (@blocks){ my $fn="$base$cnt";
say "writing $fn";
open (my $fh, '>', $fn) or die "Could not open file '$fn' $!";
print $fh "$blockn";
close $fh;
$cnt++;}

现在的文件:

$ cat block_1
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
$ cat block_2
{
payload that goes in second output file    }

使用Text::Balanced是稳健的,可能是最好的解决方案。

可以使用单个Perl正则表达式执行块:

$ perl -0777 -nlE 'while (/({(?:(?1)|[^{}]*+)++})|[^{}s]++/g) {if ($1) {$cnt++; say "block $cnt:== start:n$1n== end";}}' file
block 1:== start:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
== end
block 2:== start:
{
payload that goes in second output file    }
== end

但这比使用像Text::Balanced这样的正确解析器要脆弱一些。。。

我在C中有一个解决方案。这似乎太复杂了,无法在shell脚本中轻松实现。该程序并不过于复杂,但有200多行代码,其中包括错误检查、一些速度优化和其他细节。

源文件将方括号拆分为块。c:

#include <stdio.h>
/* Example code by jcxz100 - your problem if you use it! */
#define BUFF_IN_MAX 255
#define BUFF_IN_SIZE (BUFF_IN_MAX+1)
#define OUT_NAME_MAX 31
#define OUT_NAME_SIZE (OUT_NAME_MAX+1)
#define NO_CHAR ''
int main()
{
char pcBuff[BUFF_IN_SIZE];
size_t iReadActual;
FILE *pFileIn, *pFileOut;
int iNumberOfOutputFiles;
char pszOutName[OUT_NAME_SIZE];
char cLiteralChar, cAtomicChar, cChunkStartChar, cChunkEndChar;
int iChunkNesting;
char *pcOutputStart;
size_t iOutputLen;
pcBuff[BUFF_IN_MAX] = '';  /* ... just to be sure. */
iReadActual = 0;
pFileIn = pFileOut = NULL;
iNumberOfOutputFiles = 0;
pszOutName[OUT_NAME_MAX] = '';  /* ... just to be sure. */
cLiteralChar = cAtomicChar = cChunkStartChar = cChunkEndChar = NO_CHAR;
iChunkNesting = 0;
pcOutputStart = (char*)pcBuff;
iOutputLen = 0;
if ((pFileIn = fopen("input-utf-8.txt", "r")) == NULL)
{
printf("What? Where?n");
return 1;
}
while ((iReadActual = fread(pcBuff, sizeof(char), BUFF_IN_MAX, pFileIn)) > 0)
{
char *pcPivot, *pcStop;
pcBuff[iReadActual] = ''; /* ... just to be sure. */
pcPivot = (char*)pcBuff;
pcStop = (char*)pcBuff + iReadActual;
while (pcPivot < pcStop)
{
if (cLiteralChar != NO_CHAR) /* Ignore this char? */
{
/* Yes, ignore this char. */
if (cChunkStartChar != NO_CHAR)
{
/* ... just write it out: */
fprintf(pFileOut, "%c", *pcPivot);
}
pcPivot++;
cLiteralChar = NO_CHAR;
/* End of "Yes, ignore this char." */
}
else if (cAtomicChar != NO_CHAR) /* Are we inside an atomic string? */
{
/* Yup; we are inside an atomic string. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == '\') /* Treat next char as literal? */
{
cLiteralChar = '\'; /* Yes. */
bBreakInnerWhile = 1;
}
else if (*pcPivot == cAtomicChar) /* End of atomic? */
{
cAtomicChar = NO_CHAR; /* Yes. */
bBreakInnerWhile = 1;
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
if (cChunkStartChar != NO_CHAR)
{
/* The atomic string is part of a chunk. */
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
}
/* End of "Yup; we are inside an atomic string." */
}
else if (cChunkStartChar == NO_CHAR) /* Are we inside a chunk? */
{
/* No, we are outside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
while (bBreakInnerWhile == 0)
{
/* Detect start of anything interesting: */
switch (*pcPivot)
{
/* Start of atomic? */
case '"':
case ''':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
/* Start of chunk? */
case '{':
cChunkStartChar = *pcPivot;
cChunkEndChar = '}';
break;
case '[':
cChunkStartChar = *pcPivot;
cChunkEndChar = ']';
break;
case '(':
cChunkStartChar = *pcPivot;
cChunkEndChar = ')';
break;
case '<':
cChunkStartChar = *pcPivot;
cChunkEndChar = '>';
break;
}
if (cChunkStartChar != NO_CHAR)
{
iNumberOfOutputFiles++;
printf("Start '%c' '%c' chunk (file %04d.txt)n", *pcPivot, cChunkEndChar, iNumberOfOutputFiles);
sprintf((char*)pszOutName, "output/%04d.txt", iNumberOfOutputFiles);
if ((pFileOut = fopen(pszOutName, "w")) == NULL)
{
printf("What? How?n");
fclose(pFileIn);
return 2;
}
bBreakInnerWhile = 1;
}
else if (++pcPivot == pcStop)
{
bBreakInnerWhile = 1;
}
}
/* End of "No, we are outside a chunk." */
}
else
{
/* Yes, we are inside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == cChunkStartChar)
{
/* Increase level of brackets/parantheses: */
iChunkNesting++;
}
else if (*pcPivot == cChunkEndChar)
{
/* Decrease level of brackets/parantheses: */
iChunkNesting--;
if (iChunkNesting == 0)
{
/* We are now outside chunk. */
bBreakInnerWhile = 1;
}
}
else
{
/* Detect atomic start: */
switch (*pcPivot)
{
case '"':
case ''':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
}
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
if (iChunkNesting == 0)
{
printf("File done.n");
cChunkStartChar = cChunkEndChar = NO_CHAR;
fclose(pFileOut);
pFileOut = NULL;
}
/* End of "Yes, we are inside a chunk." */
}
}
}
if (cChunkStartChar != NO_CHAR)
{
printf("Chunk exceeds end-of-file. Exiting gracefully.n");
fclose(pFileOut);
pFileOut = NULL;
}
if (iNumberOfOutputFiles == 0) printf("Nothing to do...n");
else printf("All done.n");
fclose(pFileIn);
return 0;
}

我已经解决了美好的拥有和其中一个更遥远的美好的拥有。为了说明这一点,输入比问题中的示例稍微复杂一点:

junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote " inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
} trailing junk
intermittent junk
<
payload that goes in second output file } mismatched end bracket should be ignored     >
end junk

结果文件输出/0001.txt

{ here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote " inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
}

结果文件输出/0002.txt

<
payload that goes in second output file } mismatched end bracket should be ignored     >

感谢@dawg的帮助:(

最新更新