如何将文本数据与管道符号分隔的行中的单元格转换为通过python的特定模式的数据?



我想将下面的数据转换为特定行4单元格的模式。请查收以下数据样本。

text = """A | B | Lorem | Ipsum | is | simply | dummy
C | D | text | of | the | printing | and
E | F | typesetting | industry. | Lorem
G | H | more | recently | with | desktop | publishing | software | like | Aldus
I | J | Ipsum | has | been | the | industry's
K | L | standard | dummy | text | ever | since | the | 1500s
M | N | took | a
O | P | scrambled | it | to | make | a | type | specimen | book"""

我需要将每行转换为只包含不超过4个单元格。第四个单元格之后的任何单元格都应该插入到下一行,其中前两个单元格类似于第一行,当前行也不应该大于4个单元格。以上文本数据的转换应该如下所示:

A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
.
.
and so on...

我自己尝试了一些东西,但我甚至没有完成一半,因为下面的代码是不完整的。

new_text = ""
for i in text.split('n'):
row = i.split(' | ')
if len(row) == 4:
new_text = new_text + i + 'n'
elif len(row) > 4:
for j in range(len(row)):
if j < 3:
new_text = new_text + row[0] + ' | ' + row[1] + ...

如果每行的单元格数高于4,我无法找出使用前两个单元格的逻辑。

您可以拆分输入行,然后每次处理每行2个元素。可能的代码:

for line in io.StringIO(text):
row = line.strip().split(' | ')
for i in range(2, len(row), 2):
print(' | '.join(row[:2] + row[i: i+2]))

它给出了预期的结果:

A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
I | J | Ipsum | has
I | J | been | the
I | J | industry's
K | L | standard | dummy
K | L | text | ever
K | L | since | the
K | L | 1500s
M | N | took | a
O | P | scrambled | it
O | P | to | make
O | P | a | type
O | P | specimen | book

我将使用pandas来完成这个任务:

import pandas as pd
from io import StringIO
MAX     = 100   # expected maximum number of input columns
IDX_COL = 2     # number of index columns (A / B)
N_COLS  = 2     # number of desired non-index output columns
df = (pd
.read_csv(io.StringIO(text), sep=r's*|s*',
engine='python', names=range(MAX))
.set_index(list(range(IDX_COL)))
.pipe(lambda d: d.set_axis(
pd.MultiIndex.from_arrays(
[(d.columns-IDX_COL)%N_COLS,
(d.columns-IDX_COL)//N_COLS]), axis=1)
)
.stack().droplevel(IDX_COL)
.to_csv('output.csv', header=None, sep='|')
)

输出文件:

A|B|Lorem|Ipsum
A|B|is|simply
A|B|dummy|
C|D|text|of
C|D|the|printing
C|D|and|
E|F|typesetting|industry.
E|F|Lorem|
G|H|more|recently
G|H|with|desktop
G|H|publishing|software
G|H|like|Aldus
I|J|Ipsum|has
I|J|been|the
I|J|industry's|
K|L|standard|dummy
K|L|text|ever
K|L|since|the
K|L|1500s|
M|N|took|a
O|P|scrambled|it
O|P|to|make
O|P|a|type
O|P|specimen|book