正则表达式文本解析器



我有这样的数据帧

ID  Series
1102    [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]
1500    [('forgot data pages info', 0, 22, 'NP')]
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]

我正在尝试将名为 Series 的列中的文本解析为名为 Series1 Series2 等的不同列,直到解析的最大文本数。

df_parsed = df['Series'].str[1:-1].str.split(', ', expand = True)

像这样:

ID  Series  Series1 Series2 Series3
1102    [('taxi instructions', 13, 30, 'NP'), ('consistent basis', 31, 47, 'NP'), ('the atc taxi clearance', 89, 111, 'NP')]    taxi instructions   consistent basis    the atc taxi clearance
1500    [('forgot data pages info', 0, 22, 'NP')]   forgot data pages info      
649 [('hud', 0, 3, 'NP'), ('correctly fotr approach', 12, 35, 'NP')]    hud correctly fotr approach

最终结果的格式不容易理解,但也许您可以按照这个概念创建新列:

def process(ls):
    return ' '.join([x[0] for x in ls])
df['Series_new'] = df['Series'].apply(lambda x: process(x))

如果你想创建N个新列(N = max_len(Series_list)(,我想你可以先计算N。然后,按照上述概念正确填写 NaN 以创建 N 个新列。

最新更新