Python使用从数据框架中提取的regex直到n创建一个新列

我有一个像这样的数据框架:

data = {'c1':['Level:     LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 1n', 
'Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 3n', 
'Level:     LOGGING_ONLYn Thrown: lib: this is problem type 02n tn tError executing the statement: error statement2n', 
'Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 04n tn tError executing the statement: error statement1n'],
'c2':["one", "two", "three", "four"]}

我想创建:

一个正则表达式，提取Thrown: lib:之后的任何内容，直到第一个n。我把这个组命名为"第1组"。所以我将在下面写入:
```
data = {'c3':['this is problem type 01', 
'this is problem type 01', 
'this is problem type 02', 
'this is problem type 04']}
```

然后我想创建一个正则表达式，提取"第01组"(前一个正则表达式)之后的所有内容，忽略句子之间的t和n，直到下一个n。所以我将在下面写入:

data = {'c4':['Error executing the statement: error statement 1', 
'Error executing the statement: error statement 3', 
'Error executing the statement: error statement2', 
'Error executing the statement: error statement1']}

最后我希望我的数据框是这样的:

data = {'c1':['Level:     LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 1', 
'Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 3', 
'Level:     LOGGING_ONLYn Thrown: lib: this is problem type 02n tn tError executing the statement: error statement2', 
'Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 04n tn tError executing the statement: error statement1'],
'c3':['this is problem type 01', 
'this is problem type 01', 
'this is problem type 02', 
'this is problem type 04'],
'c4':['Error executing the statement: error statement 1', 
'Error executing the statement: error statement 3', 
'Error executing the statement: error statement2', 
'Error executing the statement: error statement1'],
'c2':["one", "two", "three", "four"]}

这是我到目前为止所拥有的，我试图从"Thrown: lib:";直到第一个n，但它不工作。

df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:r?n.*)*)', expand=False)

也许可以像一行代码那样做，但是像这样:

import re
import pandas as pd

data = {'c1':['Level:     LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 1n', 
'Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 3n', 
'Level:     LOGGING_ONLYn Thrown: lib: this is problem type 02n tn tError executing the statement: error statement2n', 
'Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 04n tn tError executing the statement: error statement1n'],
'c2':["one", "two", "three", "four"]}

df = pd.DataFrame(data)
pattern1 = 'Thrown: lib: ([a-zA-Zds]*)\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()
pattern2 = '(\ns\t){1,}(.*)\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=True)[1]

输出:

print(df.to_string())
                                                     c1     c2                       c3                                                c4
0  Level:     LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 1n    one  this is problem type 01  Error executing the statement: error statement 1
1  Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 01n tn tError executing the statement: error statement 3n    two  this is problem type 01  Error executing the statement: error statement 3
2   Level:     LOGGING_ONLYn Thrown: lib: this is problem type 02n tn tError executing the statement: error statement2n  three  this is problem type 02   Error executing the statement: error statement2
3   Level: NOT_LOGGING_ONLYn Thrown: lib: this is problem type 04n tn tError executing the statement: error statement1n   four  this is problem type 04   Error executing the statement: error statement1

我会使用re包:

data['c3'] = [re.findall("Thrown: lib: ([^n]+)", x) for x in data['c1']]
data['c4'] = [re.split("n", x)[3].strip() for x in data['c1']]

第一个模式提取Thrown: lib:和第一个换行符之间的所有内容
第二个模式假设相关消息总是第4个令牌，当被n分割时，这似乎是

跟进下面的问题。data['c4']的模式是基于这样一个事实:消息总是在4 "n";消息中的换行符。
现在，如果感兴趣的分隔符是"n tn"，你可以修改模式:

data['c4'] = [re.split("n tn", x)[1].strip() for x in data['c1']]

或

data['c4'] = [re.findall(".*?n tn(.*)", x)[0].strip() for x in data['c1']]

最后一种方法更好，如果split在分隔符上失败，您将得到IndexError。

相关内容

最新更新

热门标签：