如何使用正则表达式仅提取打印的表格文本



我只是尝试使用正则表达式从这样的文本中提取包含格式化表的字符串:

Table: Person
+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| personId    | int     |
| lastName    | varchar |
| firstName   | varchar |
+-------------+---------+
personId is the primary key column for this table.
This table contains information about the ID of some persons and their first and last names.

Table: Address
+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| addressId   | int     |
| personId    | int     |
| city        | varchar |
| state       | varchar |
+-------------+---------+
addressId is the primary key column for this table.
Each row of this table contains information about the city and state of one person with ID = PersonId.

我只想提取格式化为表格的文本,如下所示:

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| addressId   | int     |
| personId    | int     |
| city        | varchar |
| state       | varchar |
+-------------+---------+
+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| addressId   | int     |
| personId    | int     |
| city        | varchar |
| state       | varchar |
+-------------+---------+

使用正则表达式可能吗?

我试过这个没有成功:

(+-(.*)-+n)

提前感谢!

使用您显示的示例和尝试,请尝试遵循 Python3 中的正则表达式和代码。这是显示的正则表达式的在线演示链接。

根据您显示的示例,它将创建 2 个值,您可以通过findall命令生成的列表的第0项和1项访问这些值。

import re
value="""....."""  ##Your variable value, since its too long putting .... here.
re.findall(r'^Table: S+n+(+.*+n| Column Name | Type +|n+-++-++n[^+]*n+-++-++)', value, flags=re.MULTILINE)

说明:为上述正则表达式添加详细说明(仅用于解释目的):

^Table:              ##Start of sting followed by Table:
S+n+               ##Matching space followed by non-spaces followed by 1 or more new lines.
(                    ##Starting single capturing group here.
+.*               ##Matching literal + just before new line.
+n               ##Matching literal + followed by a single new line.
| Column Name     ##Matching literal | followed by space and Column Name.
| Type +          ##Matching literal | followed by space and Type space(s).
|n+-+           ##Matching literal | followed by new line and literal + followed by -(occurrences).
+-++n           ##Matching literal + followed by -(occurrences) literal + and new line.
[^+]*n+          ##match everything before next occurrence of + followed by new line and +
-++-++           ##Matching 1 or more occurrences of - followed by literal + followed by 1 or more - with literal + here.
)                    ##Closing capturing group here.

注意:有关re模块的查找所有功能的文档附在此处的链接中。代码也使用它re.MULTILINE标志,您可以查看附加到它的文档链接。

最新更新