SQL 查询:正则表达式在具有属性的 HTML 标记之间选择文本



person 表有一个描述列,其中包含 html 格式的文本。

查询应返回每个人的 id 和说明,删除所有 html 格式。

select id, REGEXP_EXTRACT(description, r"REGEX_GOES_HERE") from person;

其中一个描述的示例:

<p class="position-body__description" data-section="pastPositions">• Lorem Ipsum dorem posum <br>This is my test html regex.<br> • Provided general HR support of multiple team members’ development, technical and certification training.<br>• Demonstrated experience in an administrative support role managing multiple requests simultaneously.</p>

预期产出:

1, Lorem Ipsum dorem posum这是我的测试 html 正则表达式。 • 为多个团队成员的发展、技术和认证培训提供一般人力资源支持。• 具有同时管理多个请求的行政支持角色的经验。

下面是 BigQuery Standard SQL

#standardSQL
SELECT id, 
REGEXP_REPLACE(description, r'<.*?>', '') description_without_formatting 
FROM `project.dataset.person`   

您可以使用问题中的虚拟数据进行测试,玩上面,如下例所示

#standardSQL
WITH `project.dataset.person` AS (
SELECT 1 id, '<p class="position-body__description" data-section="pastPositions">• Lorem Ipsum dorem posum <br>This is my test html regex.<br> • Provided general HR support of multiple team members’ development, technical and certification training.<br>• Demonstrated experience in an administrative support role managing multiple requests simultaneously.</p>' description
)
SELECT id, 
REGEXP_REPLACE(description, r'<.*?>', '') description_without_formatting 
FROM `project.dataset.person`   

有结果

Row id  description_without_formatting   
1   1   • Lorem Ipsum dorem posum This is my test html regex. • Provided general HR support of multiple team members’ development, technical and certification training.• Demonstrated experience in an administrative support role managing multiple requests simultaneously.

最新更新