python在模式后选择单词



text:

  1. 管理层、食品员工和有条件员工;知识、责任和报告-评论:现场没有书面的员工健康政策。必须为所有员工提供服务。优先基金会违规7-38-010引文发布。|5.应对呕吐和腹泻事件的程序-注释:没有书面清洁程序或呕吐/腹泻事件所需的设备。必须提供。优先基金会违规7-38-005引文发布。|25.为生食/未煮熟食物提供的消费者建议-评论:菜单不会向消费者披露和告知生食或未煮熟的特定菜单项目以及食用此类食物的潜在危险。必须提供消费者咨询,披露并提醒客户此类物品。优先级基础违规。未发布引文。|38

问题:

正文的章节包括第3、5、25和38节(后面是起始索引)。我想从"-评论:"之后和下一节开始索引之前的一节中提取所有文本。

def comments(x):
result = []
for elem in df['Violations']:
matches = re.findall(r'd+. (.*?)(?: - |r?n|$)', elem)
result.extend(matches)
print(result)

所附的代码正在进行完全相反的提取,只提取"-注释:"之前的单词,我该如何更改它?

非常感谢

如果想要Comments:|之间的文本,请在正则表达式中使用这些值。

'Comments: ([^|]*) |'

它使用()只捕获Comments:|之间的字符,但与|不同(参见[^|])。

因为|在正则表达式中有特殊的含义,所以我使用|将其用作文本中的普通字符。


'Comments: (.*?) |'

它使用CCD_ 10来获得不同于CCD_ 11 的字符


import re
elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''
#matches = re.findall('Comments: ([^|]*) |', elem)
matches = re.findall('Comments: (.*?) |', elem)
#print(matches)
for item in matches:
print(item)
print('---')

结果:

NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.

您的模式在-、换行符或字符串末尾之前的组中捕获尽可能少的文本,并且与Comments:不匹配任何部分

你可以通过匹配注释来更改它,并在之后为文本添加一个捕获组

d+. .*?(?: - Comments:s*)(.*?)(?: ||$)

Regex演示

更精确的匹配可以是匹配每个文本的开头,即数字、点和空格,然后匹配,直到第一次出现-注释:而不交叉另一个文本的开头。

在Comments之后,您可以使用捕获组捕获文本,直到下一次出现,或者如果字符串是最后一个,则断言字符串的末尾。

使用re.findall将返回捕获组1的值。

bd+. (?:(?!d+. |- Comments:).)*- Comments:s*(.*?)(?: ||$)

模式匹配:

  • b防止部分单词匹配的单词边界
  • d+.匹配1+位数字、一个点和空格
  • (?:(?!d+. |- Comments:).)*如果右边没有图案d+.- Comments,则匹配任意字符
  • - Comments:s*匹配后面跟有可选空白字符的- Comments:
  • (.*?)捕获组1,尽可能少地匹配任何字符
  • (?: ||$)匹配任一

Regex演示| Python演示

示例

import re
regex = r"bd+. (?:(?!d+. |- Comments:).)*- Comments:s*(.*?)(?: ||$)"
s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.  | 38. "
print(re.findall(regex, s))

输出

[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.', 
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.', 
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]

最新更新