匹配多个单词直到文档结束

我正在尝试使用Regex通过简历解析。我正在尝试找到标有教育的部分(或某种形式(，然后使用规则来定义块的末尾。

我目前有一个工作正则是〜一词教育，并会根据规则为我提供文档的其余部分。

这是我定义规则的完整代码

headers = ['experience','projects','work experience','skills 
summary','skills/tools']
for item in resume_paths:
    resume = getText(item)
    resume = resume.replace('n',' n ')
    education = re.findall(r'(?i)w*Educationw*[^?]+', resume)[0].split('n')
    paragraph = ''
    for line in education[1:]:
         line = line.strip()
         if (line.isupper() == False) and (not line.strip().lower() in headers):
            paragraph += line + 'n'
        else:
            break
    print(resume[:15],paragraph)

这是我正在使用的正则

(?i)w*Educationw*[^?]+

当有人多次使用"教育"一词时，我会遇到问题。我希望将正则返回所有匹配的列表到文档的末尾，并将使用规则来确定哪个是正确的。我已经尝试删除符号以获得多个匹配项，但这给了我两个单词匹配，而没有文档的其余部分。

谢谢！

您的正则义务r'(？双方;然后将其扩展到下一个问号。 W将不包括空间，标点等

我怀疑这就是您想要的。它会得到：

XYZEducationismallly

但不是

Relevant Education

[^？]的意思是什么不是'？';但是我不明白为什么您想扫描到下一个问号(或字符串的结尾(。

另外，如果没有"？"周围(很可能(，" "将把所有内容都带到整个源字符串的尽头 - 但是您可能想在下一个标题(如果有(停止，例如"就业历史"或其他。p>真正执行此权利会很困难，因为可以以许多不同的方式将简历转换为文本(一个明显的示例：文本的行可能代表原始的一条"视觉"线，或一个"段落"块，或一个"段落"块，甚至一个表单元格，如果发起人使用表进行布局，则如相当常见(。

但是，如果您卡住了文本工作，则可能是一种更清晰，更简单的方法，例如：

eduSection = []
inEducationSection = False
for line in resume:
    if re.search(r'bEducation', line): 
        inEducationSection = True
    elif re.search(r'b(History|Experience|other headingish things)', line):
        inEducationSection = False
    elif inEducationSection:
        eduSection.append(line)

如果您可以固定数据中的"标题"的样子更确切地说，您将获得更好的结果。例如：

* headings might be all caps, or title caps;
* headings might be  the only things that start in column1
* headings might have no punctuation except final ':'
* headings might be really short compared to (most) other lines
* maybe there are only a few dozen distinct headings that show up often.

我要说的第一件事是如何判断何时出门。一次您有，其余的非常容易。

相关内容

最新更新

热门标签：