Regex Parse Buffy脚本使用外观背后



我很难解析此页面:http://www.buffyworld.com/buffy/buffy/transcripts/114_tran.html

我试图通过相关对话获得角色名称。文本看起来像这样:

<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)

理想情况下,我将从<p><br>匹配到下一个<p><br>。我试图使用前面的外观并为此而仰望:

reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)

不幸的是,这与任何东西都不匹配。当我离开LookAhead ((?=<p>)|(?=<br>))时,只要在匹配的对话中没有新线,我就会匹配行。它似乎终止在新线上,而不是继续进行<p>

ex。在这一行上,"感谢"不匹配。<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly) Thanks.

谢谢您的任何见解!

围绕点表示法工作:

re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)

您还可以尝试一个特殊的标志,将线路断开到点的语义中。就我个人而言,当我可以使用拆分或一些HTML解析器时。重新逃脱,所有参数,限制和标志都可以使任何人发疯。还有re.split。

dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')
for p in paragraphs:
    if ":" in p:
        char, line = p.split(":", 1)
        if char in dialogs:
           dialogs[char].append(line)
        else:
           dialogs[char] = []

最新更新