Regex Parse Buffy脚本使用外观背后

我很难解析此页面：http：//www.buffyworld.com/buffy/buffy/transcripts/114_tran.html

我试图通过相关对话获得角色名称。文本看起来像这样：

<p>BUFFY: Wait!
<p>She stands there panting, watching the truck turn a corner.
<p>BUFFY: (whining) Don't you want your garbage?
<p>She sighs, pouts, turns and walks back toward the house.
<p>Cut to the kitchen. Buffy enters through the back door, holding a pile of
mail. She begins looking through it. We see Dawn standing by the island.
<p>DAWN: Hey Buffy. Oh, don't forget, today's trash day.<br>BUFFY: (sourly)
Thanks.
<p>Dawn piles her books into her school bag. Buffy opens a letter.
<p>Close shot of the letter.
<p>
<p>Dawn smiles, and she and Willow exit. Buffy picks up the still-wrapped
sandwich and stares at it.
<p>BUFFY: (to herself) Somebody should.
<p>She sighs, puts the sandwich back in the bag.
<p>Cut to the Bronze. Pan across various people drinking and dancing,
bartender serving. Reveal Xander and Anya sitting at the bar eating chips from
several bags. A notebook sits in front of them bearing the wedding seating
chart.
<p>ANYA: See ... this seating chart makes no sense. We have to do it again.
(Xander nodding) We can't do it again. You do it.<br>XANDER: The seating
chart's fine. Let's get back to the table arrangements. I'm starting to have
dreams of gardenia bouquets. (winces) I am so glad my manly coworkers didn't
just hear me say that. (eating chips)

理想情况下，我将从或 匹配到下一个或 。我试图使用前面的外观并为此而仰望：

reg = "((?<=<p>)|(?<=<br>))(?P<character>.+):(?P<dialogue>.+)((?=<p>)|(?=<br>))"
script = re.findall(reg, html_text)

不幸的是，这与任何东西都不匹配。当我离开LookAhead ((?=)|(?= ))时，只要在匹配的对话中没有新线，我就会匹配行。它似乎终止在新线上，而不是继续进行

ex。在这一行上，"感谢"不匹配。DAWN: Hey Buffy. Oh, don't forget, today's trash day. BUFFY: (sourly) Thanks.

谢谢您的任何见解！

围绕点表示法工作：

re.findall('((?<=<p>)|(?<=<br>))([A-Z]+):([^<]+)', text)

您还可以尝试一个特殊的标志，将线路断开到点的语义中。就我个人而言，当我可以使用拆分或一些HTML解析器时。重新逃脱，所有参数，限制和标志都可以使任何人发疯。还有re.split。

dialogs = {}
text = html_text.replace('<br>', '<p>')
paragraphs = text.split('<p>')
for p in paragraphs:
    if ":" in p:
        char, line = p.split(":", 1)
        if char in dialogs:
           dialogs[char].append(line)
        else:
           dialogs[char] = []

相关内容

最新更新

热门标签：