import re
othello_full = open('C:/Users/.../Othello.txt', encoding="mbcs").read()
split_dialogue = othello_full.split("nn")
dict = {}
for i in split_dialogue:
m = re.match(r'(BRABANTIO|GRATIANO|LODOVICO|OTHELLO|CASSIO|IAGO|MONTANO|RODERIGO|CLOWN|DESDEMONA|EMILIA|BIANCA).*(.$|?$|!$)', i)
if bool(m) == True:
dict[i.split(".", maxsplit = 1)[0]] = i.split(".", maxsplit = 1)[1]
else:
print('boo') #for purely diagnostic purpose
我试图创建一个字典,并有循环插入字符名称和他们各自的对话。我测试了regex表达式,它可以工作(至少对于我有限的样本)。我单独测试了每个组件,它们都有效。但它们在循环内不起作用。为什么?此外,是否有一种更优雅的方式,而不是在正则表达式中所有字符的名称?
下载来源:https://www.gutenberg.org/ebooks/1531
输入文本的示例
['n*** START OF THE PROJECT GUTENBERG EBOOK OTHELLO, THE MOOR OF VENICE ***',
'cover ',
'',
'OTHELLO, THE MOOR OF VENICE',
'',
'by William Shakespeare',
'',
'Contents',
'ACT InScene I. Venice. A street.nScene II. Venice. Another street.nScene III. Venice. A council chamber.',
'nACT IInScene I. A seaport in Cyprus. A Platform.nScene II. A street.nScene III. A Hall in the Castle.',
'nACT IIInScene I. Cyprus. Before the Castle.nScene II. Cyprus. A Room in the Castle.nScene III. Cyprus. The Garden of the Castle.nScene IV. Cyprus. Before the Castle.',
'nACT IVnScene I. Cyprus. Before the Castle.nScene II. Cyprus. A Room in the Castle.nScene III. Cyprus. Another Room in the Castle.',
'nACT VnScene I. Cyprus. A Street.nScene II. Cyprus. A Bedchamber in the castle.',
'',
'Dramatis Personæ',
'DUKE OF VENICEnBRABANTIO, a Senator of Venice and Desdemona’s fathernOther SenatorsnGRATIANO, Brother to BrabantionLODOVICO, Kinsman to BrabantionOTHELLO, a noble Moor in the service of VenicenCASSIO, his LieutenantnIAGO, his AncientnMONTANO, Othello’s predecessor in the government of CyprusnRODERIGO, a Venetian GentlemannCLOWN, Servant to Othello',
'DESDEMONA, Daughter to Brabantio and Wife to OthellonEMILIA, Wife to IagonBIANCA, Mistress to Cassio',
'Officers, Gentlemen, Messenger, Musicians, Herald, Sailor, Attendants,n&c.',
'SCENE: The First Act in Venice; during the rest of the Play at anSeaport in Cyprus.',
'nACT I',
'SCENE I. Venice. A street.',
' Enter Roderigo and Iago.',
'RODERIGO.nTush, never tell me, I take it much unkindlynThat thou, Iago, who hast had my purse,nAs if the strings were thine, shouldst know of this.',
'IAGO.n’Sblood, but you will not hear me.nIf ever I did dream of such a matter,nAbhor me.',
'RODERIGO.nThou told’st me, thou didst hold him in thy hate.',
'IAGO.nDespise me if I do not. Three great ones of the city,nIn personal suit to make me his lieutenant,nOff-capp’d to him; and by the faith of man,nI know my price, I am worth no worse a place.nBut he, as loving his own pride and purposes,nEvades them, with a bombast circumstance,nHorribly stuff’d with epithets of war:nAnd in conclusion,nNonsuits my mediators: for “Certes,â€x9d says he,n“I have already chose my officer.â€x9dnAnd what was he?nForsooth, a great arithmetician,nOne Michael Cassio, a Florentine,nA fellow almost damn’d in a fair wife,nThat never set a squadron in the field,nNor the division of a battle knowsnMore than a spinster, unless the bookish theoric,nWherein the toged consuls can proposenAs masterly as he: mere prattle without practicenIs all his soldiership. But he, sir, had the election,nAnd I, of whom his eyes had seen the proofnAt Rhodes, at Cyprus, and on other grounds,nChristian and heathen, must be belee’d and calm’dnBy debitor and creditor, this counter-caster,nHe, in good time, must his lieutenant be,nAnd I, God bless the mark, his Moorship’s ancient.',
'RODERIGO.nBy heaven, I rather would have been his hangman.',
'IAGO.nWhy, there’s no remedy. ’Tis the curse of service,nPreferment goes by letter and affection,nAnd not by old gradation, where each secondnStood heir to the first. Now sir, be judge yourselfnWhether I in any just term am affin’dnTo love the Moor.',
'RODERIGO.nI would not follow him, then.',
'IAGO.nO, sir, content you.nI follow him to serve my turn upon him:nWe cannot all be masters, nor all mastersnCannot be truly follow’d. You shall marknMany a duteous and knee-crooking knavenThat, doting on his own obsequious bondage,nWears out his time, much like his master’s ass,nFor nought but provender, and when he’s old, cashier’d.nWhip me such honest knaves. Others there arenWho, trimm’d in forms, and visages of duty,nKeep yet their hearts attending on themselves,nAnd throwing but shows of service on their lords,nDo well thrive by them, and when they have lin’d their coats,nDo themselves homage. These fellows have some soul,nAnd such a one do I profess myself. For, sir,nIt is as sure as you are Roderigo,nWere I the Moor, I would not be Iago:nIn following him, I follow but myself.nHeaven is my judge, not I for love and duty,nBut seeming so for my peculiar end.nFor when my outward action doth demonstratenThe native act and figure of my heartnIn complement extern, ’tis not long afternBut I will wear my heart upon my sleevenFor daws to peck at: I am not what I am.'
我已经制作了一个正则表达式来正确解析所有的文本。
import re
text_re = re.compile(
r"(?<=nn)" # Always 2 newlines before name.
# Name consists of one or more capitalized words, followed by a dot.
r"(?P<name>(?:[A-Z]+ ?)+).n"
# Dialogue consists of
r"(?P<dialog>(?:"
# One or more continuous lines.
r"(?:[^ n].+n)+"
# Sometimes, actions such as "Enter" or [Exit] are included.
r"(?:n(?: (?:[|Enter ).+nn)+"
# But they always follow with lines who aren't names (all caps).
r"(?=w[^A-Z]))?)"
# This can repeat multiple times until dialog ends.
r"+)")
正则表达式本身很复杂,但有一些解释。
你可以使用:
with open("pg1531.txt", encoding="utf-8") as txtfile:
text = txtfile.read()
for match in text_re.finditer(text):
print("Name:", match.group("name"))
print("Text:", match.group("dialog"))
print()
input()
多次按下<Enter>
,对话框仍在继续。
然后您可以使用它将对话框映射到您认为合适的人:
import collections
dialogs = collections.defaultdict(list)
for match in text_re.finditer(text):
dialogs[match.group("name")].append(match.group("dialog"))
并提取Montano的前10个对话:
print(dialogs["MONTANO"][:10])
没有太多的警告。regex很复杂,但与简单的解决方案不同,它可以防止不必要的文本(如行为编号或操作)进入对话框。我没有去掉对话框中间的进入和退出,因为理解对话框很重要,但如果你认为有必要,你可以很容易地去掉它。