我是一个完全的初学者!如何在 R 或 Python 中将.txt文件(电影脚本)转换为表(字符和行)?



我是一个完全的初学者,对于大学项目,我需要分析电影剧本。我想创建一个表,在其中我可以将字符与其行相匹配。我的文件都是.txt格式的,我想把它们转换成csv文件。我有很多脚本要处理,所以我想找到一个可以轻松适应不同文件的代码。

这就是我所拥有的:

THREEPIO
Did you hear that?  They've shut 
down the main reactor.  We'll be 
destroyed for sure.  This is 
madness!

THREEPIO
We're doomed!

THREEPIO
There'll be no escape for the 
Princess this time.
THREEPIO
What's that?

这就是我需要的:

"字符"对话">

"1"THREEPIO"你听到了吗?他们已经关闭了主反应堆。我们肯定会被摧毁的。这太疯狂了!">

"2"威胁"我们完蛋了!">

"3"THREEPIO"公主这次逃不掉了。">

"4"威胁"那是什么?">

这就是我尝试过的:

# the first 70 lines don't contain dialogues
# so we can start reading at line 70 (for instance)
i = 70
# while loop to extract character and dialogues
# (probably there's a better way to parse the file instead of
# using my crazy nested if-then-elses, but this works for me)
while (i <= nlines)
{
# if empty line
if (sw[i] == "") i = i + 1  # next line
# if text line
if (sw[i] != "")
{
# if uninteresting stuff
if (substr(sw[i], 1, 1) != " ") {
i = i + 1   # next line
} else {
if (nchar(sw[i]) < 10) {
i = i + 1  # next line
} else {
if (substr(sw[i], 1, 5) != " " && substr(sw[i], 6, 6) != " ") {
i = i + 1  # next line
} else {
# if character name
if (substr(sw[i], 1, 30) == b30) 
{
if (substr(sw[i], 31, 31) != " ")
{
tmp_name = substr(sw[i], 31, nchar(sw[i], "bytes"))
cat("n", file="EpisodeVI_dialogues.txt", append=TRUE)
cat(tmp_name, "", file="EpisodeVI_dialogues.txt", sep="t", append=TRUE)
i = i + 1        
} else {
i = i + 1
}
} else {
# if dialogue
if (substr(sw[i], 1, 15) == b15)
{
if (substr(sw[i], 16, 16) != " ")
{
tmp_diag = substr(sw[i], 16, nchar(sw[i], "bytes"))
cat("", tmp_diag, file="EpisodeVI_dialogues.txt", append=TRUE)
i = i + 1
} else {
i = i + 1
}
}
}
}
}
}    
}
}
Any help would me much appreciated! Thank you!! 

您可以这样做:

text = """
THREEPIO
Did you hear that?  They've shut 
down the main reactor.  We'll be 
destroyed for sure.  This is 
madness!

THREEPIO
We're doomed!

THREEPIO
There'll be no escape for the 
Princess this time.
THREEPIO
What's that?
"""
clean = text.split()
n = 1
tmp = []
results = []
for element in clean:
if element.isupper():
if tmp:
results.append(tmp)
tmp = [n, element]
n += 1
continue
try:
tmp[2] = " ".join((tmp[2], element))
except IndexError:
tmp.append(element)
print(results)

结果:

[[1, 'THREEPIO', "Did you hear that? They've shut down the main reactor. We'll be destroyed for sure. This is madness!"], [2, 'THREEPIO', "We're doomed!"], [3, 'THREEPIO', "There'll be no escape for the Princess this time."]]

如果你知道字符名称列表(并且不担心拼写错误(,这样的东西会起作用:

script = """
THREEPIO
Did you hear that?  They've shut 
down the main reactor.  We'll be 
destroyed for sure.  This is 
madness!

THREEPIO
We're doomed!

THREEPIO
There'll be no escape for the 
Princess this time.
THREEPIO
What's that?
"""
characters = ['THREEPIO', 'ANAKIN']
lines = [x for x in list(map(str.strip, script.split('n'))) if x]
results = []
for (i, item) in enumerate(lines):
if item in characters:
dialogue = []
for index in range(i + 1, len(lines)):
if lines[index] in characters:
break
dialogue.append(lines[index])
results.append([item, ' '.join(dialogue)])
print([x for x in enumerate(results, start=1)])

这个打印:

[(1, ['THREEPIO', "Did you hear that?  They've shut down the main reactor.  We'll be destroyed for sure.  This is madness!"]), (2, ['THREEPIO', "We're doomed!"]), (3, ['THREEPIO', "There'll be no escape for the Princess this time."]), (4, ['THREEPIO', "What's that?"])]

最新更新