问题:append()方法不能保持文本文件中字符串的原始顺序



我有一个函数create_profanity_output()(请参阅下面的完整代码(,其中转录文件中的每一句脏话都被附加到一个列表中,后面跟着时间戳和审查字符。我想保持元素在成绩单中出现的顺序

但我的问题是,附加元素的顺序是,而不是与转录本中的顺序相同。我认为append()会在列表的末尾添加一个元素(这将对应于原始顺序(。尽管我没有使用sorted()函数,但这些脏话似乎是按字母顺序排列的。

更准确地说,当前(不需要的(输出如下所示:

# Current output in wrong order.
[['fart', '00:00:03,950', '00:00:06,840', '****'],
['damn', '00:00:03,950', '00:00:06,840', '****'],
['damn', '00:00:03,950', '00:00:06,840', '****'],
['erotic', '00:00:03,950', '00:00:06,840', '****'],

但是元素在文件中的外观是1(放屁,2(色情,3(该死,4(该死,所以想要的输出是:

# Target output in correct order.
[['fart', '00:00:03,950', '00:00:06,840', '****'],
['erotic', '00:00:03,950', '00:00:06,840', '****'],
['damn', '00:00:03,950', '00:00:06,840', '****'],
['damn', '00:00:03,950', '00:00:06,840', '****'],

当成绩单中有更多的脏话时,这个问题也会发生。一旦它们具有相同的时间戳,它们就会按字母顺序排序,而不是保持原始顺序。我试着将列表排序为:

sorted_output = sorted(profanity_output, reversed=True)

sorted_output = sorted(profanity_output, reversed=False)

sorted_output = sorted(profanity_output, key=lambda x: x[0])

等等,但没有达到我的目标。

我知道这是个琐碎的问题。但脏话的顺序不能按字母顺序排列。有人知道append()为什么会这样做吗?我该如何解决这个问题?

整个代码:

def create_profanity_output():
"""Create a list 'profanity_output' which shall contain each profanity,
its timestamp and the default censor characters ('****')."""
profanity_output = []
# Define censor characters that occur in the transcript.
censor_chars = "****"
# Create lists with transcript data.
line_numbers = []
timestamps = []
text_lines = []
# Get lines from the transcript that contain strings according to the
# following pattern: 'line number', 'timestamp', 'text line', '' (empty
# string).
lines = transcript.splitlines()
# Iterate over 'lines' to get each single element from it. Divide the
# range object by 4 because of the 'lines' object's structure: (0: line
# number, 1: timestamp, 2: text line, 3: empty string).
for x in range(int(len(lines) / 4)):
# Increment iterable by 4. The * sign allows to always move 4
# elements further to the next "profanity cycle".
x = x * 4
# Add relevant elements to lists.
line_numbers.append(lines[x])
timestamps.append(lines[x + 1])
text_lines.append(lines[x + 2])

# Iterate over transcript data and create a zip object.
for line_number, timestamp, text_line in zip(line_numbers, timestamps,
text_lines):
# Create a list with timestamp strings: '00:00:03,950', '-->',
# '00:00:06,840'.
time_splits = timestamp.split()
for swearword in wordlist.splitlines():
# Iterate over tokenized text lines.
for word in text_line.split():
if word == swearword:
profanity_output.append([word, f"{time_splits[0]}", 
f"{time_splits[2]}",
censor_chars])
return profanity_output
# Call function.
profanity_output = create_profanity_output()
print(profanity_output)

正如Michael Butscher在评论中提到的,您的问题是for loops的顺序错误。目前,您的脏话列表的顺序决定了相同text_line中单词的添加顺序。切换循环的顺序将为您提供正确的顺序。

然而,更好的解决方案是事先解析你的脏话。这仍然会保持单词在text_line中的顺序,但也会加快查找速度(这只是更好的做法,即使你不需要速度(。

swearwords = set(wordlist.splitlines())
for word in text_line.split():
if word in swearwords:
...

同一行的输出实际上是按照wordlist中的脏话顺序排列的。所以你先挑一个脏话,然后穿过这条线看看它是否存在。实际上,您需要首先对该行进行迭代。您还可以使用列表的__contains__函数来查看您的单词是否真的是脏话。

像这样:

swearwords = wordlist.splitlines()
# Iterate over tokenized text lines.
for word in text_line.split():
if word in swearwords:
profanity_output.append([word, f"{time_splits[0]}", 
f"{time_splits[2]}",
censor_chars])

最新更新