迭代函数每次都会覆盖数据帧

所以我将多个docx文件转换为一个数据帧文件。代码适用于一个文档，这导致了以下结构：

data = {'Title': ['title first article, 'title second article'], 'Sources': ['source of first article', 'source of second article']}
df = pd.DataFrame(data=data)

结构是一个函数的结果：

def func_convert_updates(filename):
path = os.chdir('C:/Users/docxfiles')
regex = 'xc2xb7'
with open(filename, "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
text = result.value # The raw text
text2=re.sub(u'[|•●]', " ", text, count= 0) 
with open('output.txt', 'w', encoding='utf-8') as text_file:
text_file.write(text2)
#followed by many lines of code, omitted here, to create a dataframe
return df_titles

然后我想分析多个docx文件，所以我写了以下代码：

list_news= ['docx_file_1', 'docx_file_2.docx', ... etc]
for element in list_news:
df_titles = func_convert_updates(element)

但是，这只返回列表最后一个元素的数据帧，因为它覆盖了以前的输出。我该如何解决这个问题？提前谢谢。

如果你想在变量df_titles中的每个循环中创建所有的DataFrames，你可以这样做：

import pandas as pd
df_titles = pd.concat([func_convert_updates(element) for element in list_news], ignore_index=True)

实际的问题是，如果您多次调用函数，您会告诉open写入'output.txt'文件，并使用'w'参数覆盖文件(如果存在(。您可能想将其更改为'a'以附加到文件中，因此：

with open('output.txt', 'a', ...

另请参阅https://cmdlinetips.com/2012/09/three-ways-to-write-text-to-a-file-in-python/

相关内容

最新更新

热门标签：