Python, large files



我试图在python中打开一个100GB的文件,这是2008年到现在(2023年)的帖子的堆栈溢出存档(https://archive.org/download/stackexchange zip文件名stackoverflow.com-Posts.7z content posts .xml)。问题是这个文件太大了,电脑处理起来很困难,可能需要几个小时或几天的时间来检查它。我必须把xml文件放在csv中,但我只对2022年的帖子感兴趣,这可能少于100GB的5GB。我试着用很多方法打开文件,我试着"剪切"。文件的70%,即在70%的内容之后开始阅读,因为最新的数据在文件的底部找到,但我不能。我该怎么做呢?我当然知道前70GB是不需要的,因为这是2019年的大小。ChatGPT没能帮我。

这就是我一开始分析所有内容的方法:


'''
This script takes a raw xml file as input and outputs
an almost clean csv file
'''
import re, csv
from bs4 import BeautifulSoup
# file paths
IN = 'C:\Users\.....\Desktop\StackOverflow archive\Posts.xml' #raw xml file
OUT = 'C:\Users\.....\Desktop\StackOverflow archive\almostCleanPosts.csv' #almost clean csv file
# regex patterns
TAG_PATTERN = re.compile('<|>')
NEW_LINE = re.compile('n')
BODY_1 = re.compile('n?<pre(.*?)><code>(.*?)</code></pre>n?|<div class=".*<div class=".*>', flags=re.S)
BODY_2 = re.compile('s?<code>w*</code>.s?|s?<a href=.*">|</a>')
BODY_3 = re.compile('s?<w*>|</w*>(.|s)?')
BODY_4 = re.compile('&nbsp;|http(s?)://S*|<img src.*">|<br/>*|br>|<br />*|<ol.*">|/p>|/strong>|/li>|&lt;key>|&lt;/key>|/code>|/a>|#?&www?;?')
# method to clean instances using regex
def cleanup(dirty):
clean = re.sub(NEW_LINE,' ', dirty)    
clean = re.sub(BODY_1,' ', clean)
clean = re.sub(BODY_2,' ', clean)
clean = re.sub(BODY_3,' ', clean)
clean = re.sub(BODY_4,' ', clean)
return clean

open(OUT, 'w').close() # erase file berfore writing
with open(IN, 'r', encoding='utf-8') as inFile, open(OUT, 'a') as outFile: 
outFileWriter = csv.writer(outFile, delimiter=';')
outFileWriter.writerow(['id', 'post', 'title', 'tags']) # write header

for post in inFile:
soup = BeautifulSoup(post, "lxml") # get xml tree
line = soup.find("row")
newRow = [] # new list to append as a row to the output file
try:
#date = line.attrs['creationdate'] #decomment to get 2022
#if date.startswith("2022"):
typeId = line.attrs['posttypeid']
newRow.append(line.attrs['id']) # first field 
if typeId == '1': #is a question
newRow.append(cleanup(line.attrs['body'])) # second field
newRow.append(line.attrs['title']) # third field
newRow.append(re.sub(TAG_PATTERN,' ', line.attrs['tags'])) # fourth field
elif typeId == '2': #is an answer, title and tags attributes are missing      
newRow.append(cleanup(line.attrs['body'])) # second field 
# leave title and tag fields empty
newRow.append('') # third field
newRow.append('') # fourth field

outFileWriter.writerow(newRow) # append cleaned row tothe file

except:
pass

我希望从2022年开始获得所有CSV格式的帖子,请帮助我。

下面是一个使用Powershell的解决方案。下面的代码读取前100行

using assembly System.Xml
using assembly System.Xml.Linq
$FILENAME = "c:tempposts.xml"
$reader = [System.Xml.XmlReader]::Create($FILENAME)
$lines = 0
write-Host "START"
while($reader.EOF -eq $False)
{
$lines++
if($lines -eq 100) { break}
if ($reader.Name -ne "row")
{
$reader.ReadToFollowing("row") | out-null
}
if ($reader.EOF -eq $False)
{
$element = [System.Xml.Linq.XElement]::ReadFrom($reader)
Write-Host "ID = " $element.Attribute("Id").Value
Write-Host "PostTypeId = " $element.Attribute("PostTypeId").Value
Write-Host "ParentId = " $element.Attribute("ParentId").Value
Write-Host "CreationDate = " $element.Attribute("CreationDate").Value
Write-Host "Score = " $element.Attribute("Score").Value
Write-Host "Body = " $element.Attribute("Body").Value
Write-Host "OwnerUserId = " $element.Attribute("OwnerUserId").Value
Write-Host "LastActivityDate = " $element.Attribute("LastActivityDate").Value
Write-Host "CommentCount = " $element.Attribute("CommentCount").Value
Write-Host "ContentLicense = " $element.Attribute("ContentLicense").Value
}
}

最新更新