目标是从Wikipedia dump(70GB文件)中读取所有内容。这是无法加载内存的,因此我试图逐步解析文件并从中获得一些值。但是,我刚刚编写的脚本并没有打印任何东西,并且立即占据了我的所有记忆。
这是代码:
from lxml import etree
def fast_iter(context, func, *args, **kwargs):
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elem):
#print(elem)
print (elem.xpath( './revision/text/text( )' ))
context = etree.iterparse( 'enwiki-latest-pages-articles-multistream.xml', tag='page' )
fast_iter(context,process_element)
当将此脚本应用于小XML文件中时,它会从请求的XPATH中打印值。
但是,当应用于完整文件时,什么也不会发生。
这是Wikipedia dump的相同行
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.33.0-wmf.19</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="108" case="first-letter">Book</namespace>
<namespace key="109" case="first-letter">Book talk</namespace>
<namespace key="118" case="first-letter">Draft</namespace>
<namespace key="119" case="first-letter">Draft talk</namespace>
<namespace key="446" case="first-letter">Education Program</namespace>
<namespace key="447" case="first-letter">Education Program talk</namespace>
<namespace key="710" case="first-letter">TimedText</namespace>
<namespace key="711" case="first-letter">TimedText talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
<namespace key="2300" case="first-letter">Gadget</namespace>
<namespace key="2301" case="first-letter">Gadget talk</namespace>
<namespace key="2302" case="case-sensitive">Gadget definition</namespace>
<namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>854851586</id>
<parentid>834079434</parentid>
<timestamp>2018-08-14T06:47:24Z</timestamp>
<contributor>
<username>Godsy</username>
<id>23257138</id>
</contributor>
<comment>remove from category for seeking instructions on rcats</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
<sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>885648527</id>
<parentid>885645378</parentid>
<timestamp>2019-03-01T11:16:23Z</timestamp>
<contributor>
<username>Jarnsax</username>
<id>33627956</id>
</contributor>
<comment>improve citation metadata</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">{{redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}
{{pp-move-indef}}
{{short description|Political philosophy that advocates self-governed societies}}
{{Use dmy dates|date=July 2018}}
{{use British English|date=January 2014}}
{{Anarchism sidebar}}
{{Basic forms of government}}
'''Anarchism''' is an [[anti-authoritarian]] [[political philosophy]]{{sfn|McLaughlin|2007|p=59}}{{sfn|Flint|2009|p=27}} that advocates [[Self-governance|self-governed]] societies based on voluntary, [[cooperative]] institutions and the rejection of coercive [[Hierarchy|hierarchies]] those societies view as unjust. These institutions are often described as [[Stateless society|stateless societies]],{{r|group=note|Note01}}{{sfn|Sheehan|2003|p=85}} although several authors have defined them more specifically as distinct institutions based on non-hierarchical or [[Free association (communism and anarchism)|free associations]].{{r|group=note|Note02}} Anarchism holds the [[State (polity)|state]] to be undesirable, unnecessary, and harmful.{{r|group=note|Note03}}<ref name=definition /> Any philosophy consistent with statelessness, that is, principled opposition to the State, is anarchist, thus anarchist schools of thought range from [[anarcho-communism]] to [[anarcho-capitalism]].{{sfn|Fiala|2018}}
While [[Anti-statism|opposition to the state]] is central,{{r|group=note|Note04}} many forms of anarchism specifically entail opposing authority or hierarchical organisation based on authority in the conduct of all human relations.{{r|group=note|Note05}} Anarchism is often considered a [[Far-left politics|far-left]] ideology,{{r|group=note|Note06}}{{sfn|Kahn|2000}}{{sfn|Moyihan|2007}} and much of [[anarchist economics]] and [[Anarchist law|anarchist legal philosophy]] reflect [[Libertarian socialism|anti-authoritarian interpretations]] of [[Anarcho-communism|communism]], [[Collectivist anarchism|collectivism]], [[Anarcho-syndicalism|syndicalism]], [[Mutualism (economic theory)|mutualism]], or [[participatory economics]].{{r|group=note|Note07}}
Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy.{{sfn|Marshall|2010|p=16}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.{{sfn|Sylvan|2007|p=262}} [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].{{sfn|McLean|McMillan|2003|loc= Anarchism}} Strains of anarchism have often been divided into the categories of [[Social anarchism|social]] and [[individualist anarchism]] or similar dual classifications.{{sfn|Ostergaard|p=14|loc=Anarchism}}{{sfn|Kropotkin|2002|p=5}}{{sfn|Fowler|1972}}
</text>
</revision>
</page>
</mediawiki>
以前有人做过吗?知道如何有效地解析这个巨大的垃圾场吗?以前有任何包装/lib吗?我不想重新发明轮子。
问题:逐步解析一个大的Wikipedia dump xml文件
当此(问题)脚本在小XML文件中应用时,它会从请求的XPath中打印值。
但是,当应用于完整文件时,什么也不会发生。
我想知道,您从小文件中获得任何东西,因为您不使用namespace
参数。Wikipedia
XML文件使用以下默认namespace
:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
此示例使用lxml
:
from lxml import etree
class Wikipedia:
def __init__(self, fh, tag):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
:param tag: The tag to process
"""
# Prepend the default Namespace {*} to get anything.
self.context = etree.iterparse(fh, events=("end",), tag=['{*}' + tag])
def _parse(self):
"""
Parse the XML File for all '<tag>...</tag>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current 'Event, Element Tree'
"""
for event, elem in self.context:
yield event, elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def __iter__(self):
"""
Iterate all '<tag>...</tag>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for event, elem in self._parse():
entity = {}
# Assign the 'elem.namespace' to the 'xpath'
entity['revision'] = elem.xpath('./xmlns:revision/xmlns:text/text( )',
namespaces={'xmlns':etree.QName(elem).namespace})
yield entity
if __name__ == "__main__":
XML = b""""""<?xml version='1.0' encoding='UTF-8'?>
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/
http://www.mediawiki.org/xml/export-0.10.xsd"
version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
... (omitted for brevity)""""""
#with open('.\FILE.XML', 'rb') as in_xml_
with io.BytesIO(XML) as in_xml:
for record in Wikipedia(in_xml, tag='page'):
print("record:{}".format(record))
输出:
record:{'revision': ['#REDIRECT [[Computer accessi... (omitted for brevity) record:{'revision': ["{{redirect2|Anarchist|Anarch... (omitted for brevity)
用Python测试:3.5 -lxml.Etree:3.7.1
使用sax。请参阅示例(https://www.tutorialspoint.com/python3/python_xml_processing.htm)。
XML(SAX)的简单API-在这里,您可以注册有关感兴趣的事件的回调,然后让解析器通过文档进行。这很有用,当您的文档较大或您有内存限制时,它会在磁盘上读取文件时解析该文件,并且整个文件永远不会存储在内存中。
SAX是事件驱动的XML解析的标准接口。用SAX解析XML通常要求您通过子类XML.SAX.CONTENTHANDLER来创建自己的ContentHandler。
导入xml.sax
class MovieHandler( xml.sax.ContentHandler ):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.format = ""
self.year = ""
self.rating = ""
self.stars = ""
self.description = ""
# Call when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "movie":
print ("*****Movie*****")
title = attributes["title"]
print ("Title:", title)
# Call when an elements ends
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "format":
print ("Format:", self.format)
elif self.CurrentData == "year":
print ("Year:", self.year)
elif self.CurrentData == "rating":
print ("Rating:", self.rating)
elif self.CurrentData == "stars":
print ("Stars:", self.stars)
elif self.CurrentData == "description":
print ("Description:", self.description)
self.CurrentData = ""
# Call when a character is read
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if ( __name__ == "__main__"):
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )
parser.parse("c:\temp\movies.xml")
movies.xml
<collection shelf = "New Arrivals">
<movie title = "Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title = "Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title = "Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title = "Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>