Python XML 解析算法速度

我目前正在 heroku 上的 python-flask 网络应用程序中解析以下形式的大型 XML 文件：

<book name="bookname">
  <volume n="1" name="volume1name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
  <volume n="2" name="volume2name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
</book>

我用来解析、分析和通过 Flask 显示它的代码如下：

from lxml import etree
file = open("books/filename.xml")
parser = etree.XMLParser(recover=True)
tree = etree.parse(file, parser)
root = tree.getroot()
def getChapter(volume, chapter):
    i = 0
    data = []
    while True:
        try:
            data.append(root[volumeList().index(volume)][chapter-1][i].text)
        except IndexError:
            break
        i += 1
    if data == []:
        data = None
    return data
def volumeList():
    data = tree.xpath('//volume/@name')
    return data
def chapterCount(volume):
    currentChapter = 1
    count = 0
    while True:
        data = getChapter(volume, currentChapter)
        if data == None:
            break
        else:
            count += 1
            currentChapter += 1
    return count
def volumeNumerate():
    list = volumeList()
    i = 1
    dict = {}
    for element in list:
        dict[i] = element
        i += 1
    return dict
def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)
@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

我遇到的问题是，每当 Flask 尝试渲染包含许多章节的卷时（每当 chapterCount（session['volume']）>大约 50 个左右时），页面的加载和处理需要很长时间。相比之下，如果应用程序加载的卷少于 10/15 章，则加载几乎是即时的，即使是实时 Web 应用程序也是如此。因此，有没有一种好方法可以优化它，并提高速度和性能？多谢！

（PS：作为参考，这是我旧的getChapter函数，我停止使用，因为我不想在代码中引用单个"li"并希望代码与任何通用XML文件一起使用。不过，它比当前的getChapter函数快得多！

def OLDgetChapter(volume, chapter):
    data = tree.xpath('//volume[@name="%s"]/chapter[@n=%d]/li/text()'%(volume,chapter))
    if data == []:
        data = None
    return data

多谢！

你听说过美丽汤吗？

BeautifulSoup为你做了繁琐的解析xml的工作，除了它在C语言中完成。

我肯定这会更快（并且更具可读性）：

from bs4 import BeautifulSoup
filename = "test.xml"
soup = BeautifulSoup(open(filename), "xml")
def chapterCount(volume_name):
    volume = soup.find("volume", attrs={"name": volume_name})
    chapter_count = len(volume.find_all("chapter", recursive=False))
    return chapter_count
def getChapter(volume_name, chapter_number):
    volume = soup.find("volume", {"name": volume_name})
    chapter = volume.find("chapter", {"n": chapter_number})
    items = [ content for content in chapter.contents if content != "n" ]
    return "n".join([ item.contents[0] for item in items ])

# from now on, it's the same as your original code
def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)
@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

请注意，不仅getChapter功能会更快，而且重点是，当您想通过chapterCount计算特定卷中的章节时，您不必为每一章迭代它。这两个功能现在完全相互独立。

这两个函数的结果：

>>> print(chapterCount("volume1name"))
2
>>> print(getChapter("volume1name", 2))
li 1 content
li 2 content

编辑：

我只是问了一个问题，看看是否有一种更快的方法来计算章节。请继续关注:) - 更新：答案是您可以使用recursive=False来防止 BS 返回使用 find_all 找到的元素的整个树。或者，直接使用 lxml .

编辑：

我刚刚注意到你在你看来称render_default_values。你不应该这样做，或者至少你应该以不同的方式调用这个函数。因为"渲染默认值"意味着...好吧，呈现默认值。

允许这个函数基于全局变量（session）渲染其他东西被认为是不是很Pythonic，并且可能导致意大利面条代码（未知错误等）。

如果您担心速度，与其遍历所有卷和章节以找到合适的name和n属性值，不如使用单个 xpath 表达式一次性获取它（刚刚注意到这正是您的旧方法）。但是，与其要求li，不如要求任何带有*的元素：

//volume[@name="%s"]/chapter[@n="%s"]/*/text()

其中%s是传入的volume和chapter值的占位符。

def getChapter(volume, chapter):
    return root.xpath('//volume[@name="%s"]/chapter[@n="%s"]/*/text()' % (volume, chapter))

演示：

>>> from lxml import etree
>>> 
>>> parser = etree.XMLParser(recover=True)
>>> tree = etree.parse(open("test.xml"), parser)
>>> root = tree.getroot()
>>> 
>>> volume = 'volume1name'
>>> chapter = 2
>>> 
>>> xpath = '//volume[@name="%s"]/chapter[@n="%s"]/*/text()' % (volume, chapter)
>>> root.xpath(xpath)
['li 1 content', 'li 2 content']

相关内容

最新更新

热门标签：