减少 Python 脚本中的 RAM 使用量



我写了一个快速的小程序,从联合国教科文组织网站上抓取书籍数据,其中包含有关书籍翻译的信息。代码正在做我想要它做的事情,但是当它被处理到大约 20 个国家/地区时,它使用了 ~6GB 的 RAM。由于我需要处理大约 200 个,这对我不起作用。

我不确定所有 RAM 的使用来自哪里,所以我不确定如何减少它。我假设是字典保存了所有的书籍信息,但我并不肯定。我不确定我是否应该简单地让每个国家/地区的程序运行一次,而不是处理其中的很多?或者如果有更好的方法可以做到这一点?

这是我第一次写这样的东西,我是一个相当新手,自学成才的程序员,所以请指出代码中的任何重大缺陷,或者你拥有的可能与手头的问题没有直接关系的改进技巧。

这是我的代码,提前感谢任何帮助。

from __future__ import print_function
import urllib2, os
from bs4 import BeautifulSoup, SoupStrainer
''' Set list of countries and their code for niceness in explaining what
is actually going on as the program runs. '''
countries = {"AFG":"Afghanistan","ALA":"Aland Islands","DZA":"Algeria"}
'''List of country codes since dictionaries aren't sorted in any
way, this makes processing easier to deal with if it fails at
some point, mid run.'''
country_code_list = ["AFG","ALA","DZA"]
base_url = "http://www.unesco.org/xtrans/bsresult.aspx?lg=0&c="
destination_directory = "/Users/robbie/Test/"
only_restable = SoupStrainer(class_="restable")
class Book(object):
    def set_author(self,book): 
        '''Parse the webpage to find author names. Finds last name, then
        first name of original author(s) and sets the Book object's 
        Author attribute to the resulting string.'''
        authors = ""
        author_last_names = book.find_all('span',class_="sn_auth_name")
        author_first_names = book.find_all('span', attrs={
            'class':"sn_auth_first_name"})
        if author_last_names == []: self.Author = [" "]
        for author in author_last_names:
            try: 
                first_name = author_first_names.pop()
                authors = authors + author.getText() + ', ' + 
                    first_name.getText()
            except IndexError:
                authors = authors + (author.getText())
        self.author = authors
    def set_quality(self,book):
        ''' Check to see if book page is using Quality, then set it if 
        so.'''
        quality = book.find_all('span', class_="sn_auth_quality")
        if len(quality) == 0: self.quality = " "
        else: self.quality = quality[0].contents[0]
    def set_target_title(self,book): 
        target_title = book.find_all('span', class_="sn_target_title")
        if len(target_title) == 0: self.target_title = " "
        else: self.target_title = target_title[0].contents[0]
    def set_target_language(self,book): 
        target_language = book.find_all('span', class_="sn_target_lang")
        if len(target_language) == 0: self.target_language = " "
        else: self.target_language = target_language[0].contents[0]
    def set_translator_name(self,book) : 
        translators = ""
        translator_last_names = book.find_all('span', class_="sn_transl_name")
        translator_first_names = book.find_all('span', 
                                               class_="sn_transl_first_name")
        if translator_first_names == [] and translator_last_names == [] :
            self.translators = " "
            return None
        for translator in translator_last_names:
            try: 
                first_name = translator_first_names.pop()
                translators = translators + 
                    (translator.getText() + ',' 
                     + first_name.getText())
            except IndexError:
                translators = translators + 
                    (translator.getText())
        self.translators = translators  
    def set_published_city(self,book) : 
        published_city = book.find_all('span', class_="place")
        if len(published_city) == 0: 
            self.published_city = " "
        else: self.published_city = published_city[0].contents[0]
    def set_publisher(self,book) : 
        publisher = book.find_all('span', class_="place")
        if len(publisher) == 0: 
            self.publisher = " "
        else: self.publisher = publisher[0].contents[0] 
    def set_published_country(self,book) : 
        published_country = book.find_all('span', 
                                        class_="sn_country")
        if len(published_country) == 0: 
            self.published_country = " "
        else: self.published_country = published_country[0].contents[0]
    def set_year(self,book) : 
        year = book.find_all('span', class_="sn_year")
        if len(year) == 0: 
            self.year = " "
        else: self.year = year[0].contents[0]   
    def set_pages(self,book) : 
        pages = book.find_all('span', class_="sn_pagination")
        if len(pages) == 0: 
            self.pages = " "
        else: self.pages = pages[0].contents[0] 
    def set_edition(self, book) :
        edition = book.find_all('span', class_="sn_editionstat")
        if len(edition) == 0: 
            self.edition = " "
        else: self.edition = edition[0].contents[0]
    def set_original_title(self,book) : 
        original_title = book.find_all('span', class_="sn_orig_title")
        if len(original_title) == 0: 
            self.original_title = " "
        else: self.original_title = original_title[0].contents[0]   
    def set_original_language(self,book) :
        languages = ''
        original_languages = book.find_all('span', 
                                         class_="sn_orig_lang")
        for language in original_languages:
            languages = languages + language.getText() + ', '
        self.original_languages = languages
    def export(self, country): 
        ''' Function to allow us to easilly pull the text from the 
        contents of the Book object's attributes and write them to the 
        country in which the book was published's CSV file.'''
        file_name = os.path.join(destination_directory + country + ".csv")
        with open(file_name, "a") as by_country_csv:        
            print(self.author.encode('UTF-8') + " & " + 
                  self.quality.encode('UTF-8') + " & " + 
                  self.target_title.encode('UTF-8') + " & " + 
                  self.target_language.encode('UTF-8') + " & " + 
                  self.translators.encode('UTF-8') + " & " + 
                  self.published_city.encode('UTF-8') + " & " + 
                  self.publisher.encode('UTF-8') + " & " + 
                  self.published_country.encode('UTF-8') + " & " + 
                  self.year.encode('UTF-8') + " & " + 
                  self.pages.encode('UTF-8') + " & " + 
                  self.edition.encode('UTF-8') + " & " + 
                  self.original_title.encode('UTF-8') + " & " + 
                  self.original_languages.encode('UTF-8'), file=by_country_csv)
        by_country_csv.close()
    def __init__(self, book, country):
        ''' Initialize the Book object by feeding it the HTML for its 
        row'''
        self.set_author(book)
        self.set_quality(book)
        self.set_target_title(book)
        self.set_target_language(book)
        self.set_translator_name(book)
        self.set_published_city(book)
        self.set_publisher(book)
        self.set_published_country(book)
        self.set_year(book)
        self.set_pages(book)
        self.set_edition(book)
        self.set_original_title(book)
        self.set_original_language(book)

def get_all_pages(country,base_url):
    ''' Create a list of URLs to be crawled by adding the ISO_3166-1_alpha-3
    country code to the URL and then iterating through the results every 10
    pages. Returns a string.'''
    base_page = urllib2.urlopen(base_url+country)
    page = BeautifulSoup(base_page, parse_only=only_restable)
    result_number = page.find_all('td',class_="res1",limit=1)
    if not result_number:
        return 0
    str_result_number = str(result_number[0].getText())
    results_total = int(str_result_number.split('/')[1])
    page.decompose()
    return results_total

def build_list(country_code_list, countries):
    '''  Build the list of all the books, and return a list of Book objects
    in case you want to do something with them in something else, ever.'''
    for country in country_code_list:
        print("Processing %s now..." % countries[country])
        results_total = get_all_pages(country, base_url)
        for url in range(results_total):
            if url % 10 == 0 :
                all_books = []  
                target_page = urllib2.urlopen(base_url + country 
                                             +"&fr="+str(url))
                page = BeautifulSoup(target_page, parse_only=only_restable)
                books = page.find_all('td',class_="res2")
                for book in books:
                    all_books.append(Book (book,country))
                page.decompose()
                for title in all_books:
                    title.export(country)    
    return
if __name__ == "__main__":
    build_list(country_code_list,countries)
    print("Completed.")
我想

我只会列出一些问题或可能的改进,没有特定的顺序:

  1. 遵循 PEP 8。

    现在,你有很多变量和函数使用驼峰大小写命名,如setAuthor。这不是 Python 的传统风格;Python通常会将该命名为set_author(并且published_country而不是PublishedCountry等(。你甚至可以更改一些你正在调用的东西的名称:首先,BeautifulSoup支持findAll兼容性,但建议find_all

    除了命名之外,PEP 8 还指定了一些其他内容;例如,您需要重写以下内容:

    if len(resultNumber) == 0 : return 0
    

    如:

    if len(result_number) == 0:
        return 0
    

    甚至考虑到空列表是虚假的事实:

    if not result_number:
        return 0
    
  2. SoupStrainer传递给BeautifulSoup .

    您要查找的信息可能只存在于文档的一部分中;您不需要将整个内容解析为树。将SoupStrainer作为parse_only参数传递给BeautifulSoup。这应该通过尽早丢弃不必要的部分来减少内存使用量。

  3. 完后decompose汤。

    Python 主要使用引用计数,因此删除所有循环引用(如decompose所做的那样(应该让其垃圾回收的主要机制,引用计数,释放大量内存。Python还有一个半传统的垃圾收集器来处理循环引用,但引用计数要快得多。

  4. 不要让Book.__init__东西写到磁盘上。

    在大多数情况下,我不希望只创建一个类的实例来将某些内容写入磁盘。删除对export的调用;让用户调用export,如果他们希望将其放在磁盘上。

  5. 停止在内存中保留如此多的数据。

    您将所有这些数据累积到字典中,以便以后将其导出。减少内存的明显做法是尽快将其转储到磁盘。您的评论表明您将其放入字典以保持灵活性;但这并不意味着您必须将所有内容收集在一个列表中:使用生成器,在抓取物品时产生物品。然后用户可以像列表一样迭代它:

    for book in scrape_books():
        book.export()
    

    ...但优点是一次最多可以记住一本书。

  6. os.path使用这些函数,而不是自己修改路径。

    当涉及到路径名时,您现在的代码相当脆弱。如果我不小心从destinationDirectory中删除了尾部斜杠,就会发生一些意想不到的事情。使用os.path.join可以防止这种情况发生,并处理跨平台差异:

    >>> os.path.join("/Users/robbie/Test/", "USA")
    '/Users/robbie/Test/USA'
    >>> os.path.join("/Users/robbie/Test", "USA")  # still works!
    '/Users/robbie/Test/USA'
    >>> # or say we were on Windows:
    >>> os.path.join(r"C:Documents and SettingsrobbieTest", "USA")
    'C:\Documents and Settings\robbie\Test\USA'
    
  7. attrs={"class":...}缩写为 class_=...

    美丽汤 4.1.2 引入了class_搜索,消除了对冗长attrs={"class":...}的需要。

我想你可以改变更多的东西,但这是相当多的开始。

最后,你想要书单做什么?您应该在"for url in range"块(在其内部(的末尾导出每本书,并且不使用allbooks字典。如果您确实需要一个列表,请准确定义您需要的信息,而不是保留完整的 Book 对象。

最新更新