使用烧瓶使用的python对象减少记忆使用情况

我正在寻找一些有关如何减少python内存使用量的提示。我正在使用此代码作为主要结构来保存我的数据：

http://stevehanov.ca/blog/index.php?id=114

我需要使用烧瓶服务器来匹配接近单词。我需要放置超过200万不同的字符串(它将增加(。现在，我在尝试将约1400万人放入Trie时得到了MemoryError。

我只是添加了一个字典来保持一些快速访问的价值(我需要它，但可以将其视为一种外观ID，它与单词无直接相关(

  class TrieNode:
    values = {}
    def __init__(self):
        self.word = None
        self.children = {}
        global NodeCount
        NodeCount += 1
    def insert( self, word, value):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = TrieNode()
            node = node.children[letter]
        TrieNode.values[word] = value
        node.word = word

我不熟悉Python优化，是否有任何方法可以使"字母"对象不那么大以保存一些内存？

请注意，我的困难来自以下事实：这封信不仅是[A-Z]，而且需要处理所有" Unicode范围"(如加重的字符，而不仅仅是(。顺便说一句，它是a 单个字符，因此从内存指纹中应该很轻。如何使用CODEPOINT代替字符串对象(会更有效地内存(？

编辑：添加其他一些信息后， @juanpa-arrivillaga的答复

所以，首先，我使用 slot 在我的计算机上，有或没有__slot__的构造没有区别。

使用__slot__：

>>> class TrieNode:
    NodeCount = 0
    __slots__ = "word", "children"
    def __init__(self):
    self.word = None
    self.children = {}
    #global NodeCount # my goal is to encapsulated the NodeCount in the class itself
    TrieNode.NodeCount += 1

>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176

没有__slot__：

>>> class TrieNode:
    NodeCount = 0
    def __init__(self):
        self.word = None
        self.children = {}
        #global NodeCount
        TrieNode.NodeCount += 1

>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176

所以我不明白，为什么。我在哪里错了？

这也是我使用"实习"关键字的其他方法，因为此值是处理" ID"的字符串(因此与Unicode无关，不喜欢字母(：

顺便说一句，我的目标是拥有价值和nodecount，这是类/静态变量的等效概念，以便每个人都会由小型创建的objets的所有实例共享，我认为它可以保留内存并避免重复，但是我对python中的"类似静态"概念的理解可能是错误的(

class TrieNode:
    values = {}    # shared amon all instances so only one structure?
    NodeCount = 0
    __slots__ = "word", "children"
    def __init__(self):
      self.word = None
      self.children = {}
      #global NodeCount
      TrieNode.NodeCount += 1
    def insert( self, word, value = None):
        # value is a string id like "XYZ999999999"
        node = self
        for letter in word:
            codepoint = ord(letter) 
            if codepoint not in node.children: 
                 node.children[codepoint] = TrieNode()
        node = node.children[codepoint]
        node.word = word
        if value is not None:
             lost = TrieNode.values.setdefault(word, [])
             TrieNode.values[word].append(intern(str(value)))

添加：最后，我应该表明我正在使用Python 2.7.x family。

我想知道是否有库中的Liby中有任何固定的LEN数据类型可以帮助我节省一些内存，再次是新的，我不知道在哪里看。顺便说一句，"单词"不是真实的"自然语言"，而是"字符的任意长度顺序"，它们也可能很长。

根据您的答复，我同意避免将单词存储在每个节点中是有效的，但是您需要查看链接的文章/代码。主要目标不是重建这个词，而是能够使用此词进行有效/非常快速的字符串匹配，然后获得与每个最接近匹配的"值"，我不确定我是否了解目标是什么沿着树的路径。(不到达完整的树吗？(，当匹配时，我们只需要匹配原始单词(但此时我的理解可能是错误的(。

所以我需要在某个地方有这个巨大的命令，我想把班级封装以使其方便。但是从内存的"重量"角度来看，它可能太昂贵了？

我还注意到，我已经比您的样本更少的内存使用量(我现在不知道为什么(，但是这是结构中包含的"字母"的示例值。

>>> s = u"u266f"
>>> ord(s)
9839
>>> sys.getsizeof(s)
28
>>> sys.getsizeof(ord(s))
12
>>> print s
♯
>>> repr(s)
"u'\u266f'"

低悬挂水果：在您的节点类中使用__slots__，否则，每个TrieNode对象都围绕dict。

class TrieNode:
    __slots__ = "word", "children"
    def __init__(self):
        self.word = None
        self.children = {}

现在，每个Trienode对象都不会携带属性dict。比较大小：

>>> class TrieNode:
...     def __init__(self):
...         self.word = None
...         self.children = {}
...
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
168

vs：

>>> class TrieNode:
...     __slots__ = "word", "children"
...     def __init__(self):
...         self.is_word = False
...         self.children = {}
...
>>> sys.getsizeof(tn)
56
>>> tn.__dict__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'TrieNode' object has no attribute '__dict__'

另一种优化，使用int对象。小int对象被缓存，可能大多数字符都在该范围内，但是即使不是，int虽然在Python中仍然耐磨，但甚至比单个字符字符串小：

>>> 'ñ'
'ñ'
>>> ord('ñ')
241
>>> sys.getsizeof('ñ')
74
>>> sys.getsizeof(ord('ñ'))
28

因此，您可以做类似：

的事情

def insert( self, word, value):
    node = self
    for letter in word:
        code_point = ord(letter)
        if code_point not in node.children: 
            node.children[code_point] = TrieNode()
        node = node.children[code_point]
    node.is_word = True #Don't save the word, simply a reference to a singleton

另外，您正在围绕一个巨大增长的类变量values dict，但该信息是多余的。你说：

我只是添加一个字典来保持一些价值，并可以快速访问(我需要它(

您可以从路径中重建单词。它应该相对较快，我会认真考虑不要使用此dict。查看仅容纳一百万单字符串：

的内存需要多少内存

>>> d = {str(i):i for i in range(1000000)}
>>> (sum(sizeof(k)+sizeof(v) for k,v in d.items()) + sizeof(d)) * 1e-9
0.12483203000000001

您可以做类似的事情：

class TrieNode:
    __slots__ = "value", "children"
    def __init__(self):
        self.value = None
        self.children = {}
    def insert( self, word, value):
        node = self
        for letter in word:
            code_point = ord(letter)
            if code_point not in node.children: 
                node.children[code_point] = TrieNode()
            node = node.children[code_point]
        node.value = value #this serves as a signal that it is a word

    def get(word, default=None):
        val = self._get_value(word)
        if val is None:
            return default
        else:
            return val
    def _get_value(self, word):
        node = self
        for letter in word:
            code_point = ord(letter)
            try:
                node = node.children[code_point]
            except KeyError:
                return None
        return node.value

相关内容

最新更新

热门标签：