避免创建重复的类实例



我正在使用一个非常大的数据集,并且正在循环数据块以向类添加元素。我的数据中有许多重复的值,这意味着我多次为相同的数据创建一个类实例。从我所做的一些测试来看,似乎实际创建类的实例是操作中最昂贵的部分,所以我想尽可能地减少这一点。

我的问题是:避免创建重复的类实例的最省时的方法是什么?理想情况下,我想创建一个类实例只一次,所有重复引用相同的实例. 我不能一开始就从我的数据中删除重复的数据,但我想确保我尽量减少任何昂贵的过程。

这里有一个简单的例子,我希望它能说明我的问题。注释掉的部分显示了我对如何能够节省时间的想法。

在本例中,Person包含两个调用sleep的方法,以演示创建实例的时间成本。在我的示例中,代码将在4.22秒((SLEEP_1 * 6) + (SLEEP_2 * 6))内运行。既然我有一个叫詹姆斯的人;目前3次,我正在寻找一种方法来添加这个人只有一次,然后引用这2个重复。

我期望代码在~2.8s ((SLEEP_1 * 4) + (SLEEP_2 * 4))

运行
import time
from collections import defaultdict
SLEEP_1 = 0.2
SLEEP_2 = 0.5
# A class `Person` has a load of methods, 
# meaning that creating an instance has a non-negligible time-cost over millions of calls. 
class Person:
    def __init__(self, info):
        self._id = info['_id']
        self.name = info['name']
        self.nationality = info['nationality']
        self.age = info['age']
        self.can_drink_in_USA = self.some_long_fun()
        self.can_fly_solo = self.another_costly_fun()
    def some_long_fun(self):
        time.sleep(SLEEP_1)
        if self.age >= 21:
            return True
        return False
    def another_costly_fun(self):
        time.sleep(SLEEP_2)
        if self.age >= 18:
            return True
        return False

# Some data to iterate over
# Note that "James" is present 3 times
teams = {
    "team1": [
        {"_id": "foo", "name": "James", "nationality": "French", "age": 32},
        {"_id": "bar", "name": "Frank", "nationality": "American", "age": 36},
        {"_id": "foo", "name": "James", "nationality": "French", "age": 32}
    ],
    "team2": [
        {"_id": "foo", "name": "James", "nationality": "French", "age": 32},
        {"_id": "baz", "name": "Oliver", "nationality": "British", "age": 26},
        {"_id": "qux", "name": "Josh", "nationality": "British", "age": 42}
    ]
}

seen = defaultdict(int)
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
    for i, person in enumerate(teams[team]):
        if person['_id'] in seen:
            print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
            # p = getattr(Person, '_id') == person['_id']
            # team_directory[team].append(p)
            # continue
        print(f"Person {i + 1} = {person['name']}")
        p = Person(info=person)
        team_directory[team].append(p)
        seen[person['_id']] += 1
finish_time = time.time() - start_time
expected_finish = round((SLEEP_1 * 6) + (SLEEP_2 * 6), 2)
print(f"Built a teams directory in {round(finish_time, 2)}s [expect: {expected_finish}s]")
# Loop over the results to check - I want each team to have 3 people
# (so I can't squash duplicates from the outset
for t in team_directory:
    roster = " ".join([p.name for p in team_directory[t]])
    print(f"Team {team} contains these people: {roster}")

seen可以用作缓存,将人_id与已创建的Person对象关联起来。

这看起来像(代码直到并包括主for循环,其余代码不需要更改):

seen = {}
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
    for i, person in enumerate(teams[team]):
        if person['_id'] in seen:
            print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
            p = seen[person['_id']]
            team_directory[team].append(p)
            continue
        print(f"Person {i + 1} = {person['name']}")
        p = Person(info=person)
        team_directory[team].append(p)
        seen[person['_id']] = p

seen[person['_id']] = p这样的赋值只复制一个对象的引用,而不复制对象本身,因此它不需要太多的内存。

创建实例在数百万次调用中具有不可忽略的时间成本

那就别打电话给他们。你的两个例子是派生函数;它们使用其他属性,因此可以保留实例方法,而不需要存储在实例字段本身中。另外,您永远不会在构造函数之外的代码中使用它们,因此它们可以从那里删除并延迟到实际需要它们的代码中。

同样,该示例代码只需要一个函数,并且没有休眠

def age_check(age):
    def f(over):
        return age >= over
    return f
age_check(self.age)(18)
age_check(self.age)(21)

,或者更简单

def age_check(self, over):
    return self.age >= over

需要引用Person._id == person['_id']的实例,我不确定如何有效地做到这一点。最后,我需要添加这个:team_directory[team].append(p)

不要使用列表和追加。使用将Person._id映射到person实例本身的字典。这样,您就不需要在列表中重复迭代以查看是否已经存在一个人

显然,这一切都假设您的数据集适合内存

最新更新