我正在使用一个非常大的数据集,并且正在循环数据块以向类添加元素。我的数据中有许多重复的值,这意味着我多次为相同的数据创建一个类实例。从我所做的一些测试来看,似乎实际创建类的实例是操作中最昂贵的部分,所以我想尽可能地减少这一点。
我的问题是:避免创建重复的类实例的最省时的方法是什么?理想情况下,我想创建一个类实例只一次,所有重复引用相同的实例. 我不能一开始就从我的数据中删除重复的数据,但我想确保我尽量减少任何昂贵的过程。
这里有一个简单的例子,我希望它能说明我的问题。注释掉的部分显示了我对如何能够节省时间的想法。在本例中,Person
包含两个调用sleep
的方法,以演示创建实例的时间成本。在我的示例中,代码将在4.22秒((SLEEP_1 * 6) + (SLEEP_2 * 6)
)内运行。既然我有一个叫詹姆斯的人;目前3次,我正在寻找一种方法来添加这个人只有一次,然后引用这2个重复。
我期望代码在~2.8s ((SLEEP_1 * 4) + (SLEEP_2 * 4)
)
import time
from collections import defaultdict
SLEEP_1 = 0.2
SLEEP_2 = 0.5
# A class `Person` has a load of methods,
# meaning that creating an instance has a non-negligible time-cost over millions of calls.
class Person:
def __init__(self, info):
self._id = info['_id']
self.name = info['name']
self.nationality = info['nationality']
self.age = info['age']
self.can_drink_in_USA = self.some_long_fun()
self.can_fly_solo = self.another_costly_fun()
def some_long_fun(self):
time.sleep(SLEEP_1)
if self.age >= 21:
return True
return False
def another_costly_fun(self):
time.sleep(SLEEP_2)
if self.age >= 18:
return True
return False
# Some data to iterate over
# Note that "James" is present 3 times
teams = {
"team1": [
{"_id": "foo", "name": "James", "nationality": "French", "age": 32},
{"_id": "bar", "name": "Frank", "nationality": "American", "age": 36},
{"_id": "foo", "name": "James", "nationality": "French", "age": 32}
],
"team2": [
{"_id": "foo", "name": "James", "nationality": "French", "age": 32},
{"_id": "baz", "name": "Oliver", "nationality": "British", "age": 26},
{"_id": "qux", "name": "Josh", "nationality": "British", "age": 42}
]
}
seen = defaultdict(int)
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
for i, person in enumerate(teams[team]):
if person['_id'] in seen:
print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
# p = getattr(Person, '_id') == person['_id']
# team_directory[team].append(p)
# continue
print(f"Person {i + 1} = {person['name']}")
p = Person(info=person)
team_directory[team].append(p)
seen[person['_id']] += 1
finish_time = time.time() - start_time
expected_finish = round((SLEEP_1 * 6) + (SLEEP_2 * 6), 2)
print(f"Built a teams directory in {round(finish_time, 2)}s [expect: {expected_finish}s]")
# Loop over the results to check - I want each team to have 3 people
# (so I can't squash duplicates from the outset
for t in team_directory:
roster = " ".join([p.name for p in team_directory[t]])
print(f"Team {team} contains these people: {roster}")
seen
可以用作缓存,将人_id
与已创建的Person
对象关联起来。
这看起来像(代码直到并包括主for循环,其余代码不需要更改):
seen = {}
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
for i, person in enumerate(teams[team]):
if person['_id'] in seen:
print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
p = seen[person['_id']]
team_directory[team].append(p)
continue
print(f"Person {i + 1} = {person['name']}")
p = Person(info=person)
team_directory[team].append(p)
seen[person['_id']] = p
像seen[person['_id']] = p
这样的赋值只复制一个对象的引用,而不复制对象本身,因此它不需要太多的内存。
创建实例在数百万次调用中具有不可忽略的时间成本
那就别打电话给他们。你的两个例子是派生函数;它们使用其他属性,因此可以保留实例方法,而不需要存储在实例字段本身中。另外,您永远不会在构造函数之外的代码中使用它们,因此它们可以从那里删除并延迟到实际需要它们的代码中。
同样,该示例代码只需要一个函数,并且没有休眠
def age_check(age):
def f(over):
return age >= over
return f
age_check(self.age)(18)
age_check(self.age)(21)
,或者更简单
def age_check(self, over):
return self.age >= over
需要引用
Person._id == person['_id']
的实例,我不确定如何有效地做到这一点。最后,我需要添加这个:team_directory[team].append(p)
不要使用列表和追加。使用将Person._id
映射到person实例本身的字典。这样,您就不需要在列表中重复迭代以查看是否已经存在一个人
显然,这一切都假设您的数据集适合内存