如何有效地从单个Django模型中查询一对对象



我有一个Track模型,目前我通过模型id进行嵌套循环以获得配对,然后将其传递到一个函数中,以计算这两个不等价的Track对象之间的相似性

track_set = Track.objects.all()
track_ids = [track.id for track in track_set]
pointer_a = 0
pointer_b = 1
for pointer_a in range(len(track_ids) - 1):
for pointer_b in range(pointer_a + 1, len(track_ids)):
track_a = Track.objects.get(pk=track_ids[pointer_a])
track_b = Track.objects.get(pk=track_ids[pointer_b])
counter += 1
count_it_sim(track_a, track_b)

我认为我获取对象的方式不是很有效,有什么方法可以优化它吗?

编辑:这个count_it_sim将计算track_a和track_b之间的相似性值,我需要为track模型中的所有对计算它
models.py

class Tag(models.Model):
name = models.CharField(max_length=255, unique=True)
class Tagged(models.Model):
track = models.ForeignKey(Track, on_delete=models.CASCADE)
tag = models.ForeignKey(Tag, on_delete=models.CASCADE)
frequency = models.IntegerField(
default=0,
validators=[MinValueValidator(0)],
)
class Meta:
unique_together = ('track', 'tag')
class Track(models.Model):
track_id = models.CharField(max_length=24)
title = models.CharField(max_length=120)
link = models.URLField(max_length=120, blank=True)
tags = models.ManyToManyField(Tag, through='Tagged', blank=True)
similarity = models.ManyToManyField(
'self',
blank=True,
through='Similar',
related_name='similar_to',
symmetrical=False
)
users = models.ManyToManyField(User, through='PlayTrack', blank=True)
class Similar(models.Model):
track1 = models.ForeignKey(Track, on_delete=models.CASCADE, related_name='track1')
track2 = models.ForeignKey(Track, on_delete=models.CASCADE, related_name='track2')
similarity = models.FloatField(
validators=[MinValueValidator(0), MaxValueValidator(1)],
)

count_it_sim所要做的是,它将通过一个关联实体,即Tagged模型,获得tracka和trackb的所有标签频率,并对其进行计算,得到tracka和Trackb 之间的相似值

def count_it_sim(track_a: Track, track_b: Track):
tag_set = Tag.objects.all()
part1 = 0
part2 = 0
part3 = 0
for tag in tag_set:
try:
freq_tag_of_track_a = Tagged.objects.get(track=track_a, tag=tag).frequency
except Tagged.DoesNotExist:
continue
try:
freq_tag_of_track_b = Tagged.objects.get(track=track_b, tag=tag).frequency
except Tagged.DoesNotExist:
continue
part1 += freq_tag_of_track_a * freq_tag_of_track_b
part2 += freq_tag_of_track_a ** 2
part3 += freq_tag_of_track_b ** 2
try:
it_sim = part1 / (math.sqrt(part2) * math.sqrt(part3))
except ZeroDivisionError:
it_sim = None

编辑2:在count_it_sim上,我没有迭代Tag.objects.all()中的所有标签,而是只查询Tagged中存在的标签,结果比上一个快得多,这是我现在的代码

def count_it_sim(track_a: Track, track_b: Track):
filtered_tagged = Tagged.objects.filter(Q(track=track_a) | Q(track=track_b))
tag_ids = filtered_tagged.values_list('tag', flat=True).distinct()
part1 = 0
part2 = 0
part3 = 0
for tag_id in tag_ids:
try:
freq_tag_of_track_a = filtered_tagged.get(track=track_a, tag__id=tag_id).frequency
except Tagged.DoesNotExist:
freq_tag_of_track_a = 0
try:
freq_tag_of_track_b = filtered_tagged.get(track=track_b, tag__id=tag_id).frequency
except Tagged.DoesNotExist:
freq_tag_of_track_b = 0
part1 += freq_tag_of_track_a * freq_tag_of_track_b
part2 += freq_tag_of_track_a ** 2
part3 += freq_tag_of_track_b ** 2
try:
it_sim = part1 / (math.sqrt(part2) * math.sqrt(part3))
except ZeroDivisionError:
it_sim = None

编辑3:模型中有一些更改。取代存储track的每个tagfrequency,现在将通过计数有多少users用特定tag标记track来计算frequency。这是的更新

# models.py
...
class Tagged(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE, default=1)
track = models.ForeignKey(Track, on_delete=models.CASCADE)
tag = models.ForeignKey(Tag, on_delete=models.CASCADE)
...

且count_it_sim函数变为

def count_it_sim(track_a: Track, track_b: Track):
filtered_tagged = Tagged.objects.filter(Q(track=track_a) | Q(track=track_b))
tag_ids = filtered_tagged.values_list('tag', flat=True).distinct()
part1 = 0
part2 = 0
part3 = 0
for tag_id in tag_ids:
try:
freq_tag_of_track_a = filtered_tagged.filter(track=track_a, tag__id=tag_id).count()  # UPDATED LINE
except Tagged.DoesNotExist:
freq_tag_of_track_a = 0
try:
freq_tag_of_track_b = filtered_tagged.filter(track=track_b, tag__id=tag_id).count()  #UPDATED LINE
except Tagged.DoesNotExist:
freq_tag_of_track_b = 0
part1 = accumulate(part1, freq_tag_of_track_a * freq_tag_of_track_b)
part2 = accumulate(part2, freq_tag_of_track_a ** 2)
part3 = accumulate(part3, freq_tag_of_track_b ** 2)
try:
it_sim = final_calc(part1, part2, part3)
except ZeroDivisionError:
it_sim = None
return it_sim

我认为获取对象的方式不是很有效,有什么方法可以优化它吗?

好吧,您已经有了对象(在track_set中(,所以您不需要再次获取它们;你只需要得到成对的物体。

我需要的是(1,2((1,3((1,4((2,3((2,4((3,4(

为此,您可以使用itertools.combinations:

import itertools
for a,b in itertools.combinations(track_set, 2):
count_it_sim(a, b)

您必须确保以正确的顺序从数据库中提取对象;因为无法保证物品将如何退回:

如果查询没有指定排序,则返回结果以未指定的顺序从数据库中删除。特定的订单是仅当通过一组唯一的字段排序时才保证识别结果中的每个对象。

在您的情况下,似乎需要按主键顺序排列它们;所以我会将初始查询修改为:

track_set = Track.objects.order_by('pk')

queryset文档中有关于order_by的详细信息,模型参考中有关于指定默认顺序的详细信息。

根据我们在评论中的讨论,我们的目标是使用最少数量的数据库查询来获得所需的数据,这种结构可以方便地进行进一步的数据操作。

我们将尝试在以下字典结构中获取与每个Track对应的所有Tagged.frequency数据:

{
track_id1: list(all_related_frequencies), 
track_id2: list(all_related_frequencies),
}

这个想法是获取所有的Track实例,并在一个名为prefetched_tagged的属性中预取它们相关的Tagged实例。

Track的每个实例都有一个属性tagged_set,允许反向访问与其相关的所有Tagged实例。这是预取它们的查询:

from django.db.models import Prefetch
the_data = (
Track.objects.prefetch_related(
Prefetch(
'tagged_set',
to_attr='prefetched_tagged',
)
).all()
)

现在,我们在the_data变量中保存了Track的所有实例,其中每个实例都有一个属性prefetched_tagged,该属性包含与每个Track实例相关的所有Tagged实例的list

通过以下字典理解,我们迭代the_data变量以创建一个以所有Track.track_id为键的字典。每个键都将有一个list作为其值,该值将包含其所有相关的Tagged.frequency

为了创建列表,我们将使用字典内的列表理解:

result = {
each_track.track_id: [
each_tagged.frequency for each_tagged in each_track.prefetched_tagged
] for each_track in the_data
}

现在,变量result包含我们需要的结构中的数据,以便进一步操作它们。为了将所有数据库数据加载到内存中,需要两次数据库命中。这是衡量数据库命中率的完整代码:

from django import db
from django.db.models import Prefetch
db.reset_queries()    
the_data = (
Track.objects.prefetch_related(
Prefetch(
'tagged_set',
to_attr='prefetched_tagged',
)
).all()
)
result = {
each_track.track_id: [
each_tagged.frequency for each_tagged in each_track.prefetched_tagged
] for each_track in the_data
}
print(result)
print ("Queries Used: {0}".format(len(db.connection.queries))

最新更新