我有一个电影列表和一个比喻列表。为了计算两部电影之间的相似性,我使用余弦差。如果所有权重都是偶数,那么它可以很好地简化:
similarity =
(number of shared tropes between both movies)
/
(SQRT(number of tropes from movie 1) + SQRT(number of tropes from movie 2))
例如,如果电影1具有比喻1、3和4,而电影2具有比喻1,4、6和7,那么它们之间将共享两个比喻,并且相似性将是
2 / (SQRT(3) + SQRT(4)) = 2 / 3.73... = 0.54
我的MySQL表非常标准:
movies:
- id
- ...
tropes:
- id
- ...
movie_tropes:
- movie_id
- trope_id
我可以很容易地数出一部电影的比喻数量:
SELECT count(distinct trope_id) from movie_tropes where movie_id = 1;
SELECT count(distinct trope_id) from movie_tropes where movie_id = 2;
我对SQL有点不熟悉。有没有一种简单的join-y方法来计算这个联接表中电影1和电影2出现的trope_id的数量?
有没有一种简单的方法来计算电影1和电影2的trope_id数量?
您可以自行加入:
select count(distinct trope_id)
from movie_tropes t1
inner join movie_tropes t2 on t2.trope_id = t1.trope_id
where t1.movie_id = 1 and t2.movie_id = 2
但总的来说,你可以用两个级别的聚合同时计算三个基数。我建议:
select
sum(has_1) as cnt_1, -- count of distinct tropes for movie 1
sum(has_2) as cnt_2, -- count of distinct tropes for movie 2
sum(has_1 and has_2) as cnt_both -- count of distinct tropes for both movies
from (
select max(movie_id = 1) has_1, max(movie_id = 2) as has_2
from movie_tropes t
where movie_id in (1, 2)
group by trope_id
) t