我目前有一个表,包含大约1200万行评论,列:
id
productid
title
price
userid
profilename
helpfulness
score
review_time
summary
text
我的查询如下:
SELECT title, productid as p, count(text) as positive,
(SELECT count(*) FROM `reviews` WHERE productid = p) as total
FROM `reviews`
WHERE text like '%my favorite book%'
GROUP BY productid
ORDER BY positive DESC;
它基本上是在查找评论文本中包含"我最喜欢的书"的所有产品,获得与每个产品tid匹配的评论数量,然后获得每个产品的评论总数。
我在AWS的RDS数据库中有这个表,类设置为最快的r3.8xlarge,但是它仍然需要几天的时间来运行。
现在更奇怪的是,至少对我来说,如果我改变搜索文本到下面:
SELECT title, productid as p, count(text) as positive,
(SELECT count(*) FROM `reviews` WHERE productid = p) as total
FROM `reviews`
WHERE text like '%tim ferriss%' or
text like '%timothy ferriss%' or
text like '%four hour workweek%' or
text like '%4-hour workweek%' or
text like '%four hour body%' or
text like '%4-hour body%' or
text like '%4 hour workweek%' or
text like '%4 hour body%' or
text like '%four hour chef%' or
text like '%4-hour chef%' or
text like '%4 hour chef%'
GROUP BY productid
ORDER BY positive DESC
甚至将数据库类降低到m3.2xlarge,查询只需要不到20分钟。
我错过了什么吗?任何建议都会有帮助的,谢谢。
我认为您的查询更容易使用条件聚合编写:
SELECT title, productid as p,
sum(text like '%my favorite book%') as positive,
count(*) as total
FROM `reviews`
GROUP BY productid
ORDER BY positive DESC;
您的原始查询过滤掉了没有正面评论的产品。如果你真的想这样做,那么你可以添加:
HAVING positive > 0
group by
.