如果记录总数为x;像";查询为y,则计数为"0";不像";查询应该是x-y
我得到了";x〃;作为"0"的计数;不像";查询如下:
SELECT COUNT(DISTINCT(b.word))
FROM "hunspell"."oscar2_sorted" AS b
总计数:9597651
SELECT COUNT(distinct(b.word))
FROM "hunspell"."oscar2_sorted" as b
INNER JOIN invalidswar AS a
ON b.word LIKE (CONCAT('%', a.word,'%'))
点赞次数:73116
SELECT COUNT(distinct(b.word))
FROM "hunspell"."oscar2_sorted" AS b
INNER JOIN invalidswar AS a
ON b.word NOT LIKE (CONCAT('%', a.word,'%'))
不相似计数:9597651
预期:9524535
我不确定我遗漏了什么。
更新:
左联接计数接近预期值,但看起来仍然不正确。
SELECT COUNT(DISTINCT(b.word))
FROM "hunspell"."oscar2_sorted" AS b
LEFT JOIN (SELECT DISTINCT(b.word) AS dword
FROM "hunspell"."oscar2_sorted" AS b
INNER JOIN invalidswar AS a
ON b.word LIKE (CONCAT('%', a.word,'%'))) AS d
ON d.dword = b.word
WHERE d.dword IS NULL
左联接计数:9536539
更新2:
12004的差异可以追溯到类似和regexp_like的执行方式的差异。
SELECT count(distinct(b.word))
FROM "hunspell"."oscar2_sorted" as b
INNER JOIN invalidswar AS a
ON regexp_like(b.word, a.word)
类似regex的计数:61112
SELECT COUNT(word)
FROM (SELECT word
FROM "hunspell"."oscar2_sorted"
EXCEPT DISTINCT
(SELECT b.word
FROM "hunspell"."oscar2_sorted" as b
INNER JOIN invalidswar AS a
ON regexp_like(b.word, a.word)))
请参阅:EXCEPT子句和regexp_like
WITH
invalid_check AS
(
SELECT
o.word,
CASE WHEN
EXISTS (
SELECT *
FROM invalidswar AS i
WHERE o.word LIKE CONCAT('%', i.word,'%')
)
THEN
1
ELSE
0
END
AS is_invalid
FROM
"hunspell"."oscar2_sorted" AS o
GROUP BY
o.word
)
SELECT
COUNT(*) AS all_words,
SUM(is_invalid) AS invalid_words,
SUM(1-is_invalid) AS valid_words
FROM
valid_check
假设以上内容符合您的预期,那么只计算有效单词就可以…
SELECT
COUNT(DISTINCT o.word)
FROM
"hunspell"."oscar2_sorted" AS o
WHERE
NOT EXISTS (
SELECT *
FROM invalidswar AS i
WHERE o.word LIKE CONCAT('%', i.word,'%')
)