如何通过列文施泰因距离合并相似列的行



我正在使用 AWS Athena,我正在尝试合并所有具有特定列且值小于 5 levenshtein_distance行,并对标准化百分比求和。

该表具有以下结构:

CREATE EXTERNAL TABLE `actions`(
`id` string COMMENT 'from deserializer', 
`text` string COMMENT 'from deserializer',
`normalizedpercentage` float COMMENT 'from deserializer', 
`timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://xxxxxx/db/actions'
TBLPROPERTIES (
'has_encrypted_data'='false', 
'transient_lastDdlTime'='1566991410')

这就是我想做的:

WITH t AS 
(SELECT id,
text,
normalizedPercentage
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t

遗憾的是,PARTITION BY 子句只接受列名。

我正在考虑定义一个函数并使用它来遍历所有行,但是这在 Presto 中似乎是不可能的。

您可以根据函数计算临时表中的新列,然后在主查询中使用该列进行分区

WITH t AS 
(SELECT id,
text,
normalizedPercentage,case when  levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5 then 'groupA' else 'groupB' end as classification
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY classification ) AS cumulative
FROM t

首先将cross join应用于自身,就像在每条记录中对每条记录一样,然后WHERE子句来过滤掉所有不相关的记录,在您的情况下,distance < 5.

WITH t AS 
(SELECT a.id,
a.text,
a.normalizedPercentage,
b.id b_id,
b.text b_text,
b.normalizedPercentage b_normalizedPercentage,
levenshtein_distance(a.text, b.text) distance
FROM actions a
CROSS JOIN actions b
WHERE
(a.timestamp BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59')
AND
(b.timestamp BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59')
AND 
(levenshtein_distance(a.text, b.text) < 5) 
)
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY distance) AS cumulative
FROM t

附言我还没有测试过这个,但这应该可以工作

最新更新