我在postgres中有下表
col1 col2 col3 col4
NCT04132960 Drug ds-8201a ds 8201a
NCT04132961 Drug ds 8201a ds 8201a
NCT04132962 Drug hydrogen sulphate hydrogen sulphate
NCT04132962 Drug hydrogen sulphate hydrogen
NCT04132962 Drug hydrogen sulphate sulphate
NCT04132963 Drug bupropion (wellbutrin) wellbutrin
NCT04132964 Drug bupropion (wellbutrin) bupropion
NCT04132964 Drug bupropion (wellbutrin) bupropione
NCT04132964 Drug bupropion (wellbutrin) bup
NCT04132965 Drug Galantimine Galantimine
NCT04132965 Drug Galantimine Galantimin
NCT04132966 Drug Glucose null
我正在尝试根据以下过滤器清理上表:
a. col3 should be similar/ identical to col4
b. length of col4 > 3
c. include row if col4 is null
当我在查询下方运行时,我能够匹配除 row1(由于特殊字符而错过(之外的行,并且我收到误报。下面是我得到的查询和输出
SELECT *
FROM table
WHERE col3 ~* ('y' || REGEXP_REPLACE(col4, '+|-', '\+', 'g') || 'y') --replace regexp specific sign
and length(col4) > 3
or col4 is null
order by col1
;
我怎么能忽略 col3 != col4 的命中(例如。加兰汀,硫酸盐(,但如果它们相同,如果我忽略特殊字符(ds-8201a,ds 8201a((安非他酮(wellbutrin(,wellbutrin(,并且如果col4为空,则保留它们。
电流输出:
col1 col2 col3 col4
NCT04132961 Drug ds 8201a ds 8201a
NCT04132962 Drug hydrogen sulphate hydrogen sulphate
NCT04132962 Drug hydrogen sulphate hydrogen
NCT04132962 Drug hydrogen sulphate sulphate
NCT04132963 Drug bupropion (wellbutrin) wellbutrin
NCT04132964 Drug bupropion (wellbutrin) bupropion
NCT04132964 Drug bupropion (wellbutrin) bupropione
NCT04132965 Drug Galantimine Galantimine
NCT04132965 Drug Galantimine Galantimin
NCT04132966 Drug Glucose null
期望的输出:
col1 col2 col3 col4
NCT04132960 Drug ds-8201a ds 8201a
NCT04132961 Drug ds 8201a ds 8201a
NCT04132962 Drug hydrogen sulphate hydrogen sulphate
NCT04132963 Drug bupropion (wellbutrin) wellbutrin
NCT04132964 Drug bupropion (wellbutrin) bupropion
NCT04132965 Drug Galantimine Galantimine
NCT04132966 Drug Glucose null
是否可以在 postgres 中处理此数据清理,还是应该以编程方式进行?
谢谢
我建议您在进行比较之前"规范化"col3 和 col4 - 例如,将所有非字母数字字符替换为哈希。然后它按您的预期工作。
SELECT *
FROM tbl
WHERE
regexp_replace(col3, '[^w]', '#', 'g') ~* ('y'||regexp_replace(col4, '[^w]', '#', 'g')||'y')
and length(col4) > 3
or col4 is null
order by col1;
结果
col1 |col2|col3 |col4 |
-----------|----|----------------------|-----------------|
NCT04132960|Drug|ds-8201a |ds 8201a |
NCT04132961|Drug|ds 8201a |ds 8201a |
NCT04132962|Drug|hydrogen sulphate |hydrogen |
NCT04132962|Drug|hydrogen sulphate |sulphate |
NCT04132962|Drug|hydrogen sulphate |hydrogen sulphate|
NCT04132963|Drug|bupropion (wellbutrin)|wellbutrin |
NCT04132964|Drug|bupropion (wellbutrin)|bupropion |
NCT04132965|Drug|Galantimine |Galantimine |
NCT04132966|Drug|Glucose | |