如果两列相同/相似,则匹配两列,忽略特殊字符



我在postgres中有下表


col1        col2    col3                    col4
NCT04132960 Drug    ds-8201a                ds 8201a
NCT04132961 Drug    ds 8201a                ds 8201a
NCT04132962 Drug    hydrogen sulphate       hydrogen sulphate
NCT04132962 Drug    hydrogen sulphate       hydrogen
NCT04132962 Drug    hydrogen sulphate       sulphate
NCT04132963 Drug    bupropion (wellbutrin)  wellbutrin
NCT04132964 Drug    bupropion (wellbutrin)  bupropion 
NCT04132964 Drug    bupropion (wellbutrin)  bupropione 
NCT04132964 Drug    bupropion (wellbutrin)  bup
NCT04132965 Drug    Galantimine             Galantimine 
NCT04132965 Drug    Galantimine             Galantimin
NCT04132966 Drug    Glucose                     null

我正在尝试根据以下过滤器清理上表:

a. col3 should be similar/ identical to col4
b. length of col4 > 3
c. include row if col4 is null

当我在查询下方运行时,我能够匹配除 row1(由于特殊字符而错过(之外的行,并且我收到误报。下面是我得到的查询和输出

SELECT *
FROM table
WHERE col3 ~* ('y' || REGEXP_REPLACE(col4, '+|-', '\+', 'g') || 'y') --replace regexp specific sign
and length(col4) > 3
or col4 is null
order by col1
;

我怎么能忽略 col3 != col4 的命中(例如。加兰汀,硫酸盐(,但如果它们相同,如果我忽略特殊字符(ds-8201a,ds 8201a((安非他酮(wellbutrin(,wellbutrin(,并且如果col4为空,则保留它们。

电流输出:

col1        col2    col3                    col4
NCT04132961 Drug    ds 8201a                ds 8201a
NCT04132962 Drug    hydrogen sulphate       hydrogen sulphate
NCT04132962 Drug    hydrogen sulphate       hydrogen
NCT04132962 Drug    hydrogen sulphate       sulphate
NCT04132963 Drug    bupropion (wellbutrin)  wellbutrin
NCT04132964 Drug    bupropion (wellbutrin)  bupropion 
NCT04132964 Drug    bupropion (wellbutrin)  bupropione 
NCT04132965 Drug    Galantimine             Galantimine 
NCT04132965 Drug    Galantimine             Galantimin
NCT04132966 Drug    Glucose                     null

期望的输出:

col1        col2    col3                    col4
NCT04132960 Drug    ds-8201a                ds 8201a
NCT04132961 Drug    ds 8201a                ds 8201a
NCT04132962 Drug    hydrogen sulphate       hydrogen sulphate
NCT04132963 Drug    bupropion (wellbutrin)  wellbutrin
NCT04132964 Drug    bupropion (wellbutrin)  bupropion 
NCT04132965 Drug    Galantimine             Galantimine 
NCT04132966 Drug    Glucose                     null

是否可以在 postgres 中处理此数据清理,还是应该以编程方式进行?

谢谢

我建议您在进行比较之前"规范化"col3 和 col4 - 例如,将所有非字母数字字符替换为哈希。然后它按您的预期工作。

SELECT *
FROM tbl
WHERE 
regexp_replace(col3, '[^w]', '#', 'g') ~* ('y'||regexp_replace(col4, '[^w]', '#', 'g')||'y')
and length(col4) > 3
or col4 is null
order by col1;

结果

col1       |col2|col3                  |col4             |
-----------|----|----------------------|-----------------|
NCT04132960|Drug|ds-8201a              |ds 8201a         |
NCT04132961|Drug|ds 8201a              |ds 8201a         |
NCT04132962|Drug|hydrogen sulphate     |hydrogen         |
NCT04132962|Drug|hydrogen sulphate     |sulphate         |
NCT04132962|Drug|hydrogen sulphate     |hydrogen sulphate|
NCT04132963|Drug|bupropion (wellbutrin)|wellbutrin       |
NCT04132964|Drug|bupropion (wellbutrin)|bupropion        |
NCT04132965|Drug|Galantimine           |Galantimine      |
NCT04132966|Drug|Glucose               |                 |

最新更新