在Postgres中解析url

我在Postgres中解析url有问题。我有一个数据库充满了客户和与他们相关的url。我需要与每个客户相关联的唯一域的数组。我希望能够在我的查询中进行解析，而不是将我的结果转储到Python并在那里解析。

在postgres文档中，我发现了这个，但不知道如何将其合并到我的查询中:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  
  alias   |  description  |            token               
----------+---------------+------------------------------  
 protocol | Protocol head | http://  
 url      | URL           | example.com/stuff/index.html  
 host     | Host          | example.com  
 url_path | URL path      | /stuff/index.html

(http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html)

我从一个表开始，像这样:

customer_id | url 
-------------+--------------------   
000001      | www.example.com/fish  
000001      | www.example.com/potato  
000001      | www.potato.com/artichoke
000002      | www.otherexample.com

我的代码:

SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id

这给了我:

customer_id | unique_domains
-----------------------------
000001      | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002      | {www.otherexample.com}

我想要一张这样的表:

customer_id | unique_domains
-----------------------------
000001      | {example.com, potato.com}
000002      | {otherexample.com}

在AWS上的PostgreSQL 9.3.3数据库上工作。

您上面链接的文档是与Postgres文本搜索解析器一起使用的。这需要一个单独的配置来设置，并且可能会有更多的开销和/或与您期望的不同的东西。

如果你想设置一个文本解析器，你可以在这里找到更多的信息:

http://www.postgresql.org/docs/9.3/static/sql-createtsconfig.html

然而，如果你想在Postgres中进行内联解析，我建议使用过程化的Postgres语言，你可以导入该语言的解析库。

您提到了Python，因此您可以使用PL/Python和url解析库，例如urlparse(称为urllib)。解析Python中的

关于urlparse的更多信息

包括以下示例代码:

>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

除了这个例子之外，您还可以使用hostname成员获得主机名:

>>> print o.hostname
www.cwi.nl

如果你想正确地解析出域名(有很多边缘情况和变体-即减去www和任何其他可能存在的组合部分-如在这个答案中的方法将是最好的。

有关设置PL/Python的更多信息，可以访问这里:

http://www.postgresql.org/docs/9.3/static/plpython.html

这就是在Postgres

中解析的方法

而不是将结果转储到Python并在那里解析

最终与PL/Python绕了一圈，但如果您真的想在SQL中进行解析(特别是出于性能原因，例如，跨大型数据集)，那么使用PL/Python可能值得额外的努力。

你可以创建一个函数

create or replace function fn_get_url_path(in_link text) returns text as
$$
begin
  return (select token
          from (select token, alias
                from ts_parse('default', in_link)
                       natural join ts_token_type('default')) tokens
          where tokens.alias = 'url_path');
end;
$$ language PLpgSQL;

然后使用函数:

select fn_get_url_path(column1)
from (values ('https://www.example.com'),
         ('https://www.example.com/test.html'),
         ('https://www.example.com/test?a=1'),
         ('this doesnt work and will return null')) a

获取结果:

/*
  null,
  /test.html
  /test?a=1
  null
*/

相关内容

最新更新

热门标签：