使用Text.Regex.PCRE解析网页标题时缺少字符



我最近做了一个网站,需要从TED网站上检索演讲标题。

到目前为止,这个问题是针对这次演讲的:Francis Collins:我们需要更好的药物——现在

从网页来源,我得到:

<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>

现在,在ghci中,我尝试了这个:

λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id="altHeadline" >(.+)</span>" :: [[String]]
[["id="altHeadline" >Francis Collins: We need better drugs -- now</span>ntt</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]

无论哪种方式,解析后的标题都遗漏了左边的一些字符,并且在右边有一些意想不到的字符。这似乎与讲座题目中的--有关。然而,

λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

幸运的是,这不是Text.Regex.Posix的问题。

λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

我的建议是:不要使用正则表达式来解析HTML。请使用合适的HTML解析器。下面是一个使用html-conduit解析器和xml-conduit游标库(以及下载的http-conduit)的示例。

{-# LANGUAGE OverloadedStrings #-}
import           Data.Monoid          (mconcat)
import           Network.HTTP.Conduit (simpleHttp)
import           Text.HTML.DOM        (parseLBS)
import           Text.XML.Cursor      (attributeIs, content, element,
                                       fromDocument, ($//), (&//), (>=>))
main = do
    lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
    let doc = parseLBS lbs
        cursor = fromDocument doc
    print $ mconcat $ cursor $// element "title" &// content
    print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content

该代码也可在Haskell学校作为活动代码使用。

最新更新