Windmill未获取所有html内容

我正试图使用python Windmill框架从网页上抓取数据。然而，我在从页面中获取HTML表内容时遇到了问题。该表是由Javascript生成的，因此我使用Windmill来获取内容。然而，内容不会返回表——如果我使用BeautifulSoup尝试解析内容，就会导致错误。

from windmill.authoring import WindmillTestClient
from BeautifulSoup import BeautifulSoup
from copy import copy
import re
def get_massage():
    my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
    my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
    my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))
    return my_massage
def test_scrape():
    my_massage = get_massage()
    client = WindmillTestClient(__name__)
    client.open(url='http://marinetraffic.com/ais/datasheet.aspx?MMSI=636092060&TIMESTAMP=2&menuid=&datasource=POS&app=&mode=&B1=Search')
    client.waits.forPageLoad(timeout='60000')
    html = client.commands.getPageText()
    assert html['status']
    assert html['result']
    soup=BeautifulSoup(html['result'],markupMassage=my_massage)
    print soup.prettify()

当你查看汤的输出时，表是缺失的，但如果你用类似firebug的东西查看网页内容，它就会显示出来。总的来说，我正在尝试获取表内容，并将其解析为某种数据结构以供进一步处理。非常感谢您的帮助！

问题是，您正在使用的标记按摩对您正在处理的页面不起作用，也就是说，它删除了超出应有数量的html代码。

为了验证BeautifulSoup是否能够解析你需要的网页，我只是尝试了一下：

soup = BeautifulSoup(html['result'])
soup.table

它工作得很好，所以在这种情况下似乎根本不需要任何标记按摩。

相关内容

最新更新

热门标签：