Python-如何在alexa.com上正确使用htmlparser



因此,我尝试在我的应用程序中从www.alexa.com/topsites/global获取前20个网站,但没有得到预期的结果。

到目前为止,我使用HTMLParserurllib2:的代码

import HTMLParser, urllib2
class MyHTMLParser(HTMLParser.HTMLParser):
    def reset(self):
        HTMLParser.HTMLParser.reset(self)
        self.in_a = False
        self.next_link_text_pair = None
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            for name, value in attrs:
                if name=='href':
                    self.next_link_text_pair = [value, '']
                    self.in_a = True
                    break
    def handle_data(self, data):
        if self.in_a: self.next_link_text_pair[1] += data
    def handle_endtag(self, tag):
        if tag=='a':
            if self.next_link_text_pair is not None:
                print self.next_link_text_pair
            self.next_link_text_pair = None
            self.in_a = False
if __name__=='__main__':
    p = MyHTMLParser()
    p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())

我得到的结果:

['/', '']
['/topsites', 'Browse Top Sites']
['/', 'Home']
['/plans', 'Plans and Pricing']
['/tools', 'Tools']
['/pro/dashboard', 'My Dashboard']
['/toolbar', 'Toolbar']
['/about', 'About Us']
['/support', 'Support']
['http://blog.alexa.com', 'Blog']
['/secure/login?resource=%2Ftopsites%2Fglobal', 'Sign In']
['/register?resource=%2Ftopsites%2Fglobal', 'Create an Account']
['/topsites/countries', 'By Country']
['/topsites/category', 'By Category']
['/siteinfo/google.com', 'Google.com']
['/siteinfo/facebook.com', 'Facebook.com']
['/siteinfo/youtube.com', 'Youtube.com']
['/siteinfo/baidu.com', 'Baidu.com']
['/siteinfo/yahoo.com', 'Yahoo.com']
['/siteinfo/wikipedia.org', 'Wikipedia.org']
['/siteinfo/amazon.com', 'Amazon.com']
['/siteinfo/twitter.com', 'Twitter.com']
['/siteinfo/taobao.com', 'Taobao.com']
['/siteinfo/qq.com', 'Qq.com']
['/siteinfo/google.co.in', 'Google.co.in']
['/siteinfo/linkedin.com', 'Linkedin.com']

我想去掉第一个不希望的结果,如HomePlan and pricing等等,只得到前20个网站名称,而不使用['/siteinfo/

有人能帮我吗我不想用美汤

您可以检查URL是否以/siteinfo/开头以消除不相关的内容:

if self.next_link_text_pair is not None:
    if self.next_link_text_pair[0].startswith('/siteinfo/'):
        print self.next_link_text_pair[1]

相关内容

最新更新