如何在特定时间段内使用 Python 跟踪网页中的特定内容

我想监视某些网页中存在的一些内容更改。我想每天使用任何脚本或浏览器插件本身做同样的事情......

例如，如果某些网页上的特定内容发生某些更改，我想根据我的查询湿通知，而无需订阅其订阅。

我希望每天在符合我的标准时收到通知。
他们的任何脚本或浏览器插件都可用于此吗？
我可以使用 python 脚本来跟踪可用的更改来实现这一点吗？
我怎样才能做到这一点？

你可以简单地根据urllib/requests/Beautiful soup模块编写python脚本来做到这一点。

你要做的是编写一个函数来解析网站的所需部分，并（在循环中执行）检查它是否符合你的要求，如果它不满足，则退出循环，并在一段时间后再次运行循环（你可以使用时间模块的 time.sleep（）函数来做到这一点）并一次又一次地检查。

def parse(url):
    #extract the content you want
    while(#condition):
            if condition met:
                #do this
            else:
                #do this
           time.sleep(#time after that you want to recheck)

就是这样，你就完成了。不要忘记导入模块！:)

这是我的代码，我如何从一个站点抓取一个表。在该站点中，他们没有在表中定义 id 或类，因此您无需放置任何东西。如果 id 或类意味着只需使用 html.xpath（'//table[@id=id_val]/tr'）而不是 html.xpath（'//table/tr'）

import time
from lxml import etree
import urllib
while True:
    time.sleep(60) # for 1 minute time interval
    #time.sleep(86400) # for 1 day time interval
    web = urllib.urlopen("http://www.yoursite.com/")
    html = etree.HTML(web.read())
    tr_nodes = html.xpath('//table/tr')
    td_content = [tr.xpath('td') for tr in tr_nodes  if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India'  or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
    main_list = []
    for i in td_content:
        if i[5].text == 'Freshers' or  'Freshers' in i[5].text.split('/') or  '0' in i[5].text.split(' '):
            sub_list = [td.text for td in i]
            sub_list.insert(6,'http://yoursite.com/%s'%i[6].xpath('a')[0].get('href'))
            main_list.append(sub_list)
    print 'main_list',main_list

相关内容

最新更新

热门标签：