Python程序抓取不同的文本，尽管网页没有改变

这段代码试图抓取Amazon清单，通过第一方Amazon供应商检查其可用性。

from lxml import html
from time import sleep
import requests
import time
Amazonurl = raw_input("Item URL: ")
page = requests.get(Amazonurl)
tree = html.fromstring(page.text)
Stock = tree.xpath('//*[@id="merchant-info"]/text()')
IfInstock = ''.join(Stock)

if 'Ships from and sold by Amazon.com.' in IfInstock:
    print 'Instock'
    print time.strftime("%a, %d %b %Y %H:%M:%S")
else:
    print 'Not in Stock'
    print time.strftime("%a, %d %b %Y %H:%M:%S")

奇怪的是，当我插入，说，http://www.amazon.com/New-Nintendo-3DS-XL-Black/dp/B00S1LRX3W/ref=sr_1_1?ie=UTF8&qid=1438413018&sr=8-1&keywords=new+3ds在过去几天没有脱销，有时代码会返回"Instock"，而其他时候，它会返回"缺货"。我发现这是因为代码经常刮擦

[]

，而其他时候，它刮掉下面的内容，因为它应该。

['n    n    nn    n        n        n    n    n    n    n    n    n    n    n    n    n    n    n    n        Ships from and sold by Amazon.com.n    n    n        n        n        n        n        n        n        Gift-wrap available.n        nn']

网页似乎没有改变。有没有人知道为什么我的输出经常变化，或者解释一下我如何解决这个问题?

亚马逊拒绝为您提供此页面。

我刚刚在你的脚本中添加了一行代码，只是为了看看当你得到odd结果时，响应的status_code是什么。

from lxml import html
from time import sleep
import requests
import time
Amazonurl = "http://www.amazon.com/dp/B00S1LRX3W/?tag=stackoverfl08-20"
intent = 0
while True:
    page = requests.get(Amazonurl)
    tree = html.fromstring(page.text)
    print(page.status_code)
    Stock = tree.xpath('//*[@id="merchant-info"]/text()')
    IfInstock = ''.join(Stock)
    if 'Ships from and sold by Amazon.com.' in IfInstock:
        print('Instock')
        print(time.strftime("%a, %d %b %Y %H:%M:%S"))
    else:
        print('Not in Stock')
        print(time.strftime("%a, %d %b %Y %H:%M:%S"))
    time.sleep(15)
    if intent>15:
        break
    intent += 1

我以15秒的时间间隔运行这个脚本，就像您说的那样。结果如下:

200
Instock
Sat, 01 Aug 2015 19:51:27
200
Instock
Sat, 01 Aug 2015 19:51:43
503
Not in Stock
Sat, 01 Aug 2015 19:51:59
200
Instock
Sat, 01 Aug 2015 19:52:15
200
Instock
Sat, 01 Aug 2015 19:52:32
200
Instock
Sat, 01 Aug 2015 19:52:48
200
Instock
Sat, 01 Aug 2015 19:53:05
200
Instock
Sat, 01 Aug 2015 19:53:22
200
Instock
Sat, 01 Aug 2015 19:53:38
200
Instock
Sat, 01 Aug 2015 19:53:55
200
Instock
Sat, 01 Aug 2015 19:54:12
200
Instock
Sat, 01 Aug 2015 19:54:29
200
Instock
Sat, 01 Aug 2015 19:54:45
200
Instock
Sat, 01 Aug 2015 19:55:02
200
Instock
Sat, 01 Aug 2015 19:55:18
200
Instock
Sat, 01 Aug 2015 19:55:35
200
Instock
Sat, 01 Aug 2015 19:55:52

可以看到，当结果为odd或"无库存"时，status_code为503。根据http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html的定义如下:

10.5.4 503 Service Unavailable服务器目前无法处理请求临时超载或维护服务器。其含义是这是暂时的，过些时候就会缓解延迟。如果已知，延迟的长度可以用a表示Retry-After头。如果没有给出Retry-After，客户端应该像处理500响应那样处理响应。
  Note: The existence of the 503 status code does not imply that a
  server must use it when becoming overloaded. Some servers may wish
  to simply refuse the connection.

也就是说，亚马逊不为您提供这个页面，因为您在短时间内提出了几个请求。这个"短"的时间对亚马逊来说并不是那么苛刻，这就是为什么你大多数时候得到的是200 status_code。

我希望这能回答你的问题。现在，如果你真的想放弃像亚马逊这样的网站，我建议你使用Scrapy，它很容易使用，也很容易配置。你可以通过使用随机的user-agent来避开像亚马逊这样的网站。当然，这只是你原来问题的附加内容。

相关内容

最新更新

热门标签：