蟒蛇屏幕抓取 Forbes.com



我正在编写一个Python程序,用于从有趣的在线技术文章中提取和存储元数据:"og:title","og:description","og:image",og:url和og:site_name。

这是我正在使用的代码...

# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
# Create the Request
http = urllib3.PoolManager()
# Create the Response
response = http.request('GET ', url, headers)
# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')
# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
if len(tag.get("content", None)) > len(title):
title = tag.get("content", None)

该程序在除一个站点之外的所有站点上运行良好。在"forbes.com"上,我无法使用 Python 访问文章:

网址= https://www.forbes.com/consent/?toURL=https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086

我无法绕过这个同意页面;这似乎是"TrustArc"的"Cookie 同意管理器"解决方案。在计算机上,您基本上提供了您的同意...每次连续运行,您都可以访问文章。

如果我引用"toURL"网址: https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086

并绕过"https://www.forbes.com/consent/"页面,我被重定向回此页面。

我试图查看是否可以在标题中设置cookie,但找不到魔术键。

谁能帮我?

需要发送一个必需的cookienotice_gdpr_prefs才能查看数据:

import requests
from bs4 import BeautifulSoup
src = requests.get(
"https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
headers= {
"cookie": "notice_gdpr_prefs"
})
soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta",  property="og:title")
print(title["content"])

最新更新