使用Python-Beautiful Soup对Javascript表(具有网格和列表视图)进行Web Scrapeng



我正在尝试从这个网站的json表中解析数据。

url-https://boxes.mysubscriptionaddiction.com/subscription_boxes_for/food.

我主要需要列出的所有食品订阅箱的名称、评级和描述。我在这里面临一些挑战。一个是表网格和列表视图有两个视图。我们如何指定在代码中引用的表视图?第二,我得到了

ValueError - Timeout value connect was Timeout(connect=<object object at 0x000002767CECD5C0>, 
read=<object object at 0x000002767CECD5C0>, total=None), but it must be an int, float or None.

不确定这意味着什么
我的代码:

from pandas.io.html import read_html
from selenium import webdriver
import json
import requests
import os
import sys
from bs4 import BeautifulSoup
import requests

driver = webdriver.Firefox(executable_path='C:Driversgeckodriver.exe')
driver.get('https://boxes.mysubscriptionaddiction.com/subscription_boxes_for/food')

table = driver.find_element_by_xpath('/html/body/div[3]/div/span/div[2]/div/div[1]/div[3]/div[3]/table')
table_html = table.get_attribute('innerHTML')
bs = BeautifulSoup(table_html, 'html.parser')
rows = bs.select('tbody tr')
print(bs)

以下是如何获取您要查找的数据:(data是一个包含信息的dict(

import requests
from bs4 import BeautifulSoup
import json
scrape_url = 'https://boxes.mysubscriptionaddiction.com/subscription_boxes_for/food'
r1 = requests.get(scrape_url)
page = r1.content
soup = BeautifulSoup(page, 'html.parser')
scripts = soup.find_all('script')
data_str = scripts[11].contents[0].strip()
data = json.loads(data_str,strict=False)
print(data['itemListElement'])

最新更新