Playwright并没有加载所有的HTML Python



我只是想从页面中抓取标题,但用page.inner_html('body'(加载的html并不包括所有的html。我认为它可能是从JS加载的,但当我查看开发工具中的网络选项卡时,我似乎找不到json,也找不到它是从哪里加载的。我在Selenium上也尝试过,所以肯定有什么地方我做错了。

因此,列表中没有显示任何项目,但常规HTML显示良好。无需等待内容加载,即可加载信息。

#import playwright
from playwright.sync_api import sync_playwright
url = 'https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en'
#open url
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
#enable javascript

page.goto(url)
#enable javascript

#load the page and wait for the page to load
page.wait_for_load_state("networkidle")
#get the html content
html = page.inner_html("body")
print(html)
#close browser
browser.close()

不,网页不是由JavaScript动态加载内容的,而是完全静态的HTML DOM

from bs4 import BeautifulSoup
import requests
page = requests.get('https://order.mandarake.co.jp/order/listPage/list?categoryCode=07&keyword=naruto&lang=en')
soup = BeautifulSoup(page.content,'lxml')
data = []
for e in soup.select('div.title'):
d = {
'title':e.a.get_text(strip=True),

}

data.append(d)
print(data)

输出:

[{'title': 'NARUTO THE ANIMATION CHRONICLEu3000genga made for sale'}, {'title': 'Plex DPCF Haruno Sakura Reboru ring of the eyes'}, {'title': 'Naruto: Shippudenu3000(replica)  ナルト'}, {'title': 'Naruto: Shippudenu3000(replica)  ナルト'}, {'title': 'Naruto: Shippudenu3000(replica)  NARUTO -ナルト-'}, {'title': 'Naruto: Shippuden ナルトu3000(replica)'}, {'title': 'Naruto Shippuudenu3000(replica) NARUTO -ナルト-'}, {'title': 'NARUTO -ナルト- 疾風伝u3000(複製セル)'}, {'title': 'MegaHouse    ちみ メガ Petit Chara Land NARUTO SHIPPUDEN ナルト blast-of-wind intermediary   Even [swirl ナルト special is a volume on ばよ.  
All 6 types set] inner bag not opened/box damaged'}, {'title': 'NARUTO -ナルト- 疾風伝u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝u3000(複製セル)'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト- 疾風伝'}, {'title': 'NARUTO -ナルト-'}]

最新更新