是否有办法从一个网页中提取数据层信息与python?



我在一个数据集建设工作,使dataLayer变量(对象)信息。我想用机器学习实现页面分类过程的自动化。输入图片描述

有。

如果变量是静态分配的,例如<script>块,那么你可以用Beautiful Soup解析HTML,找到脚本块并获得结果。

更有可能的是,数据是在页面加载后动态生成的(或者在单独的脚本块中),所以你需要例如剧作家来自动化一个无头浏览器,然后从那里读取变量。

<标题>剧作家示例
from playwright.sync_api import sync_playwright, BrowserContext

def get_datalayer(ctx: BrowserContext, url: str):
page = ctx.new_page()
page.goto(url)
page.wait_for_load_state("networkidle")
return page.evaluate("window.dataLayer")

with sync_playwright() as p:
browser = p.chromium.launch()
with browser.new_context() as bcon:
data_layer = get_datalayer(bcon, "https://www.berceaumagique.com/")
print(data_layer)

打印出

[
{
"UtmSource": "",
"EmailHash": "...",
"NewCustomer": "0",
"AcceptFunctionalCookie": "",
"AcceptTargetingCookie": "",
"IdUser": "",
"Page": "home",
"RealPage": "home",
"urlElitrack": "...",
},
{"google_tag_params": {"ecomm_pagetype": "home"}},
{"PageType": "HomePage"},
{"EffinityPage": "home", "Session": "0", "NewCustomer": "0"},
{"gtm.start": 1658135295957, "event": "gtm.js", "gtm.uniqueEventId": 1},
{
"event": "axeptio_update",
"axeptio_authorized_vendors": [],
"gtm.uniqueEventId": 19,
},
{"event": "gtm.dom", "gtm.uniqueEventId": 22},
{"event": "gtm.js", "gtm.uniqueEventId": 23},
{
"event": "promotionsView",
"ecommerce": {
"promoView": {
"promotions": [
{
"id": "slider-1",
"name": "rentree-scolaire",
"creative": "home slider",
"position": "1",
}
]
}
},
"gtm.uniqueEventId": 24,
},
...
]

最新更新