使用lxml控制元素中的子节点的方法,该元素在xpath之后给定



我在我的pc 中编写了以下示例代码

from bs4 import BeautifulSoup
from lxml import etree, html
import requests
URL = "https://en.wikipedia.org/wiki/Nike,_Inc."

HEADERS = ({'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})

webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
print(dom.xpath('//*[@id="firstHeading"]')[0].text)

但是打印出来的是空的我检查了内容dom文本,但它是无另一方面,我使用html.tostring((将结果检查为xpath所以内容是存在的。。。

事实上,这只是一个样本,所以我想控制以下代码,

import pytest
from bs4 import BeautifulSoup
from lxml import etree, html
import requests
from urllib import request as rs

def test_scraping():
URL = "https://news.yahoo.co.jp/search?p=岸田文雄&ei=utf-8&categories=business"

HEADERS = ({'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})

webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
elements = dom.xpath("//li[@class='viewableWrap newsFeed_item newsFeed_item-normal newsFeed_item-ranking']")
for element in elements:
print(element.tag)
if element.text is not None:
.... <-- not working....

我可以使用find函数找到一个标记contet或另一个标记one但我想用另一种方法来控制它

所以,如果你知道,请告诉我方法

太棒了,我发现find参数支持xpath例如,我可以通过以下代码从元素中获取缩略图和标题

element.find('.//img').get('src')
element.find('.//div[@class="newsFeed_item_title"]').text

注意:find是返回第一个元素https://lxml.de/api/lxml.etree._Element-class.html#find

最新更新