Python:使用lxml-xpath从所有HTML子元素文本中获取文本



我使用的是python的lxml-xpath。如果我给出HTML标记的完整路径,我就可以提取文本。然而,我无法从标签中提取所有文本,也无法将其子元素提取到列表中。因此,例如,给定这个html,我想获得";示例";类别:

想得到:
["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

mzjn-s-anwer是正确的。经过一番反复试验,我终于使它发挥了作用。这就是结束代码的样子。您需要将//text()放在xpath的末尾。它目前还没有重构,所以肯定会有一些错误和不良做法,但它是有效的。

session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
page = session.get("The url you are webscraping")
content = page.content
htmlsite = urllib.request.urlopen("The url you are webscraping")
soup = BeautifulSoup(htmlsite, 'lxml')
htmlsite.close()
tree = html.fromstring(content)
scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

我在keeleyteton.com的团队介绍页面上试用过。它返回了以下列表,这是正确的(尽管需要大量修改!(,因为它们位于不同的标签中,有些是儿童标签。谢谢你的帮助!

['rn        ', 'rn        ', 'Nicholas F. Galluccio', 'rn        ', 'rn        ', 'Managing Director and Portfolio Manager', 'rn        ', 'Teton Small Cap Select Value', 'rn        ', 'Keeley Teton Small Mid Cap Value', 'rn      ', 'rn        ', 'rn        ', 'Scott R. Butler', 'rn        ', 'rn        ', 'Senior Vice President and Portfolio Manager ', 'rn        ', 'Teton Small Cap Select Value', 'rn        ', 'Keeley Teton Small Mid Cap Value', 'rn      ', 'rn        ', 'rn        ', 'Thomas E. Browne, Jr., CFA', 'rn        ', 'rn        ', 'Portfolio Manager', 'rn        ', 'Keeley Teton Small and Mid Cap Dividend Value', 'rn        ', 'Keeley Teton Small and Small Mid Cap Value', 'rn      ', 'rn        ', 'rn        ', 'Brian P. Leonard, CFA', 'rn        ', 'rn
', 'Portfolio Manager', 'rn        ', 'Keeley Teton Small and Mid Cap Dividend Value', 'rn        ', 'Keeley Teton Small and Small Mid Cap Value', 'rn      ', 'rn        ', 'rn        ', 'Robert M. Goldsborough', 'rn        ', 'rn        ', 'Research Analyst', 'rn        ', 'Keeley Teton Small and Mid Cap Dividend Value', 'rn      ', 'rn        ', 'rn        ', 'Brian R. Keeley, CFA', 'rn        ', 'rn        ', 'Portfolio Manager', 'rn        ', 'Keeley Teton Small and Small Mid Cap Value', 'rn      ', 'rn        ', 'rn        ', 'Edward S. Borland', 'rn        ', 'rn
', 'Research Analyst', 'rn        ', 'Keeley Teton Small and Small Mid Cap Value', 'rn      ', 'rn        ', 'rn        ', 'Kevin M. Keeley', 'rn        ', 'rn        ', 'President', 'rn
', 'rn        ', 'rn        ', 'Deanna B. Marotz', 'rn        ', 'rn        ', 'Chief Compliance Officer', 'rn      ']

最新更新