使用美丽的汤 - 提取标签中的字符串<div>?



我对bs4的新事物是新手我替换它,不存在。

项目参与了两个部分:

  • 循环局部:(似乎很简单(。
  • 解析器零件:我有一些问题 - 请参阅下面。

我正在尝试循环浏览一系列URL,并从WordPress-Plugins列表中刮除下面的数据。请参阅下面的循环 -

from bs4 import BeautifulSoup
import requests
#array of URLs to loop through, will be larger once I get the loop working correctly
plugins = ['https://wordpress.org/plugins/wp-job-manager', 'https://wordpress.org/plugins/ninja-forms']

项目:对于WordPress-Plugins的状态数据列表: - 大约有50个插件!

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.

解析器零件:所以这是我用美丽的汤的方法 - 提取标签中的绳子?

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://wordpress.org/plugins/participants-database/"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
ttt = page_soup.find("div", {"class":#post-15991 > div.entry-meta > div.widget.plugin-meta"})
item = ttt.a.text
print(item)

背景:想从此页面获取以下数据:

https://wordpress.org/plugins/participants-database/

我需要以下三行的数据 - 在上述示例中

Version: <strong>1.29.3</strong>
Active installations: <strong>100,000+</strong>
Tested up to: <strong>4.9.4</strong>

请参阅我在这里找到的XPath:

//*[@id="post-15991"]/div[4]/div[1]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[1]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[2]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[3]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[4]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[5]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[6]

您可以简单地获得所需的值:

ttt = page_soup.find("div", {"class":"plugin-meta"})
text_nodes = [node.text.strip() for node in ttt.ul.findChildren('li')[:-1:2]]

text_nodes的输出:

['Version: 1.7.7.7', 'Active installations: 10,000+', 'Tested up to: 4.9.4']

最新更新