美丽的汤蟒蛇扩展数据

我对python很陌生。堆栈溢出的长期用户，但第一次发布问题。我正在尝试使用美丽的汤从网站中提取数据。我要提取的示例代码是(在数据中列出并标记(

能够提取到列表中，但我无法提取准确数据。这里的目标是提取列于：指甲油订阅盒，美容产品订阅盒，女性订阅盒标记在：美妆，美容，指甲油

你能告诉我如何实现它吗？

import requests
from bs4 import BeautifulSoup
l1=[]
url='http://boxes.mysubscriptionaddiction.com/box/julep-maven'
source_code=requests.get(url)
plain_text=source_code.text
soup= BeautifulSoup(plain_text,"lxml")
for item in soup.find_all('p'):    
l1.append(item.contents)
search='nListed in:n'
for a in l1:
if a[0] in ('nTagged in:n','nListed in:n'):
print(a)

既然你正在使用lxml，为什么不以更直接的方式使用它(lxml被认为比BeautifulSoup更快(：

import requests
from lxml import html
url='http://boxes.mysubscriptionaddiction.com/box/julep-maven'
source_code=requests.get(url)
tree = html.fromstring(source_code.content) #parses the html
paras = tree.xpath('//div[@class="box-information"]/p') #gets the para elements
# This loop prints the desired para elements' text.
for ele in paras[1:]:
print(ele.text_content())

输出：

Listed in:
Nail Polish Subscription Boxes, Subscription Boxes for Beauty Products, Subscription Boxes for Women

Tagged in:
Makeup, Beauty, Nail polish

注意：该站点受验证码保护，因此您可能需要将浏览器开发工具中的源 html 复制为字符串，并在tree = html.fromstring(copied_string)中使用它以使此代码正常工作。

soup = BeautifulSoup(plain_text, 'html.parser')
import re
context = soup(text=re.compile(r'Listed in:'))
for item in context:
listed_in = item.parent
tagged_in = listed_in.find_next_siblings()[0]
print(listed_in.text.strip('n').replace('n', ''))
print(tagged_in.text.strip('n').replace('n', ''))

将在一行中显示所有内容：

Listed in:Nail Polish Subscription Boxes, Subscription Boxes for Beauty Products, Subscription Boxes for Women, Tagged in: Makeup, Beauty, Nail polish

希望有帮助。

相关内容

最新更新

热门标签：