<li>
<strong>Company Name</strong>
":"
<span itemprop="name">PT ERA MURNI BUSANA</span>
</li>
在上面的HTML代码中,我试图提取公司名称,即PT ERA MURNI BUSANA。如果我使用单个测试链接,我可以使用我写的单行代码获得名称:
soup.find_all("span",attrs={"itemprop":"name"})[3].get_text()
但我想从单个网页中的所有此类页面中提取信息。所以我写for循环,但它是获取细节。我正在粘贴我一直在尝试的代码中需要修改的部分。代码:-
for link in supplierlinks: #links have been extracted and merged with the base url
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.content,'lxml')
companyname=soup.find_all("span",attrs={"itemprop":"name"})[2].get_text()
输出看起来像:
{"公司名称":"AIRINDO SAKTI GARMENT PT"}
{"公司名称":"服装"}
{"公司名称":"服装"}
我需要的不是产品中突然出现的服装,而是公司名称。如何修改for循环中的代码?
链接:https://idn.bizdirlib.com/node/5290
试试这个代码:
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0'}
r = requests.get('https://idn.bizdirlib.com/node/5290',headers=headers).text
soup = BeautifulSoup(r,'html5lib')
print(soup.find_all("span",attrs={"itemprop":"name"})[-1].get_text())
div = soup.find('div',class_ = "content clearfix")
li_tags = div.div.find_all('fieldset')[1].find_all('div')[-1].ul.find_all('li')
supplierlinks = []
for li in li_tags:
try:
supplierlinks.append("https://idn.bizdirlib.com/"+li.a['href'])
except:
pass
for link in supplierlinks:
r = requests.get(link,headers=headers).text
soup = BeautifulSoup(r,'html5lib')
print(soup.find_all("span", attrs={"itemprop": "name"})[-1].get_text())
输出:
PT ERA MURNI BUSANA
PT ELKA SURYA ABADI
PT EMPANG BESAR MAKMUR
PT EMS
PT ENERON
PT ENPE JAYA
PT ERIDANI TOUR AND TRAVEL
PT EURO ASIA TRADE & INDUSTRY
PT EUROKARS CHRISDECO UTAMA
PT EVERAGE VALVES METAL
PT EVICO
此代码打印页面上所有链接的公司名称
您可以选择包含文本"Company Name"
的元素<strong>
的同级元素(另外,不要忘记设置用户代理http头(:
import requests
from bs4 import BeautifulSoup
url = 'https://idn.bizdirlib.com/node/5290'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.select_one('strong:contains("Company Name") + *').text )
打印:
PT ERA MURNI BUSANA
编辑:获取联系人:
import requests
from bs4 import BeautifulSoup
url = 'https://idn.bizdirlib.com/node/5290'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.select_one('strong:contains("Company Name") + *').text )
print( soup.select_one('strong:contains("Contact") + *').text )
打印:
PT ERA MURNI BUSANA
Mr. Yohan Kustanto