使用beautifulsoup提取网页中所有URL中的公司名称和其他信息


<li>
<strong>Company Name</strong> 
":" 
<span itemprop="name">PT ERA MURNI BUSANA</span>
</li>

在上面的HTML代码中,我试图提取公司名称,即PT ERA MURNI BUSANA。如果我使用单个测试链接,我可以使用我写的单行代码获得名称:

soup.find_all("span",attrs={"itemprop":"name"})[3].get_text()

但我想从单个网页中的所有此类页面中提取信息。所以我写for循环,但它是获取细节。我正在粘贴我一直在尝试的代码中需要修改的部分。代码:-

for link in supplierlinks:     #links have been extracted and merged with the base url
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.content,'lxml')
companyname=soup.find_all("span",attrs={"itemprop":"name"})[2].get_text()

输出看起来像:

{"公司名称":"AIRINDO SAKTI GARMENT PT"}

{"公司名称":"服装"}

{"公司名称":"服装"}

我需要的不是产品中突然出现的服装,而是公司名称。如何修改for循环中的代码?

链接:https://idn.bizdirlib.com/node/5290

试试这个代码:

import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0'}
r = requests.get('https://idn.bizdirlib.com/node/5290',headers=headers).text
soup = BeautifulSoup(r,'html5lib')
print(soup.find_all("span",attrs={"itemprop":"name"})[-1].get_text())
div = soup.find('div',class_ = "content clearfix")
li_tags = div.div.find_all('fieldset')[1].find_all('div')[-1].ul.find_all('li')
supplierlinks = []
for li in li_tags:
try:
supplierlinks.append("https://idn.bizdirlib.com/"+li.a['href'])
except:
pass
for link in supplierlinks:
r = requests.get(link,headers=headers).text
soup = BeautifulSoup(r,'html5lib')
print(soup.find_all("span", attrs={"itemprop": "name"})[-1].get_text())

输出:

PT ERA MURNI BUSANA
PT ELKA SURYA ABADI
PT EMPANG BESAR MAKMUR
PT EMS
PT ENERON
PT ENPE JAYA
PT ERIDANI TOUR AND TRAVEL
PT EURO ASIA TRADE & INDUSTRY
PT EUROKARS CHRISDECO UTAMA
PT EVERAGE VALVES METAL
PT EVICO

此代码打印页面上所有链接的公司名称

您可以选择包含文本"Company Name"的元素<strong>的同级元素(另外,不要忘记设置用户代理http头(:

import requests 
from bs4 import BeautifulSoup

url = 'https://idn.bizdirlib.com/node/5290'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.select_one('strong:contains("Company Name") + *').text )

打印:

PT ERA MURNI BUSANA

编辑:获取联系人:

import requests 
from bs4 import BeautifulSoup

url = 'https://idn.bizdirlib.com/node/5290'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.select_one('strong:contains("Company Name") + *').text )
print( soup.select_one('strong:contains("Contact") + *').text )

打印:

PT ERA MURNI BUSANA
Mr.  Yohan  Kustanto

最新更新