在Python网络抓取方面需要帮助

我写了一个简单的代码来抓取标题、地址、联系人、电话号码和网站链接，但我的程序只是抓取标题，我不知道如何抓取所有其他东西，因为没有类和id。

这是我的代码：

import requests
from bs4 import BeautifulSoup
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def get_detail_data(soup):
try:
title = soup.find('a',class_="ListingDetails_Level1_SITELINK",id=False).text
except:
title = 'empty'  
print(title)
try:
address = soup.find('div',class_="ListingDetails_Level1_CONTACTINFO",id=False).find_all('span').text
except:
address = "address"
print(address)
try:
person_name = soup.find('a',class_="",id=False).find_all('img').text
except:
person_name = "empty person"
print(person_name)
try:
phone_no = soup.find('img',class_="",id=False).text
except:
phone_no = "empty phone no"
print(phone_no)
try:
website = soup.find('a',class_="",id=False).text
except:
website = "empty website"
print(website)


def main():
url = "https://secure.kelownachamber.org/Pools-Spas/Rocky%27s-Reel-System-Inc-4751"
#get_page(url)
get_detail_data(get_page(url))
if __name__ == '__main__':
main()

以下代码对我有效(这只是向你展示如何从该网站获取数据，所以我保持简单(：

import requests
from bs4 import BeautifulSoup
result = requests.get("https://secure.kelownachamber.org/Pools-Spas/Rocky%27s-Reel-System-Inc-4751")
src = result.content
soup = BeautifulSoup(src,'html.parser')
divs  = soup.find_all("div",attrs={"class":"ListingDetails_Level1_HEADERBOXBOX"})
for tag in divs:
try:
title = tag.find("a",attrs={"class":"ListingDetails_Level1_SITELINK"}).text
address = tag.find("span",attrs={"itemprop":"street-address"}).text
postal = tag.find("span",attrs={"itemprop":"postal-code"}).text
maincontact = tag.find("span",attrs={"class":"ListingDetails_Level1_MAINCONTACT"}).text
siteTag = tag.find("span",attrs={"class":"ListingDetails_Level1_VISITSITE"})
site = siteTag.find("a").attrs['href']
print(title)
print(address) 
print(postal)
print(maincontact)
print(site)
except:
pass

如果您试图用Beautiful Soup抓取的页面元素没有类或id，则很难告诉find()方法您要查找的内容。

在这种情况下，我更喜欢使用这里记录的select()或select_one()。这些方法允许您传递一个CSS选择器，这与您用来告诉web浏览器哪些元素要以特定方式进行样式设置的语法完全相同。

您可以在这里找到可用选择器的参考。我无法为您的案例提供所需的确切CSS表达式，因为您还没有提供要抓取的HTML示例，但这应该会让您开始。

例如，如果你试图抓取的页面看起来像这样：

<div id="contact">
<div>
<a href="ListingDetails_Level1_SITELINK">Some title</a>
</div>
<div>
<p>1, Sesame St., Address...... </p>
</div>
</div>

然后，为了获得地址，你可以使用CSS选择器，如下所示：

address = soup.select_one("#contact > div:nth-child(2) > p")

上面说，地址将通过查找id为"contact"的div中的第二个div，然后查找其中的段落来找到。

相关内容

最新更新

热门标签：