Web剪贴的新手.如何从div中提取标题.如何将抓取的数据放入Dataframe



我最近开始了我的数据科学期刊。我使用谷歌Colab的Jupyter来完成这项任务。

第一个问题

我现在正试图从一个房地产网站上抓取数据,我想在那里抓取房产所有权、价格、位置、床和浴室。

检查来源https://www.zameen.com/Homes/Lahore-1-1.html

<span class="_4720d1a0 "><span class="_0c8a5353 c1b40987"></span><span aria-label="Beds" class="b6a29bc0">5</span></span>,
<span class="_0c8a5353 c1b40987"></span>,
<span aria-label="Beds" class="b6a29bc0">5</span>,
<span class="_4720d1a0 "><span class="_0c8a5353 fa6c05cc"></span><span aria-label="Baths" class="b6a29bc0">6</span></span>,
<span class="_0c8a5353 fa6c05cc"></span>,
<span aria-label="Baths" class="b6a29bc0">6</span>,
<span class="_4720d1a0 "><span class="_0c8a5353 d2db01cb"></span><span aria-label="Area" class="b6a29bc0"><div class="_7ac32433" title="1 Kanal Luxury Bungalow For Sale In Lahore Dha"><div class="_1e0ca152 _026d7bff"><div><span>1 Kanal</span></div></div></div></span></span>,
<span class="_0c8a5353 d2db01cb"></span>

我能够将价格、位置、床和浴室作为列表

从每个属性中查找区域

property = soup.find_all("span", attrs={"aria-label":"Area"})

从每个房地产中查找价格

property = soup.find_all("span", attrs={"class":"f343d9ce"})

但我无法理解如何在跨度中提取Property Title,然后在div中再次提取。

<span aria-label="Area" class="b6a29bc0"><div class="_7ac32433" title="1 Kanal Luxury Bungalow For Sale In Lahore Dha"><div class="_1e0ca152 _026d7bff"><div><span>1 Kanal</span></div></div></div></span>

从每个属性中查找标题

property = soup.find_all("div", class_="_7ac32433")
for i in property:
print(i.get_text())

它只显示

PKR5.5 Crore
1 Kanal
PKR6.5 Crore
1 Kanal
PKR69.9 Lakh
5 Marla
PKR4.45 Crore
1 Kanal
PKR6.29 Crore
1 Kanal
PKR2.25 Crore
10 Marla
PKR55 Lakh
5 Marla
PKR5.28 Crore
1 Kanal
PKR1.4 Crore
5.5 Marla
PKR1.05 Crore
4 Marla
PKR5.15 Crore
1.1 Kanal
PKR6.35 Crore
1 Kanal
PKR1.15 Crore
5 Marla
PKR68 Lakh
3 Marla
PKR3.6 Crore
1 Kanal
PKR2.25 Crore

第二个问题

一旦我能够从URL中提取所需的数据。我如何创建一个数据框架并将这些数据导入数据科学项目的数据框架?我真的是个新手,所以我甚至无法构建代码。

以下是提取区域、价格和标题并将其添加到数据帧的示例:

import pandas as pd 
import requests
from bs4 import BeautifulSoup
url = 'https://www.zameen.com/Homes/Lahore-1-1.html'
page = requests.get(url)
html = BeautifulSoup(page.text, 'html')
area_elements = html.find_all("span", attrs={"aria-label":"Area"})
areas = [el.text for el in area_elements]
price_elements = html.find_all("span", attrs={"class":"f343d9ce"})
prices = [el.text for el in price_elements]
title_elements = html.find_all("a", attrs={"class":"_7ac32433"})
titles = [el.get('title') for el in title_elements]
# create dataframe
df = pd.DataFrame({
'area': areas,
'title': titles,
'price': prices
})
df.head()

最新更新