我需要一些帮助,对于一个项目,我应该从房地产网站解析信息。
不知何故,我几乎可以解析所有内容,但它有一个联机符,这是我以前从未见过的。
代码本身太大,但是一些例子片段:
<div class="d-none" data-listing='{"strippedPhotos":[{"caption":"","description":"","urls":{"1920x1080":"https://ot.ingatlancdn.com/d6/07/32844921_216401477_hd.jpg","800x600":"https://ot.ingatlancdn.com/d6/07/32844921_216401477_l.jpg","228x171":"https://ot.ingatlancdn.com/d6/07/32844921_216401477_m.jpg","80x60":"https://ot.ingatlancdn.com/d6/07
你能帮我识别这一点,也许是一个解决方案,如何解析所有包含的信息到pandas
DF?
编辑,添加代码:
other = []
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = soup.find_all('div', id="listing", class_="d-none", attrs="data-listing")
data
您可以访问属性的值并通过json.loads()
转换字符串:
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
那么简单地创建你的通过pandas.json_normalize()
DataFrame
:
pd.json_normalize(data['strippedPhotos'])
导致预期结果不明确,这应该指向一个方向:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import json
hdr = {'User-Agent': 'Mozilla/5.0'}
site= "https://ingatlan.com/xiii-ker/elado+lakas/tegla-epitesu-lakas/32844921"
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
data = json.loads(soup.find('div', id="listing", class_="d-none", attrs="data-listing").get('data-listing'))
### all data
pd.json_normalize(data)
### only strippedPhotos
pd.json_normalize(data['strippedPhotos'])