在Python/BS4/Selenium循环中,刮擦的结果会有所不同



我有一个csv文件,其中包含需要抓取的链接。我还设置了使用相同的chrome浏览器进行登录(我需要的元素只有在登录时才可用(。当我在循环之外刮取一页时,我会从页面中获得所需的结果。当我把相同的代码放入一个循环来抓取所有链接时,我会得到不同的结果。我认为这与"source="和或"汤="有关

CSV文件包含3个链接:

https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987

单页代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987")
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
#####################################
address = soup.find('span', class_='street-address').text
print("      Address: " + address)
city = soup.find('span', class_='locality').text
print("         City: " + city)
state = soup.find('span', class_='region').text
print("        State: " + state)
zipcode = soup.find('span', class_='postal-code').text
print("      ZipCode: " + zipcode)
soldPrice = soup.find('div', class_='price-col number').text
print("   Sold Price: " + soldPrice)
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print("         Date: " + sDate)
mls = soup.find('div', class_='sourceContent').text
print("   MLS Source: " + mls)
for span in soup.find_all('span'):
if span.find(text='MLS#'):
mlsNum = span.nextSibling.text
print("         MLS#: " + mlsNum)
driver.quit()

单页结果完美显示:

Address: 4551 S 200 E 
City: Murray, 
State: UT
ZipCode: 84107
Sold Price: $262,000 
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
[Finished in 3.3s]

循环前"source="one_answers"driver="的循环代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import csv
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
#driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987")
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
#####################################
with open('UTlinks.csv') as file:
readCSV = csv.reader(file)
for row in readCSV:
url = str(row).replace("['","").replace("']","")
print("_________________________________")
print("Scraping: " + url)        
driver.get(url)
#source = driver.page_source
#soup = BeautifulSoup(source, "html.parser")
####################################
try:
address = soup.find('span', class_='street-address').text
print("      Address: " + address)
except:
print("      Address: " + "NA")
try:
city = soup.find('span', class_='locality').text
print("         City: " + city)
except:
print("         City: " + "NA")
try:
state = soup.find('span', class_='region').text
print("        State: " + state)
except:
print("        State: " + "NA")
try:
zipcode = soup.find('span', class_='postal-code').text
print("      ZipCode: " + zipcode)
except:
print("      ZipCode: " + "NA")
try:
soldPrice = soup.find('div', class_='price-col number').text
print("   Sold Price: " + soldPrice)
except:
print("   Sold Price: " "NA")            
try:
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
except:
print("Listing Agent: " + "NA")
try:
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
except:
print(" Buying Agent: " + "NA")
try:
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print("         Date: " + sDate)
except:
print("         Date: " + "NA")
try:
mls = soup.find('div', class_='sourceContent').text
print("   MLS Source: " + mls)
except:
print("   MLS Source: " + "NA")
try:
for span in soup.find_all('span'):
if span.find(text='MLS#'):
mlsNum = span.nextSibling.text
print("         MLS#: " + mlsNum)
except:
print("         MLS#: " + "NA")

循环的结果:你可以看到它从文件中打印url,然后抓取当前打开的浏览器结果3次。。。但是获取打开url所需的所有信息。

_________________________________
Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
Address: 4551 S 200 E 
City: Murray, 
State: UT
ZipCode: 84107
Sold Price: $262,000 
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
Address: 4551 S 200 E 
City: Murray, 
State: UT
ZipCode: 84107
Sold Price: $262,000 
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
Address: 4551 S 200 E 
City: Murray, 
State: UT
ZipCode: 84107
Sold Price: $262,000 
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
[Finished in 6.9s]

如果我把"source="one_answers"汤="放在循环中:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
#driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E- 84107/home/86457987")
#source = driver.page_source
#soup = BeautifulSoup(source, "html.parser")
#####################################
with open('UTlinks.csv') as file:
readCSV = csv.reader(file)
for row in readCSV:
url = str(row).replace("['","").replace("']","")
print("_________________________________")
print("Scraping: " + url)        
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
####################################
try:
address = soup.find('span', class_='street-address').text
print("      Address: " + address)
except:
print("      Address: " + "NA")
try:
city = soup.find('span', class_='locality').text
print("         City: " + city)
except:
print("         City: " + "NA")
try:
state = soup.find('span', class_='region').text
print("        State: " + state)
except:
print("        State: " + "NA")
try:
zipcode = soup.find('span', class_='postal-code').text
print("      ZipCode: " + zipcode)
except:
print("      ZipCode: " + "NA")
try:
soldPrice = soup.find('div', class_='price-col number').text
print("   Sold Price: " + soldPrice)
except:
print("   Sold Price: " "NA")            
try:
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
except:
print("Listing Agent: " + "NA")
try:
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
except:
print(" Buying Agent: " + "NA")
try:
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print("         Date: " + sDate)
except:
print("         Date: " + "NA")
try:
mls = soup.find('div', class_='sourceContent').text
print("   MLS Source: " + mls)
except:
print("   MLS Source: " + "NA")
try:
for span in soup.find_all('span'):
if span.find(text='MLS#'):
mlsNum = span.nextSibling.text
print("         MLS#: " + mlsNum)
except:
print("         MLS#: " + "NA")

'source='&'汤='循环中的结果:


Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
Address: 875 E Arrow Head Ln S #44 
City: Salt Lake City, 
State: UT
ZipCode: 84107
Sold Price: NA
Listing Agent: Joe Olschewski
Buying Agent: James Corey
Date: NA
MLS Source: WFRMLS
MLS#: 1654937
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave- 84107/home/86446505
Address: 35 American Ave 
City: Murray, 
State: UT
ZipCode: 84107
Sold Price: NA
Listing Agent: Dana Conway
Buying Agent: Rich Varga
Date: NA
MLS Source: WFRMLS
MLS#: 1660023
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
Address: 4551 S 200 E 
City: Murray, 
State: UT
ZipCode: 84107
Sold Price: NA
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: NA
MLS Source: WFRMLS
MLS#: 1635000
[Finished in 8.6s]

现在它运行良好,但没有抓取"Sold Price:"或"Sold Date:"。如果我取消错误处理,它会抛出以下错误:

soldPrice = soup.find('div', class_='price-col number').text
AttributeError: 'NoneType' object has no attribute 'text'

我在这里做错了什么?

有一个API来获取数据。不过Tad很狡猾。我可以在不注册/登录的情况下提取数据(尽管出于某种原因,我在html或json响应中唯一找不到的是购买代理(。但如果你登录,它似乎会提供这些数据。看起来其他一切(以及更多(都在那里。

import requests
from bs4 import BeautifulSoup
import json
links = ['https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264',
'https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505',
'https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987']
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
payload = {
'email':'username@email.com',
'pwd':'thisIsThePassword'}
#cookiesStr = ''
with requests.Session() as s:       
login = s.post('https://www.redfin.com/stingray/do/api-login', headers=headers, params=payload)
#cookies = s.cookies.get_dict()
#for k, v in cookies.items():
#    cookiesStr += '%s=%s;' %(k,v)
#headers.update({'cookie':cookiesStr}) 
for url in links:
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '_tLAB.wait(function()' in script.text and '/stingray/api/home/details/belowTheFold' in script.text:
jsonStr = script.text
jsonStr = '{' + jsonStr.split('{',2)[-1].rsplit(')',2)[0]
jsonData2 = json.loads(jsonStr)
jsonData2 = json.loads(jsonData2['res']['text'].split('&&')[-1])
address = jsonData2['payload']['amenitiesInfo']['addressInfo']['street']
city = jsonData2['payload']['amenitiesInfo']['addressInfo']['city']
state = jsonData2['payload']['amenitiesInfo']['addressInfo']['state']
zipcode = jsonData2['payload']['amenitiesInfo']['addressInfo']['zip']
mlsSource = jsonData2['payload']['amenitiesInfo']['provider']
listingAgent = jsonData2['payload']['amenitiesInfo']['mlsDisclaimerInfo']['listingAgentName']
if 'InitialContext = ' in script.text:
jsonStr = script.text.split('InitialContext = ')[-1].split('root.__reactServerState.Config')[0].rsplit(';',1)[0]
jsonData = json.loads(jsonStr)
dataAPIs = jsonData['ReactServerAgent.cache']['dataCache']
jsonData2 = json.loads(dataAPIs['/stingray/api/home/details/aboveTheFold']['res']['text'].split('&&')[-1])
soldPrice = jsonData2['payload']['addressSectionInfo']['priceInfo']['amount']
soldDate = jsonData2['payload']['mediaBrowserInfo']['sashes'][0]['lastSaleDate']
jsonData2 = json.loads(dataAPIs['/stingray/api/home/details/initialInfo']['res']['text'].split('&&')[-1])
mls = jsonData2['payload']['mlsId']
jsonData2 = json.loads(dataAPIs['/stingray/api/home/details/mainHouseInfoPanelInfo']['res']['text'].split('&&')[-1])
buyingAgents = jsonData2['payload']['mainHouseInfo']['buyingAgents'][0]['agentInfo']['agentName']
print("_________________________________")
print("Scraping: " + url)  
print('%15s: %s' %('Address',address))
print('%15s: %s' %('City',city))
print('%15s: %s' %('State',state))
print('%15s: %s' %('Zipcode',zipcode))
print('%15s: $' %('Sold Price') + f'{soldPrice:,}')
print('%15s: %s' %('Listing Agent',listingAgent))
print('%15s: %s' %('Buying Agent',buyingAgents))
print('%15s: %s' %('Date',soldDate))
print('%15s: %s' %('MLS Source',mlsSource))
print('%15s: %s' %('MLS#',mls))

输出:

_________________________________
Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
Address: 875 E Arrow Head Lane South Unit 44
City: Salt Lake City
State: UT
Zipcode: 84107
Sold Price: $179,900
Listing Agent: Joe Olschewski
Buying Agent: James Corey
Date: MAR 12, 2020
MLS Source: WFRMLS
MLS#: 1654937
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
Address: 35 American Ave
City: Murray
State: UT
Zipcode: 84107
Sold Price: $317,500
Listing Agent: Dana Conway
Buying Agent: Rich Varga
Date: MAR 25, 2020
MLS Source: WFRMLS
MLS#: 1660023
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
Address: 4551 South 200 E
City: Murray
State: UT
Zipcode: 84107
Sold Price: $262,000
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: DEC 20, 2019
MLS Source: WFRMLS
MLS#: 1635000

最新更新