如何限制在python beautifulsoup中每隔几秒刷新一次的页面中抓取数据

我正试图抓取一些数据。我面临的问题是页面每隔几秒钟刷新一次。我想限制仅基于最新块的数据抓取，并刷新扫描，并希望赶上下一个后续块。任何意见都会很有帮助的。

目标#1 -抓块的连续性

目标#2 -消除重复

from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = row.find_all('td')[3].text[0:]
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)

if float(transval) >= 1:
print ("Doing something with the data -> " + str(block) + "   " + str(transval))
else:
pass
print (" -> Whole Page Scanned: ", scans)
sleep(1)

当前输出:#——在运行脚本

时会有所不同

Doing something with the data -> 10186993   1.233071907624764
Doing something with the data -> 10186993   4.689434542638692
Doing something with the data -> 10186993   27.97137792744322   #-- grab only until here and reload the scan
Doing something with the data -> 10186992   9.0
Doing something with the data -> 10186991   2.98
Doing something with the data -> 10186991   1.0
-> Whole Page Scanned:  1
Doing something with the data -> 10186994   1.026868093169767
Doing something with the data -> 10186994   4.0
Doing something with the data -> 10186994   4.55582682
Doing something with the data -> 10186994   8.184713205161088
Doing something with the data -> 10186993   1.233071907624764
Doing something with the data -> 10186993   4.689434542638692
Doing something with the data -> 10186993   27.97137792744322
Doing something with the data -> 10186992   9.0
-> Whole Page Scanned:  2

想要输出:

Doing something with the data -> 10186993   1.233071907624764
Doing something with the data -> 10186993   4.689434542638692
Doing something with the data -> 10186993   27.97137792744322
-> Whole Page Scanned:  1
Doing something with the data -> 10186994   1.026868093169767
Doing something with the data -> 10186994   4.0
Doing something with the data -> 10186994   4.55582682
Doing something with the data -> 10186994   8.184713205161088
-> Whole Page Scanned:  2

我在这里使用了Pandas，因为它在底层使用了beautifulsoup，但由于它是一个表，所以我让pandas来解析它。这样操作表就容易了。

所以它看起来像是你只想要最新的/max"Block"然后返回大于或等于1的任何值。这能满足你的要求吗?

import pandas as pd
from time import sleep
import requests
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
df = pd.read_html(reqtxsInternal.text)[0]
df = df[df['Block'] == max(df['Block'])]
df['Value'] = df['Value'].str.extract('(^d*.*d+)')
df = df[df['Value'].astype(float) >= 1]

print (df[['Block','Value']])
print (" -> Whole Page Scanned: ", scans)
sleep(1)

你的另一个选择是让它检查当前的'block'是否大于之前的。然后将该逻辑添加到只在

时打印:

from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
previous_block = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = row.find_all('td')[3].text[0:]
if float(block) > float(previous_block):
previous_block = block
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)

if float(transval) >= 1 and block == previous_block:
print ("Doing something with the data -> " + str(block) + "   " + str(transval))
else:
pass
print (" -> Whole Page Scanned: ", scans)
sleep(1)

连续性只有在block数递增/递减时才有效。

由于每次刷新都会更改数据，因此我建议先收集所需的数据，然后再重复数据删除，然后做您想做的事情。

from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
all_data = set()
prev_block = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')

for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = int(row.find_all('td')[3].text[0:])
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)

if (prev_block != 0) and (block < prev_block):
# print(block, prev_block)
continue
else:
prev_block = block
if (block >= prev_block) and (transval >= 1):
# print(block, prev_block)
print("Do something with the data -> " + str(block) + " " + str(transval))

# collect the data
all_data.add((block, transval))
else:
pass

print (" -> Whole Page Scanned: ", scans)
sleep(1)


# do something with the data
print('Do something with this collected data:', all_data)

相关内容

最新更新

热门标签：