我正试图抓取一些数据。我面临的问题是页面每隔几秒钟刷新一次。我想限制仅基于最新块的数据抓取,并刷新扫描,并希望赶上下一个后续块。任何意见都会很有帮助的。
目标#1 -抓块的连续性
目标#2 -消除重复
from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = row.find_all('td')[3].text[0:]
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)
if float(transval) >= 1:
print ("Doing something with the data -> " + str(block) + " " + str(transval))
else:
pass
print (" -> Whole Page Scanned: ", scans)
sleep(1)
当前输出:#——在运行脚本
时会有所不同Doing something with the data -> 10186993 1.233071907624764
Doing something with the data -> 10186993 4.689434542638692
Doing something with the data -> 10186993 27.97137792744322 #-- grab only until here and reload the scan
Doing something with the data -> 10186992 9.0
Doing something with the data -> 10186991 2.98
Doing something with the data -> 10186991 1.0
-> Whole Page Scanned: 1
Doing something with the data -> 10186994 1.026868093169767
Doing something with the data -> 10186994 4.0
Doing something with the data -> 10186994 4.55582682
Doing something with the data -> 10186994 8.184713205161088
Doing something with the data -> 10186993 1.233071907624764
Doing something with the data -> 10186993 4.689434542638692
Doing something with the data -> 10186993 27.97137792744322
Doing something with the data -> 10186992 9.0
-> Whole Page Scanned: 2
想要输出:
Doing something with the data -> 10186993 1.233071907624764
Doing something with the data -> 10186993 4.689434542638692
Doing something with the data -> 10186993 27.97137792744322
-> Whole Page Scanned: 1
Doing something with the data -> 10186994 1.026868093169767
Doing something with the data -> 10186994 4.0
Doing something with the data -> 10186994 4.55582682
Doing something with the data -> 10186994 8.184713205161088
-> Whole Page Scanned: 2
我在这里使用了Pandas
,因为它在底层使用了beautifulsoup,但由于它是一个表,所以我让pandas来解析它。这样操作表就容易了。
所以它看起来像是你只想要最新的/max"Block"
然后返回大于或等于1的任何值。这能满足你的要求吗?
import pandas as pd
from time import sleep
import requests
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
df = pd.read_html(reqtxsInternal.text)[0]
df = df[df['Block'] == max(df['Block'])]
df['Value'] = df['Value'].str.extract('(^d*.*d+)')
df = df[df['Value'].astype(float) >= 1]
print (df[['Block','Value']])
print (" -> Whole Page Scanned: ", scans)
sleep(1)
你的另一个选择是让它检查当前的'block'
是否大于之前的。然后将该逻辑添加到只在
from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
previous_block = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = row.find_all('td')[3].text[0:]
if float(block) > float(previous_block):
previous_block = block
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)
if float(transval) >= 1 and block == previous_block:
print ("Doing something with the data -> " + str(block) + " " + str(transval))
else:
pass
print (" -> Whole Page Scanned: ", scans)
sleep(1)
连续性只有在block
数递增/递减时才有效。
由于每次刷新都会更改数据,因此我建议先收集所需的数据,然后再重复数据删除,然后做您想做的事情。
from bs4 import BeautifulSoup
from time import sleep
import re, requests
trim = re.compile(r'[^d,.]+')
url = "https://bscscan.com/txs?a=0x10ed43c718714eb63d5aa57b78b54704e256024e&ps=100&p=1"
baseurl = 'https://bscscan.com/tx/'
header = {"User-Agent": "Mozilla/5.0"}
scans = 0
all_data = set()
prev_block = 0
while True:
scans += 1
reqtxsInternal = requests.get(url,header, timeout=2)
souptxsInternal = BeautifulSoup(reqtxsInternal.content, 'html.parser')
blocktxsInternal = souptxsInternal.findAll('table')[0].findAll('tr')
for row in blocktxsInternal[1:]:
txnhash = row.find_all('td')[1].text[0:]
txnhashdetails = txnhash.strip()
block = int(row.find_all('td')[3].text[0:])
value = row.find_all('td')[9].text[0:]
amount = trim.sub('', value).replace(",", "")
transval = float(amount)
if (prev_block != 0) and (block < prev_block):
# print(block, prev_block)
continue
else:
prev_block = block
if (block >= prev_block) and (transval >= 1):
# print(block, prev_block)
print("Do something with the data -> " + str(block) + " " + str(transval))
# collect the data
all_data.add((block, transval))
else:
pass
print (" -> Whole Page Scanned: ", scans)
sleep(1)
# do something with the data
print('Do something with this collected data:', all_data)