从Alexa中提取与<REACH RANK="1"/>
相关的整数。我的意思是:
<!--
Need more Alexa data? Find our APIS here: https://aws.amazon.com/alexa/
-->
<ALEXA VER="0.9" URL="google.com/" HOME="0" AID="=" IDN="google.com/">
<SD TITLE="A" FLAGS="" HOST="google.com">
<OWNER NAME="aa"/>
</SD>
<SD>
<POPULARITY URL="google.com/" TEXT="1" SOURCE="panel"/>
<REACH RANK="1"/>
<RANK DELTA="+0"/>
<COUNTRY CODE="US" NAME="United States" RANK="1"/>
</SD>
</ALEXA>
到目前为止,我尝试的是来自这篇 Github 帖子的建议,以及在尝试使用所述正则表达式模式尝试不同代码变体时弄乱 RegExr 上发现的正则表达式模式。
我目前拥有的:
try:
xml = (BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=snbamz&url=" + url).read(), "xml"))
rank = re.search(r'"<REACH[^>]*RANK="(d+")', xml)
print(rank)
print(f'Your rank for {url} is {rank}')
except Exception as err:
print(err)
rank = -1
#print(f'Your rank for {url} is {rank}')
它要么 1(命中异常或 2( 导致此错误:
expected string or bytes-like object
由于您使用的是 BeautifulSoup,因此您可以使用xml
来解析它。像这样:
import requests
from bs4 import BeautifulSoup
endpoint = 'http://data.alexa.com/data'
url = 'insert the value you are using here'
data = dict(
cli=10,
dat=snbamz,
url=url
)
r = requests.get(url, data=data)
soup = BeautifulSoup(r.content, 'xml')
rank = soup.REACH.get('RANK')
这不是一个完整的示例,但我希望它可以用作您可以从那里开发的起点。
下面是一个概念证明:
Python 3.7.5 (default, Dec 15 2019, 17:54:26)
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> alexa = '''
... <!--
... Need more Alexa data? Find our APIS here: https://aws.amazon.com/alexa/
... -->
... <ALEXA VER="0.9" URL="google.com/" HOME="0" AID="=" IDN="google.com/">
... <SD TITLE="A" FLAGS="" HOST="google.com">
... <OWNER NAME="aa"/>
... </SD>
... <SD>
... <POPULARITY URL="google.com/" TEXT="1" SOURCE="panel"/>
... <REACH RANK="1"/>
... <RANK DELTA="+0"/>
... <COUNTRY CODE="US" NAME="United States" RANK="1"/>
... </SD>
... </ALEXA>
... '''
>>> soup = BeautifulSoup(alexa, 'xml')
>>> rank = soup.REACH.get('RANK')
>>> rank
'1'
>>>