解析 Alexa 以获取排名信息



从Alexa中提取与<REACH RANK="1"/>相关的整数。我的意思是:

<!--
Need more Alexa data?  Find our APIS here: https://aws.amazon.com/alexa/
-->
<ALEXA VER="0.9" URL="google.com/" HOME="0" AID="=" IDN="google.com/">
<SD TITLE="A" FLAGS="" HOST="google.com">
<OWNER NAME="aa"/>
</SD>
<SD>
<POPULARITY URL="google.com/" TEXT="1" SOURCE="panel"/>
<REACH RANK="1"/>
<RANK DELTA="+0"/> 
<COUNTRY CODE="US" NAME="United States" RANK="1"/>
</SD>
</ALEXA>

到目前为止,我尝试的是来自这篇 Github 帖子的建议,以及在尝试使用所述正则表达式模式尝试不同代码变体时弄乱 RegExr 上发现的正则表达式模式。

我目前拥有的:

try:
xml = (BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=snbamz&url=" + url).read(), "xml"))
rank = re.search(r'"<REACH[^>]*RANK="(d+")', xml)
print(rank)
print(f'Your rank for {url} is {rank}')
except Exception as err:
print(err)
rank = -1
#print(f'Your rank for {url} is {rank}')

它要么 1(命中异常或 2( 导致此错误:

expected string or bytes-like object

由于您使用的是 BeautifulSoup,因此您可以使用xml来解析它。像这样:

import requests
from bs4 import BeautifulSoup
endpoint = 'http://data.alexa.com/data'
url = 'insert the value you are using here'
data = dict(
cli=10,
dat=snbamz,
url=url
)
r = requests.get(url, data=data)
soup = BeautifulSoup(r.content, 'xml')
rank = soup.REACH.get('RANK')

这不是一个完整的示例,但我希望它可以用作您可以从那里开发的起点。

下面是一个概念证明:

Python 3.7.5 (default, Dec 15 2019, 17:54:26) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> alexa = '''
... <!--
...     Need more Alexa data?  Find our APIS here: https://aws.amazon.com/alexa/
... -->
... <ALEXA VER="0.9" URL="google.com/" HOME="0" AID="=" IDN="google.com/">
...   <SD TITLE="A" FLAGS="" HOST="google.com">
...     <OWNER NAME="aa"/>
...   </SD>
...   <SD>
...     <POPULARITY URL="google.com/" TEXT="1" SOURCE="panel"/>
...     <REACH RANK="1"/>
...     <RANK DELTA="+0"/> 
...     <COUNTRY CODE="US" NAME="United States" RANK="1"/>
...   </SD>
... </ALEXA>
... '''
>>> soup = BeautifulSoup(alexa, 'xml')
>>> rank = soup.REACH.get('RANK')
>>> rank
'1'
>>>

最新更新