我正在尝试从该网站抓取页面文本阿拉伯语和法语页面的URL相同我尝试了以下代码
headers = {'Accept-Language': "lang="AR-DZ"}
r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",headers)
soup = BeautifulSoup(r.content,"lxml")
print(soup.getText)
我收到以下错误消息:
<bound method Tag.get_text of <html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br/><br/>Your support ID is: 12750291427324767866<br/><br/><a href="javascript:history.back();">[Go Back]</a></body></html>>
当我删除标题时,Beautifulsoup用法语刮页面。
我的目标是搜集阿拉伯语的陈述和演讲,以便建立一个语料库。感谢您的帮助。
首先:在"lang="AR-DZ"
中,在AR-DZ
之前打开"
,但在AR-DZ
之后没有关闭"
,而是应该使用"lang=AR-DZ"
通常在浏览器中要更改此页面上的语言,您必须单击带有url的链接http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx,它具有language=ar
,因此您可以在代码中执行同样的操作。
使用Session()
来记住cookies
,并首先将requests.get()
与此url一起使用。它将在cookies
中设置正确的语言。
import requests
from bs4 import BeautifulSoup
#headers = {'User-Agent': 'Mozilla/5.0'}
#headers = {'Accept-Language': "lang=AR-DZ"}
s = requests.Session()
url = 'http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx'
r = s.get(url)#, headers=headers)
url = 'http://www.mae.gov.dz/news_article/6396.aspx'
r = s.get(url)#, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print(soup.getText)
将语言cookie设置为"ar";
import requests
from bs4 import BeautifulSoup
cookies = dict(language='ar')
r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",cookies=cookies)
soup = BeautifulSoup(r.content,"lxml")
print(soup.text)