堆栈溢出!我正在尝试解析此网站:https://www.ligloo.fr/annonce-immobiliere/studio.html并使用此url导航页面:*https://www.ligloo.fr/annonce-immobiliere/STUDIO.html#!/?page=page_number&tri=属性
这是我目前正在运行的内容,但每次迭代我都会得到相同的html树
def main():
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
session = requests.session()
session.headers.update({'user_agent':user_agent})
initial_url = 'https://www.ligloo.fr/annonce-immobiliere/{0}.html#!/?page={1}&tri=pertinance'
categories = ('STUDIO', 'LOFTS', 'MAISON', 'APPART-2-PIECES-MOINS-DE-40-M2')
for category in categories:
dictionary_of_links = {}
for page in range(1, 6):
url = initial_url.format(category, page)
result = session.get(url)
tree = html.fromstring(result.text) #why tree is the same every time?
编辑:谢谢所有帮助我的人!我发现实际上.html文件在浏览网站时不会改变,所以使用selenium是我现在能想到的唯一选项
您的URL没有编码我认为的特殊字符-尝试更改
?page={1}
至
?page%3D{1}
对于Python 3.6+
initial_url = 'https://www.ligloo.fr/annonce-immobiliere'
categories = ('STUDIO', 'LOFTS', 'MAISON', 'APPART-2-PIECES-MOINS-DE-40-M2')
for category in categories:
url = f"{initial_url}/{category}.html"
for page in range(1, 6):
params = {'page': page, 'tri': 'pertinance'}
result = requests.get(url, params=params)