我试图使用python抓取网页,但为了抓取网页,我需要接受网页上的cookie。
我试过的代码是
URL = "https://www.howoge.de/wohnungen-gewerbe/wohnungssuche.html"
with open('cookies') as f:
j = json.load(f)
session = requests.Session()
for cookie in j: session.cookies.set(cookie['name'], cookie['value'])
r = session.get(URL)
尽管这没有引起任何错误,但仍然不接受cookie。
这是我的饼干:
[
{
"domain": ".howoge.de",
"expirationDate": 1694266885,
"hostOnly": false,
"httpOnly": false,
"name": "__cmpcpcu10543",
"path": "/",
"sameSite": "no_restriction",
"secure": true,
"session": false,
"storeId": null,
"value": "__51_54__"
},
{
"domain": ".howoge.de",
"expirationDate": 1694266885,
"hostOnly": false,
"httpOnly": false,
"name": "__cmpconsent10543",
"path": "/",
"sameSite": "no_restriction",
"secure": true,
"session": false,
"storeId": null,
"value": "BPfEJk4PfEJk4AfHIBDEDXAAAAAAAA"
},
{
"domain": "www.howoge.de",
"expirationDate": 1696858880,
"hostOnly": true,
"httpOnly": false,
"name": "__cmpcc",
"path": "/",
"sameSite": "no_restriction",
"secure": true,
"session": false,
"storeId": null,
"value": "1"
},
{
"domain": ".howoge.de",
"expirationDate": 1694266885,
"hostOnly": false,
"httpOnly": false,
"name": "__cmpcvcu10543",
"path": "/",
"sameSite": "no_restriction",
"secure": true,
"session": false,
"storeId": null,
"value": "__s974_U__"
},
{
"domain": "www.howoge.de",
"hostOnly": true,
"httpOnly": false,
"name": "PHPSESSID",
"path": "/",
"sameSite": null,
"secure": false,
"session": true,
"storeId": null,
"value": "8pnd5h5up4v4rjh498if7hedac"
}
]
解决这个问题的最佳方法应该是什么?
您不需要cookie或其他任何东西。
试试这个:
import requests
api_url = "https://www.howoge.de/?type=999&tx_howsite_json_list[action]=immoList"
request_payload = {
"tx_howsite_json_list[page]": "1",
"tx_howsite_json_list[limit]": "12",
"tx_howsite_json_list[lang]": "",
"tx_howsite_json_list[rent]": "",
"tx_howsite_json_list[area]": "",
"tx_howsite_json_list[rooms]": "egal",
"tx_howsite_json_list[wbs]": "all-offers",
}
response = requests.post(api_url, data=request_payload).json()
for item in response["immoobjects"]:
print(f'{item["title"]} - {item["rent"]}')
输出:
Rüdickenstraße 23, 13053 Berlin - 1174.31
Rüdickenstraße 23, 13053 Berlin - 1174.31
Rotkamp 4, 13053 Berlin - 1428.25
Rotkamp 6, 13053 Berlin - 617.41
Rotkamp 6, 13053 Berlin - 1147.71
Rotkamp 6, 13053 Berlin - 1147.71
Rotkamp 6, 13053 Berlin - 565.12
Frankfurter Allee 218, 10365 Berlin - 513.85
Frankfurter Allee 218, 10365 Berlin - 501.6
Frankfurter Allee 218, 10365 Berlin - 513.85
Frankfurter Allee 218, 10365 Berlin - 717
Frankfurter Allee 218, 10365 Berlin - 890.6
您不需要cookie,也不需要标头。尝试这样做可以获得一个干净的列表数据帧:
data = {
'tx_howsite_json_list[page]': '1',
'tx_howsite_json_list[limit]': '12',
'tx_howsite_json_list[lang]': '',
'tx_howsite_json_list[rent]': '',
'tx_howsite_json_list[area]': '',
'tx_howsite_json_list[rooms]': 'egal',
'tx_howsite_json_list[wbs]': 'all-offers',
}
response = requests.post('https://www.howoge.de/?type=999&tx_howsite_json_list[action]=immoList', data=data)
df = pd.DataFrame(json.loads(response.content)["immoobjects"])
df.head()
uid title image district rent area rooms wbs features coordinates icon link favorite notice
0 19335 Rüdickenstraße 23, 13053 Berlin /fileadmin/promos/downloadedImages/266355f2369... Alt-Hohenschönhausen 1174.31 77 3 nein [Balkon/Loggia, Fußbodenheizung, Zentralheizun... {'lat': '52.5600754', 'lng': '13.5089916'} icon-Figures-haus_full /wohnungen-gewerbe/wohnungssuche/detail/1771-1... False Schöne 3-Zimmer-Wohnung
1 19336 Rüdickenstraße 23, 13053 Berlin /fileadmin/promos/downloadedImages/266355f2369... Alt-Hohenschönhausen 1174.31 77 3 nein [Balkon/Loggia, Fußbodenheizung, Zentralheizun... {'lat': '52.5600754', 'lng': '13.5089916'} icon-Figures-haus_full /wohnungen-gewerbe/wohnungssuche/detail/1771-1... False
如果要从以下页面获取列表,请更改data
中'tx_howsite_json_list[page]': '1',
的值。