小贝子编程

crawler4j种子URL被编码，错误页面是crawler而不是实际页面

本文关键字：crawler 种子 URL 编码错误 crawler4j urlencode crawler4j
更新时间 : 2023-09-11
英文 : Crawler4J seed url gets encoded and error page is crawler instead of actual page

我正在使用crawler 4j在github上爬网上的用户配置文件，例如我想爬网：https：//github.com/search?q=java location：q = java location：india＆india＆p; p =; p =1目前

字符串url =" https://github.com/search?q=java location：india＆amp； p = 1";Controller.Addseed(url(;

爬行者4J开始URL爬行时是：https://github.com/search?q=java+location:india&Amp;p=1

这给了我错误页面。我该怎么办，我尝试给出编码的URL，但这也行不通。

我必须最终对crawler4j源代码进行丝毫更改：文件名：urlcanonicalizer.java方法：PenterCoderfc3986

刚刚评论了此方法中的第一行，我能够爬行并获取结果

//string = string.replace(" "，"％2b"(;

在我的URL中有字符，它被％2B替换，我收到了一个错误页面，我想知道为什么他们在编码整个URL之前已专门替换字符。

相关内容