我已经编写了一个colly脚本来从站点收集端口授权信息。
func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
// // Filter domains affected by this rule
DomainGlob: "searates.com/*",
// // Set a delay between requests to these domains
Delay: 1 * time.Second,
// // Add an additional random delay
RandomDelay: 3 * time.Second,
})
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
h.Request.Visit(link)
})
})
// Find and visit all ports info page
c.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
fmt.Println("Port Authority: ", portAuth)
})
c.Visit("https://www.searates.com/maritime/")
}
我有两个问题:
此外,我有点被迫使用
e.Request.Visit
,因为d.Visit
(如果我克隆c(不会被执行。我发现,当我将c克隆为d并用于获取"端口信息"部分时,整个块都被跳过了。我做错了什么/为什么会有这种行为?在当前代码中,按原样执行
fmt.Println("Port Authority: ", portAuth)
两次。我得到的打印如下:
❯ go run .
Country: Albania /maritime/albania
Port: Durres /port/durres_al
Port Authority: Durres Port Authority
Port Authority:
Port: Sarande /port/sarande_al
Port Authority: Sarande Port Authority
Port Authority:
Port: Shengjin /port/shengjin_al
Port Authority: Shengjin Port Authority
Port Authority:
再一次,我不明白为什么它被打印了两次。请帮忙:(
来自Go文档:
collector.Visit
-Visit通过创建对参数中指定的URL的请求来启动收集器的收集作业。Visit也会调用以前提供的回调
Request.Visit
-Visit通过创建请求来继续收集器的收集作业,并保留上一个请求的上下文。Visit还会调用以前提供的回调。
不同之处在于深度参数和上下文。如果您使用收集器。访问事件处理程序内部,深度始终为1。
以下是调用差异:
collector.Visit
:
if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)
Request.Visit
:
return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
具体地说,要调用克隆的d,您需要在c.OnHTML
事件处理程序中触发d.Visit
。参见coursera示例。您还需要使用AbsoluteURL
,因为克隆的收集器没有链接的上下文(例如,如果它是相对的(。以下是全部内容:
func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
// // Filter domains affected by this rule
DomainGlob: "searates.com/*",
// // Set a delay between requests to these domains
Delay: 1 * time.Second,
// // Add an additional random delay
RandomDelay: 3 * time.Second,
})
d := c.Clone()
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
d.Visit(absoluteURL)
})
})
// Find and visit all ports info page
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)
}
})
c.Visit("https://www.searates.com/maritime/")
}
请注意绝对URL是如何使用的,因为不同收集器的上下文不同,因此克隆的收集器无法导航相对URL链接。
关于为什么打印两次的第二个问题,是因为给定页面上有2个div.row
元素。我尝试了各种不同的CSS选择方法,只将事件处理程序应用于第一个div.row
,但只添加一个检查字符串长度是否大于0会更容易。