请求、访问和收集器之间的细微差别.访问



我已经编写了一个colly脚本来从站点收集端口授权信息。

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
//  // Filter domains affected by this rule
DomainGlob: "searates.com/*",
//  // Set a delay between requests to these domains
Delay: 1 * time.Second,
//  // Add an additional random delay
RandomDelay: 3 * time.Second,
})
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
h.Request.Visit(link)
})
})
// Find and visit all ports info page
c.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
fmt.Println("Port Authority: ", portAuth)
})
c.Visit("https://www.searates.com/maritime/")
}

我有两个问题:

  1. 此外,我有点被迫使用e.Request.Visit,因为d.Visit(如果我克隆c(不会被执行。我发现,当我将c克隆为d并用于获取"端口信息"部分时,整个块都被跳过了。我做错了什么/为什么会有这种行为?

  2. 在当前代码中,按原样执行fmt.Println("Port Authority: ", portAuth)两次。我得到的打印如下:

❯ go run .
Country:  Albania /maritime/albania
Port:  Durres /port/durres_al
Port Authority:  Durres Port Authority
Port Authority:  
Port:  Sarande /port/sarande_al
Port Authority:  Sarande Port Authority
Port Authority:  
Port:  Shengjin /port/shengjin_al
Port Authority:  Shengjin Port Authority
Port Authority:  

再一次,我不明白为什么它被打印了两次。请帮忙:(

来自Go文档:

collector.Visit-Visit通过创建对参数中指定的URL的请求来启动收集器的收集作业。Visit也会调用以前提供的回调

Request.Visit-Visit通过创建请求来继续收集器的收集作业,并保留上一个请求的上下文。Visit还会调用以前提供的回调。

不同之处在于深度参数和上下文。如果您使用收集器。访问事件处理程序内部,深度始终为1。

以下是调用差异:

collector.Visit:

if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)

Request.Visit:

return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)

具体地说,要调用克隆的d,您需要在c.OnHTML事件处理程序中触发d.Visit。参见coursera示例。您还需要使用AbsoluteURL,因为克隆的收集器没有链接的上下文(例如,如果它是相对的(。以下是全部内容:

func main() {
// Temp Variables
var tcountry, tport string
// Colly collector
c := colly.NewCollector()
//Ignore the robot.txt
c.IgnoreRobotsTxt = true
// Time-out after 20 seconds.
c.SetRequestTimeout(20 * time.Second)
//use random agents during requests
extensions.RandomUserAgent(c)
//set limits to colly opoeration
c.Limit(&colly.LimitRule{
//  // Filter domains affected by this rule
DomainGlob: "searates.com/*",
//  // Set a delay between requests to these domains
Delay: 1 * time.Second,
//  // Add an additional random delay
RandomDelay: 3 * time.Second,
})
d := c.Clone()
// Find and visit all country links
c.OnHTML("#clist", func(e *colly.HTMLElement) {
// fmt.Println("Country List: ", h.ChildAttrs("a", "href"))
e.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tcountry = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Country: ", tcountry, link)
e.Request.Visit(link)
})
})
// Find and visit all ports links
c.OnHTML("#plist", func(h *colly.HTMLElement) {
// fmt.Println("Port List: ", h.ChildAttrs("a", "href"))
h.ForEach("li.col-xs-6.col-md-3", func(_ int, el *colly.HTMLElement) {
tport = el.ChildText("a")
link := el.ChildAttr("a", "href")
fmt.Println("Port: ", tport, link)
absoluteURL := h.Request.AbsoluteURL(link)
d.Visit(absoluteURL)
})
})
// Find and visit all ports info page
d.OnHTML("div.row", func(e *colly.HTMLElement) {
portAuth := e.ChildText("table#port_det tbody:nth-child(1) tr:nth-child(2) td:nth-child(2)")
if len(portAuth) > 0 {
fmt.Println("Port Authority: ", portAuth)
}
})
c.Visit("https://www.searates.com/maritime/")
}

请注意绝对URL是如何使用的,因为不同收集器的上下文不同,因此克隆的收集器无法导航相对URL链接。

关于为什么打印两次的第二个问题,是因为给定页面上有2个div.row元素。我尝试了各种不同的CSS选择方法,只将事件处理程序应用于第一个div.row,但只添加一个检查字符串长度是否大于0会更容易。

相关内容

  • 没有找到相关文章

最新更新