for循环中的Goroutine会导致意外行为



我正在做围棋教程中的网络爬虫练习。

我试图使用并发Mutex来解决这个问题,基于这里找到的解决方案。我对它进行了修改,以符合原始问题中预定义的签名。但是,爬网程序在URL树的第二级停止。在调试过程中,print语句的不同行为完全让我感到困惑:

var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
fmt.Printf("enter: %sn", u) // here
go func(url string) {
defer done.Done()
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()

如果我把print语句放在goroutine之外,那么输出是预期的。但我不知道为什么它会停在那里。

enter: https://golang.org/pkg/
enter: https://golang.org/cmd/

但如果我把打印声明放在goroutine中,那就是

var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %sn", u) // here
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()

输出变为

enter: https://golang.org/cmd/
enter: https://golang.org/cmd/

我有两个问题:

  1. 在第二种情况下,为什么enter: https://golang.org/cmd/被打印两次
  2. 为什么爬网函数在出现错误时停止,而不是继续遍历URL树

PS:第二个问题可能与第一个问题有关。我有意在goroutine中使用u而不是url来重现让我困惑的错误

下面是我修改后的解决方案

package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type fetchState struct {
mu sync.Mutex
fetched map[string]bool
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, f *fetchState) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
f.mu.Lock()
already := f.fetched[url]
f.fetched[url] = true
f.mu.Unlock()

if already {
return
}

if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %qn", url, body)

var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %sn", u)
Crawl(u, depth-1, fetcher, f)
}(u)
}
done.Wait()

return
}
func makeState() *fetchState{
f := &fetchState{}
f.fetched = make(map[string]bool)
return f
}
func main() {
Crawl("https://golang.org/", 4, fetcher, makeState())
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

欢迎使用堆栈溢出!

在您的函数中,您定义了url作为参数,但在其中一直使用u。func literal捕获的循环变量u。

尝试这样做:

var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(url string) {
defer done.Done()
fmt.Printf("enter: %sn", url)  // <- check the difference
Crawl(url, depth-1, fetcher, f) // <- check the difference
}(u)
}
done.Wait()

对于为什么使用u变量打印相同的值,这是一个非常常见的错误:https://github.com/golang/go/wiki/CommonMistakes#using-循环迭代器变量上的goroutines

简而言之,go通过引用goroutines来传递单个变量。当他们执行时,他们可能会在其中找到迭代的最后一个值

我发现了这篇简洁的文章,详细解释了它:https://eli.thegreenplace.net/2019/go-internals-capturing-loop-variables-in-closures/

最新更新