如何在GO中输出与并发Web刮板的CSV



我是新手,正在尝试利用Go中的并发构建基本刮板以从URL中摘取提取标题,元描述和元关键字。

我能够用并发将结果打印到终端,但无法弄清楚如何将输出写入CSV。我尝试了许多变体,我可以通过有限的GO知识来想到这些变体,许多人最终打破了并发性 - 因此失去了一点点。

我的代码和URL输入文件如下 - 预先感谢您的任何提示!

// file name: metascraper.go
package main
import (
    // import standard libraries
    "encoding/csv"
    "fmt"
    "io"
    "log"
    "os"
    "time"
    // import third party libraries
    "github.com/PuerkitoBio/goquery"
)
func csvParsing() {
    file, err := os.Open("data/sample.csv")
    checkError("Cannot open file ", err)
    if err != nil {
        // err is printable
        // elements passed are separated by space automatically
        fmt.Println("Error:", err)
        return
    }
    // automatically call Close() at the end of current method
    defer file.Close()
    //
    reader := csv.NewReader(file)
    // options are available at:
    // http://golang.org/src/pkg/encoding/csv/reader.go?s=3213:3671#L94
    reader.Comma = ';'
    lineCount := 0
    fileWrite, err := os.Create("data/result.csv")
    checkError("Cannot create file", err)
    defer fileWrite.Close()
    writer := csv.NewWriter(fileWrite)
    defer writer.Flush()
    for {
        // read just one record
        record, err := reader.Read()
        // end-of-file is fitted into err
        if err == io.EOF {
            break
        } else if err != nil {
            fmt.Println("Error:", err)
            return
        }
        go func(url string) {
            // fmt.Println(msg)
            doc, err := goquery.NewDocument(url)
            if err != nil {
                checkError("No URL", err)
            }
            metaDescription := make(chan string, 1)
            pageTitle := make(chan string, 1)
            go func() {
                // time.Sleep(time.Second * 2)
                // use CSS selector found with the browser inspector
                // for each, use index and item
                pageTitle <- doc.Find("title").Contents().Text()
                doc.Find("meta").Each(func(index int, item *goquery.Selection) {
                    if item.AttrOr("name", "") == "description" {
                        metaDescription <- item.AttrOr("content", "")
                    }
                })
            }()
            select {
            case res := <-metaDescription:
                resTitle := <-pageTitle
                fmt.Println(res)
                fmt.Println(resTitle)
                // Have been trying to output to CSV here but it's not working
                // writer.Write([]string{url, resTitle, res})
                // err := writer.WriteString(`res`)
                // checkError("Cannot write to file", err)
            case <-time.After(time.Second * 2):
                fmt.Println("timeout 2")
            }
        }(record[0])
        fmt.Println()
        lineCount++
    }
}
func main() {
    csvParsing()
    //Code is to make sure there is a pause before program finishes so we can see output
    var input string
    fmt.Scanln(&input)
}
func checkError(message string, err error) {
    if err != nil {
        log.Fatal(message, err)
    }
}

带有URL的数据/sample.csv输入文件:

    http://jonathanmh.com
    http://keshavmalani.com
    http://google.com
    http://bing.com
    http://facebook.com

在您提供的代码中,您已经评论了以下代码:

// Have been trying to output to CSV here but it's not working
err = writer.Write([]string{url, resTitle, res})
checkError("Cannot write to file", err)

此代码是正确的,除了您有一个问题。在功能的早期,您有以下代码:

fileWrite, err := os.Create("data/result.csv")
checkError("Cannot create file", err)
defer fileWrite.Close()

此代码使文件作者一旦您的csvParsing() func退出。因为您已经关闭了延期文件作者,所以您无法在并发函数中写入它。

解决方案:您需要在并发的func 或类似的内容内使用defer fileWrite.Close() ,因此您在写信之前就不会关闭文件作者。

相关内容

  • 没有找到相关文章

最新更新