A Tour of Go最后一道习题：Web Crawler

原题目
中文版题目
初始代码
先看懂原始代码
我的改进方法

原题目

Exercise: Web Crawler
In this exercise you'll use Go's concurrency features to parallelize a web crawler.
Modify the Crawl function to fetch URLs in parallel without fetching the same URL twice.
Hint: you can keep a cache of the URLs that have been fetched on a map, but maps alone are not safe for concurrent use!

中文版题目

练习：Web 爬虫
在这个练习中，我们将会使用 Go 的并发特性来并行化一个 Web 爬虫。
修改 Crawl 函数来并行地抓取 URL，并且保证不重复。
提示：你可以用一个 map 来缓存已经获取的 URL，但是要注意 map 本身并不是并发安全的！

初始代码

package main

import (
	"fmt"
)

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page.
	Fetch(url string) (body string, urls []string, err error)
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
	// TODO: Fetch URLs in parallel.
	// TODO: Don't fetch the same URL twice.
	// This implementation doesn't do either:
	if depth <= 0 {
		return
	}
	body, urls, err := fetcher.Fetch(url)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Printf("found: %s %q\n", url, body)
	for _, u := range urls {
		Crawl(u, depth-1, fetcher)
	}
	return
}

func main() {
	Crawl("https://golang.org/", 4, fetcher)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
	if res, ok := f[url]; ok {
		return res.body, res.urls, nil
	}
	return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
	"https://golang.org/": &fakeResult{
		"The Go Programming Language",
		[]string{
			"https://golang.org/pkg/",
			"https://golang.org/cmd/",
		},
	},
	"https://golang.org/pkg/": &fakeResult{
		"Packages",
		[]string{
			"https://golang.org/",
			"https://golang.org/cmd/",
			"https://golang.org/pkg/fmt/",
			"https://golang.org/pkg/os/",
		},
	},
	"https://golang.org/pkg/fmt/": &fakeResult{
		"Package fmt",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
	"https://golang.org/pkg/os/": &fakeResult{
		"Package os",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
}

先看懂原始代码

原始代码本来就可以运行
结果也没有错
但是有两点缺陷：
- 其一就是没有并发
- 其二是结果有重复值，而操作上也的确是做了重复的工作
这些导致了低效率

我的改进方法

使用channel quit来确保当前goroutine会在执行完其他goroutine后才结束
使用visited[]来标记哪些已经访问，哪些没有
visited是各个goroutine竞争的变量，因此使用Mutex来限制visited的访问
代码：

ackage main

import (
	"fmt"
	"sync"
)

type Fetcher interface {
	// Fetch returns the body of URL and
	// a slice of URLs found on that page.
	Fetch(url string) (body string, urls []string, err error)
}

type Visitor struct {
	visited map[string]bool
	mux sync.Mutex
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
	// TODO: Fetch URLs in parallel.
	// TODO: Don't fetch the same URL twice.
	// This implementation doesn't do either:
	
	quit := make(chan bool)
	var v Visitor
	v.visited = make(map[string]bool)
	go MyCrawl(url, depth, fetcher, quit, &v)
	<-quit
}

func MyCrawl(url string, depth int, fetcher Fetcher, quit chan bool, v *Visitor) {
	if depth <= 0 {
		quit <- true
		return
	}
	v.mux.Lock()
	if !v.visited[url] {
		body, urls, err := fetcher.Fetch(url)
		if err != nil {
			fmt.Println(err)
			v.visited[url] = true
			v.mux.Unlock()
			quit <- true
			return
		}
		v.visited[url] = true
		fmt.Printf("found: %s %q\n", url, body)
		v.mux.Unlock()
		for _, u := range urls {
			childquit := make(chan bool)
			go MyCrawl(u, depth-1, fetcher, childquit, v)
			<-childquit
		}
	} else {
		v.mux.Unlock()
	}
	quit <- true
}

func main() {
	Crawl("https://golang.org/", 4, fetcher)
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
	body string
	urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string, error) {
	if res, ok := f[url]; ok {
		return res.body, res.urls, nil
	}
	return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
	"https://golang.org/": &fakeResult{
		"The Go Programming Language",
		[]string{
			"https://golang.org/pkg/",
			"https://golang.org/cmd/",
		},
	},
	"https://golang.org/pkg/": &fakeResult{
		"Packages",
		[]string{
			"https://golang.org/",
			"https://golang.org/cmd/",
			"https://golang.org/pkg/fmt/",
			"https://golang.org/pkg/os/",
		},
	},
	"https://golang.org/pkg/fmt/": &fakeResult{
		"Package fmt",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
	"https://golang.org/pkg/os/": &fakeResult{
		"Package os",
		[]string{
			"https://golang.org/",
			"https://golang.org/pkg/",
		},
	},
}

笔者曾查过许多网上的代码，但大多都不能满足，并发，去重的要求，许多都只是为了用上channel和Mutex而去编写的代码，所以就只能自己写了
这道问题涉及程序中的许多细节问题，欢迎读者质疑我，我也找个时间放上github

posted @ 2019-05-21 23:28 DickLai 阅读(830) 评论(0) 收藏举报

刷新页面返回顶部

DickLai

学而不思则罔，思而不学则殆

A Tour of Go最后一道习题：Web Crawler

原题目

中文版题目

初始代码

先看懂原始代码

我的改进方法

公告