<Web Crawler><Java><thread-safe queue>

Basic Solution

The simplest way is to build a web crawler that runs on a single machine with single thread.
So, a basic web crawler should be like this:
- Start with a URL pool that contains all the websites we want to crawl.
- For each URL, issue a HTTP GET request to fetch the web page content.
- Parse the content(usually HTML) and extract potential URLs that we want to crawel.
- Add new URLs to the pool and keep crawling.

Scale issues

As is known to all, any system will face a bunch of issues after scaling.
Then, what can be bottlenecks of a distributed web crawler? And how to solve them?

Crawling frequency

How often will u crawel a website?
对于小网站，它们的服务器可能负载不了过于频繁的请求。
一种解决方式是参照robot.txt文件。

Dedup

In a single machine, u can keep the URL pool in memory and remove duplicate entries.
However, things becomes more complicated in a distributed system.
So how can we dedup these URLs?
一种常见的做法是使用Bloom Filter。bf是一种空间有效的系统，它可以用来检测一个元素是否在集合中。但是bf给出的在pool中的判断可能是错误的。(不在的判断是精准的)

Parsing

After fetching the response data, the next step is to parse the data(usually HTML) to extract the information we care about.
This sounds like a simple thing, but it can be quite hard to make it robust.

Other pro.

detect loops: many websites may contain links like A -> B -> C -> A, and ur crawler may end up running forever. How to fix this? 【实际上去重之后就不会有环路了吧，like BFS】
DNS lookup: when the system get scaled to certain level, DNS lookup can be a bottleneck and u may build ur own DNS server.

A java Web Crawler

为了移除暂不想关注的点，包括html解析，cookie的使用等等。直接爬一个接口，返回的是json。
多线程：
- queue使用的是BlockingQueue阻塞队列，offer的时候设置一个等待时间，超时则返回null，同样的poll也有等待时间；
- visited使用Collections.synchronizedSet
处理url：
1. HTTP GET request: 使用URL(requestUrl).openStream()，因为不用设置额外的头部，所以很简单；
2. 分情况：
  - 如果该url是json，则解析该json，把取得的链接放到queue中；
  - 如果该url是mp3，则下载保存到本地
代码见github-wttttt

Thread-safe Queue

java提供了两种thread-safe的类：
BlockingQueue: 阻塞队列:
- 入队操作：
  - add(e): 在队列满的时候会报异常；
  - offer(e): 不会报异常，也不会阻塞，返回值是boolean。即在队满的时候不会插入元素，而直接返回false；
  - offer(e, timeout, unit): 可以设定等待时间；
  - put(e): 在队列满时会阻塞；
- 出队操作：
  - remove(): 从空队列remove会报异常；
  - poll(): 不会报异常也不会阻塞，与offer(e)相对应；
  - poll(timeout, unit): 设定等待时间；
  - take(): 队列为空时会阻塞；
- 查看元素：
  - element(): 在队列为空时报异常；
  - peek(): 不报异常也不阻塞，返回boolean；
- BlockingQueue接口的具体实现类：
  - ArrayBlockingQueue：构造函数必须带int参数以指明大小；
  - LinkedBlockingQueue：若其构造函数带一个规定大小的参数，生成的BlockingQueue有大小限制，若不带大小参数，所生成的BlockingQueue的大小由Integer.MAX_VALUE来决定；
  - PriorityBlockingQueue：其所含对象的排序不是FIFO,而是依据对象的自然排序顺序或者是构造函数的Comparator决定的顺序
concurrentLinkedQueue: 非阻塞队列
- ConcurrentLinkedQueue是一个无锁的并发线程安全队列；
- 对比锁机制的实现，无锁机制的难点在于要充分考虑线程间的协调。简单说来就是多个线程对内部数据结构进行访问时，若其中一个线程执行的中途因为一些原因出现故障，其他的线程能够监测并帮助完成剩下的操作。这就需要把数据结构的操作过程精细地划分为多个状态或阶段，考虑每个阶段或状态多线程访问会出现的情况。
- ConcurrentLinkedQueue有两个volatile的线程共享变量：head、tail。要保证队列的线程安全就是要保证对这两个node的引用的访问(更新、查看)的原子性和可见性。
- 由于volatile本身能保证可见性，所以就是对其修改的原子性要被保证。
anyway，阻塞算法其实本质就是加锁，使用synchronized关键字。而相比之下，非阻塞算法的设计和实现就比较困难了，要通过低级的原子性来支持并发。

发表于 2017-06-09 09:57 橘子不是唯一的水果阅读(301) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

<Web Crawler><Java><thread-safe queue>

Basic Solution

Scale issues

Crawling frequency

Dedup

Parsing

Other pro.

A java Web Crawler

Thread-safe Queue

公告

导航