Mining of Massive Datasets – Link Analysis

5.1 PageRank

5.1.1 Early Search Engines and Term Spam

As people began to use search engines to find their way around the Web, unethical people saw the opportunity to fool search engines into leading people to their page.

Techniques for fooling search engines into believing your page is about something it is not, are called term spam.

The ability of term spammers to operate so easily rendered early search engines almost useless. To combat term spam, Google introduced two innovations:

1. PageRank was used to simulate where Web surfers. Pages that would have a large number of surfers were considered more “important” than pages that would rarely be visited. Google prefers important pages to unimportant pages when deciding which pages to show first in response to a search query.

2. The content of a page was judged not only by the terms appearing on that page, but by the terms used in or near the links to that page.

It is reasonable to ask why simulation of random surfers should allow us to approximate the intuitive notion of the “importance” of pages. There are two related motivations that inspired this approach.

• Users of the Web “vote with their feet.” They tend to place links to pages they think are good or useful pages to look at, rather than bad or useless pages.
• The behavior of a random surfer indicates which pages users of the Web are likely to visit. Users are more likely to visit useful pages than useless pages.

 

5.1.2 Definition of PageRank

Think of the Web as a directed graph, where pages are the nodes, and there is an arc from page p1 to page p2 if there are one or more links from p1 to p2.

An example of a tiny version of the Web, where there are only four pages.

Page A has links to each of the other three pages;

Page B has links to A and D only;

Page C has a link only to A;

Page D has links to B and C only.

In general, we can define the transition matrix of the Web to describe what happens to random surfers after one step. This matrix M has n rows and columns, if there are n pages. The element mij in row i and column j has value 1/k if page j has k arcs out, and one of them is to page i.

上边那个例子的转移矩阵M如下, 很容易理解, A可能链接到3个Page, 所以链接到每个Page的概率为1/3, 其他也一样.

   A    B    C   D

A  0    1/2  1  0
B 1/3  0    0  1/2
C 1/3  0    0  1/2
D  1/3  1/2 0  0

那么根据这个转移矩阵, 怎么样能产生PageRank?

PageRank就是用来模拟随机web访问, 被访问概率越高的page应该越重要, 即pagerank就应该越高.

上面的M是初始的各个网页被访问的概率, 那么如果想要知道随机在web上访问n步后, 各个page被访问的概率, 怎么样算?

If M is the transition matrix of the Web, then after one step, the distribution of the surfer will be Mv0, 就是下面的V1, 然后经过n步就得到Vn, 就是经过n步随机访问每个page的被访问的概率, 即各个page的pageRank. 这就是一个马尔可夫(Markov)过程...

 V0                 V1                          V2                         V3                     Vn

1/4              9/24                15/48                11/32                3/9
1/4   M*V0    5/24  M*V1     11/48   M*V2     7/32                 2/9
1/4  ----->    5/24 ----->     11/48    ----->    7/32  …  ......     2/9
1/4               5/24               11/48                7/32                 2/9

To see why multiplying a distribution vector v by M gives the distribution x = Mv at the next step, we reason as follows.

The probability xi that a random surfer will be at node i at the next step, is Sumj( mijvj ). Here, mij is the probability that a surfer at node j will move to node i at the next step (often 0 because there is no link from j to i), andvj is the probability that the surfer was at node j at the previous step.

 

As we shall see, computing PageRank by simulating random surfers is a time-consuming process. One might think that simply counting the number of in-links for each page would be a good approximation to where random surfers would wind up.

Simplified PageRank Doesn’t Work

if that is all we did, then the hypothetical shirt-seller could simply create a “spam farm” of a million pages, each of which linked to his shirt page. Then, the shirt page looks very important indeed, and a search engine would be fooled.

 

5.1.3 Structure of the Web

前一节给出了基本的PageRank算法, 可是真正Web结构不可能那么简单, 直接使用这样的算法会有些问题, 下面先看看真实的Web结构是什么样的.

bowtie

                    Figure 5.2: The “bowtie” picture of the Web

上面的领结图显示了真实Web的结构, 有如下部分组成,

1. A large strongly connected component (SCC).
2. The in-component, consisting of pages that could reach the SCC by following links, but were not reachable from the SCC.
3. The out-component, consisting of pages reachable from the SCC but unable to reach the SCC.
4. Tendrils, which are of two types. Some tendrils consist of pages reachable from the in-component but not able to reach the in-component. The other tendrils can reach the out-component, but are not reachable from the out-component.
5. Tubes, which are pages reachable from the in-component and able to reach the out-component, but unable to reach the SCC or be reached from the SCC.
6. Isolated components that are unreachable from the large components (the SCC, in- and out-compunents) and unable to reach those components.

Several of these structures violate the assumptions needed for the Markov process iteration to converge to a limit.

As a result, PageRank is usually modified to prevent following problems,

First is the dead end, a page that has no links out.

The second problem is groups of pages that all have outlinks but they never link to any other pages. These structures are called spider traps.

 

5.1.4 Avoiding Dead Ends

Recall that a page with no link out is called a dead end.

   A      B    C   D

A  0    1/2  0  0
B 1/3   0    0  1/2
C 1/3   0    0  1/2
D 1/3  1/2  0  0

如上面这个图, C就是一个dead end, 这样的图用Markov process iteration去计算page rank的结果就是全0, 因为一旦到达C就没有next step.

一个简单的解决方法就是先把dead end节点和相关边删掉, 然后再计算剩下节点的page rank.

最后根据其他节点的page rank来计算dead end节点的page rank.

PRc = 1/3*PRa + 1/2*PRd

当然也可以用下节介绍的Taxation方法来解决这个问题.

 

5.1.5 Spider Traps and Taxation

As we mentioned, a spider trap is a set of nodes with no dead ends but no arcs out. These structures can appear intentionally or unintentionally on the Web, and they cause the PageRank calculation to place all the PageRank within the spider traps.

这种情况和dead end不一样, 进入spider trap后, 虽然还是不断有next step, 但是再也无法离开.

   A      B    C   D

A  0    1/2  0  0
B 1/3   0    0  1/2
C 1/3   0    1  1/2
D 1/3  1/2  0  0

这就给出一个简化版的spide trap, 对于C节点有个自循环, 一旦到C节点就无法离开...这个是对Web上真实的spide traps的抽象和简化.

Note that in general spider traps can have many nodes, there are spider traps with millions of nodes that spammers construct intentionally.

这样的图用Markov process iteration去计算page rank的结果是, spide trap得到所有的page rank, 其他的节点的page rank都为0.

这样spammer就达到目的了, 使他有意搭建的spider traps得到了最高的pagerank.

解决的办法就是引入随机性, 这个和解决局部最优问题思路是一样的, 通过某个随机事件跳出这个trap…

于是上面的Markov process变成如下, 增加了1 − β的随机surfer

v′ = βMv + (1 − β)e/n

where β is a chosen constant, usually in the range 0.8 to 0.9, e is a vector of all 1’s with the appropriate number of components, and n is the number of nodes in the Web graph.

with probability β, the random surfer decides to follow an out-link from their present page.

with probability 1 − β, a new random surfer at a random page

这样增加随机surfer, 最终结果虽然spide trap节点C仍然会得到大部分pagerank, 但是有所限制, 其他节点也会得到一些pagerank.

 

5.2 Efficient Computation of PageRank

To compute the PageRank for a large graph representing the Web, we have to perform a matrix–vector multiplication on the order of 50 times, until the vector is close to unchanged at one iteration.

涉及到稀疏矩阵的表示和MapReduce方法...

 

5.3 Topic-Sensitive PageRank

5.3.1 Motivation for Topic-Sensitive Page Rank

Different people have different interests, and sometimes distinct interests are expressed using the same term in a query. The canonical example is the search query jaguar, which might refer to the animal, the automobile, a version of the MAC operating system, or even an ancient game console.

但是用户的数目太多了, 不可能为每一个用户生成一个favorite vector, So,

The topic-sensitive PageRank approach creates one vector for each of some small number of topics, biasing the PageRank to favor pages of that topic.

然后只需要把每个用户classify到某个topic, 就可以近似的判断用户的interests.

 

5.3.2 Biased Random Walks

Suppose we have identified some pages that represent a topic such as “sports.” To create a topic-sensitive PageRank for sports, we can arrange that the random surfers are introduced only to a random sports page, rather than to a random page of any kind.

其实唯一的不同就是, surfer有了偏向性, 原来的模型就是每次surfer是随机的, 每个页面的概率是一致的, 现在有了biased, 就是假设用户只会(或较大概率)surfer到特定topic的page, 如sport.

The mathematical formulation for the iteration,

Suppose S is a set of the pages we have identified as belonging to a certain topic.

Let eS be a vector that has 1 in the components in S and 0 in other components.

v′ = βMv + (1 − β)eS/|S|

posted on 2011-09-06 15:49  fxjwind  阅读(619)  评论(0编辑  收藏  举报