Larbin源代码分析[3]URL类分析

一 分析utils包中的url类

         该类代表实际中的一个url,成员变量主要有 ,char * file ,char * host , uint16_t port  , int8_t depth, char * cookie

         还有一个public属性的in_addr表示一个ipv4的地址。

         成员函数中主要有一些,比如构造函数,返回url,添加cookie等操作。

二 实例代码如下     

// Larbin

// Sebastien Ailleret

// 15-11-99 -> 14-03-02

 

/* This class describes an URL */

 

#ifndef URL_H

#define URL_H

 

#include <netinet/in.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <stdlib.h>

 

#include "types.h"

 

bool fileNormalize (char *file);

 

class url {

private:

         char *host;

         char *file;

         uint16_t port; // the order of variables is important for physical size

         int8_t depth;

         /* parse the url */

         void parse (char *s);

         /** parse a file with base */

         void parseWithBase (char *u, url *base);

         /* normalize file name */

         bool normalize (char *file);

         /* Does this url starts with a protocol name */

         bool isProtocol (char *s);

         /* constructor used by giveBase */

         url (char *host, uint port, char *file);

 

 public:

         /* Constructor : Parses an url (u is deleted) */

         url (char *u, int8_t depth, url *base);

 

         /* constructor used by input */

         url (char *line, int8_t depth);

 

         /* Constructor : read the url from a file (cf serialize) */

         url (char *line);

 

         /* Destructor */

         ~url ();

 

         /* inet addr (once calculated) */

         struct in_addr addr;

 

         /* Is it a valid url ? */

         bool isValid ();

 

         /* print an URL */

         void print ();

 

         /* return the host */

         inline char *getHost () { return host; }

 

         /* return the port */

         inline uint getPort () { return port; }

 

         /* return the file */

         inline char *getFile () { return file; }

 

         /** Depth in the Site */

         inline int8_t getDepth () { return depth; }

 

         /* Set depth to max if we are at an entry point in the site

         * try to find the ip addr

         * answer false if forbidden by robots.txt, true otherwise */

         bool initOK (url *from);

 

         /** return the base of the url

         * give means that you have to delete the string yourself

         */

         url *giveBase ();

 

         /** return a char * representation of the url

         * give means that you have to delete the string yourself

         */

         char *giveUrl ();

 

         /** write the url in a buffer

         * buf must be at least of size maxUrlSize

         * returns the size of what has been written (not including '\0')

         */

         int writeUrl (char *buf);

 

         /* serialize the url for the Persistent Fifo */

         char *serialize ();

 

         /* very thread unsafe serialisation in a static buffer */

         char *getUrl();

 

         /* return a hashcode for the host of this url */

         uint hostHashCode ();

 

         /* return a hashcode for this url */

         uint hashCode ();

 

         #ifdef URL_TAGS

         /* tag associated to this url */

         uint tag;

#endif // URL_TAGS

 

#ifdef COOKIES

         /* cookies associated with this page */

         char *cookie;

         void addCookie(char *header);

#else // COOKIES

         inline void addCookie(char *header) {}

#endif // COOKIES

};

 

#endif // URL_H

 

三 代码分析

         url中的实现类主要是,创建url,其中创建规则如下:

    http://www.hach.com/r/0343/ttt.html

    则host为www.hach.com   , file 为/r/0343/ttt.html

         url的构造函数即根据 上述规则构建 url类。 若是含有base url 则新的url的file为 base->file + 新url 的file。

   (2)另外url的hash函数即是 利用了 file 字符串和 host字符串。

         /* return a hashcode for this url */

    uint url::hashCode () {

        unsigned int h=port;

        unsigned int i=0;

        while (host[i] != 0) {

            h = 31*h + host[i];

            i++;

        }

        i=0;

        while (file[i] != 0) {

            h = 31*h + file[i];

            i++;

        }

       return h % hashSize;

    }

   (3) cookie的处理函数如下

         若addCookie(char * head) 中的head字符串是以 set-cookie: 开始的,则将head之后的12个字符

    添加到cookie变量中。

 

四综上:

  url 类中的成员变量,char * file ,char * host , port , cookie 能够表示一个url。

  并且url类中提供了解析函数,使用户可以根据从网页中爬取的url构造url类对象。

posted on 2011-10-24 15:33  zhoulinhu  阅读(433)  评论(0编辑  收藏  举报

导航