简单的爬虫 一

前两天看到园子里有人用Python写了一个爬虫,爬拉勾网统计薪资等数据,所以我就想我是不是用C#也来一个爬虫

首先分析拉勾网

先选择一个.NET的,地点先统一选择北京

然后进入下面的这个页面

http://www.lagou.com/zhaopin/.NET/?labelWords=label

然后当我使劲刷新 上面这个地址的时候我发现,页面的头先出来的,中间的列表慢了一下,所以我猜测,当这个页面执行完成后通过AJAX加载第一页的信息。

然后我通过 fiddler 抓包验证我的猜想。

刷新一些这个页面抓到80个左右的包

其中第一个包返回的html基本没有什么用,起码暂时没有我想要的信息。

在这个包中,找到了我需要的信息,他返回的是一个JSON格式的数据。

{
    "resubmitToken": null,
    "code": 0,
    "success": true,
    "requestId": null,
    "msg": null,
    "content": {
        "totalCount": 1007,
        "pageNo": 1,
        "pageSize": 15,
        "hasNextPage": true,
        "totalPageCount": 68,
        "currentPageNo": 1,
        "hasPreviousPage": false,
        "result": [
            {
                "relScore": 965,
                "createTime": "2016-04-28 10:05:39",
                "companyId": 28818,
                "calcScore": false,
                "showOrder": 0,
                "haveDeliver": false,
                "positionName": ".NET/C#",
                "positionType": "后端开发",
                "workYear": "3-5年",
                "education": "本科",
                "jobNature": "全职",
                "companyShortName": "畅捷通信息技术股份有限公司",
                "city": "北京",
                "salary": "15k-25k",
                "financeStage": "上市公司",
                "positionId": 1765871,
                "companyLogo": "image1/M00/00/3F/CgYXBlTUXMOADN_rAADQYzTeBQE385.jpg",
                "positionFirstType": "技术",
                "companyName": "畅捷通",
                "positionAdvantage": "上市公司 免费班车 20-35W 春节14天假",
                "industryField": "移动互联网 · 企业服务",
                "companyLabelList": [
                    "技能培训",
                    "节日礼物",
                    "绩效奖金",
                    "岗位晋升"
                ],
                "score": 1323,
                "deliverCount": 7,
                "leaderName": "曾志勇",
                "companySize": "500-2000人",
                "countAdjusted": false,
                "adjustScore": 48,
                "randomScore": 0,
                "orderBy": 99,
                "adWord": 1,
                "formatCreateTime": "2016-04-28",
                "imstate": "disabled",
                "createTimeSort": 1461809139000,
                "positonTypesMap": null,
                "hrScore": 53,
                "flowScore": 158,
                "showCount": 722,
                "pvScore": 12.26185956183834,
                "plus": "",
                "searchScore": 0,
                "totalCount": 0
            },
        ],
        "start": 0
    }
}

反序列化后并且干掉一部分多余的数据,得到上面的这个串

JSON串和页面上展示的信息一对,证明了我的猜想是对的。

剩下的就是想办法获取到,每一条招聘信息的URL地址

如何获取到从第一页到最后一页的JSON数据

 首先根据返回回来的JSON创建一个对应的实体,用来存放数据。

 分析招聘信息的URL

http://www.lagou.com/jobs/1765871.html
http://www.lagou.com/jobs/1613765.html
http://www.lagou.com/jobs/797212.html
http://www.lagou.com/jobs/224215.html
http://www.lagou.com/jobs/1638545.html

从上面的这五个URL地址可以发现,URL地址的结构

http://www.lagou.com/jobs/+?+.html

而这个 ? 可要到JSON串中找一下

很轻松的就在JSON中找到了这个? 为  "positionId": 1765871,

现在还剩下如何从第一页,抓取到最后一页

查看一下请求报文

报文中 

Query String
  city:北京
Form Data
  first:false
  pn:1
  kd:.NET

其中   

  city 是搜索的城市

  first是用来表示是否是第一次加载第一页

  pn 表示第几页

  kd 表示搜索的技术

 

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
using System.Web.Script.Serialization;

namespace LG
{
    public static class RequestDemo
    {
        public static DTOModel RequestDTO(
            string kd = ".NET",
            string city = "北京",
            bool first = false,
            int pn = 1
            )
        {
            string URL = "http://www.lagou.com/jobs/positionAjax.json?city=" + city;
            DTOModel dto = new DTOModel();
            StringBuilder result = new StringBuilder();
            HttpWebRequest req = null;
            HttpWebResponse res = null;
            Stream receiveStream = null;
            StreamReader sr = null;

            req = WebRequest.Create(URL) as HttpWebRequest;
            req.Method = "POST";
            req.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36";
            req.ContentType = "application/x-www-form-urlencoded; charset=UTF-8";
            req.Headers.Add("Accept-Language", "zh-CN,zh;q=0.8");
            req.Headers.Add("Origin", "http://www.lagou.com");
            req.Headers.Add("X-Requested-With", "XMLHttpRequest");
            req.Host = "www.lagou.com";
            req.Referer = "http://www.lagou.com/zhaopin/" + kd + "/?labelWords=label";
            System.IO.Stream RequestStream = req.GetRequestStream();
            string sb = "";
            sb += "first" + first;
            sb += "&pn" + pn;
            sb += "&kd" + kd;
            byte[] buf;
            buf = System.Text.Encoding.GetEncoding("utf-8")
                        .GetBytes(sb);

            res = req.GetResponse() as HttpWebResponse;
            receiveStream = res.GetResponseStream();


            Encoding encode = Encoding.GetEncoding("UTF-8");
            sr = new StreamReader(receiveStream, encode);
            char[] readbuffer = new char[256];
            int n = sr.Read(readbuffer, 0, 256);
            while (n > 0)
            {
                string str = new string(readbuffer, 0, n);
                result.Append(str);
                n = sr.Read(readbuffer, 0, 256);
            }
            receiveStream.Close();
            sr.Close();
            JavaScriptSerializer js = new JavaScriptSerializer();
            dto = js.Deserialize<DTOModel>(result.ToString());


            return dto;

        }
    }
}

然后根据返回的数据构建对应的URL

    "content": {
        "totalCount": 5000,
        "hasNextPage": true,
        "pageNo": 1,
        "pageSize": 15,
        "totalPageCount": 334,
        "currentPageNo": 1,
        "hasPreviousPage": false,

根据这些参数我们可以循环去获取数据。

拿着构建好的URL地址,我们就可以去找我们想要的页面了。

posted @ 2016-05-02 15:23  乔安生  阅读(318)  评论(0编辑  收藏  举报