Zlib模块和http模块爬虫案例

Zlib模块

zlib是压缩包的内置模块，将文件打包，生成一个压缩包

以流的概念
- Node.js中数据传输是分片的流
步骤
1. fs创建可读的流
2. 创建空压缩包
3. 创建可写的流
4. 通过管道流进行数据传递
pipe
- 连接I/O之间的管道，这里称之为管道流

代码
  const zlib = require( 'zlib' ) // zlib是一个压缩包的内置模块
  const fs = require( 'fs' ) // fs是文件系统 
  const inp = fs.createReadStream('./dist/1.txt') // 创建可读的流
  const out = fs.createWriteStream('1.txt.gz') //创建可写的流 
  const gzib = zlib.createGzip() // 创建一个空的压缩包 
  inp 
    .pipe( gzib )
    .pipe( out )
    
    $ node 文件名

console模块

底层调用的是 process.stdout

爬虫案例

爬虫

通过后端语言爬取网站中的数据，然后通过特定模块进行数据清洗，最后将数据输出到前端
不是所有的网站都能爬取
基本组成
1. 程序入口
2. 请求模块
3. 数据解释
程序入口
- 程序入口可以用web页面实现，还可以在网页上显示抓取的数据和分析结果；
请求模块
- https发送请求，有get方式和requers方式两种

这边用的是get，代码如下：
    const http = require( 'http' );
    const cheerio = require( 'cheerio' );
    http.get('http://nodejs.org/dist/index.json', (res) => {
    const { statusCode } = res;  // 获取状态码  1xx - 5xx
    const contentType = res.headers['content-type']; // 文件类型  text/json/html/xml

      let error;
      // 错误报出，状态码不是200,报错，不是json类型报错
      if (statusCode !== 200) {
        error = new Error('Request Failed.\n' +
                          `Status Code: ${statusCode}`);
      } else if (!/^application\/json/.test(contentType)) {
        error = new Error('Invalid content-type.\n' +
                          `Expected application/json but received ${contentType}`);
      }
      if (error) {
        console.error(error.message);
        // consume response data to free up memory
        res.resume();  // 继续请求
        return;
      }

    res.setEncoding('utf8'); // 字符编码

option里分别写入爬取网址的数据和请求头数据

如果是html格式的，以下代码可以不用写

 let error;
  // 错误报出，状态码不是200,报错，不是json类型报错
  if (statusCode !== 200) {
    error = new Error('Request Failed.\n' +
                      `Status Code: ${statusCode}`);
  } else if (!/^application\/json/.test(contentType)) {
    error = new Error('Invalid content-type.\n' +
                      `Expected application/json but received ${contentType}`);
  }
  if (error) {
    console.error(error.message);
    // consume response data to free up memory
    res.resume();  // 继续请求
    return;
  }

-数据解释

将爬取到的数据调用cheerio显示或保存

   
 res.setEncoding('utf8'); // 字符编码 

 // 核心 -- start
 let rawData = '';
 res.on('data', (chunk) => { rawData += chunk; }); // 数据拼接 
 res.on('end', () => { // 数据获取结束
   try {

     const $ = cheerio.load( rawData )

     $('td.student a').each( function ( item ) {
       console.log( $( this ).text() )
     })

   } catch (e) {
     console.error(e.message);
   }
 });

 // 核心  -- end
   }).on('error', (e) => {
     console.error(`Got error: ${e.message}`);
   });


req.end()

反爬虫

给标签的内容中放一张图片 

posted on 2019-08-13 20:23 吃鱼的虾阅读(374) 评论(0) 收藏举报

刷新页面返回顶部

Zlib模块和http模块爬虫案例

Zlib模块

zlib是压缩包的内置模块，将文件打包，生成一个压缩包

爬虫案例

爬虫

反爬虫

导航

公告