如何优雅的爬取 gzip 格式的页面并保存在本地（java实现）

1. 引言

在爬取汽车销量数据时需要爬取 html 保存在本地后再做分析，由于一些页面的 gzip 编码格式，

获取后要先解压缩，否则看到的是一片乱码。在网络上仔细搜索了下，终于在这里找到了一个优雅的方案。

2. 使用的开源库

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.4</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>18.0</version>
        </dependency>

3. 实现代码

package com.reycg;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.util.List;
import java.util.zip.GZIPInputStream;

import org.apache.commons.io.FileUtils;

import com.google.common.base.Charsets;
import com.google.common.io.ByteSource;
import com.google.common.io.Resources;

public class GzippedByteSource extends ByteSource {

    private final ByteSource source;

    public GzippedByteSource(ByteSource gzippedSource) {
        source = gzippedSource;
    }

    @Override
    public InputStream openStream() throws IOException {
        return new GZIPInputStream(source.openStream());
    }

    public static void main(String[] args) throws IOException {
        URL url = new URL("..."); // TODO 此处需要输入 html 页面地址
        String filePath = "1.html";
        
        List<String> lines = new GzippedByteSource(Resources.asByteSource(url)).asCharSource(Charsets.UTF_8).readLines();
　　　　 // List<String> lines = Resources.asCharSource(url, Charsets.UTF_8).readLines();   // 非 gzip 格式 html 页面获取 (1)
        
        FileUtils.writeLines(new File(filePath), lines);
    }

}

4. 注意

1. 如果在执行时报下面错误，说明返回 html 页面并非 gzip 格式

Exception in thread "main" java.util.zip.ZipException: Not in GZIP format

此时可以使用上面代码标号为（1）的代码行获取。

5. 附注

获取汽车销量主要用来在我个人开发的 汽车销量查询小助手（小程序）展示所用，如果有同学感兴趣，可以在

微信小程序中搜索汽车销量查询小助手或者扫描下方二维码查看效果，欢迎同学提建议和评论。

posted @ 2018-10-30 11:29 ReyCG 阅读(1214) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部