spring boot+HttpClient+Jsoup 实现爬虫

一。思路

解析链接--->获取数据-->存入数据库

二。代码

1.pom.xml

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.9</version>
</dependency>

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
</dependency>


2.解析链接

//设置代理，模范浏览器
private static final String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36";

public static String sendGet(String url) {
    //生成httpClient，相当于打开一个浏览器
    CloseableHttpClient httpClient = HttpClients.createDefault();
    //设置请求和传输超时时间
    RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(2000).build();
    CloseableHttpResponse response = null;
    String html = null;
    //创建get请求，相当于在浏览器地址栏输入网址
    HttpGet request = new HttpGet(url);
    try {
        request.setHeader("User-Agent", USER_AGENT);
        request.setConfig(requestConfig);
        //执行get请求,相当于在输入地址栏后敲回车键
        response = httpClient.execute(request);
        //判断响应状态为200，进行处理
        if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
            org.apache.http.HttpEntity httpEntity = response.getEntity();
            html = EntityUtils.toString(httpEntity, "utf-8");
        } else {
            //失败处理
            System.out.println("请求失败");
            System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        //关闭
        HttpClientUtils.closeQuietly(response);
        HttpClientUtils.closeQuietly(httpClient);
    }
    return html;
}

3.获取数据

public static void getData(String html) {
    Document document = Jsoup.parse(html);
　　//这里根据你想要获取的数据选择节点
    Elements elements = document.select(".goods-list").select(".goods-item");
    for (int i = 0; i < elements.size(); i++) {
　　　　　　String name = elments.get(i).text();
　　 }
}


参考文档
https://blog.csdn.net/weixin_45504342/article/details/99433261

posted @ 2020-08-11 17:14 伏沙金阅读(675) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

伏沙金

GGDong、栋

spring boot+HttpClient+Jsoup 实现爬虫

公告