其实,若不考虑反爬虫技术,正儿八经的爬虫技术没有什么太多的技术含量,这里只是将这次爬取数据的过程做个简单的备忘,在Conv-2019的特别日子里,不能到公司职场工作,在家远程,做一些调研和准备工作。这里头,就有产品市场调研这块,数据说话!
我重点爬取了京东商城的数据,当然,早期也爬取了天猫和淘宝的数据(阿里系列,反爬虫技术还是比较厉害,后来频繁提示滑动条,这个绕不过去,即便程序中监测到跳出来了滑动条验证,然后我手动验证都不让过,这的确比较厉害,目前因为没有多少时间深入调研,没有弄清楚这个到底怎么绕过去,若有过来人,还请告知一二!!!)
我的爬取过程,技术采用的是selenium+httpclient+mysql实现的。
- selenium是一款自动化测试工具,在这里,很好的用来设计自动化的点击页面按钮的动作。说实在话,不用selenium,完全用jsoup也是可以搞得定的。但是,完全用selenium,可能有些场景就不是那么好搞定了。涉及到完全异步操作的时候,selenium的模拟点击页面,不管通过cssSelector还是xpath等,都可能遇到元素不存在的错误。
- 完全用jsoup是可以解决问题的,只不过呢,完全用jsoup,这个爬虫的程序就相对比较复杂一些了,自己要写很多的代码。
- 所以,我最终采用了selenium和httpclient爬取数据。selenium模拟翻页,因为京东商城的商品列表页面,是有明确的规律的。不管是参数翻页(WebElement.click(href)这种模式),还是基于模拟点击列表页面的"下一页",都是比较轻松的事情,而且,针对要爬取的页面,还有web页面被打开,可以看到一个大概的视图。httpclient在这里,主要用来获取商品的价格和评论数据,价格是辅助获取,评论数据是完全依靠httpclient。
先创建一个爬虫程序的maven工程,主要是为了方便拉取依赖包。
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.6</version>
</dependency>
<dependency>
<groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
因为我这里selenium基于浏览器运行,即模拟浏览器的工作,所以,我选择的是客户端模式,谷歌浏览器驱动。所以,还要下载chrome的本地程序,可以理解为chrome的内核程序,在java工程程序中,系统参数中需要配置这个chrome浏览器内核,通过java的JNI工作模式,进行模拟控制操作浏览器打开页面的过程。
整个java工程就是一个非常基本的main程序,普通的maven项目,读者可以按照自己的需求,设计成web模式也是可以的。先来看看配置selenium的部分。
JDSeleniumFullProxy
package com.shihuc.up.spider.jd.comment; import com.google.common.collect.Lists; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import java.util.List; import java.util.concurrent.TimeUnit; public class JDSeleniumFullProxy { public static ChromeDriver driver; static { try { //启动浏览器 getDriver(); } catch (InterruptedException e) { e.printStackTrace(); } } public static void main(String[] args) throws InterruptedException { getProductsWithFullScenario(); Thread.sleep(10000); System.out.println("!!!!!!!==========Well Done===========!!!!!!"); //关闭模拟器 driver.quit(); } private static void getProductsWithFullScenario() { String urls[] = new String[] { /*车载手机支架*/ "https://search.jd.com/Search?keyword=%E8%BD%A6%E8%BD%BD%E6%89%8B%E6%9C%BA%E6%94%AF%E6%9E%B6&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&click=0" }; String products[][] = new String[][] { {"jd_info_czsjzj", "jd_comment_czsjzj"} }; int hmp = 40; JDProductDao productDao = new JDProductDao(); //爬取所需的数据 for (int i=0; i < urls.length; i++) { JDSeleniumFullCrawler.getAllProducts(driver, hmp, urls[i], productDao, products[i]); } //将价格和销量做适当的处理(价格有范围的,销量中有‘万’或者 ‘+’的,处理为数值) for (int i=0; i<products.length; i++) { productDao.updateProductForPriceSells(products[i][0]); } } /** * 获取 ChromeDriver * @throws InterruptedException */ private static void getDriver() throws InterruptedException{ String os = System.getProperty("os.name"); if (os.toLowerCase().startsWith("win")) { System.setProperty("webdriver.chrome.driver", System.getProperty("user.dir") + "\\chromedriver_win32\\chromedriver.exe"); } else { System.setProperty("webdriver.chrome.driver", "/usr/bin/chromedriver"); } ChromeOptions options = new ChromeOptions(); // 关闭界面上的---Chrome正在受到自动软件的控制 options.addArguments("--disable-infobars"); // 允许重定向 options.addArguments("--disable-web-security"); // 最大化 options.addArguments("--start-maximized"); options.addArguments("--no-sandbox"); List<String> excludeSwitches = Lists.newArrayList("enable-automation"); options.setExperimentalOption("excludeSwitches", excludeSwitches); driver = new ChromeDriver(options); driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS); //driver.get("https://passport.jd.com/new/login.aspx"); /** * 下面这些模拟滑动条的方式,都不凑用,只有通过淘宝的登录页打开,然后手动切换到支付宝登录页,手机支付宝扫码 * 这样方能绕过淘宝反爬虫的那个滑动条阻拦 */ // while(true) { // if(currentIsLoginPage()){ // System.out.println("============>>>>"); // }else { // System.out.println(">>>>>>OOOOOOOOOOO"); // break; // } // Thread.sleep(2000); // } } private static boolean currentIsLoginPage() { String url = driver.getCurrentUrl(); if (url.contains("https://passport.jd.com/new/login.aspx")){ return true; } return false; } }
代码中红色部分,是我的chrome驱动程序所在路径的配置,即chromedriver.exe文件在我的项目内文件夹chromedriver_win32里面。依据你下载这个文件时放的路径不同,这里有所调整。
上面程序中,也可以模拟程序登录的过程,因为京东商城浏览商品,不管怎么浏览都不要求登录,不想阿里系,浏览一下,还防爬,时不时蹦出来登录。。。鄙视。。。
接下来,就是真正操作selenium和jsoup爬取数据的过程了。
JDSeleniumFullCrawler
package com.shihuc.up.spider.jd.comment; import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; import com.shihuc.up.spider.jd.opt.JDPhoneHolder; import org.openqa.selenium.By; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import java.io.IOException; import java.util.HashMap; import java.util.List; import java.util.Set; public class JDSeleniumFullCrawler { private static String COMMENT_TOTAL = "评论总数"; private static String COMMENT_GOOD = "好评数量"; private static String COMMENT_GENERAL = "中评数量"; private static String COMMENT_POOL = "差评数量"; private static String COMMENT_VIDEO = "视频晒单"; private static String COMMENT_AFTER = "追评数量"; public static void getAllProducts(ChromeDriver driver, int howManyPages, String url, JDProductDao pdao, String []pname) { for (int i = 1; i <= howManyPages; i++) { getFullPageProducts(driver, i, url, pdao, pname); try { Thread.sleep(100); } catch (InterruptedException e) { e.printStackTrace(); } } } public static void getFullPageProducts(ChromeDriver driver, int i, String rawUrl, JDProductDao pdao, String []pname) { // WebElement pageNumInput = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/input")); // pageNumInput.clear(); // pageNumInput.sendKeys(i + ""); // WebElement searchSubmit = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/a")); // searchSubmit.click(); String url = rawUrl + "&page=" + (2*i - 1) + "&s=" + (60*(i-1) + 1); driver.get(url); getProductsProcess(driver, pdao, pname); } private static void getProductsProcess(ChromeDriver driver, JDProductDao pdao, String []pname) { List<WebElement> itemElements = driver.findElements(By.cssSelector("#J_goodsList .gl-item")); System.out.println(itemElements.size()); String mainHandle = driver.getWindowHandle(); String href = null; for(WebElement we: itemElements) { try { String weId = we.getAttribute("data-pid"); //WebElement weHref = we.findElement(By.cssSelector(".p-name a")); WebElement weHref = we.findElement(By.cssSelector(".p-img a")); //href = weHref.getAttribute("href"); href = "https://item.jd.com/" + weId + ".html"; //价格和评论这么取取不到,网站是一个完全异步的显示逻辑 String price = null; try { WebElement wePrice = we.findElement(By.cssSelector(".p-price strong i")); price = wePrice.getText(); }catch (Exception ep) { System.err.println("can not get the price information for pid " + weId + " ......"); } // String sells = null; // try { // WebElement weSells = we.findElement(By.cssSelector(".p-commit strong a")); // sells = weSells.getText(); // }catch (Exception ec) { // System.err.println("can not get the comment information for pid " + weId + " ......"); // } driver.executeScript("window.open(\"https://item.jd.com/" + weId + ".html\");"); Set<String> handles = driver.getWindowHandles(); String newHandle = ""; for (String s : handles) { if (s.equalsIgnoreCase(mainHandle)) { continue; } newHandle = s; break; } //将窗口调整到刚才打开的产品详情页窗口 driver.switchTo().window(newHandle); //获取当前产品详情页的关注的产品详情信息 JDProduct product = getJDProductInfos(driver); try { if (price == null) { price = JDPhoneHolder.getPrice(weId); } } catch (IOException e) { e.printStackTrace(); } product.setUrl(href); product.setPid(weId); product.setPrice(price); //JDComment comment = getCommentByCD(driver); JDComment comment = getCommentByPID(weId); product.setComment(comment); int rid = pdao.addProductInfoGenId(product, pname[0]); pdao.addProductComments2(product, rid, pname[1]); //关闭当前处理的产品详情页窗口 closeAllOtherWindows(mainHandle, driver); }catch(Exception eal) { closeAllOtherWindows(mainHandle, driver); eal.printStackTrace(); System.out.println(href); } } } public static JDProduct getJDProductInfoByUrl(WebDriver driver, String url, JDProduct product) { System.out.println("URL: " + url); driver.get(url); WebElement weComment = driver.findElement(By.cssSelector(".comment-count .count")); WebElement wePrice = driver.findElement(By.cssSelector(".summary-price .price")); String strComment = weComment.getText(); if (strComment.equalsIgnoreCase("0")){ try { strComment = JDPhoneHolder.getCommitCountNum(product.getPid()) + ""; } catch (IOException e) { e.printStackTrace(); } } String strPrice = wePrice.getText(); product.setPrice(strPrice); return product; } public static JDProduct getJDProductInfos(WebDriver driver) { WebElement weTitle = driver.findElement(By.cssSelector(".w div.sku-name")); String title = weTitle.getText(); /** * 获取产品型号信息, 通过xpath获取信息的性能比cssSelector高很多 */ WebElement weBrand = driver.findElement(By.xpath(".//*[@id=\"parameter-brand\"]/li/a")); String brand = weBrand.getText(); WebElement weName = driver.findElement(By.xpath(".//*[@id=\"detail\"]/div[2]/div[1]/div[1]/ul[2]/li[1]")); String name = weName.getText(); name = name.replace("商品名称:","").trim(); JDProduct product = new JDProduct(); product.setBrand(brand); product.setPname(name); product.setTitle(title); return product; } public static JDComment getCommentByPID(String pid) { JDComment comments = new JDComment(); HashMap<String, Integer> groups = new HashMap<>(); try { JSONObject commentJson =JDPhoneHolder.getComments(pid); JSONObject productCommentSummary = commentJson.getJSONObject("productCommentSummary"); //好评比例 int goodRateShow = productCommentSummary.getInteger("goodRateShow"); comments.setGoodRate(goodRateShow); //评论总数 int commentCount = productCommentSummary.getInteger("commentCount"); comments.setTotalc(commentCount); //好评数量 int goodCount = productCommentSummary.getInteger("goodCount"); comments.setGoodc(goodCount); //中评数量 int generalCount = productCommentSummary.getInteger("generalCount"); comments.setGeneralc(generalCount); //差评数量 int poorCount = productCommentSummary.getInteger("poorCount"); comments.setPoorc(poorCount); //视频晒单 int videoCount = productCommentSummary.getInteger("videoCount"); comments.setVideoc(videoCount); //追评数量 int afterCount = productCommentSummary.getInteger("afterCount"); comments.setAfterc(afterCount); JSONArray hotCommentTagStatistics = commentJson.getJSONArray("hotCommentTagStatistics"); for (int i=0; i<hotCommentTagStatistics.size(); i++){ JSONObject hotComment = hotCommentTagStatistics.getJSONObject(i); String name = hotComment.getString("name"); int count = hotComment.getInteger("count"); groups.put(name, count); } } catch (IOException e) { e.printStackTrace(); } comments.setCommentGroups(groups); return comments; } public static JDComment getCommentByCD(ChromeDriver driver) { JDComment comment = new JDComment(); WebElement weCommentTab = driver.findElement(By.xpath("//*[@id=\"detail\"]/div[1]/ul/li[5]")); weCommentTab.click(); try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } WebElement weGoodRate = driver.findElement(By.cssSelector(".comment-percent .percent-con")); String goodRate = weGoodRate.getText(); int len = goodRate.length(); if (len > 1) { goodRate = goodRate.substring(0, len - 1); } int rate = Integer.valueOf(goodRate); List<WebElement> weGroupList = driver.findElements(By.cssSelector(".J-comment-info .percent-info .tag-list .tag-1")); HashMap<String, Integer> groups = new HashMap<>(); for (WebElement we: weGroupList) { String rawGroup = we.getText(); splitDescInfo(rawGroup, groups); } List<WebElement> weLevelList = driver.findElements(By.cssSelector(".J-comments-list .filter-list li")); HashMap<String, Integer> levels = new HashMap<>(); for (WebElement we: weLevelList) { WebElement weLevel = we.findElement(By.cssSelector("a")); if (containsDatatab(weLevel)){ //TODO // String rawLevel = weLevel.getText(); // splitDescInfo(rawLevel, levels); } } comment.setGoodRate(rate); comment.setCommentGroups(groups); return comment; } private static boolean containsDatatab(WebElement we){ try { we.getAttribute("data-tab"); return true; }catch (Exception e){ return false; } } private static void splitDescInfo(String desc, HashMap<String, Integer> map) { String info = desc; int commaIdx = info.indexOf("("); String context = info.substring(0, commaIdx); String strCount = info.substring(commaIdx+1, info.length() - 1); float count = getRealCount(strCount); map.put(context, (int)count); } private static float getRealCount(String rawCount) { float realCount; if (rawCount.contains("万")){ int wanIdx = rawCount.indexOf("万"); String strRealCount = rawCount.substring(0, wanIdx); realCount = Float.valueOf(strRealCount) * 10000; }else if (rawCount.contains("+")){ int plusIdx = rawCount.indexOf("+"); String strRealCount = rawCount.substring(0, plusIdx); realCount = Integer.valueOf(strRealCount); }else{ realCount = Integer.valueOf(rawCount); } return realCount; } private static void closeAllOtherWindows(String main, ChromeDriver driver) { Set<String> handles = driver.getWindowHandles(); System.out.println("------->main: " + main); Object []hs = handles.toArray(); for (int i = hs.length - 1; i>0; i--) { System.out.println("-------->child: " + hs[i]); driver.switchTo().window(hs[i].toString()); driver.close(); } driver.switchTo().window(main); } }
这个java类里面,重点在于处理页面切换的逻辑,否则想操作的页面数据和实际driver所指向的页面handle可能不是一个东西,导致所找的页面元素不存在的错误,这是比较常见的错误,所以,一定得注意窗口句柄的管理,爬取完毕后,页面最好是关闭掉(selenium模拟操作页面打开页面是顺序的将句柄记录在一个有序集合LinkedHashSet里面,所以,操作的时候,后打开的页面句柄在集合的后面,利用Set转换为Array的模式,简单实现窗口的关闭逻辑),因为爬取数据的场景很简单,列表页和详情页之间切换。
接下来,是爬取到的数据写库的过程,我操作数据,用的是很简单的spring的jdbcTemplate实现的,虽然功能不及mybatis那么强大,但是应付爬取点数据,还是够了。
JDProductDao
package com.shihuc.up.spider.jd.comment; import com.mchange.v2.c3p0.ComboPooledDataSource; import org.openqa.selenium.chrome.ChromeDriver; import org.springframework.jdbc.core.BatchPreparedStatementSetter; import org.springframework.jdbc.core.JdbcTemplate; import org.springframework.jdbc.core.PreparedStatementCreator; import org.springframework.jdbc.core.RowCallbackHandler; import org.springframework.jdbc.support.GeneratedKeyHolder; import org.springframework.jdbc.support.KeyHolder; import java.beans.PropertyVetoException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import java.util.ArrayList; import java.util.HashMap; import java.util.List; public class JDProductDao extends JdbcTemplate{ public JDProductDao(){ //定义c3p0连接池 ComboPooledDataSource ds = new ComboPooledDataSource(); try { ds.setDriverClass("com.mysql.jdbc.Driver"); ds.setUser("root"); ds.setPassword("shihuc"); ds.setJdbcUrl("jdbc:mysql://localhost:3306/nav?characterEncoding=utf-8"); } catch (PropertyVetoException e) { e.printStackTrace(); } super.setDataSource(ds); } public int addProductInfoGenId(JDProduct product, String shop) { KeyHolder keyHolder = new GeneratedKeyHolder(); JDComment comment = product.getComment(); super.update(new PreparedStatementCreator(){ final String sql="insert into good_holder_" + shop + " (pid,title,brand,pname,price,url, goodrate,totalc,goodc,generalc,poorc,videoc,afterc)" + " values (?,?,?,?,?,?,?,?,?,?,?,?,?)"; public PreparedStatement createPreparedStatement(java.sql.Connection conn) throws SQLException{ PreparedStatement ps = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS); ps.setString(1, product.getPid()); ps.setString(2, product.getTitle()); ps.setString(3, product.getBrand()); ps.setString(4, product.getPname()); ps.setString(5, product.getPrice()); ps.setString(6, product.getUrl()); ps.setInt(7, comment.getGoodRate()); ps.setInt(8, comment.getTotalc()); ps.setInt(9, comment.getGoodc()); ps.setInt(10, comment.getGeneralc()); ps.setInt(11, comment.getPoorc()); ps.setInt(12, comment.getVideoc()); ps.setInt(13, comment.getAfterc()); return ps; } },keyHolder); return keyHolder.getKey().intValue(); } public void addProductComments(JDProduct product, int rid, String shop) { final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)"; List<Object []> comments = transformCommentsToObjects(rid, product.getComment()); super.batchUpdate(sql, new BatchPreparedStatementSetter() { @Override public void setValues(PreparedStatement ps, int i) throws SQLException { ps.setInt(1, (Integer) comments.get(i)[0]); ps.setString(2, (String)comments.get(i)[1]); ps.setInt(3, (Integer) comments.get(i)[2]); } @Override public int getBatchSize() { return comments.size(); } }); } public void addProductComments2(JDProduct product, int rid, String shop) { final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)"; List<Object []> comments = transformCommentsToObjects(rid, product.getComment()); super.batchUpdate(sql, comments); } private List<Object[]> transformCommentsToObjects(int rid, JDComment comments) { List<Object[]> list = new ArrayList<>(); Object[] object = null; HashMap<String, Integer> groups = comments.getCommentGroups(); for(String group: groups.keySet()){ object = new Object[]{ rid, group, groups.get(group), }; list.add(object); } return list ; } public List<JDProduct> updateProductForPriceSells(String tableIdx) { //查询数据,使用RowCallbackHandler处理结果集 String sql = "select id, pid, price from good_holder_" + tableIdx; final JDProduct product = new JDProduct(); List<JDProduct> nokProducts = new ArrayList<>(); //将结果集数据行中的数据抽取到product对象中 super.query(sql, new Object[]{}, new RowCallbackHandler() { public void processRow(ResultSet rs) throws SQLException { product.setId(rs.getInt("id")); product.setPid(rs.getString("pid")); product.setPrice(rs.getString("price")); dataProcess(product, tableIdx); } }); return nokProducts; } public void updateNokProductForPriceSells(String tableIdx, ChromeDriver driver) { //查询数据,使用RowCallbackHandler处理结果集 String sql = "select id, url, price from good_holder_" + tableIdx; final JDProduct product = new JDProduct(); //将结果集数据行中的数据抽取到product对象中 super.query(sql, new Object[]{}, new RowCallbackHandler() { public void processRow(ResultSet rs) throws SQLException { product.setId(rs.getInt("id")); product.setUrl(rs.getString("url")); product.setPrice(rs.getString("price")); if(isNokProduct(product, tableIdx)){ JDProduct pd = JDSeleniumFullCrawler.getJDProductInfoByUrl(driver, product.getUrl(), product); reSetPriceOrSells(product.getId(), tableIdx, pd.getPrice()); } } }); } public boolean isNokProduct(JDProduct product, String tableIdx){ String price = product.getPrice(); String url = product.getUrl(); if (price.equalsIgnoreCase("")) { System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok"); if (url != null && !url.equalsIgnoreCase("")){ return true; } } return false; } public void dataProcess(JDProduct product, String tableIdx) { String price = product.getPrice(); double dlow = 0 , dhigh=0; if (price.equalsIgnoreCase("")) { System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok"); return; } String low = "0", high = "0"; if (price.contains("-")){ int idx = price.indexOf("-"); low = price.substring(0, idx); high = price.substring(idx+1); }else{ low = price; high = price; } dlow = Double.valueOf(low); dhigh = Double.valueOf(high); // String countReg = "^[1-9][0-9]*"; // Pattern p = Pattern.compile(countReg); // Matcher m = p.matcher(sells); // if (m.find()){ // String sc = m.group(); // sellCount = Integer.valueOf(sc); // } updateProductPriceSell(product.getId(), tableIdx, dlow, dhigh); } public void updateProductPriceSell(int id, String tableIdx, double priceLow, double priceHigh) { String sql = "update good_holder_" + tableIdx + " set priceLow=?,priceHigh=? where id=?"; int rows = super.update(sql, priceLow, priceHigh,id); System.out.println(rows); } public void reSetPriceOrSells(int id, String tableIdx, String price) { String sql = "update good_holder_" + tableIdx + " set price=? where id=?"; int rows = super.update(sql, price, id); System.out.println(rows); } }
下面就是商品信息和评论信息的model类
JDProduct
package com.shihuc.up.spider.jd.comment; public class JDProduct { private int id; private String pid; private String title; private String brand; private String pname; private String price; private String url; private double priceHigh; private double priceLow; private JDComment comment; public int getId() { return id; } public void setId(int id) { this.id = id; } public String getPid() { return pid; } public void setPid(String pid) { this.pid = pid; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getBrand() { return brand; } public void setBrand(String brand) { this.brand = brand; } public String getPname() { return pname; } public void setPname(String pname) { this.pname = pname; } public String getPrice() { return price; } public void setPrice(String price) { this.price = price; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public double getPriceHigh() { return priceHigh; } public void setPriceHigh(double priceHigh) { this.priceHigh = priceHigh; } public double getPriceLow() { return priceLow; } public void setPriceLow(double priceLow) { this.priceLow = priceLow; } public JDComment getComment() { return comment; } public void setComment(JDComment comment) { this.comment = comment; } @Override public String toString() { return "Product{" + "pid=" + pid + ", title='" + title + '\'' + ", brand='" + brand + '\'' + ", pname='" + pname + '\'' + ", price=" + price + '\'' + '}'; } }
JDComment
package com.shihuc.up.spider.jd.comment; import java.awt.*; import java.util.HashMap; public class JDComment { private Integer goodRate; /** * 评论内容的分类信息以及对应的条数 */ private HashMap<String, Integer> commentGroups; //天猫是销量数据,淘宝和京东一样,是累计评论数据 private int totalc; private int goodc; private int generalc; private int poorc; private int videoc; private int afterc; public Integer getGoodRate() { return goodRate; } public void setGoodRate(Integer goodRate) { this.goodRate = goodRate; } public HashMap<String, Integer> getCommentGroups() { return commentGroups; } public void setCommentGroups(HashMap<String, Integer> commentGroups) { this.commentGroups = commentGroups; } public int getTotalc() { return totalc; } public void setTotalc(int totalc) { this.totalc = totalc; } public int getGoodc() { return goodc; } public void setGoodc(int goodc) { this.goodc = goodc; } public int getGeneralc() { return generalc; } public void setGeneralc(int generalc) { this.generalc = generalc; } public int getPoorc() { return poorc; } public void setPoorc(int poorc) { this.poorc = poorc; } public int getVideoc() { return videoc; } public void setVideoc(int videoc) { this.videoc = videoc; } public int getAfterc() { return afterc; } public void setAfterc(int afterc) { this.afterc = afterc; } }
这里需要补充说明一下,价格和评论用到的关于httpclient拉到网页的工具类
HttpClientUtils
package com.shihuc.up.spider; import com.shihuc.up.spider.jd.opt.JDPhoneHolder; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.entity.UrlEncodedFormEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.methods.HttpRequestBase; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.message.BasicNameValuePair; import org.apache.http.util.EntityUtils; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Map; public class HttpClientUtils { //创建httpclient连接池 private static PoolingHttpClientConnectionManager connectionManager; static{ connectionManager=new PoolingHttpClientConnectionManager(); //定义连接池最大连接数 connectionManager.setMaxTotal(200); //对指定的网址最多只有20个连接 connectionManager.setDefaultMaxPerRoute(20); } private static CloseableHttpClient getCloseableHttpClient(){ CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build(); return httpClient; } private static String execute(HttpRequestBase httpRequestBase) throws IOException { httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"); //设置超时时间 RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(10000).setConnectTimeout(10000).setSocketTimeout(15 * 1000).build(); httpRequestBase.setConfig(config); CloseableHttpClient httpClient = getCloseableHttpClient(); CloseableHttpResponse response = httpClient.execute(httpRequestBase); String html = EntityUtils.toString(response.getEntity(), "utf-8"); return html; } private static String executeReferer(HttpRequestBase httpRequestBase, String referer) throws IOException { httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0"); httpRequestBase.setHeader("Referer", referer); httpRequestBase.setHeader("Sec-Fetch-Mode", "no-cors"); //设置超时时间 RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(60000).setConnectTimeout(60000).setSocketTimeout(10 * 10000).build(); httpRequestBase.setConfig(config); CloseableHttpClient httpClient = getCloseableHttpClient(); CloseableHttpResponse response = httpClient.execute(httpRequestBase); String html = EntityUtils.toString(response.getEntity(), "utf-8"); return html; } public static String doGet(String url) throws IOException { HttpGet httpGet = new HttpGet(url); String html = execute(httpGet); return html; } public static String doGetReferer(String url, String referer) throws IOException { HttpGet httpGet = new HttpGet(url); String html = executeReferer(httpGet, referer); return html; } public static String doPost(String url, Map<String,String> params) throws IOException { HttpPost httpPost = new HttpPost(url); List<BasicNameValuePair> list = new ArrayList<>(); for (String key : params.keySet()) { list.add(new BasicNameValuePair(key,params.get(key))); } UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list); httpPost.setEntity(entity); return execute(httpPost); } public static void main(String args[]) { String pid = "4310407"; // try { // JDPhoneHolder.getCommitCount(pid); // } catch (IOException e) { // e.printStackTrace(); // } try { int commitCountNum = JDPhoneHolder.getCommitCountNum(pid); System.out.println("产品: " + pid + ", 评论数:" + commitCountNum); } catch (IOException e) { e.printStackTrace(); } } }
针对所用到的表结构,也附在这里:
产品表:
CREATE TABLE `good_holder_jd_info_czsjzj` ( `id` int(11) NOT NULL AUTO_INCREMENT, `pid` varchar(32) NOT NULL COMMENT '产品ID', `title` varchar(1024) NOT NULL COMMENT '产品标题描述', `brand` varchar(1024) NOT NULL COMMENT '产品品牌', `pname` varchar(1024) NOT NULL COMMENT '产品名称', `price` varchar(32) NOT NULL COMMENT '产品价格', `url` varchar(2048) NOT NULL COMMENT '产品链接', `priceLow` double(16,2) DEFAULT NULL COMMENT '商品的低价', `priceHigh` double(16,2) DEFAULT NULL COMMENT '商品的高价', `goodrate` int(11) DEFAULT NULL COMMENT '产品评论分数', `totalc` int(64) DEFAULT NULL COMMENT '总评论数', `goodc` int(11) DEFAULT NULL COMMENT '好评数量', `generalc` int(11) DEFAULT NULL COMMENT '中评数量', `poorc` int(11) DEFAULT NULL COMMENT '差评数量', `videoc` int(11) DEFAULT NULL COMMENT '视频晒单量', `afterc` int(11) DEFAULT NULL COMMENT '追评数量', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=1134 DEFAULT CHARSET=utf8mb4
评论分类表(我这里没有抓评论的详情数据,我只抓取了评论的类别和次数数据)
CREATE TABLE `good_holder_jd_comment_czsjzj` ( `rid` int(11) NOT NULL COMMENT '评论对应的产品记录的主键ID', `info` varchar(256) DEFAULT NULL COMMENT '描述内容信息', `count` int(11) DEFAULT NULL COMMENT '对应内容的条数' ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
这个评论分类表的数据类似下图红圈内的内容:
写在博文的最后,关于抓取JD商品价格和评论数据的方法:
//获取价格,只需要传入商品的ID即可 public static String getPrice(String pid) throws IOException { String priceUrl="https://p.3.cn/prices/mgets?pduid="+Math.random()+"&skuIds=J_"+pid; String priceJson = HttpClientUtils.doGet(priceUrl); System.out.println(priceJson); Gson gson = new GsonBuilder().create(); List<Map<String,String>> list = gson.fromJson(priceJson, List.class); return list.get(0).get("p"); }
//获取商品的评论信息,只需要传入商品的ID即可 public static JSONObject getComments(String pid) throws IOException { String baseUrl = "https://sclub.jd.com/comment/productPageComments.action?score=0&sortType=5&page=1&pageSize=1&isShadowSku=0&productId=" + pid; String commentJson = HttpClientUtils.doGet(baseUrl); System.out.println(commentJson); JSONObject jsonObject = JSON.parseObject(commentJson); return jsonObject; }
两个函数中,红色URL部分,是重点内容,从这两个URL来看,JD的商城站点信息,相对设计的还是比较简单的。
这篇博文,就分享到这里吧,上述爬虫程序(主要是爬取车载手机支架信息的),稍微修改一下,就可以爬取其他商品的类似信息。欢迎评论,欢迎给出绕开阿里反爬技术的解决方案!