Java-数据科学秘籍-全-

Java 数据科学秘籍（全）

零、前言

数据科学是当今专业化的热门领域，涵盖了广泛的人工智能领域，如数据处理、信息检索、机器学习、自然语言处理、大数据、深度神经网络和数据可视化。在本书中，我们将了解既现代又聪明的技术，并以简单易懂的方式介绍了 70 多个问题。

考虑到对高质量数据科学家的高需求，我们使用核心 Java 以及用 Java 编写的著名、经典和最先进的数据科学库编写了食谱。我们从数据收集和清理过程开始。然后，我们看看如何对获得的数据进行索引和搜索。之后，我们将介绍描述性统计和推断性统计以及它们在数据中的应用。然后，我们有两个连续的章节讨论机器学习在数据上的应用，这些数据可以成为构建任何智能系统的基础。现代信息检索和自然语言处理技术也包括在内。大数据是一个新兴领域，也涵盖了这个热门领域的几个方面。我们还涵盖了使用深度神经网络进行深度学习的基础知识。最后，我们学习如何使用有意义的视觉或图形来表示数据和从数据中获得的信息。

这本书的目标读者是对数据科学感兴趣并计划使用 Java 应用数据科学以更好地理解底层数据的任何人。

这本书涵盖了什么

第 1 章，获取和清理数据，涵盖了不同的数据读写方式以及清理数据以去除噪声的方式。它还让读者熟悉不同的数据文件类型，如 PDF、ASCII、CSV、TSV、XML 和 JSON。本章还介绍了提取 web 数据的方法。

第 2 章，索引和搜索数据，讲述了如何使用 Apache Lucene 索引数据以进行快速搜索。本章描述的技术可以被看作是现代搜索技术的基础。

第 3 章，统计分析数据，涵盖了 Apache Math API 的应用，从数据中收集和分析统计数据。本章还涵盖了更高层次的概念，如统计显著性测试，这是研究人员将他们的结果与基准进行比较时的标准工具。

第 4 章，从数据中学习-第 1 部分，涵盖了使用 Weka 机器学习工作台的基本分类、聚类和特征选择练习。

第 5 章，从数据中学习-第 2 部分，是后续章节，涵盖了使用另一个名为 Java 机器学习(Java-ML)库的 Java 库进行数据导入和导出、分类和特性选择。本章还涵盖了斯坦福分类器和大规模在线访问(MOA)的基本分类。

第 6 章，从文本数据中检索信息，涵盖了数据科学在文本数据信息检索中的应用。它涵盖了核心 Java 的应用以及流行的库，如 OpenNLP、Stanford CoreNLP、Mallet 和 Weka，用于将机器学习应用于信息提取和检索任务。

第七章、处理大数据，涵盖了机器学习的大数据平台应用，如 Apache Mahout、Spark-MLib 等。

第 8 章、从数据中深度学习，涵盖了使用 Java 深度学习(DL4j)库进行深度学习的基础知识。我们涵盖了 word2vec 算法、信念网络和自动编码器。

第 9 章、可视化数据，涵盖了基于数据生成有吸引力的信息显示的 GRAL 包。在软件包的许多功能中，选择了基本和基本图。

这本书你需要什么

我们已经用 Java 解决了现实世界的数据科学问题。我们的重点是为那些想知道如何用 Java 解决问题的人提供有效的内容。需要 Java 的基本知识，例如类、对象、方法、参数和参数、异常以及导出 Java 归档(JAR)文件。代码由叙述、信息和提示很好地支持，以帮助读者理解上下文和目的。在本书中解决的问题背后的理论，在许多情况下，没有被彻底讨论，但在必要时为感兴趣的读者提供参考。

这本书是给谁的

这本书是为那些想知道如何使用 Java 解决与数据科学相关的现实世界问题的人准备的。这本书从覆盖面的角度来看非常全面，对于已经从事数据科学并寻求使用 Java 解决项目中的问题的从业者来说也非常有用。

章节

在这本书里，你会发现几个经常出现的标题(准备好，怎么做...，它是如何工作的...，还有更多...，另请参见)。

为了给出如何完成配方的明确说明，我们使用以下章节:

准备就绪

本节将告诉您制作方法的内容，并描述如何设置制作方法所需的任何软件或任何初步设置。

怎么做...

本节包含遵循配方所需的步骤。

它是如何工作的...

这一部分通常包括对前一部分发生的事情的详细解释。

还有更多...

这一节包含了关于配方的附加信息，以使读者对配方有更多的了解。

参见

这个部分提供了一些有用的链接，可以链接到食谱的其他有用信息。

习俗

在这本书里，你会发现许多区分不同种类信息的文本样式。下面是这些风格的一些例子和它们的含义的解释。

文本中的码字、数据库表名、文件夹名、文件名、文件扩展名、路径名、伪 URL、用户输入、Twitter 句柄显示如下:“其中，你会找到一个名为lib的文件夹，这是你感兴趣的文件夹。”

代码块设置如下:

    classVals = new ArrayList<String>(); 
      for (int i = 0; i < 5; i++){ 
        classVals.add("class" + (i + 1)); 
    }

任何命令行输入或输出都按如下方式编写:

@relation MyRelation 

@attribute age numeric 
@attribute name string 
@attribute dob date yyyy-MM-dd 
@attribute class {class1,class2,class3,class4,class5} 

@data 
35,'John Doe',1981-01-20,class3 
30,'Harry Potter',1986-07-05,class1

新术语和重要词汇以粗体显示。您在屏幕上看到的文字，例如在菜单或对话框中看到的文字，出现在如下文本中:“从管理面板中选择系统信息”

注意

警告或重要提示出现在这样的框中。

Tip

提示和技巧是这样出现的。

读者反馈

我们随时欢迎读者的反馈。让我们知道你对这本书的看法——你喜欢或不喜欢什么。读者的反馈对我们来说很重要，因为它有助于我们开发出真正让你受益匪浅的图书。

要给我们发送总体反馈，只需给 feedback@packtpub.com 发电子邮件，并在邮件主题中提及书名。

如果有一个你擅长的主题，并且你有兴趣写一本书或者为一本书投稿，请查看我们在www.packtpub.com/authors的作者指南。

客户支持

既然您已经是 Packt book 的骄傲拥有者，我们有许多东西可以帮助您从购买中获得最大收益。

下载示例代码

你可以从你在http://www.packtpub.com的账户下载本书的示例代码文件。如果你在其他地方购买了这本书，你可以访问 http://www.packtpub.com/support 的并注册，让文件直接通过电子邮件发送给你。

您可以按照以下步骤下载代码文件:

使用您的电子邮件地址和密码登录或注册我们的网站。
将鼠标指针悬停在顶部的支持选项卡上。
点击代码下载&勘误表。
在搜索框中输入书名。
选择您要下载代码文件的书。
从下拉菜单中选择您购买这本书的地方。
点击代码下载。

您也可以通过点击 Packt Publishing 网站上该书网页上的代码文件按钮来下载代码文件。在搜索框中输入书名即可进入该页面。请注意，您需要登录到您的 Packt 帐户。

下载文件后，请确保使用最新版本的解压缩或解压文件夹:

WinRAR / 7-Zip for Windows
适用于 Mac 的 Zipeg / iZip / UnRarX
用于 Linux 的 7-Zip / PeaZip

这本书的代码包也托管在 GitHub 的 https://GitHub . com/packt publishing/Java-Data-Science-Cookbook 上。我们在 https://github.com/PacktPublishing/也有来自我们丰富的书籍和视频目录的其他代码包。看看他们！

下载这本书的彩色图片

我们还为您提供了一个 PDF 文件，其中包含本书中使用的截图/图表的彩色图像。彩色图像将帮助您更好地理解输出中的变化。你可以从https://www . packtpub . com/sites/default/files/downloads/javadatasciencecoookbook _ color images . pdf下载这个文件。

勘误表

尽管我们已尽一切努力确保内容的准确性，但错误还是会发生。如果您在我们的某本书中发现了一个错误——可能是文本或代码中的错误——如果您能向我们报告，我们将不胜感激。这样做，你可以让其他读者免受挫折，并帮助我们改进本书的后续版本。如果您发现任何勘误表，请通过访问http://www.packtpub.com/submit-errata，选择您的图书，点击勘误表提交表链接，并输入您的勘误表的详细信息。一旦您的勘误表得到验证，您的提交将被接受，该勘误表将被上传到我们的网站或添加到该标题的勘误表部分下的任何现有勘误表列表中。

要查看之前提交的勘误表，请前往https://www.packtpub.com/books/content/support并在搜索栏中输入图书名称。所需信息将出现在勘误表部分。

盗版

互联网上版权材料的盗版是所有媒体都存在的问题。在 Packt，我们非常重视版权和许可证的保护。如果您在互联网上发现我们作品的任何形式的非法拷贝，请立即向我们提供地址或网站名称，以便我们采取补救措施。

请联系我们在 copyright@packtpub.com 与涉嫌盗版材料的链接。

我们感谢您帮助保护我们的作者，以及我们为您带来有价值内容的能力。

问题

如果你对这本书的任何方面有问题，你可以在 questions@packtpub.com 联系我们，我们会尽最大努力解决问题。

一、获取和清理数据

在本章中，我们将介绍以下配方:

使用 Java 从分层目录中检索所有文件名
使用 Apache Commons IO 从分层目录中检索所有文件名
使用 Java 8 从文本文件中一次性读取内容
使用 Apache Commons IO 一次从文本文件中读取所有内容
使用 Apache Tika 提取 PDF 文本
使用正则表达式清理 ASCII 文本文件
使用 Univocity 解析逗号分隔值文件
使用 Univocity 解析选项卡分隔值文件
使用 JDOM 解析 XML 文件
使用 JSON.simple 编写 JSON 文件
使用 JSON.simple 读取 JSON 文件
使用 JSoup 从 URL 中提取 web 数据
使用 Selenium 从网站提取网络数据Webdriver
从 MySQL 数据库读取表数据

简介

每个数据科学家都需要处理以多种格式存储在磁盘上的数据，比如 ASCII 文本、PDF、XML、JSON 等等。此外，数据可以存储在数据库表中。在进行任何分析之前，数据科学家的首要任务是从这些数据源和这些格式中获取数据，并应用数据清理技术去除其中存在的噪声。在这一章中，我们将看到完成这一重要任务的方法。

我们将不仅在本章，而且在整本书中使用外部 Java 库(Java 归档文件或简单的 JAR 文件)。这些库是由开发人员或组织创建的，目的是让每个人的生活更轻松。我们将使用 Eclipse IDE 进行代码开发，最好是在 Windows 平台上，并贯穿全书。这里是你如何包含任何外部 JAR 文件，在许多食谱中，我指导你将外部 JAR 文件包含到你的项目中，这是你需要做的。

在 Eclipse 中右键单击项目 | 构建路径 | 配置构建路径，可以在项目中添加一个 JAR 文件。在库选项卡下，点击添加外部 jar...，并选择您将用于特定项目的外部 JAR 文件:

使用 Java 从分层目录中检索所有文件名

这个方法(以及下面的内容)是为想要从一个复杂的目录结构中检索文件路径和名称(用于将来的分析)的数据科学家准备的，这个复杂的目录结构包含根目录中的许多目录和文件。

准备就绪

为了执行此配方，我们需要以下内容:

在目录中创建目录(尽可能多的层)。
在其中的一些目录中创建文本文件，同时将一些目录留空以获得更多刺激。

怎么做...

我们将创建一个接受一个File参数的static方法，这个参数最终是根目录或开始的目录。该方法将返回一组在这个根目录(以及所有其他后续目录)中找到的文件:
```
        public static Set<File> listFiles(File rootDir) {  
```

首先，创建一个包含文件信息的HashSet:

        Set<File> fileSet = new HashSet<File>();

一旦创建了HashSet，我们需要检查根目录或其中的目录是否是null。对于这种情况，我们不需要进一步处理:
```
        if (rootDir == null || rootDir.listFiles() == null){ 
                     return fileSet; 
           } 
```

我们一次从根目录中考虑一个目录(或文件),并检查我们是在处理一个文件还是一个目录。对于一个文件，我们将它添加到我们的HashSet中。在目录的情况下，我们通过发送目录的路径和名称再次递归调用这个方法:

        for (File fileOrDir : rootDir.listFiles()) { 
                     if (fileOrDir.isFile()){ 
                       fileSet.add(fileOrDir); 
                     } 
                     else{ 
                       fileSet.addAll(listFiles(fileOrDir)); 
                     } 
             }

Finally, we return the HashSet to the caller of this method:

       return fileSet; 
          }

完整的方法，以及运行它的类和驱动程序方法如下:

import java.io.File; 
import java.util.HashSet; 
import java.util.Set; 

public class TestRecursiveDirectoryTraversal { 
   public static void main(String[] args){ 
      System.out.println(listFiles(new File("Path for root 
          directory")).size()); 
   } 

   public static Set<File> listFiles(File rootDir) { 
       Set<File> fileSet = new HashSet<File>(); 
       if(rootDir == null || rootDir.listFiles()==null){ 
           return fileSet; 
       } 
       for (File fileOrDir : rootDir.listFiles()) { 
             if (fileOrDir.isFile()){ 
               fileSet.add(fileOrDir); 
             } 
             else{ 
               fileSet.addAll(listFiles(fileOrDir)); 
             } 
     } 

       return fileSet; 
   } 
}

注意

注意使用HashSet来存储文件的路径和名称。这意味着我们不会有任何重复，因为 Java 中的Set数据结构不包含重复条目。

使用 Apache Commons IO 从分层目录中检索所有文件名

分层目录中的文件名列表可以递归地完成，如前面的方法所示。然而，使用 Apache Commons IO 库，这可以用更简单、更方便的方式和更少的代码来完成。

准备就绪

为了执行此配方，我们需要以下内容:

在这个菜谱中，我们将使用 Apache 的一个名为 Commons IO 的 Java 库。在整本书中，我们将使用 2.5 版本。从这里下载您选择的 JAR 文件:https://commons . Apache . org/proper/commons-io/download _ io . CGI
将 JAR 文件包含在项目中，即 Eclipse 中的一个外部 JAR。

怎么做...

创建一个方法，将目录层次结构中的根目录作为输入:
```
        public void listFiles(String rootDir){ 
```

创建一个根目录名为

        File dir = new File(rootDir);

的文件对象

Apache Commons 库的FileUtils类包含一个名为listFiles()的方法。用这个方法检索所有的文件名，并把这些名字放在一个带有<File>泛型的列表变量中。使用TrueFileFilter.INSTANCE匹配所有目录:
```
        List<File> files = (List<File>) FileUtils.listFiles(dir,   
          TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE); 
```
文件名可以显示在标准输出中，如下所示。现在我们有了列表中的名字，我们有了进一步处理这些文件中的数据的方法:
```
        for (File file : files) { 
           System.out.println("file: " + file.getAbsolutePath()); 
        } 
```

Close the method:

这个配方中的方法、它的类以及运行它的驱动程序方法如下:

import java.io.File; 
import java.util.List; 
import org.apache.commons.io.FileUtils; 
import org.apache.commons.io.filefilter.TrueFileFilter; 

public class FileListing{ 
   public static void main (String[] args){ 
      FileListing fileListing = new FileListing(); 
      fileListing.listFiles("Path for the root directory here"); 
   } 
   public void listFiles(String rootDir){ 
      File dir = new File(rootDir); 

      List<File> files = (List<File>) FileUtils.listFiles(dir,  
        TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE); 
      for (File file : files) { 
         System.out.println("file: " + file.getAbsolutePath()); 
      } 
   }

Tip

如果您想列出带有某些特定扩展名的文件，Apache Commons 库中也有一个名为listFiles的方法。但是，参数不同；该方法带三个参数，即文件目录、String[]扩展名、布尔递归。这个库中另一个有趣的方法是 listFilesAndDirs (File directory，IOFileFilter fileFilter，IOFileFilter dirFilter)，如果有人不仅对列出文件而且对列出目录感兴趣的话。详细信息可在 https://commons.apache.org/proper/commons-io/javadocs/的找到。

使用 Java 8 从文本文件中一次性读取内容

在许多情况下，数据科学家都有文本格式的数据。读取文本文件内容的方法有很多种，它们各有利弊:有些方法耗费时间和内存，有些方法速度很快，不需要太多的计算机内存；有些人一次读取所有文本内容，而有些人逐行读取文本文件。选择取决于手头的任务和数据科学家完成该任务的方法。

这个菜谱演示了如何使用 Java 8 一次读取所有文本文件内容。

怎么做...

首先，创建一个String对象来保存将要读取的文本文件的路径和名称:
```
        String file = "C:/dummy.txt";  
```
使用Paths类的get()方法，我们得到我们试图读取的文件的路径。这个方法的参数是指向文件名的String对象。这个方法的输出被提供给另一个名为lines()的方法，它在Files类中。该方法从文件中读取所有行作为一个Stream，因此，该方法的输出指向一个Stream变量。因为我们的dummy.txt文件包含字符串数据，所以Stream变量的泛型被设置为String。

整个读取过程需要一个try...catch块来尝试读取一个不存在或损坏的文件等等。

下面的代码段显示了我们的dummy.txt文件的内容。stream变量包含文本文件的行，因此使用变量的forEach()方法显示每行内容:

        try (Stream<String> stream = Files.lines(Paths.get(file))) { 
        stream.forEach(System.out::println); } catch (IOException e) { 
        System.out.println("Error reading " +  file.getAbsolutePath()); 
        }

使用 Apache Commons IO 从文本文件中一次性读取内容

使用 Apache Commons IO API 可以实现上一个配方中描述的相同功能。

准备就绪

为了执行此配方，我们需要以下内容:

在这个菜谱中，我们将使用 Apache 的一个名为 Commons IO 的 Java 库。从这里下载您选择的版本:https://commons . Apache . org/proper/commons-io/download _ io . CGI
将 JAR 文件包含在项目中，即 Eclipse 中的一个外部 JAR。

怎么做...

比方说，您正试图读取位于您的C:/ drive中名为dummy.txt的文件的内容。首先，你需要创建一个文件对象来访问这个文件，如下:
```
        File file = new File("C:/dummy.txt");  
```
接下来，创建一个 string 对象来保存文件的文本内容。我们将从 Apache Commons IO 库中使用的方法称为readFileToString，它是名为FileUtils的类的成员。有许多不同的方法可以调用这个方法。但是现在，只需要知道我们需要给这个方法发送两个参数。首先是file对象，这是我们将要读取的文件，然后是文件的编码，在这个例子中是UTF-8 :
```
        String text = FileUtils.readFileToString(file, "UTF-8"); 
```
前面两行足以读取文本文件内容并将其放入变量中。然而，你不仅仅是一个数据科学家，你还是一个聪明的数据科学家。因此，您需要在代码前后添加几行代码，以便处理 Java 方法在试图读取一个不存在或已损坏的文件时抛出的异常。前面代码的完整性可以通过引入如下的try...catch块来实现:
```
       File file = new File("C:/dummy.txt"); 
       try { 
       String text = FileUtils.readFileToString(file, "UTF-8"); 
       } catch (IOException e) { 
       System.out.println("Error reading " + file.getAbsolutePath()); 
       } 
```

使用 Apache Tika 提取 PDF 文本

解析和提取数据最困难的文件类型之一是 PDF。有些 pdf 甚至无法解析，因为它们受密码保护，而有些 pdf 包含扫描文本和图像。因此，这种动态文件类型有时会成为数据科学家的噩梦。这个菜谱演示了如何使用 Apache Tika 从 PDF 文件中提取文本，因为该文件没有加密或密码保护，并且包含没有扫描的文本。

准备就绪

为了执行此配方，我们需要以下内容:

从http://archive.apache.org/dist/tika/tika-app-1.10.jar下载 Apache Tika 1.10 JAR 文件，并将其作为外部 Java 库包含在您的 Eclipse 项目中。
在您的C: drive上将任何解锁的 PDF 文件另存为testPDF.pdf。

怎么做...

创建一个名为convertPdf(String)的方法，该方法将待转换的 PDF 文件的名称作为参数:
```
        public void convertPDF(String fileName){ 
```
创建一个包含 PDF 数据的字节流输入流:
```
        InputStream stream = null; 
```
创建一个try块，如下所示:
```
        try{ 
```

将文件分配给刚刚创建的stream:

        stream = new FileInputStream(fileName);

Apache Tika 包中提供了许多不同的解析器。如果您不知道要使用哪个解析器，或者说您不仅有 pdf，还有其他类型的文档要转换，您应该使用如下的【T0:】T1
创建一个处理程序来处理文件的主体内容。注意-1是构造函数的参数。通常，Apache Tika 仅限于处理最多 100，000 个字符的文件。-1值确保这个限制被主体处理程序忽略:
```
        BodyContentHandler handler = new BodyContentHandler(-1); 
```

创建元数据对象:

        Metadata metadata = new Metadata();

用您刚刚创建的所有这些对象调用解析器对象的parser()方法:

        parser.parse(stream, handler, metadata, new ParseContext());

使用 handler 对象的tostring()方法获取从文件中提取的正文:
```
        System.out.println(handler.toString()); 
```
Close the try block and complement it with a catch block and finally block, and close the method as follows:

```java
        }catch (Exception e) { 
                      e.printStackTrace(); 
                 }finally { 
                     if (stream != null) 
                          try { 
                               stream.close(); 
                          } catch (IOException e) { 
                            System.out.println("Error closing stream"); 
                          } 
                   } 
        } 

```

一个类中带有驱动方法的完整方法如下。您刚刚创建的方法可以通过向它发送您需要转换的 PDF 文件的路径和名称来调用，该文件在您的`C: drive`中保存为`testPDF.pdf`:

```java
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import org.apache.tika.metadata.Metadata; 
import org.apache.tika.parser.AutoDetectParser; 
import org.apache.tika.parser.ParseContext; 
import org.apache.tika.sax.BodyContentHandler; 

public class TestTika { 
     public static void main(String args[]) throws Exception { 
          TestTika tika = new TestTika(); 
          tika.convertPdf("C:/testPDF.pdf"); 
     } 
     public void convertPdf(String fileName){ 
          InputStream stream = null; 
          try { 
              stream = new FileInputStream(fileName); 
              AutoDetectParser parser = new AutoDetectParser(); 
              BodyContentHandler handler = new BodyContentHandler(-1); 
              Metadata metadata = new Metadata(); 
              parser.parse(stream, handler, metadata, new 
                  ParseContext()); 
              System.out.println(handler.toString()); 
          }catch (Exception e) { 
              e.printStackTrace(); 
          }finally { 
              if (stream != null) 
                   try { 
                        stream.close(); 
                   } catch (IOException e) { 
                        System.out.println("Error closing stream"); 
                   } 
          } 
     } 
} 

```

使用正则表达式清理 ASCII 文本文件

ASCII 文本文件可能包含最终在转换过程中引入的不必要的字符单元，如 PDF 到文本的转换或 HTML 到文本的转换。这些字符通常被视为噪音，因为它们是数据处理的主要障碍之一。这个方法使用正则表达式清除了 ASCII 文本数据中的一些干扰。

怎么做...

创建一个名为cleanText(String)的方法，它采用String格式的文本:
```
        public String cleanText(String text){ 
```

Add the following lines in your method, return the cleaned text, and close the method. The first line strips off non-ASCII characters. The line next to it replaces continuous white spaces with a single white space. The third line erases all the ASCII control characters. The fourth line strips off the ASCII non-printable characters. The last line removes non-printable characters from Unicode:

        text = text.replaceAll("[^p{ASCII}]",""); 
        text = text.replaceAll("s+", " "); 
        text = text.replaceAll("p{Cntrl}", ""); 
        text = text.replaceAll("[^p{Print}]", ""); 
        text = text.replaceAll("p{C}", ""); 

        return text; 
        }

类中包含驱动程序方法的完整方法如下所示:

public class CleaningData { 
   public static void main(String[] args) throws Exception { 
      CleaningData clean = new CleaningData(); 
      String text = "Your text here you have got from some file"; 
      String cleanedText = clean.cleanText(text); 
      //Process cleanedText 
   } 

   public String cleanText(String text){ 
      text = text.replaceAll("[^p{ASCII}]",""); 
        text = text.replaceAll("s+", " "); 
        text = text.replaceAll("p{Cntrl}", ""); 
        text = text.replaceAll("[^p{Print}]", ""); 
        text = text.replaceAll("p{C}", ""); 
        return text; 
   } 
}

使用 Univocity 解析逗号分隔值(CSV)文件

数据科学家处理的另一种非常常见的文件类型是逗号分隔值 ( CSV )文件，其中数据由逗号分隔。CSV 文件非常受欢迎，因为它们可以被大多数电子表格应用程序读取，如 MS Excel。

在这个菜谱中，我们将看到如何解析 CSV 文件并处理从中检索到的数据点。

准备就绪

为了执行此配方，我们需要以下内容:

从http://OSS . sonatype . org/content/repositories/releases/com/univo city/univo city-parsers/2 . 2 . 1/univo city-parsers-2 . 2 . 1 . JAR下载 Univocity JAR 文件。将 JAR 文件作为外部库包含在 Eclipse 的项目中。

使用记事本从以下数据创建一个 CSV 文件。文件扩展名应该是.csv。你将文件保存为C:/testCSV.csv :

        Year,Make,Model,Description,Price 
        1997,Ford,E350,"ac, abs, moon",3000.00 
        1999,Chevy,"Venture ""Extended Edition""","",4900.00 
        1996,Jeep,Grand Cherokee,"MUST SELL! 
        air, moon roof, loaded",4799.00 
        1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 
        ,,"Venture ""Extended Edition""","",4900.00

怎么做...

创建一个名为parseCsv(String)的方法，将文件名作为字符串参数:
```
        public void parseCsv(String fileName){ 
```

然后创建一个设置对象。该对象提供了许多配置设置选项:

        CsvParserSettings parserSettings = new CsvParserSettings();

您可以配置解析器来自动检测输入中的行分隔符序列:

        parserSettings.setLineSeparatorDetectionEnabled(true);

创建一个RowListProcessor,将每个解析过的行存储在一个列表中:
```
        RowListProcessor rowProcessor = new RowListProcessor(); 
```
您可以配置解析器使用一个RowProcessor来处理每个解析行的值。你会在com.univocity.parsers.common.processor包中找到更多的RowProcessors，但是你也可以自己创建:
```
        parserSettings.setRowProcessor(rowProcessor); 
```
如果您将要解析的 CSV 文件包含标题，您可以将第一个解析的行视为文件中每一列的标题:
```
        parserSettings.setHeaderExtractionEnabled(true); 
```

现在，用给定的设置创建一个parser实例:

        CsvParser parser = new CsvParser(parserSettings);

parse()方法将解析文件，并将每个解析的行委托给您定义的【T1:】T2
如果您已经解析了头部，那么可以如下找到【T0:】T1
然后，您可以轻松地处理这个字符串数组来获取头值。
另一方面，行值可以在列表中找到。可以使用 for 循环打印列表，如下所示:

```java
        List<String[]> rows = rowProcessor.getRows(); 
        for (int i = 0; i < rows.size(); i++){ 
           System.out.println(Arrays.asList(rows.get(i))); 
        } 

```

Finally, close the method:

```java
       } 

```

整个方法可以写成如下形式:

```java
import java.io.File; 
import java.util.Arrays; 
import java.util.List; 

import com.univocity.parsers.common.processor.RowListProcessor; 
import com.univocity.parsers.csv.CsvParser; 
import com.univocity.parsers.csv.CsvParserSettings; 

public class TestUnivocity { 
      public void parseCSV(String fileName){ 
          CsvParserSettings parserSettings = new CsvParserSettings(); 
          parserSettings.setLineSeparatorDetectionEnabled(true); 
          RowListProcessor rowProcessor = new RowListProcessor(); 
          parserSettings.setRowProcessor(rowProcessor); 
          parserSettings.setHeaderExtractionEnabled(true); 
          CsvParser parser = new CsvParser(parserSettings); 
          parser.parse(new File(fileName)); 

          String[] headers = rowProcessor.getHeaders(); 
          List<String[]> rows = rowProcessor.getRows(); 
          for (int i = 0; i < rows.size(); i++){ 
            System.out.println(Arrays.asList(rows.get(i))); 
          } 
      } 

      public static void main(String[] args){ 
         TestUnivocity test = new TestUnivocity(); 
         test.parseCSV("C:/testCSV.csv"); 
      } 
} 

```

注意

有许多 CSV 解析器是用 Java 编写的。然而，相比之下，大学是最快的。详细对比结果见此:https://github.com/uniVocity/csv-parsers-comparison

使用 Univocity 解析制表符分隔值(TSV)文件

与 CSV 文件不同，制表符分隔值 ( TSV )文件包含由制表符分隔的数据。这个菜谱向您展示了如何从 TSV 文件中检索数据点。

准备就绪

为了执行此配方，我们需要以下内容:

从http://OSS . sonatype . org/content/repositories/releases/com/univo city/univo city-parsers/2 . 2 . 1/univo city-parsers-2 . 2 . 1 . JAR下载 Univocity JAR 文件。将 JAR 文件包含在 Eclipse 外部库的项目中。
使用记事本从以下数据创建 TSV 文件。文件扩展名应该是.tsv。您将文件保存为C:/testTSV.tsv:

Year    Make    Model   Description Price 
1997    Ford    E350    ac, abs, moon   3000.00 
1999    Chevy   Venture "Extended Edition"      4900.00 
1996    Jeep    Grand Cherokee  MUST SELL!nair, moon roof, loaded  4799.00 
1999    Chevy   Venture "Extended Edition, Very Large"      5000.00 
        Venture "Extended Edition"      4900.00

怎么做...

创建一个名为parseTsv(String)的方法，将文件名作为字符串参数:
```
        public void parseTsv(String fileName){ 
```
在这个菜谱中，TSV 文件的行分隔符是一个换行符或n。要将该字符设置为行分隔符，请修改设置:
```
        settings.getFormat().setLineSeparator("n"); 
```

使用这些设置，创建一个 TSV 解析器:

        TsvParser parser = new TsvParser(settings);

一次性解析 TSV 文件的所有行，如下所示:

        List<String[]> allRows = parser.parseAll(new File(fileName));

遍历列表对象，打印/处理这些行，如下所示:

        for (int i = 0; i < allRows.size(); i++){ 
                 System.out.println(Arrays.asList(allRows.get(i))); 
               }

最后，关闭方法:

类中包含驱动程序方法的完整方法如下所示:

import java.io.File; 
import java.util.Arrays; 
import java.util.List; 

import com.univocity.parsers.tsv.TsvParser; 
import com.univocity.parsers.tsv.TsvParserSettings; 

public class TestTsv { 
   public void parseTsv(String fileName){ 
       TsvParserSettings settings = new TsvParserSettings(); 
       settings.getFormat().setLineSeparator("n"); 
       TsvParser parser = new TsvParser(settings); 
       List<String[]> allRows = parser.parseAll(new File(fileName)); 
       for (int i = 0; i < allRows.size(); i++){ 
         System.out.println(Arrays.asList(allRows.get(i))); 
       } 
   } 
}

使用 JDOM 解析 XML 文件

与通常是非结构化的文本数据不同，在 XML 文件中组织数据是一种以结构化方式准备、传递和利用数据的流行方法。有几种方法可以解析 XML 文件的内容。在本书中，我们将把我们的方法限制在一个名为 JDOM 的用于 XML 解析的外部 Java 库。

准备就绪

为了执行此配方，我们需要以下内容:

从http://www.jdom.org/downloads/index.html下载 JDOM 的 2.06 版 JAR 文件。
在 Eclipse 中，创建一个项目并将 JAR 文件包含在一个外部 JAR 中。
打开记事本。创建一个名为xmldummy的新文件，扩展名为.xml。该文件的内容将简单如下:

       <?xml version="1.0"?> 
       <book> 
          <author> 
             <firstname>Alice</firstname> 
             <lastname>Peterson</lastname> 
          </author> 
          <author> 
             <firstname>John</firstname> 
             <lastname>Doe</lastname> 
          </author> 
       </book>

怎么做...

创建一个名为builder :

       SAXBuilder builder = new SAXBuilder();

的SAXBuilder对象

现在您需要创建一个File对象来指向您将要解析的 XML 文件。如果您已经将 XML 文件保存在了C:/驱动器中，那么输入下面的代码段:
```
        File file = new File("c:/dummyxml.xml"); 
```

在一个try块中，您将创建一个Document对象，这将是您的 XML 文件:

        try { 
          Document document = (Document) builder.build(file);

当您解析 XML 时，由于它是树结构的，您需要知道文件的根元素来开始遍历树(换句话说，开始系统地解析)。因此，您创建了一个类型为Element的rootNode对象来保存根元素，在我们的例子中是<book>节点:
```
        Element rootNode = document.getRootElement(); 
```
然后，您将检索根节点中所有名为author的子节点。这些名字以列表的形式出现，因此，您将使用一个列表变量来保存它们:
```
        List list = rootNode.getChildren("author"); 
```
接下来，您将使用一个for循环来遍历这个列表，以获得这个列表中条目的元素。每个元素将保存在一个名为 node 的Element类型变量中。该变量有一个名为getChildText()的方法，该方法以其子变量的名字作为参数；该方法返回命名的子元素的文本内容，如果没有这样的子元素，则返回null。这个方法很方便，因为调用getChild().getText()可以抛出一个NullPointerException :
```
        for (int i = 0; i < list.size(); i++) { 
           Element node = (Element) list.get(i); 
        System.out.println("First Name : " + 
          node.getChildText("firstname")); 
        System.out.println("Last Name : " +             
          node.getChildText("lastname")); 
        } 
```

Finally, you will be closing the try block; put the following catch blocks to handle exceptions:

        } catch (IOException io) { 
              System.out.println(io.getMessage()); 
        } catch (JDOMException jdomex) { 
              System.out.println(jdomex.getMessage()); 
        }

食谱的完整代码如下:

import java.io.File; 
import java.io.IOException; 
import java.util.List; 

import org.jdom2.Document; 
import org.jdom2.Element; 
import org.jdom2.JDOMException; 
import org.jdom2.input.SAXBuilder; 

public class TestJdom { 

   public static void main(String[] args){ 
      TestJdom test = new TestJdom(); 
      test.parseXml("C:/dummyxml.com"); 

   } 
   public void parseXml(String fileName){ 
      SAXBuilder builder = new SAXBuilder(); 
      File file = new File(fileName); 
      try { 
         Document document = (Document) builder.build(file); 
         Element rootNode = document.getRootElement(); 
         List list = rootNode.getChildren("author"); 
         for (int i = 0; i < list.size(); i++) { 
            Element node = (Element) list.get(i); 
            System.out.println("First Name : " + 
                node.getChildText("firstname")); 
            System.out.println("Last Name : " + 
                node.getChildText("lastname")); 
         } 
      } catch (IOException io) { 
         System.out.println(io.getMessage()); 
      } catch (JDOMException jdomex) { 
         System.out.println(jdomex.getMessage()); 
      } 
   } 
}

注意

有许多不同类型的 XML 解析器，每种都有自己的好处 Dom 解析器:这些解析器将文档的完整内容加载到内存中，并在内存中创建其完整的层次树。 SAX 解析器:这些解析器不会将完整的文档加载到内存中，而是基于事件触发器来解析文档。 JDOM 解析器 : JDOM 解析器以类似于 DOM 解析器的方式解析文档，但是更方便。 StAX 解析器 : 这些解析器处理文档的方式与 SAX 解析器相似，但效率更高。 XPath 解析器 : 这些解析器基于表达式解析文档，广泛用于 XSLT。 DOM4J 解析器 : 这是一个 Java 库，使用 Java 集合框架解析 XML、XPath 和 XSLT，提供对 DOM、SAX 和 JAXP 的支持。

使用 JSON.simple 编写 JSON 文件

就像 XML 一样，JSON 也是一种轻量级的人类可读的数据交换格式。它代表 JavaScript 对象符号。这正成为现代 web 应用程序生成和解析的流行格式。在这个菜谱中，您将看到如何编写 JSON 文件。

准备就绪

为了执行此配方，我们需要以下内容:

从https://code.google.com/archive/p/json-simple/downloads下载json-simple-1.1.1.jar，并将 JAR 文件作为外部库包含到您的 Eclipse 项目中。

怎么做...

创建一个名为writeJson(String outFileName)的方法，该方法采用我们将生成的 JSON 文件的名称作为输出，并带有这个菜谱中的 JSON 信息。
创建一个 JSON 对象，并使用该对象的put()方法填充一些字段。例如，假设你的领域是书籍和它们的作者。下面的代码将创建一个 JSON 对象，并填充哈利波特系列中的一本书的名称及其作者的姓名:
```
        JSONObject obj = new JSONObject(); 
          obj.put("book", "Harry Potter and the Philosopher's Stone"); 
          obj.put("author", "J. K. Rowling");
```

接下来，假设我们有三个书评人对这本书的评论。它们可以放在一个 JSON 数组中。该数组可以按如下方式填充。首先，我们使用数组对象的add()来添加评论。当所有的评论都被添加到数组中时，我们将把数组放到我们在上一步中创建的 JSON 对象中:

JSONArray list = new JSONArray(); 

list.add("There are characters in this book that will remind us of all the people we have met. Everybody knows or knew a spoilt, overweight boy like Dudley or a bossy and interfering (yet kind-hearted) girl like Hermione"); 

list.add("Hogwarts is a truly magical place, not only in the most obvious way but also in all the detail that the author has gone to describe it so vibrantly."); 

list.add("Parents need to know that this thrill-a-minute story, the first in the Harry Potter series, respects kids' intelligence and motivates them to tackle its greater length and complexity, play imaginative games, and try to solve its logic puzzles. "); 

obj.put("messages", list);

我们将把 JSON 对象中的信息写到一个输出文件中，因为这个文件将用于演示我们如何读取/解析 JSON 文件。下面的try...catch代码块将信息写到一个 JSON 文件中:

        try { 

                 FileWriter file = new FileWriter("c:test.json"); 
                 file.write(obj.toJSONString()); 
                 file.flush(); 
                 file.close(); 

        } catch (IOException e) { 
                 //your message for exception goes here. 
        }

JSON 对象的内容也可以显示在标准输出上，如下所示:
```
        System.out.print(obj); 
```
最后，关闭方法:

整个类、该配方中描述的方法以及调用带有输出 JSON 文件名的方法的驱动程序方法如下:

import java.io.FileWriter; 
import java.io.IOException; 
import org.json.simple.JSONArray; 
import org.json.simple.JSONObject; 

public class JsonWriting { 

   public static void main(String[] args) { 
      JsonWriting jsonWriting = new JsonWriting(); 
      jsonWriting.writeJson("C:/testJSON.json"); 
   } 

   public void writeJson(String outFileName){ 
      JSONObject obj = new JSONObject(); 
      obj.put("book", "Harry Potter and the Philosopher's Stone"); 
      obj.put("author", "J. K. Rowling"); 

      JSONArray list = new JSONArray(); 
      list.add("There are characters in this book that will remind us  
        of all the people we have met. Everybody knows or knew a 
          spoilt, overweight boy like Dudley or a bossy and interfering   
            (yet kind-hearted) girl like Hermione"); 
      list.add("Hogwarts is a truly magical place, not only in the most 
        obvious way but also in all the detail that the author has gone     
          to describe it so vibrantly."); 
      list.add("Parents need to know that this thrill-a-minute story, 
        the first in the Harry Potter series, respects kids'  
          intelligence and motivates them to tackle its greater length 
            and complexity, play imaginative games, and try to solve 
              its logic puzzles. "); 

      obj.put("messages", list); 

      try { 

         FileWriter file = new FileWriter(outFileName); 
         file.write(obj.toJSONString()); 
         file.flush(); 
         file.close(); 

      } catch (IOException e) { 
         e.printStackTrace(); 
      } 

      System.out.print(obj); 
   } 
}

输出文件将包含如下数据。请注意，此处显示的输出已经过修改，以增加可读性，实际输出是一段大而平的文本:

{ 
"author":"J. K. Rowling", 
"book":"Harry Potter and the Philosopher's Stone", 
"messages":[ 
         "There are characters in this book that will remind us of all the people we have met. Everybody knows or knew a spoilt, overweight boy like Dudley or a bossy and interfering (yet kind-hearted) girl like Hermione", 
         "Hogwarts is a truly magical place, not only in the most obvious way but also in all the detail that the author has gone to describe it so vibrantly.", 
         "Parents need to know that this thrill-a-minute story, the first in the Harry Potter series, respects kids' intelligence and motivates them to tackle its greater length and complexity, play imaginative games, and try to solve its logic puzzles." 
         ] 
}

使用 JSON.simple 读取 JSON 文件

在这个菜谱中，我们将看到如何读取或解析 JSON 文件。作为我们的样本输入文件，我们将使用我们在前一个菜谱中创建的 JSON 文件。

准备就绪

为了执行此配方，我们需要以下内容:

使用前面的方法创建一个 JSON 文件，其中包含图书、作者和评论者的评论信息。该文件将用作该配方解析/读取的输入。

怎么做...

因为我们将读取或解析一个 JSON 文件，首先，我们将创建一个 JSON 解析器:
```
        JSONParser parser = new JSONParser(); 
```
然后，在try块中，我们将检索 book 和 author 字段中的值。然而，要做到这一点，我们首先使用解析器的parse()方法来读取输入的 JSON 文件。parse()方法将文件的内容作为对象返回。因此，我们将需要一个 Object变量来保存内容。然后，object将被分配给一个 JSON 对象进行进一步处理。注意在赋值期间Object变量的类型转换:
```
       try { 

         Object obj = parser.parse(new FileReader("c:test.json")); 
         JSONObject jsonObject = (JSONObject) obj; 

         String name = (String) jsonObject.get("book"); 
         System.out.println(name); 

         String author = (String) jsonObject.get("author"); 
         System.out.println(author); 
       }
```

从输入 JSON 文件中检索的下一个字段是 review 字段，这是一个数组。我们迭代这个字段如下:


       JSONArray reviews = (JSONArray) jsonObject.get("messages"); 
         Iterator<String> iterator = reviews.iterator(); 
         while (iterator.hasNext()) { 
            System.out.println(iterator.next()); 
       }

最后，我们创建 catch 块来处理解析期间可能出现的三种类型的异常，然后关闭该方法:


        } catch (FileNotFoundException e) { 
                 //Your exception handling here 
              } catch (IOException e) { 
                 //Your exception handling here 
              } catch (ParseException e) { 
                 //Your exception handling here 
              } 
        }

整个类、该配方中描述的方法以及运行该方法的驱动程序方法如下:

import java.io.FileNotFoundException; 
import java.io.FileReader; 
import java.io.IOException; 
import java.util.Iterator; 
import org.json.simple.JSONArray; 
import org.json.simple.JSONObject; 
import org.json.simple.parser.JSONParser; 
import org.json.simple.parser.ParseException; 

public class JsonReading { 
   public static void main(String[] args){ 
      JsonReading jsonReading = new JsonReading(); 
      jsonReading.readJson("C:/testJSON.json"); 
   } 
   public void readJson(String inFileName) { 
      JSONParser parser = new JSONParser(); 
      try { 
         Object obj = parser.parse(new FileReader(inFileName)); 
         JSONObject jsonObject = (JSONObject) obj; 

         String name = (String) jsonObject.get("book"); 
         System.out.println(name); 

         String author = (String) jsonObject.get("author"); 
         System.out.println(author); 

         JSONArray reviews = (JSONArray) jsonObject.get("messages"); 
         Iterator<String> iterator = reviews.iterator(); 
         while (iterator.hasNext()) { 
            System.out.println(iterator.next()); 
         } 
      } catch (FileNotFoundException e) { 
         //Your exception handling here 
      } catch (IOException e) { 
         //Your exception handling here 
      } catch (ParseException e) { 
         //Your exception handling here 
      } 
   } 
}

成功执行代码后，您将能够在标准输出中看到输入文件的内容。

使用 JSoup 从 URL 中提取 web 数据

如今，大量的数据可以在网上找到。这些数据有时是结构化的、半结构化的，甚至是非结构化的。因此，提取它们需要非常不同的技术。有许多不同的方法来提取 web 数据。最简单方便的方法之一是使用名为 JSoup 的外部 Java 库。这个菜谱使用 JSoup 中提供的一些方法来提取 web 数据。

准备就绪

为了执行此配方，我们需要以下内容:

去https://jsoup.org/download，下载jsoup-1.9.2.jar文件。将 JAR 文件添加到 Eclipse 项目的外部库中。
如果您是 Maven 爱好者，请按照下载页面上的说明将 JAR 文件包含到您的 Eclipse 项目中。

怎么做...

创建一个名为extractDataWithJsoup(String url)的方法。该参数是调用该方法所需的任何网页的 URL。我们将从这个网址提取网页数据:
```
        public void extractDataWithJsoup(String href){  
```
使用connect()方法，将 URL 发送到我们想要连接的地方(并提取数据)。然后，我们会用它链接更多的方法。首先，我们将链接以毫秒为参数的timeout()方法。之后的方法定义连接期间的用户代理名称，以及是否尝试忽略连接错误。与前两个方法相联系的下一个方法是最终返回一个Document对象的get()方法。因此，我们将在Document类的doc中保存这个返回的对象:
```
        doc = 
          Jsoup.connect(href).timeout(10*1000).userAgent
            ("Mozilla").ignoreHttpErrors(true).get();
```
As this code throws IOException, we will be using a try...catch block as follows:
```
        Document doc = null; 
        try { 
         doc = Jsoup.connect(href).timeout(10*1000).userAgent
           ("Mozilla").ignoreHttpErrors(true).get(); 
           } catch (IOException e) { 
              //Your exception handling here 
        } 
```
Tip

We are not used to expressing time in milliseconds. Therefore, when milliseconds are the time unit in coding, it is a good practice to write 10*1000 to represent 10 seconds. This enhances the readability of the code.
对于一个Document对象，可以找到大量的方法。如果要提取 URL 的标题，可以使用如下的标题方法:
```
         if(doc != null){ 
          String title = doc.title(); 
```
为了只提取 web 页面的文本部分，我们可以将body()方法与Document对象的text()方法链接起来，如下:
```
        String text = doc.body().text();
```
如果想提取一个 URL 中的所有超链接，可以使用带有a[href]参数的Document对象的select()方法。这一次给你所有的链接:
```
        Elements links = doc.select("a[href]"); 
```

也许你想单独处理网页中的链接？这也很简单——你需要遍历所有的链接来得到单个的链接:

        for (Element link : links) { 
            String linkHref = link.attr("href"); 
            String linkText = link.text(); 
            String linkOuterHtml = link.outerHtml(); 
            String linkInnerHtml = link.html();  
        System.out.println(linkHref + "t" + linkText + "t"  +  
          linkOuterHtml + "t" + linkInnterHtml);       
        }

最后，用大括号结束 if 条件。用大括号结束该方法:

        } 
        }

完整的方法、其类和驱动程序方法如下:

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

public class JsoupTesting { 
   public static void main(String[] args){ 
      JsoupTesting test = new JsoupTesting(); 
      test.extractDataWithJsoup("Website address preceded by http://"); 
   } 

   public void extractDataWithJsoup(String href){ 
      Document doc = null; 
      try { 
         doc = Jsoup.connect(href).timeout(10*1000).userAgent
             ("Mozilla").ignoreHttpErrors(true).get(); 
      } catch (IOException e) { 
         //Your exception handling here 
      } 
      if(doc != null){ 
         String title = doc.title(); 
         String text = doc.body().text(); 
         Elements links = doc.select("a[href]"); 
         for (Element link : links) { 
            String linkHref = link.attr("href"); 
            String linkText = link.text(); 
            String linkOuterHtml = link.outerHtml(); 
            String linkInnerHtml = link.html(); 
            System.out.println(linkHref + "t" + linkText + "t"  + 
                linkOuterHtml + "t" + linkInnterHtml); 
         } 
      } 
   } 
}

使用 Selenium Webdriver 从网站提取 web 数据

Selenium 是一个基于 Java 的工具，用于帮助自动化软件测试或质量保证。有趣的是，Selenium 可以用来自动检索和利用 web 数据。这个食谱告诉你如何做。

准备就绪

为了执行此配方，我们需要以下内容:

从 http://selenium-release.storage.googleapis.com/index.html?下载selenium-server-standalone-2.53.1.jar和selenium-java-2.53.1.zip路径=2.53/ 。从后者中，提取出selenium-java-2.53.1.jar文件。将这两个 JAR 文件包含在您的 eclipse 项目外部 Java 库中。
从https://ftp.mozilla.org/pub/firefox/releases/47.0.1/下载并安装火狐 47.0.1，选择适合你操作系统的版本。

Tip

由于 Selenium 和 Firefox 之间的版本冲突问题，一旦您运行特定版本的代码，请关闭 Firefox 中的自动更新下载和安装选项。

怎么做...

创建一个名为extractDataWithSelenium(String)的方法，该方法将一个String作为参数，它最终是我们要从中提取数据的 URL。我们可以从 URL 中提取许多不同类型的数据，比如标题、标题和下拉框中的值。这个菜谱只集中提取网页的文字部分:
```
        public String extractDataWithSelenium(String url){ 
```
接下来，使用下面的代码创建一个 Firefox web 驱动程序:
```
        WebDriver driver = new FirefoxDriver(); 
```

通过传递 URL:

        driver.get("http://cogenglab.csd.uwo.ca/rushdi.htm");

来使用WebDriver对象的get()方法

The text of the webpage can be found using xpath, where the value of id is content:
用findElement()方法找到这个特殊的元素。这个方法返回一个WebElement对象。创建一个名为webElement的WebElement对象来保存返回值:
```
        WebElement webElement = driver.findElement(By.xpath("//* 
          [@id='content']")); 
```
对象有一个名为getText()的方法。调用该方法检索网页文本，并将文本放入一个String变量中，如下:
```
        String text = (webElement.getText()); 
```

Finally, return the String variable and close the method:

配方的 driver main()方法的完整代码段如下所示:

import org.openqa.selenium.By; 
import org.openqa.selenium.WebDriver; 
import org.openqa.selenium.WebElement; 
import org.openqa.selenium.firefox.FirefoxDriver; 

public class TestSelenium { 
   public String extractDataWithSelenium(String url) { 
      WebDriver driver = new FirefoxDriver(); 
      driver.get("http://cogenglab.csd.uwo.ca/rushdi.htm"); 
      WebElement webElement = driver.findElement(By.xpath("//*
        [@id='content']")); 
      System.out.println(webElement.getText()); 
      return url;    
   } 

   public static void main(String[] args){ 
      TestSelenium test = new TestSelenium(); 
      String webData = test.extractDataWithSelenium
        ("http://cogenglab.csd.uwo.ca/rushdi.htm"); 
      //process webData 
   } 
}

注意

Selenium 和 Firefox 有兼容性问题。一些 Selenium 版本不能与一些 Firefox 版本一起使用。此处提供的配方适用于配方中提到的版本。但它不能保证能与其他 Selenium 或 Firefox 版本兼容。

由于 Selenium 和 Firefox 之间的版本冲突问题，一旦您使用两者的特定版本运行代码，请关闭 Firefox 中的自动更新下载和安装选项。

从 MySQL 数据库读取表格数据

数据也可以存储在数据库表中。这个菜谱演示了我们如何从 MySQL 的一个表中读取数据。

准备就绪

为了执行此配方，我们需要以下内容:

从 http://dev.mysql.com/downloads/mysql/下载并安装 MySQL 社区服务器。这个配方使用的版本是 5.7.15。
Create a database named data_science. In this database, create a table named books that contains data as follows:

字段类型的选择对于这个配方来说并不重要，但是字段的名称需要与这里展示的名称完全匹配。
从http://dev.mysql.com/downloads/connector/j/下载平台无关的 MySql JAR 文件，并将其作为外部库添加到您的 Java 项目中。这个配方使用的版本是 5.1.39。

怎么做...

创建一个 public void readTable(String user, String password, String server)方法，它将 MySQL 数据库的用户名、密码和服务器名作为参数:
```
        public void readTable(String user, String password, String   
          server){ 
```

创建一个 MySQL 数据源，并使用该数据源设置用户名、密码和服务器名:

        MysqlDataSource dataSource = new MysqlDataSource(); 
          dataSource.setUser(user); 
          dataSource.setPassword(password); 
          dataSource.setServerName(server);

在try块中，为数据库创建一个连接。使用该连接，创建一个语句，该语句将用于执行一个SELECT查询来从表中获取信息。查询的结果将存储在一个结果集中:

        try{ 
          Connection conn = dataSource.getConnection(); 
          Statement stmt = conn.createStatement(); 
          ResultSet rs = stmt.executeQuery("SELECT * FROM  
            data_science.books");

现在，遍历结果集，通过提及列名来检索每一列数据。请注意方法的使用，它为我们提供了您在使用它们之前需要知道的字段类型。例如，我们知道 ID 字段是整数，我们可以使用getInt()方法:

        while (rs.next()){ 
          int id = rs.getInt("id"); 
          String book = rs.getString("book_name"); 
          String author = rs.getString("author_name"); 
          Date dateCreated = rs.getDate("date_created"); 
          System.out.format("%s, %s, %s, %sn", id, book, author, 
            dateCreated); 
        }

迭代后关闭结果集、语句和连接:

        rs.close(); 
          stmt.close(); 
          conn.close();

在从表中读取数据的过程中，尽可能捕捉一些异常，并关闭该方法:

        }catch (Exception e){ 
           //Your exception handling mechanism goes here. 
          } 
        }

执行该方法的完整方法、类和驱动程序方法如下:

import java.sql.*; 
import com.mysql.jdbc.jdbc2.optional.MysqlDataSource; 
public class TestDB{ 
     public static void main(String[] args){ 
          TestDB test = new TestDB(); 
          test.readTable("your user name", "your password", "your MySQL 
              server name"); 
     } 
     public void readTable(String user, String password, String server) 
         { 
          MysqlDataSource dataSource = new MysqlDataSource(); 
          dataSource.setUser(user); 
          dataSource.setPassword(password); 
          dataSource.setServerName(server); 
          try{ 
               Connection conn = dataSource.getConnection(); 
               Statement stmt = conn.createStatement(); 
               ResultSet rs = stmt.executeQuery("SELECT * FROM 
                   data_science.books"); 
               while (rs.next()){ 
                    int id = rs.getInt("id"); 
                    String book = rs.getString("book_name"); 
                    String author = rs.getString("author_name"); 
                    Date dateCreated = rs.getDate("date_created"); 
                    System.out.format("%s, %s, %s, %sn", id, book, 
                        author, dateCreated); 
               } 
               rs.close(); 
               stmt.close(); 
               conn.close(); 
          }catch (Exception e){ 
               //Your exception handling mechanism goes here. 
          } 
     } 
}

这段代码显示您创建的表中的数据。

二、索引和搜索数据

在本章中，我们将介绍以下配方:

用 Apache Lucene 索引数据
用 Apache Lucene 搜索索引数据

简介

在这一章中，你将学到两个非常重要的食谱。第一个配方演示了如何索引数据，第二个配方与第一个配方密切相关，演示了如何搜索索引数据。

对于索引和搜索，我们将使用 Apache Lucene。Apache Lucene 是一个免费的开源 Java 软件库，主要用于信息检索。它由 Apache 软件基金会支持，并根据 Apache 软件许可证发布。

许多不同的现代搜索平台，如 Apache Solr 和 ElasticSearch，或爬行平台，如 Apache Nutch，在后端使用 Apache Lucene 进行数据索引和搜索。因此，任何学习这些搜索平台的数据科学家都将从本章的两个基本食谱中受益。

使用 Apache Lucene 索引数据

在这个菜谱中，我们将演示如何用 Apache Lucene 索引大量数据。索引是快速搜索数据的第一步。实际上，Lucene 使用了一个倒排全文索引。换句话说，它考虑所有的文档，将它们分成单词或标记，然后为每个标记建立一个索引，这样它就可以提前知道如果搜索一个术语，应该查找哪个文档。

准备就绪

以下是要实施的步骤:

To download Apache Lucene, go to http://lucene.apache.org/core/downloads.html, and click on the Download button. At the time of writing, the latest version of Lucene was 6.4.1. Once you click on the Download button, it will take you to the mirror websites that host the distribution:
Choose any appropriate mirror for downloading. Once you click a mirror website, it will take you to a directory of distribution. Download the lucene-6.4.1.zip file onto your system:
Once you download it, unzip the distribution. You will see a nicely organized folder distribution, as follows:
Open Eclipse, and create a project named LuceneTutorial. To do that, open Eclipse and go to File. Then go to New... and Java Project. Take the name of the project and click on Finish:
Now you will be inserting JAR files necessary for this recipe as external libraries into your project. Right-click on your project name in the Package Explorer. Select Build Path and then Configure Build Path... This will open properties for your project:
Click on the Add External Jars button, and then add the following JAR files from Lucene 6.4.1 distributions:
- lucene-core-6.4.1.jar，可以在解压后的 Lucene 发行版的lucene-6.4.1\core中找到
- lucene-queryparser-6.4.1.jar，可以在解压后的 Lucene 发行版的lucene-6.4.1\queryparser中找到
- Lucene-analyzers-common-6.4.1.jar，可以在解压后的 Lucene 发行版的lucene-6.4.1\analysis\common中找到
添加完 JAR 文件后，点击 OK :
对于索引，你将使用文本格式的威廉莎士比亚的作品。打开浏览器，进入http://norvig.com/ngrams/。这将打开一个名为自然语言语料库数据:美丽数据的页面。在下载部分的文件中，您会发现一个名为 shakespeare 的. txt 文件。将此文件下载到系统中的任何位置。
Unzip the files and you will see that the distribution contains three folders, comedies, historical, and tragedies:
Create a folder in your project directory. Right-click on your project in Eclipse and go to New, and then click Folder. As the folder name, type in input and click on Finish:
将步骤 8 中的shakespeare.txt复制到您在步骤 9 中创建的文件夹中。
按照步骤 9 中的说明创建另一个名为 index 的文件夹。在此阶段，您的项目文件夹将如下所示:

现在您已经为编码做好了准备。

怎么做...

Create a package in your project named org.apache.lucene.demo, and create a Java file in the package named IndexFiles.java:
在这个 Java 文件中，您将创建一个名为IndexFiles :
```
         public class IndexFiles { 
```
的类
您将编写的第一个方法称为indexDocs。该方法使用给定的索引编写器对任何给定的文件进行索引。如果将目录作为参数提供，该方法将递归地遍历在给定目录下找到的文件和目录。这个方法为每个输入文件索引一个文档:

提示

这种方法相对较慢,因此为了获得更好的性能,可以将多个文档放入输入文件中
```
        static void indexDocs(final IndexWriter writer, Path path) 
          throws IOException { 
```
- writer 是编写索引的索引编写器，给定的文件或目录信息将存储在该索引中
- path 是要索引的文件，或包含将为其创建索引的文件的目录

如果提供了一个目录，该目录将被递归迭代或遍历:

        if (Files.isDirectory(path)) { 
          Files.walkFileTree(path, new SimpleFileVisitor<Path>() {

然后，您将重写一个名为visitFile的方法，根据给定的路径和基本文件属性来访问文件或目录:

        @Override 
          public FileVisitResult visitFile(Path file, 
            BasicFileAttributes attrs) throws IOException {

接下来，您将调用一个稍后将创建的名为indexDoc的静态方法。我们故意将 catch 块留空，因为我们让您决定如果文件不能被索引时该做什么:
```
        try { 
            indexDoc(writer, file, 
               attrs.lastModifiedTime().toMillis()); 
          } catch (IOException ignore) { 

       } 
```

从visitFile方法返回:

        return FileVisitResult.CONTINUE; 
       }

关闭区块:
```
    } 
         ); 
    } 
```

在 else 块中，调用indexDoc方法。请记住，在else块中，您处理的是文件，而不是目录:

        else { 
         indexDoc(writer, path,  
           Files.getLastModifiedTime(path).toMillis()); 
       }

关闭indexDocs()方法:

```java
       } 

```

现在创建一个方法来处理单个文档的索引:

```java
        static void indexDoc(IndexWriter writer, Path file, long 
          lastModified) throws IOException { 

```

首先，创建一个try块来创建一个新的空文档:

```java
        try (InputStream stream = Files.newInputStream(file)) { 
          Document doc = new Document(); 

```

接下来，将文件的路径添加为一个字段。键入"path"作为字段名。该字段将是可搜索的或索引的。但是，请注意，您没有对字段进行标记，也没有对术语频率或位置信息进行索引:

```java
        Field pathField = new StringField("path", file.toString(),  
          Field.Store.YES); 
        doc.add(pathField); 

```

添加文件的最后修改日期，一个名为"modified" :

```java
        doc.add(new LongPoint("modified", lastModified)); 

```

的字段

将文件内容添加到名为"contents"的字段中。您指定的阅读器将确保文件的文本被标记化和索引，但不被存储:

```java
        doc.add(new TextField("contents", new BufferedReader(new 
          InputStreamReader(stream, StandardCharsets.UTF_8)))); 

```

### 注意

如果文件不是用`UTF-8`编码,那么搜索特殊字符将会失败

为文件创建索引:

```java
        if (writer.getConfig().getOpenMode() == OpenMode.CREATE) { 
            System.out.println("adding " + file); 
            writer.addDocument(doc); 
        } 

```

文档可能已经被索引了。你的else区块将处理这些情况。您将使用updateDocument而不是替换与精确路径匹配的旧路径，如果存在:

```java
        else { 
            System.out.println("updating " + file); 
            writer.updateDocument(new Term("path", file.toString()),  
              doc); 
       } 

```

关闭 try 块和方法:

```java
        }
        }
```

现在让我们为这个类创建 main 方法。

```java
        public static void main(String[] args) {
```

You will be providing three options from the console when you run your program:

*   第一个选项是 index，参数将是包含索引的文件夹
*   第二个选项是 docs，参数是包含文本文件的文件夹
*   最后一个选项是 update，该参数将表示您是想要创建新索引还是更新旧索引

为了保存这三个参数的值，创建并初始化三个变量:

```java
        String indexPath = "index"; 
        String docsPath = null; 
        boolean create = true; 

```

设置三个选项的值:

```java
        for(int i=0;i<args.length;i++) { 
         if ("-index".equals(args[i])) { 
            indexPath = args[i+1]; 
            i++; 
         } else if ("-docs".equals(args[i])) { 
            docsPath = args[i+1]; 
            i++; 
         } else if ("-update".equals(args[i])) { 
            create = false; 
         } 
       } 

```

```java
        final Path docDir = Paths.get(docsPath); 

```

现在，您将开始索引目录中的文件。首先，设置计时器，因为您将计时索引延迟:

```java
        Date start = new Date(); 

```

对于索引，创建一个目录并创建一个分析器(在这种情况下，您将使用一个基本的、标准的分析器和一个索引编写器配置器):

```java
       try { 

         Directory dir = FSDirectory.open(Paths.get(indexPath)); 
         Analyzer analyzer = new StandardAnalyzer(); 
         IndexWriterConfig iwc = new IndexWriterConfig(analyzer); 

```

配置好索引编写器后，根据关于索引创建或更新的输入，设置索引的打开模式。如果选择创建新的索引，打开模式将被设置为CREATE。否则就是CREATE_OR_APPEND :

```java
         if (create) { 
            iwc.setOpenMode(OpenMode.CREATE); 
         } else { 
            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); 
         } 

```

创建索引编写器:

```java
        IndexWriter writer = new IndexWriter(dir, iwc); 
        indexDocs(writer, docDir);  

```

关闭writer :

```java
       writer.close(); 

```

至此，您差不多完成了编码。只需完成对索引时间的跟踪:

```java
        Date end = new Date(); 
        System.out.println(end.getTime() - start.getTime() + " total 
          milliseconds"); 

```

关闭try块。我们有意将catch块留为空白，以便您可以决定在索引过程中出现异常的情况下应该做什么:

```java
        } catch (IOException e) { 
        } 

```

关闭 main 方法并关闭类:

```java
       } 
       } 

```

Right-click on your project in Eclipse, select Run As, and click on **Run Configurations...**:
Go to the Arguments tab in the Run Configurations window. In the Program Arguments option, put -docs input\ -index index\. Click on Run:
代码的输出如下所示:

它是如何工作的...

食谱的完整代码如下:

package org.apache.lucene.demo; 

import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.document.Field; 
import org.apache.lucene.document.LongPoint; 
import org.apache.lucene.document.StringField; 
import org.apache.lucene.document.TextField; 
import org.apache.lucene.index.IndexWriter; 
import org.apache.lucene.index.IndexWriterConfig.OpenMode; 
import org.apache.lucene.index.IndexWriterConfig; 
import org.apache.lucene.index.Term; 
import org.apache.lucene.store.Directory; 
import org.apache.lucene.store.FSDirectory; 
import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.nio.charset.StandardCharsets; 
import java.nio.file.FileVisitResult; 
import java.nio.file.Files; 
import java.nio.file.Path; 
import java.nio.file.Paths; 
import java.nio.file.SimpleFileVisitor; 
import java.nio.file.attribute.BasicFileAttributes; 
import java.util.Date; 

public class IndexFiles { 
   static void indexDocs(final IndexWriter writer, Path path) throws 
     IOException { 
      if (Files.isDirectory(path)) { 
         Files.walkFileTree(path, new SimpleFileVisitor<Path>() { 
            @Override 
            public FileVisitResult visitFile(Path file, 
              BasicFileAttributes attrs) throws IOException { 
               try { 
                  indexDoc(writer, file, 
                    attrs.lastModifiedTime().toMillis()); 
               } catch (IOException ignore) { 
               } 
               return FileVisitResult.CONTINUE; 
            } 
         } 
               ); 
      } else { 
         indexDoc(writer, path, 
            Files.getLastModifiedTime(path).toMillis()); 
      } 
   } 

   static void indexDoc(IndexWriter writer, Path file, long 
      lastModified) throws IOException { 
      try (InputStream stream = Files.newInputStream(file)) { 
         Document doc = new Document(); 
         Field pathField = new StringField("path", file.toString(), 
           Field.Store.YES); 
         doc.add(pathField); 
         doc.add(new LongPoint("modified", lastModified)); 
         doc.add(new TextField("contents", new BufferedReader(new 
            InputStreamReader(stream, StandardCharsets.UTF_8)))); 

         if (writer.getConfig().getOpenMode() == OpenMode.CREATE) { 
            System.out.println("adding " + file); 
            writer.addDocument(doc); 
         } else { 
            System.out.println("updating " + file); 
            writer.updateDocument(new Term("path", file.toString()), 
              doc); 
         } 
      } 
   } 
   public static void main(String[] args) { 
      String indexPath = "index"; 
      String docsPath = null; 
      boolean create = true; 
      for(int i=0;i<args.length;i++) { 
         if ("-index".equals(args[i])) { 
            indexPath = args[i+1]; 
            i++; 
         } else if ("-docs".equals(args[i])) { 
            docsPath = args[i+1]; 
            i++; 
         } else if ("-update".equals(args[i])) { 
            create = false; 
         } 
      } 

      final Path docDir = Paths.get(docsPath); 

      Date start = new Date(); 
      try { 
         System.out.println("Indexing to directory '" + indexPath + 
           "'..."); 

         Directory dir = FSDirectory.open(Paths.get(indexPath)); 
         Analyzer analyzer = new StandardAnalyzer(); 
         IndexWriterConfig iwc = new IndexWriterConfig(analyzer); 

         if (create) { 
            iwc.setOpenMode(OpenMode.CREATE); 
         } else { 
            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); 
         } 
         IndexWriter writer = new IndexWriter(dir, iwc); 
         indexDocs(writer, docDir); 

         writer.close(); 

         Date end = new Date(); 
         System.out.println(end.getTime() - start.getTime() + " total 
           milliseconds"); 

      } catch (IOException e) { 
      } 
   } 
}

使用 Apache Lucene 搜索索引数据

现在您已经索引了您的数据，您将在这个菜谱中使用 Apache Lucene 搜索数据。这个配方中的搜索代码依赖于您在前一个配方中创建的索引，因此，只有按照前一个配方中的说明执行，它才能成功执行。

准备就绪

Complete the previous recipe. After completing the previous recipe, go to the index directory in your project that you created in step 11 of that recipe. Make sure that you see some indexing files there:
Create a Java file named SearchFiles in the org.apache.lucene.demo package you created in the previous recipe:
现在您已经准备好在SearchFiles.java文件中键入一些代码。

怎么做...

在 Eclipse 的编辑器中打开SearchFiles.java并创建下面的类:
```
        public class SearchFiles { 
```
您需要创建两个常量字符串变量。第一个变量将包含您在之前的配方中创建的index的路径。第二个变量将包含您要搜索的字段内容。在我们的例子中，我们将在index :
```
        public static final String INDEX_DIRECTORY = "index"; 
        public static final String FIELD_CONTENTS = "contents"; 
```
的contents字段中进行搜索

开始创建你的主方法:

        public static void main(String[] args) throws Exception {

通过打开index目录中的索引创建一个indexreader:

        IndexReader reader = 
          DirectoryReader.open(FSDirectory.open
            (Paths.get(INDEX_DIRECTORY)));

下一步是创建一个搜索器来搜索索引:

         IndexSearcher indexSearcher = new IndexSearcher(reader);

作为您的分析仪，创建一个标准分析仪:

         Analyzer analyzer = new StandardAnalyzer();

通过向QueryParser构造函数提供两个参数来创建一个查询解析器，这两个参数是您将要搜索的字段和您已经创建的解析器:
```
        QueryParser queryParser = new QueryParser(FIELD_CONTENTS,  
          analyzer); 
```
在这个菜谱中，您将使用一个预定义的搜索词。在这个搜索中，您试图找到同时包含"over-full"和"persuasion" :
```
        String searchString = "over-full AND persuasion"; 
```
的文档

使用搜索字符串创建一个查询:

        Query query = queryParser.parse(searchString);

搜索者将查看索引，看是否能找到搜索项。你也提到了会有多少搜索结果，在我们的例子中是5 :

```java
        TopDocs results = indexSearcher.search(query, 5); 

```

创建一个数组来保存hits :

```java
        ScoreDoc[] hits = results.scoreDocs; 

```

请注意，在索引过程中，我们只使用了一个文档shakespeare.txt。因此，在我们的例子中，这个数组的长度最大为 1。
您可能还想知道找到的匹配的文档数量:

```java
        int numTotalHits = results.totalHits; 
        System.out.println(numTotalHits + " total matching documents"); 

```

最后，遍历命中结果。您将获得找到匹配项的文档 ID。有了文档 ID，您就可以创建文档，并打印文档的路径和 Lucene 为您使用的搜索词计算的分数:

```java
        for(int i=0;i<hits.length;++i) { 
         int docId = hits[i].doc; 
         Document d = indexSearcher.doc(docId); 
         System.out.println((i + 1) + ". " + d.get("path") + " score=" 
           + hits[i].score); 
        } 

```

关闭方法和类:

```java
        } 
        } 

```

If you run the code, you will see the following output:
打开项目文件夹的输入文件夹中的shakespeare.txt文件。手动搜索，你会发现"over-full"和"persuasion"都出现在文档中。
更改步骤 8 中的searchString，如下:

```java
        String searchString = "shakespeare"; 

```

By keeping the rest of the codes as they are, whether you run the code, you will see the following output:
再次打开Shakespeare.txt文件，仔细检查莎士比亚这个词是否出现在其中。你将一无所获。

这个食谱的完整代码如下:

package org.apache.lucene.demo; 
import java.nio.file.Paths; 
import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.index.DirectoryReader; 
import org.apache.lucene.index.IndexReader; 
import org.apache.lucene.queryparser.classic.QueryParser; 
import org.apache.lucene.search.IndexSearcher; 
import org.apache.lucene.search.Query; 
import org.apache.lucene.search.ScoreDoc; 
import org.apache.lucene.search.TopDocs; 
import org.apache.lucene.store.FSDirectory; 

public class SearchFiles { 
   public static final String INDEX_DIRECTORY = "index"; 
   public static final String FIELD_CONTENTS = "contents"; 

   public static void main(String[] args) throws Exception { 
      IndexReader reader = DirectoryReader.open(FSDirectory.open
        (Paths.get(INDEX_DIRECTORY))); 
      IndexSearcher indexSearcher = new IndexSearcher(reader); 

      Analyzer analyzer = new StandardAnalyzer(); 
      QueryParser queryParser = new QueryParser(FIELD_CONTENTS, 
         analyzer); 
      String searchString = "shakespeare"; 
      Query query = queryParser.parse(searchString); 

      TopDocs results = indexSearcher.search(query, 5); 
      ScoreDoc[] hits = results.scoreDocs; 

      int numTotalHits = results.totalHits; 
      System.out.println(numTotalHits + " total matching documents"); 

      for(int i=0;i<hits.length;++i) { 
         int docId = hits[i].doc; 
         Document d = indexSearcher.doc(docId); 
         System.out.println((i + 1) + ". " + d.get("path") + " score=" 
           + hits[i].score); 
      } 
   } 
}

注意

你可以访问https://lucene.apache.org/core/2_9_4/queryparsersyntax.html获取 Apache Lucene 支持的查询语法。

三、统计分析数据

在本章中，我们将介绍以下配方:

生成描述性统计数据
生成汇总统计数据
从多个分布生成汇总统计数据
计算频率分布
统计字符串中的词频
用 Java 8 计算字符串中的词频
计算简单回归
计算普通最小二乘回归
计算广义最小二乘回归
计算两组数据点的协方差
计算两组数据点的皮尔逊相关性
进行配对 t 检验
进行卡方检验
进行单向 ANOVA 检验
进行柯尔莫哥洛夫-斯米尔诺夫试验

简介

统计分析是数据科学家的日常活动之一。这种分析包括但不限于描述性统计、频率分布、简单和多重回归、相关性和协方差以及数据分布中的统计显著性的分析。幸运的是，Java 有许多只需几行代码就能对数据进行强大的统计分析的库。这一章用 15 种方法概述了数据科学家如何使用 Java 进行这种分析。

请注意，本章的重点仅在于使用 Java 进行基本的数据统计分析，尽管也可能使用线性代数、数值分析、特殊函数、复数、几何、曲线拟合、微分方程。

为了执行本章中的配方，我们需要以下内容:

阿帕奇公共数学 3.6.1。所以需要从http://commons . Apache . org/proper/commons-math/download _ math . CGI下载 JAR 文件。
If you want to use older versions, the older versions are in the archive at http://archive.apache.org/dist/commons/math/binaries/, as shown in the following screenshot:
将它作为外部 JAR 文件包含在您的 eclipse 项目中:

Apache Commons Math 3.6.1 的stat包非常丰富，优化得很好。该包可以生成以下描述性统计数据:

算术和几何平均
方差和标准差
和、积、对数和、平方和
最小值、最大值、中间值和百分位数
偏斜度和峰度
一阶、二阶、三阶和四阶矩

此外，根据他们的网站http://commons . Apache . org/proper/commons-math/user guide/stat . html，这些方法被优化并使用尽可能少的内存。

除了百分位数和中位数，所有这些统计数据都可以计算，而无需在内存中保存输入数据值的完整列表。

生成描述性统计数据

描述性统计用于总结一个样本，通常不是基于概率理论开发的。相比之下，推断统计学主要用于从人口的代表性样本中得出关于人口的结论。在这个菜谱中，我们将看到如何使用 Java 从小样本中生成描述性统计数据。

在不扩大这个食谱范围的情况下，我们将只关注这里列出的描述性统计数据的一个子集。

怎么做...

创建一个将double数组作为参数的方法。该数组将包含您要计算其描述性统计数据的值:
```
        public void getDescStats(double[] values){ 
```

创建一个DescriptiveStatistics类型的对象:

        DescriptiveStatistics stats = new DescriptiveStatistics();

遍历双数组的所有值，并将它们添加到DescriptiveStatistic对象:

        for( int i = 0; i < values.length; i++) { 
            stats.addValue(values[i]); 
        }

Apache Commons 数学库的DescriptiveStatistics类中有计算一组值的平均值、标准差和中值的方法。调用这些方法来获取值的描述性统计信息。最后，关闭方法:

        double mean = stats.getMean(); 
        double std = stats.getStandardDeviation(); 
        double median = stats.getPercentile(50); 
        System.out.println(mean + "\t" + std + "\t" + median); 
        }

包括驱动方法在内的完整代码如下所示:

import org.apache.commons.math3.stat.descriptive.DescriptiveStatistics; 

public class DescriptiveStats { 
   public static void main(String[] args){ 
      double[] values = {32, 39, 14, 98, 45, 44, 45, 34, 89, 67, 0, 
          15, 0, 56, 88}; 
      DescriptiveStats descStatTest = new DescriptiveStats(); 
      descStatTest.getDescStats(values); 

   } 
   public void getDescStats(double[] values){ 
      DescriptiveStatistics stats = new DescriptiveStatistics(); 
      for( int i = 0; i < values.length; i++) { 
              stats.addValue(values[i]); 
      } 
      double mean = stats.getMean(); 
      double std = stats.getStandardDeviation(); 
      double median = stats.getPercentile(50); 
      System.out.println(mean + "\t" + std + "\t" + median); 
   } 
}

Tip

为了以线程安全的方式计算统计数据，您可以创建一个SynchronizedDescriptiveStatistics实例，如下所示:DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics();

生成汇总统计

我们可以通过使用SummaryStatistics类来生成数据的汇总统计。这类似于前面配方中使用的DescriptiveStatistics类；主要区别在于，与DescriptiveStatistics类不同，SummaryStatistics类不在内存中存储数据。

怎么做...

像前面的方法一样，创建一个将double数组作为参数的方法:
```
        public void getSummaryStats(double[] values){ 
```

创建一个SummaryStatistics:类的对象

        SummaryStatistics stats = new SummaryStatistics();

将所有值添加到SummaryStatistics类的这个对象:

        for( int i = 0; i < values.length; i++) { 
            stats.addValue(values[i]); 
        }

最后，使用SummaryStatistics类中的方法为这些值生成汇总统计数据。使用完统计数据后，关闭该方法:

        double mean = stats.getMean(); 
        double std = stats.getStandardDeviation(); 
        System.out.println(mean + "\t" + std); 
        }

驱动程序方法的完整代码如下所示:

import org.apache.commons.math3.stat.descriptive.SummaryStatistics; 

public class SummaryStats { 
   public static void main(String[] args){ 
      double[] values = {32, 39, 14, 98, 45, 44, 45, 34, 89, 67, 0, 15, 
        0, 56, 88}; 
      SummaryStats summaryStatTest = new SummaryStats(); 
      summaryStatTest.getSummaryStats(values); 
   } 
   public void getSummaryStats(double[] values){ 
      SummaryStatistics stats = new SummaryStatistics(); 
      for( int i = 0; i < values.length; i++) { 
              stats.addValue(values[i]); 
      } 
      double mean = stats.getMean(); 
      double std = stats.getStandardDeviation(); 
      System.out.println(mean + "\t" + std); 
   } 
}

从多个分布生成汇总统计

在这个菜谱中，我们将创建一个AggregateSummaryStatistics实例来累积总体统计数据，并为样本数据创建SummaryStatistics。

怎么做...

创建一个接受两个double数组参数的方法。每个数组将包含两组不同的数据:

        public void getAggregateStats(double[] values1, double[] 
          values2){

创建一个AggregateSummaryStatistics :

        AggregateSummaryStatistics aggregate = new  
        AggregateSummaryStatistics();

类的对象

为了从这两个分布中生成汇总统计数据，创建两个SummaryStatistics类的对象:

        SummaryStatistics firstSet = 
          aggregate.createContributingStatistics(); 
        SummaryStatistics secondSet = 
          aggregate.createContributingStatistics();

将上一步创建的两个对象中的两个分布值相加:

        for(int i = 0; i < values1.length; i++) { 
           firstSet.addValue(values1[i]); 
        } 
        for(int i = 0; i < values2.length; i++) { 
           secondSet.addValue(values2[i]); 
        }

使用AggregateSummaryStatistics类的方法从两个分布中生成聚合统计数据。最后，在使用生成的统计信息后关闭该方法:

        double sampleSum = aggregate.getSum(); 
        double sampleMean = aggregate.getMean(); 
        double sampleStd= aggregate.getStandardDeviation(); 
        System.out.println(sampleSum + "\t" + sampleMean + "\t" +  
          sampleStd); 
        }

该配方的完整代码库如下:

import org.apache.commons.math3.stat.descriptive.
  AggregateSummaryStatistics; 
import org.apache.commons.math3.stat.descriptive.SummaryStatistics; 

public class AggregateStats { 
   public static void main(String[] args){ 
      double[] values1 = {32, 39, 14, 98, 45, 44, 45}; 
      double[] values2 = {34, 89, 67, 0, 15, 0, 56, 88}; 
      AggregateStats aggStatTest = new AggregateStats(); 
      aggStatTest.getAggregateStats(values1, values2); 
   } 
   public void getAggregateStats(double[] values1, double[] values2){ 
      AggregateSummaryStatistics aggregate = new 
      AggregateSummaryStatistics(); 
      SummaryStatistics firstSet = 
        aggregate.createContributingStatistics(); 
      SummaryStatistics secondSet = 
        aggregate.createContributingStatistics(); 

      for(int i = 0; i < values1.length; i++) { 
         firstSet.addValue(values1[i]); 
      } 
      for(int i = 0; i < values2.length; i++) { 
         secondSet.addValue(values2[i]); 
      } 
      double sampleSum = aggregate.getSum(); 
      double sampleMean = aggregate.getMean(); 
      double sampleStd= aggregate.getStandardDeviation(); 
      System.out.println(sampleSum + "\t" + sampleMean + "\t" + 
        sampleStd); 
   } 
}

还有更多...

这种方法有几个缺点，在这里讨论一下:

每次我们调用addValue()方法时，调用必须在由聚合维护的SummaryStatistics实例上同步
每次我们添加一个值，它都会更新集合和样本

为了克服这些缺点，类中提供了一个static aggregate方法。

计算频率分布

Frequency类有计算一个桶中数据实例数量的方法，计算数据实例的唯一数量的方法，等等。Frequency的接口非常简单，在大多数情况下，它只需要很少几行代码就可以完成所需的计算。

作为值类型，字符串、整数、长整型和字符都受支持。

注意

自然排序是累积频率的默认排序，但这可以通过向constructor提供Comparator来覆盖。

怎么做...

创建一个将double数组作为参数的方法。我们将计算这个数组的值的频率分布:
```
        public void getFreqStats(double[] values){ 
```

创建一个Frequency类的对象:

        Frequency freq = new Frequency();

将double数组的值添加到这个对象:

        for( int i = 0; i < values.length; i++) { 
            freq.addValue(values[i]); 
        }

为数组中的每个值生成频率:

        for( int i = 0; i < values.length; i++) { 
          System.out.println(freq.getCount(values[i])); 
        }

最后，关闭方法:

该配方的完整代码库如下:

import org.apache.commons.math3.stat.Frequency; 

public class FrequencyStats { 
   public static void main(String[] args){ 
      double[] values = {32, 39, 14, 98, 45, 44, 45, 34, 89, 67, 0, 15, 
        0, 56, 88}; 
      FrequencyStats freqTest = new FrequencyStats(); 
      freqTest.getFreqStats(values); 

   } 
   public void getFreqStats(double[] values){ 
      Frequency freq = new Frequency(); 
      for( int i = 0; i < values.length; i++) { 
         freq.addValue(values[i]); 
      } 

      for( int i = 0; i < values.length; i++) { 
         System.out.println(freq.getCount(values[i])); 
      } 
   } 
}

统计字符串中的词频

这个方法与本章中的其他方法有很大的不同，因为它处理字符串并计算字符串中的词频。我们将使用 Apache Commons Math 和 Java 8 来完成这项任务。这个配方将使用外部库，而下一个配方将使用 Java 8 实现相同的功能。

怎么做...

创建一个采用String数组的方法。该数组包含一个字符串中的所有单词:
```
        public void getFreqStats(String[] words){ 
```

创建一个Frequency类对象:

        Frequency freq = new Frequency();

将所有单词添加到Frequency对象:

        for( int i = 0; i < words.length; i++) { 
          freq.addValue(words[i].trim()); 
        }

对于每个单词，使用Frequency类的getCount()方法计算频率。最后，处理完频率后，关闭方法:

        for( int i = 0; i < words.length; i++) { 
           System.out.println(words[i] + "=" + 
             freq.getCount(words[i])); 
           } 
        }

配方的工作代码如下:

import org.apache.commons.math3.stat.Frequency; 

public class WordFrequencyStatsApache { 
   public static void main(String[] args){ 
      String str = "Horatio says 'tis but our fantasy, " 
            + "And will not let belief take hold of him " 
            + "Touching this dreaded sight, twice seen of us. " 
            + "Therefore I have entreated him along, 35" 
            + "With us to watch the minutes of this night, " 
            + "That, if again this apparition come, " 
            + "He may approve our eyes and speak to it."; 
      String[] words = str.toLowerCase().split("\\W+"); 
      WordFrequencyStatsApache freqTest = new 
        WordFrequencyStatsApache(); 
      freqTest.getFreqStats(words); 

   } 
   public void getFreqStats(String[] words){ 
      Frequency freq = new Frequency(); 
      for( int i = 0; i < words.length; i++) { 
         freq.addValue(words[i].trim()); 
      } 

      for( int i = 0; i < words.length; i++) { 
         System.out.println(words[i] + "=" + freq.getCount(words[i])); 
      } 
   } 
}

它是如何工作的...

这个菜谱打印出字符串中的每个单词及其频率，这样输出的结果就会出现重复的单词及其频率。当您在最后一个for循环中处理频率时，您需要有一个编程机制来避免重复输出。

例如，下一个配方使用了一个Map数据结构来避免单词重复。如果单词的顺序不重要，可以使用一个HashMap，如果单词的顺序很重要，就需要使用一个TreeMap数据结构。

用 Java 8 统计字符串中的词频

该方法不使用 Apache Commons 数学库来计算给定字符串中的单词频率；相反，它使用 Java 8 中引入的核心库和机制。

注意

有许多方法可以实现一个计算词频的工作示例，因此，我们鼓励读者看看这个方法在早于 8.0 版本的 Java 中的许多实现。

怎么做...

创建一个采用字符串参数的方法。我们将计算这个字符串中单词的频率:
```
        public void getFreqStats(String str){ 
```
从给定的字符串创建一个Stream。在我们的例子中，我们将把字符串转换成小写，并根据正则表达式\W+识别单词。将字符串转换成流的过程将并行完成:
```
        Stream<String> stream =   
          Stream.of(str.toLowerCase().split("\\W+")).parallel(); 
```
我们将使用Stream类的collect()方法收集单词及其频率。请注意，集合将被发送到一个在其泛型中带有String和Long的Map对象；string 将包含单词，long 将包含其频率:
```
        Map<String, Long> wordFreq = 
          stream.collect(Collectors.groupingBy
            (String::toString,Collectors.counting())); 
```
最后，我们将使用forEach一次性打印地图内容并关闭方法:

        wordFreq.forEach((k,v)->System.out.println(k + "=" + v));
        }

该配方的工作示例如下:

import java.util.Map;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class WordFrequencyStatsJava { 

   public static void main(String[] args){ 
      String str = "Horatio says 'tis but our fantasy, " 
            + "And will not let belief take hold of him " 
            + "Touching this dreaded sight, twice seen of us. " 
            + "Therefore I have entreated him along, 35" 
            + "With us to watch the minutes of this night, " 
            + "That, if again this apparition come, " 
            + "He may approve our eyes and speak to it."; 

      WordFrequencyStatsJava freqTest = new WordFrequencyStatsJava(); 
      freqTest.getFreqStats(str); 
   } 
   public void getFreqStats(String str){ 
      Stream<String> stream = Stream.of(str.toLowerCase().split("\\W+")).parallel(); 
      Map<String, Long> wordFreq = stream 
            .collect(Collectors.groupingBy(String::toString,Collectors.counting())); 
      wordFreq.forEach((k,v)->System.out.println(k + "=" + v)); 
   } 
}

计算简单回归

SimpleRegression类支持一个自变量的普通最小二乘回归:y = intercept + slope * x，其中intercept是可选参数。该类还能够为intercept提供标准误差。观测值(x，y)对既可以一次一个地添加到模型中，也可以以二维数组的形式提供。在这个配方中，数据点一次添加一个。

注意

观察值不存储在内存中，因此可以添加到模型中的观察值数量没有限制。

怎么做...

要计算简单回归，创建一个采用二维double数组的方法。该数组表示一系列(x，y)值:
```
        public void calculateRegression(double[][] data){ 
```
Create a SimpleRegression object, and add the data:
```
        SimpleRegression regression = new SimpleRegression(); 
        regression.addData(data); 
```
Note

If you don't intercept it or if you want to exclude it from the calculation, you need to use a different constructor to create the SimpleRegression object SimpleRegression regression = new SimpleRegression(false);.
找出截距、斜率以及截距和斜率的标准误差。最后，关闭方法:

        System.out.println(regression.getIntercept()); 
        System.out.println(regression.getSlope()); 
          System.out.println(regression.getSlopeStdErr()); 
        }

这个食谱的完整代码如下:

import org.apache.commons.math3.stat.regression.SimpleRegression; 

public class RegressionTest { 

   public static void main(String[] args){ 
      double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; 
      RegressionTest test = new RegressionTest(); 
      test.calculateRegression(data); 
   } 
   public void calculateRegression(double[][] data){ 
      SimpleRegression regression = new SimpleRegression(); 
      regression.addData(data); 
      System.out.println(regression.getIntercept()); 
      System.out.println(regression.getSlope()); 
      System.out.println(regression.getSlopeStdErr()); 
   } 
}

注意

如果模型中的观测值少于两个，或者所有 x 值都相同，则所有统计数据都将返回 NaN。

使用 getter 方法，如果您在获得统计数据后添加更多数据，您将能够在不使用新实例的情况下获得更新的统计数据。

计算普通最小二乘回归

OLSMultipleLinearRegression提供普通最小二乘回归来拟合线性模型 Y=Xb+u* 。这里，Y是 n 向量回归，X是[n,k]矩阵，其中k列称为回归变量，b是回归参数的k-vector，u是误差项或残差的n-vector。

怎么做...

创建一个采用二维double 数组和一维double数组:

        public void calculateOlsRegression(double[][] x, double[] y){

的方法

创建一个 OLS 回归对象，并添加数据点 x 和 y:

        OLSMultipleLinearRegression regression = new 
          OLSMultipleLinearRegression(); 
        regression.newSampleData(y, x);

使用OLSMultipleLinearRegression类中的以下方法计算各种回归参数和诊断。这些信息的用途取决于您手头的任务。最后，关闭方法:

        double[] beta = regression.estimateRegressionParameters();        
        double[] residuals = regression.estimateResiduals(); 
        double[][] parametersVariance = 
          regression.estimateRegressionParametersVariance(); 
        double regressandVariance = 
          regression.estimateRegressandVariance(); 
        double rSquared = regression.calculateRSquared(); 
        double sigma = regression.estimateRegressionStandardError(); 
        }

可以如下创建x和y数据点。对于这个例子，我们使用了固定数据，因此，数组索引的初始化不是自动进行的。你需要创建一个循环系统来创建x阵列:

        double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; 
        double[][] x = new double[6][]; 
        x[0] = new double[]{0, 0, 0, 0, 0}; 
        x[1] = new double[]{2.0, 0, 0, 0, 0}; 
        x[2] = new double[]{0, 3.0, 0, 0, 0}; 
        x[3] = new double[]{0, 0, 4.0, 0, 0}; 
        x[4] = new double[]{0, 0, 0, 5.0, 0}; 
        x[5] = new double[]{0, 0, 0, 0, 6.0};

该配方的工作示例如下:

import org.apache.commons.math3.stat.regression.
  OLSMultipleLinearRegression; 
public class OLSRegressionTest { 
   public static void main(String[] args){ 
      double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; 
      double[][] x = new double[6][]; 
      x[0] = new double[]{0, 0, 0, 0, 0}; 
      x[1] = new double[]{2.0, 0, 0, 0, 0}; 
      x[2] = new double[]{0, 3.0, 0, 0, 0}; 
      x[3] = new double[]{0, 0, 4.0, 0, 0}; 
      x[4] = new double[]{0, 0, 0, 5.0, 0}; 
      x[5] = new double[]{0, 0, 0, 0, 6.0};    
      OLSRegressionTest test = new OLSRegressionTest(); 
      test.calculateOlsRegression(x, y); 
   } 
   public void calculateOlsRegression(double[][] x, double[] y){ 
      OLSMultipleLinearRegression regression = new 
        OLSMultipleLinearRegression(); 
      regression.newSampleData(y, x); 

      double[] beta = regression.estimateRegressionParameters();        
      double[] residuals = regression.estimateResiduals(); 
      double[][] parametersVariance = 
        regression.estimateRegressionParametersVariance(); 
      double regressandVariance = 
        regression.estimateRegressandVariance(); 
      double rSquared = regression.calculateRSquared(); 
      double sigma = regression.estimateRegressionStandardError(); 
//print out the values here 
   } 
}

注意

当输入数据数组的维度不匹配并且数据数组不包含足够的数据来估计模型时，两个事件会抛出IllegalArgumentException。

计算广义最小二乘回归

在这个食谱中，我们将看到最小二乘回归的另一种变体，称为广义最小二乘回归。 GLSMultipleLinearRegression实现广义最小二乘拟合线性模型 Y=Xb+u* 。

怎么做...

为回归的 omega 参数创建一个采用二维双数组、一维双数组和二维双数组的方法:

        public void calculateGlsRegression(double[][] x, double[] y, 
          double[][] omega){

创建 GLS 回归对象、数据点和 omega 参数:

        GLSMultipleLinearRegression regression = new  
          GLSMultipleLinearRegression(); 
        regression.newSampleData(y, x, omega);

使用GLSMultipleLinearRegression类的方法，计算回归的各种统计数据，最后关闭方法:

        double[] beta = regression.estimateRegressionParameters();        
        double[] residuals = regression.estimateResiduals(); 
        double[][] parametersVariance = 
          regression.estimateRegressionParametersVariance(); 
        double regressandVariance = 
          regression.estimateRegressandVariance(); 
        double sigma = regression.estimateRegressionStandardError(); 
        }

要查看我们如何填充两个数组 x 和 y，请参考前面的配方。对于这个食谱，除了 x 和 y 数据点之外，我们还需要ω值。ω值可以插入到二维双数组中，如下所示:

        double[][] omega = new double[6][]; 
        omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; 
        omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; 
        omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; 
        omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; 
        omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; 
        omega[5] = new double[]{0, 0, 0, 0, 0, 6.6};

这个配方的完整类、方法和驱动程序方法如下:

import org.apache.commons.math3.stat.regression.
   GLSMultipleLinearRegression; 
public class GLSRegressionTest { 
   public static void main(String[] args){ 
      double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; 
      double[][] x = new double[6][]; 
      x[0] = new double[]{0, 0, 0, 0, 0}; 
      x[1] = new double[]{2.0, 0, 0, 0, 0}; 
      x[2] = new double[]{0, 3.0, 0, 0, 0}; 
      x[3] = new double[]{0, 0, 4.0, 0, 0}; 
      x[4] = new double[]{0, 0, 0, 5.0, 0}; 
      x[5] = new double[]{0, 0, 0, 0, 6.0};           
      double[][] omega = new double[6][]; 
      omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; 
      omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; 
      omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; 
      omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; 
      omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; 
      omega[5] = new double[]{0, 0, 0, 0, 0, 6.6};   
      GLSRegressionTest test = new GLSRegressionTest(); 
      test.calculateGlsRegression(x, y, omega); 
   } 
   public void calculateGlsRegression(double[][] x, double[] y, 
     double[][] omega){ 
      GLSMultipleLinearRegression regression = new 
        GLSMultipleLinearRegression(); 
      regression.newSampleData(y, x, omega);        

      double[] beta = regression.estimateRegressionParameters();        
      double[] residuals = regression.estimateResiduals(); 
      double[][] parametersVariance = 
        regression.estimateRegressionParametersVariance(); 
      double regressandVariance = 
        regression.estimateRegressandVariance(); 
      double sigma = regression.estimateRegressionStandardError(); 
//print out the values here 
   } 
}

计算两组数据点的协方差

无偏协方差由公式cov(X, Y) = sum [(xi - E(X))(yi - E(Y))] / (n - 1),给出，其中E(X)是X的平均值，E(Y)是Y值的平均值。非偏差校正估计使用n代替n - 1。为了确定协方差是否进行了偏差校正，我们需要设置一个额外的可选参数biasCorrected ，默认情况下该参数设置为 true。

怎么做...

创建一个采用两个一维双数组的方法。每个数组代表一组数据点:
```
        public void calculateCov(double[] x, double[] y){ 
```
Calculate the co-variance of the two sets of data points as follows:
```
        double covariance = new Covariance().covariance(x, y, false); 
```
Note

For this formula, we use unbiased covariance correction, so we use three parameters in the covariace() method. To use unbiased covariance between two double arrays, please delete the third parameter double covariance = new Covariance().covariance(x, y);:

.
根据您的要求使用协方差并关闭方法:

        System.out.println(covariance); 
        }

该配方的工作代码如下:

import org.apache.commons.math3.stat.correlation.Covariance; 

public class CovarianceTest { 
   public static void main(String[] args){ 
      double[] x = {43, 21, 25, 42, 57, 59}; 
      double[] y = {99, 65, 79, 75, 87, 81}; 
      CovarianceTest test = new CovarianceTest(); 
      test.calculateCov(x, y); 
   } 
   public void calculateCov(double[] x, double[] y){ 
      double covariance = new Covariance().covariance(x, y, false);//If 
        false is removed, we get unbiased covariance 
      System.out.println(covariance); 
   } 
}

计算两组数据点的皮尔逊相关性

PearsonsCorrelation计算由公式cor(X, Y) = sum[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]定义的相关性，其中E(X)和E(Y)是X 和Y的平均值，s(X)和s(Y)是它们各自的标准差。

怎么做...

创建一个方法，该方法采用两个代表两组数据点的double数组:
```
        public void calculatePearson(double[] x, double[] y){ 
```

创建一个PearsonsCorrelation对象:

        PearsonsCorrelation pCorrelation = new PearsonsCorrelation();

计算两组数据点的相关性:

        double cor = pCorrelation.correlation(x, y);

根据您的要求使用关联，并关闭方法:

        System.out.println(cor); 
        }

食谱的完整代码如下:

import org.apache.commons.math3.stat.correlation.PearsonsCorrelation; 

public class PearsonTest { 
   public static void main(String[] args){ 
      double[] x = {43, 21, 25, 42, 57, 59}; 
      double[] y = {99, 65, 79, 75, 87, 81}; 
      PearsonTest test = new PearsonTest(); 
      test.calculatePearson(x, y); 
   } 
   public void calculatePearson(double[] x, double[] y){ 
      PearsonsCorrelation pCorrelation = new PearsonsCorrelation(); 
      double cor = pCorrelation.correlation(x, y); 
      System.out.println(cor); 
   } 
}

进行配对 t 检验

在 Apache Commons Math 提供的众多标准统计显著性测试库中，我们将仅使用几个来演示配对 t 检验、卡方检验、单向 ANOVA 检验和 Kolmogorov-Smirnov 检验。读者可以执行其他的显著性测试，因为代码将使用TestUtils类中的静态方法来执行测试。

Apache Commons Math 支持单样本和双样本 t 检验。此外，两个样本测试可以是成对的，也可以是非成对的。不成对双样本检验可以在假设和不假设子总体方差相等的情况下进行。

怎么做...

创建一个将两组 double 值作为参数的方法。我们将进行配对 t 检验，以找出这两组值之间的任何统计显著性:
```
        public void getTtest(double[] sample1, double[] sample2){ 
```

可以使用pairedT()方法找到两个分布的 t 统计量:

        System.out.println(TestUtils.pairedT(sample1, sample2));

成对 t 检验的 p 值可以使用pairedTTest()方法找到:

        System.out.println(TestUtils.pairedTTest(sample1, sample2));

Finally, the significance in difference between two distributions for any given confidence interval or alpha value can be found as follows:
```
        System.out.println(TestUtils.pairedTTest(sample1, sample2, 
          0.05)); 
```
在本例中，第三个参数设置为 0.05，这表示我们想知道在 alpha 水平设置为 0.05 或 95%置信区间时差异是否显著。
最后，关闭方法:

该配方的工作示例如下:

import org.apache.commons.math3.stat.inference.TestUtils; 
public class TTest { 
   public static void main(String[] args){ 
      double[] sample1 = {43, 21, 25, 42, 57, 59}; 
      double[] sample2 = {99, 65, 79, 75, 87, 81}; 
      TTest test = new TTest(); 
      test.getTtest(sample1, sample2); 
   } 
   public void getTtest(double[] sample1, double[] sample2){ 
      System.out.println(TestUtils.pairedT(sample1, sample2));//t 
        statistics 
      System.out.println(TestUtils.pairedTTest(sample1, sample2));//p 
        value 
      System.out.println(TestUtils.pairedTTest(sample1, sample2, 
        0.05)); 
   } 
}

进行卡方检验

为了对两组数据分布进行卡方检验，一个分布称为观察分布，另一个分布称为期望分布。

怎么做...

创建一个将这两个分布作为参数的方法。注意，观察分布是一个long数组，而期望分布是一个double数组:
```
        public void getChiSquare(long[] observed, double[] expected){ 
```

获得卡方检验的 t 统计量如下:

        System.out.println(TestUtils.chiSquare(expected, observed));

测试的 p 值可以用类似的方法找到，但方法不同:

        System.out.println(TestUtils.chiSquareTest(expected, 
          observed));

We can also observe whether the difference between the expected and observed data distributions is significant for a given confidence interval, as follows:
```
        System.out.println(TestUtils.chiSquareTest(expected, observed, 
          0.05)); 
```
在这个例子中，我们的置信区间被设置为 95%，因此，chiSquareTest()方法的第三个参数被设置为 alpha 值 0.05。
最后，关闭方法:

这个食谱的完整代码在这里:

import org.apache.commons.math3.stat.inference.TestUtils; 
public class ChiSquareTest { 
   public static void main(String[] args){ 
      long[] observed = {43, 21, 25, 42, 57, 59}; 
      double[] expected = {99, 65, 79, 75, 87, 81}; 
      ChiSquareTest test = new ChiSquareTest(); 
      test.getChiSquare(observed, expected); 
   } 
   public void getChiSquare(long[] observed, double[] expected){ 
      System.out.println(TestUtils.chiSquare(expected, observed));//t 
        statistics 
      System.out.println(TestUtils.chiSquareTest(expected, 
        observed));//p value 
      System.out.println(TestUtils.chiSquareTest(expected, observed, 
        0.05)); 
   } 
}

进行单向 ANOVA 检验

ANOVA 代表方差分析。在这个食谱中，我们将看到如何使用 Java 进行单向 ANOVA 测试，以确定三个或三个以上独立且不相关的数据集的均值是否存在显著差异。

怎么做...

创建一个采用各种数据分布的方法。在我们的例子中，我们将对卡路里、脂肪、碳水化合物和控制的关系进行方差分析:
```
        public void calculateAnova(double[] calorie, double[] fat, 
          double[] carb, double[] control){
```
创建一个ArrayList。这个ArrayList将包含所有的数据。该方法作为参数的数据分布可以看作是类。因此，在我们的例子中，我们将它们命名为classes :
```
        List<double[]> classes = new ArrayList<double[]>(); 
```

依次将四个类别的数据添加到ArrayList :

        classes.add(calorie); 
        classes.add(fat); 
        classes.add(carb); 
        classes.add(control);

单向 ANOVA 检验的 F 值可由下式得出:

        System.out.println(TestUtils.oneWayAnovaFValue(classes));

单向 ANOVA 检验的 p 值可以使用下面一行找到:

        System.out.println(TestUtils.oneWayAnovaPValue(classes));

最后，要找出给定的四个类中的数据点的差异是否显著，请使用下面这段代码:
```
        System.out.println(TestUtils.oneWayAnovaTest(classes, 0.05)); 
```
使用括号结束该方法:

单向 ANOVA 测试配方的完整代码如下:

import java.util.ArrayList; 
import java.util.List; 
import org.apache.commons.math3.stat.inference.TestUtils; 
public class AnovaTest { 
   public static void main(String[] args){ 
      double[] calorie = {8, 9, 6, 7, 3}; 
      double[] fat = {2, 4, 3, 5, 1}; 
      double[] carb = {3, 5, 4, 2, 3}; 
      double[] control = {2, 2, -1, 0, 3}; 
      AnovaTest test = new AnovaTest(); 
      test.calculateAnova(calorie, fat, carb, control); 
   } 
      public void calculateAnova(double[] calorie, double[] fat, 
        double[] 
      carb, double[] control){ 
      List<double[]> classes = new ArrayList<double[]>(); 
      classes.add(calorie); 
      classes.add(fat); 
      classes.add(carb); 
      classes.add(control); 

   System.out.println(TestUtils.oneWayAnovaFValue(classes)); 
   System.out.println(TestUtils.oneWayAnovaPValue(classes));      
   System.out.println(TestUtils.oneWayAnovaTest(classes, 0.05)); 
   } 
}

注意

t 检验、卡方检验和 ANOVA 检验返回的 p 值是精确的，基于对distribution包中 t 分布、卡方分布和 F 分布的数值逼近。

进行科尔莫戈罗夫-斯米尔诺夫检验

Kolmogorov-Smirnov 检验(或简称为 KS 检验)是对本质上连续的一维概率分布的相等性的检验。这是确定两组数据点是否显著不同的常用方法之一。

怎么做...

创建一个采用两种不同数据分布的方法。我们将通过使用 Kolmogorov-Smirnov 检验来查看两个数据分布的差异是否显著:
```
        public void calculateKs(double[] x, double[] y){ 
```
测试中的一个关键统计量是 d 统计量。为了计算测试的 p 值，我们需要一个 double 值:
```
        double d = TestUtils.kolmogorovSmirnovStatistic(x, y); 
```

要评估值来自单位正态分布的零假设，请使用以下代码:

        System.out.println(TestUtils.kolmogorovSmirnovTest(x, y, 
          false));

最后，显著性检验的 p 值可通过以下方式确定:

        System.out.println(TestUtils.exactP(d, x.length, y.length, 
          false));

食谱的完整代码如下:

import org.apache.commons.math3.stat.inference.TestUtils; 

public class KSTest { 
   public static void main(String[] args){ 
      double[] x = {43, 21, 25, 42, 57, 59}; 
      double[] y = {99, 65, 79, 75, 87, 81}; 
      KSTest test = new KSTest(); 
      test.calculateKs(x, y); 
   } 
   public void calculateKs(double[] x, double[] y){ 
     double d = TestUtils.kolmogorovSmirnovStatistic(x, y); 
   System.out.println(TestUtils.kolmogorovSmirnovTest(x, y, false)); 
   System.out.println(TestUtils.exactP(d, x.length, y.length, 
     false)); 
   } 
}

本章的食谱到此结束。Apache Commons 数学库可以进行许多不同的统计分析。有关该库的更多使用，请参考本章中使用的版本的 Javadoc，可以在http://commons . Apache . org/proper/commons-math/javadocs/API-3 . 6 . 1/index . html找到。

四、从数据中学习——第一部分

在本章中，我们将介绍以下配方:

创建并保存属性关系文件格式文件
交叉验证机器学习模型
分类看不见的测试数据
用过滤的分类器对看不见的测试数据进行分类
生成线性回归模型
生成逻辑回归模型
使用 KMeans 算法对数据进行聚类
来自类的聚类数据
从数据中学习关联规则
使用低级方法、过滤方法和元分类器方法选择特征/属性

简介

在本章和接下来的章节中，我们将介绍使用机器学习技术从数据中学习模式的方法。这些模式是至少三个关键机器学习任务的关注中心:分类、回归和聚类。分类是从一个名义类别预测一个值的任务。与分类相反，回归模型试图从数值类中预测值。最后，聚类是根据数据点的接近程度对其进行分组的技术。

有许多基于 Java 的工具、工作台、库和 API 可以用于前面提到的机器学习领域的研究和开发。最流行的工具之一是怀卡托知识分析环境 ( Weka )，这是一个在 GNU 通用公共许可证下授权的自由软件。它是用 Java 编写的，拥有非常好的数据准备和过滤选项、带有可定制参数设置的经典机器学习算法以及强大的数据可视化选项。此外，除了易于使用的 Java 库，它还有一个非常方便的图形用户界面 ( GUI )供非 Java 用户使用。

在本章中，我们的重点将是演示如何进行常规的数据科学活动，例如为工具准备数据集、为不同类型的机器学习任务生成模型，以及使用 Weka 进行模型性能评估。

注意

请注意，本章配方中的代码不会实现任何异常处理，因此，catch 块故意留空。异常处理完全取决于用户和他/她的需求。

创建并保存属性关系文件格式(ARFF)文件

Weka 的原生文件格式叫做属性关系文件格式 ( ARFF )。ARFF 文件有两个逻辑部分。第一部分叫做头，第二部分叫做数据 。**头部分有三个必须出现在 ARFF 文件中的物理部分——关系的名称、属性或特征以及它们的数据类型和范围。数据部分有一个物理部分，也必须存在以生成机器学习模型。ARFF 文件的头部分如下所示:

% 1\. Title: Iris Plants Database 
   % 
   % 2\. Sources: 
   %      (a) Creator: R.A. Fisher 
   %      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) 
   %      (c) Date: July, 1988 
   % 
   @RELATION iris 

   @ATTRIBUTE sepallength     NUMERIC 
   @ATTRIBUTE sepalwidth      NUMERIC 
   @ATTRIBUTE petallength     NUMERIC 
   @ATTRIBUTE petalwidth      NUMERIC 
   @ATTRIBUTE class  {Iris-setosa,Iris-versicolor,Iris-virginica}

这里，以%符号开始的行表示注释。关系的名称由关键字@RELATION表示。以@ATTRIBUTE关键字开始的下几行表示特性或属性。在示例中，关系的名称是iris，数据集有五个属性——前四个属性是数字类型，最后一个属性是数据点的类，这是一个带有三个类值的名义属性。

ARFF 文件的数据部分如下所示:

   @DATA 
   5.1,3.5,1.4,0.2,Iris-setosa 
   4.9,3.0,1.4,0.2,Iris-setosa 
   4.7,3.2,1.3,0.2,Iris-setosa 
   4.6,3.1,1.5,0.2,Iris-setosa 
   5.0,3.6,1.4,0.2,Iris-setosa 
   5.4,3.9,1.7,0.4,Iris-setosa 
   4.6,3.4,1.4,0.3,Iris-setosa

该示例显示数据部分以关键字@DATA开始，然后包含逗号分隔的属性值。逗号分隔值的顺序应该与属性部分中属性的顺序一致。

注意

@RELATION、@ATTRIBUTE和@DATA声明不区分大小写。

要了解更多关于 ARFF 文件格式、Weka 和稀疏 ARFF 文件中支持的属性类型，请参考http://www.cs.waikato.ac.nz/ml/weka/arff.html。

为了执行本章中的配方，我们需要以下内容:

为了开发我们的代码，我们将使用 Eclipse IDE，为了成功地执行本章中的所有代码，我们将把 Weka JAR 文件添加到一个项目中。为此，请按照下列步骤操作:

To download Weka, go to http://www.cs.waikato.ac.nz/ml/weka/downloading.html and you will find download options for Windows, Mac, and other operating systems such as Linux. Read through the options carefully and download the appropriate version.

注意

在撰写本书时，3.9.0 是开发人员的最新版本，由于作者已经在他的 64 位 Windows 机器上安装了 1.8 版本的 JVM，他选择下载一个用于 64 位 Windows 的自解压可执行文件，而不使用 Java VM，如下图所示:
下载完成后，双击可执行文件并按照屏幕上的说明进行操作。你需要安装 Weka 的完整版本。
安装完成后，不要运行该软件。相反，转到安装它的目录，找到 Weka 的 Java 归档文件(weka.jar)。将这个文件作为外部库添加到您的 Eclipse 项目中:

Tip

如果你因为某种原因需要下载旧版本的 Weka，都可以在https://sourceforge.net/projects/weka/files/找到。请注意，旧版本中的许多方法可能已被废弃，因此不再受支持。

怎么做...

We will be keeping all our codes in a main() method instead of creating a method. Therefore, create a class and a main method:
```
        public class WekaArffTest { 
          public static void main(String[] args) throws Exception { 
```
注意，main方法将包含与 Weka 库相关的代码，因此会抛出异常
创建两个数组列表。第一个ArrayList将包含属性，第二个ArrayList将包含类值。因此，第一个ArrayList的泛型将是 Attribute 类型的(事实上，它是一个 Weka 类来建模属性)，而第二个ArrayList的泛型可以是 string 来表示类标签:
```
        ArrayList<Attribute>      attributes; 
        ArrayList<String>      classVals; 
```
接下来，创建一个实例对象。该对象将为 ARFF 文件的@DATA部分中的实例建模；@DATA部分中的每一行都是一个实例:
```
        Instances       data; 
```
创建一个双数组。该数组将包含属性的值:
```
        double[]        values; 
```
现在是设置属性的时候了。我们将创建 ARFF 文件的@ATTRIBUTE部分。首先，实例化属性:
```
        attributes = new ArrayList<Attribute>(); 
```
接下来，我们将创建一个名为 age 的数字属性，并将它添加到我们的属性ArrayList :
```
        attributes.add(new Attribute("age")); 
```
我们现在将创建一个名为 name 的字符串属性，并将它添加到我们的属性ArrayList中。然而，在此之前，我们将创建一个 String 类型的空的ArrayList,并将NULL赋给它。这个空的ArrayList将在Attribute类的构造函数中使用，以表明 name 是一个字符串类型属性，而不是名义属性，就像类属性:
```
        ArrayList<String> empty = null; 
        attributes.add(new Attribute("name", empty)); 
```
Weka 也支持日期类型属性。接下来，我们将创建一个 dob 属性来表示出生日期:
```
        attributes.add(new Attribute("dob", "yyyy-MM-dd")); 
```

然后我们将实例化类值ArrayList，并将创建五个类值- class1、class2、class3、class4和class5 :

        classVals = new ArrayList<String>(); 
        for (int i = 0; i < 5; i++){ 
           classVals.add("class" + (i + 1)); 
        }

有了这些类值，我们接下来将创建一个属性，并将其添加到我们的属性ArrayList :

```java
        Attribute classVal = new Attribute("class", classVals); 
        attributes.add(classVal); 

```

通过这一行代码，我们已经完成了 ARFF 文件的@ATTRIBUTE部分的创建。接下来，我们将填充@DATA部分。
首先，我们将创建一个名为MyRelation(这是我们的 ARFF 文件的@RELATION部分中的参数)的实例对象，以及所有属性:

```java
        data = new Instances("MyRelation", attributes, 0); 

```

我们将使用前面创建的 double 数组为我们的四个属性生成四个值；我们将指定年龄、姓名、出生日期和类值(随机选择，在本例中没有意义):

```java
        values = new double[data.numAttributes()];
        values[0] = 35; 
        values[1] = data.attribute(1).addStringValue("John Doe"); 
        values[2] = data.attribute(2).parseDate("1981-01-20"); 
        values[3] = classVals.indexOf("class3"); 

```

然后，我们将这些值添加到我们的数据部分:

```java
        data.add(new DenseInstance(1.0, values)); 

```

同样，我们将为我们的数据部分创建第二个实例，如下所示:

```java
        values = new double[data.numAttributes()];   
        values[0] = 30; 
        values[1] = data.attribute(1).addStringValue("Harry Potter"); 
        values[2] = data.attribute(2).parseDate("1986-07-05"); 
        values[3] = classVals.indexOf("class1"); 
        data.add(new DenseInstance(1.0, values)); 

```

如果我们想在某个地方保存 ARFF 文件，添加以下代码段:

```java
        BufferedWriter writer = new BufferedWriter(new    
          FileWriter("c:/training.arff")); 
        writer.write(data.toString()); 
        writer.close(); 

```

我们刚刚创建的 ARFF 文件的全部内容也可以显示在控制台输出上:

```java
        System.out.println(data); 

```

此时，关闭方法和类:

           } 
        }

食谱的完整代码如下:

import java.io.BufferedWriter; 
import java.io.FileWriter; 
import java.util.ArrayList; 

import weka.core.Attribute; 
import weka.core.DenseInstance; 
import weka.core.Instances; 

public class WekaArffTest { 
   public static void main(String[] args) throws Exception { 
      ArrayList<Attribute>      attributes; 
      ArrayList<String>      classVals; 
      Instances       data; 
      double[]        values; 

      // Set up attributes 
      attributes = new ArrayList<Attribute>(); 
      // Numeric attribute 
      attributes.add(new Attribute("age")); 
      // String attribute 
      ArrayList<String> empty = null; 
      attributes.add(new Attribute("name", empty)); 
      // Date attribute 
      attributes.add(new Attribute("dob", "yyyy-MM-dd")); 
      classVals = new ArrayList<String>(); 
      for (int i = 0; i < 5; i++){ 
         classVals.add("class" + (i + 1)); 
      } 
      Attribute classVal = new Attribute("class", classVals); 
      attributes.add(classVal); 

      // Create Instances object 
      data = new Instances("MyRelation", attributes, 0); 

      // Data fill up 
      // First instance 
      values = new double[data.numAttributes()]; 
      values[0] = 35; 
      values[1] = data.attribute(1).addStringValue("John Doe"); 
      values[2] = data.attribute(2).parseDate("1981-01-20"); 
      values[3] = classVals.indexOf("class3"); 

      // add 
      data.add(new DenseInstance(1.0, values)); 

      // Second instance 
      values = new double[data.numAttributes()];   
      values[0] = 30; 
      values[1] = data.attribute(1).addStringValue("Harry Potter"); 
      values[2] = data.attribute(2).parseDate("1986-07-05"); 
      values[3] = classVals.indexOf("class1"); 

      // add 
      data.add(new DenseInstance(1.0, values)); 

      //writing arff file to disk 
      BufferedWriter writer = new BufferedWriter(new 
        FileWriter("c:/training.arff")); 
      writer.write(data.toString()); 
      writer.close(); 
      // Output data 
      System.out.println(data); 
   } 
}

代码的输出如下所示:

@relation MyRelation 

@attribute age numeric 
@attribute name string 
@attribute dob date yyyy-MM-dd 
@attribute class {class1,class2,class3,class4,class5} 

@data 
35,'John Doe',1981-01-20,class3 
30,'Harry Potter',1986-07-05,class1

交叉验证机器学习模型

在这个菜谱中，我们将创建四个方法来做四件不同的事情——一个方法将加载一个 ARFF 文件(假设 ARFF 文件已经创建并保存在某个地方)；第二种方法将读取 ARFF 文件中的数据，并生成一个机器学习模型(我们任意选择了朴素贝叶斯模型)；第三种方法是通过使用序列化来保存模型，最后一种方法是使用 10 重交叉验证来评估 ARFF 文件上的模型。

怎么做...

创建两个实例变量。第一个将包含 iris 数据集的所有实例。iris ARFF 数据集可以在您安装的 Weka 目录的 data 文件夹中找到。第二个变量将是一个NaiveBayes分类器:
```
        Instances iris = null; 
        NaiveBayes nb;  
```
我们的第一个方法是使用DataSource类加载 iris ARFF 文件，使用类的getDataSet()方法读取内容，并设置类属性的位置。如果用记事本打开iris.arff数据集，会看到 class 属性是最后一个属性，这是约定，不是强制规则。因此，iris.setClassIndex(iris.numAttributes() - 1);已经被用来分配最后一个属性作为类属性。对于 Weka 中的任何分类任务，这都是非常重要的:
```
        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            iris = source.getDataSet(); 
            if (iris.classIndex() == -1) 
            iris.setClassIndex(iris.numAttributes() - 1); 
          } catch (Exception e1) { 
          } 
        } 
```

我们的下一个方法是使用NaiveBayes类的buildClassifier(dataset)方法生成一个NaiveBayes分类器，基于 iris 数据集:

        public void generateModel(){ 
          nb = new NaiveBayes(); 
          try { 
            nb.buildClassifier(iris); 
          } catch (Exception e) { 
          } 
        }

Weka 有一个工具可以保存使用 Weka 生成的任何模型。这些保存的模型可以在以后用于分类看不见的、未标记的测试数据。我们需要使用 Weka 的SerializationHelper类，它有一个名为 write 的特殊方法，该方法将保存用户模型和模型的路径作为其参数:
```
        public void saveModel(String modelPath){ 
          try { 
            weka.core.SerializationHelper.write(modelPath, nb); 
          } catch (Exception e) { 
          } 
        } 
```
我们的最终方法是使用 iris 数据集交叉验证模型的性能评估。为此，我们将使用 10 重交叉验证。如果我们只有少量的数据，这种流行的模型性能评估技术是非常有用的。然而，它也有一些局限性。对这种方法的利弊的讨论超出了本书的范围。感兴趣的读者可以参考https://en . Wikipedia . org/wiki/Cross-validation _(statistics)了解更多详情:

        public void crossValidate(){ 
          Evaluation eval = null; 
          try { 
            eval = new Evaluation(iris); 
            eval.crossValidateModel(nb, iris, 10, new Random(1)); 
            System.out.println(eval.toSummaryString()); 
          } catch (Exception e1) { 
          }   
        }

第eval.crossValidateModel(nb, iris, 10, new Random(1));行使用模型和数据集作为它的前两个参数。第三个参数表示这是一个 10 重交叉验证。最后一个参数是在过程中引入随机化，这非常重要，因为在大多数情况下，我们数据集中的数据实例不是随机化的。

该配方的完整可执行代码如下:

import java.util.Random; 

import weka.classifiers.Evaluation; 
import weka.classifiers.bayes.NaiveBayes; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 

public class WekaCVTest { 
   Instances iris = null; 
   NaiveBayes nb; 

   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         iris = source.getDataSet(); 
         if (iris.classIndex() == -1) 
            iris.setClassIndex(iris.numAttributes() - 1); 
      } catch (Exception e1) { 
      } 
   } 

   public void generateModel(){ 
      nb = new NaiveBayes(); 
      try { 
         nb.buildClassifier(iris); 
      } catch (Exception e) { 

      } 
   } 

   public void saveModel(String modelPath){ 
      try { 
         weka.core.SerializationHelper.write(modelPath, nb); 
      } catch (Exception e) { 
      } 
   } 

   public void crossValidate(){ 
      Evaluation eval = null; 
      try { 
         eval = new Evaluation(iris); 
         eval.crossValidateModel(nb, iris, 10, new Random(1)); 
         System.out.println(eval.toSummaryString()); 
      } catch (Exception e1) { 
      }   
   } 
   public static void main(String[] args){ 
      WekaCVTest test = new WekaCVTest(); 
      test.loadArff("C:/Program Files/Weka-3-6/data/iris.arff"); 
      test.generateModel(); 
      test.saveModel("c:/nb.model"); 
      test.crossValidate(); 
   } 
}

代码的输出如下所示:

Correctly Classified Instances         144               96      % 
Incorrectly Classified Instances         6                4      % 
Kappa statistic                          0.94   
Mean absolute error                      0.0342 
Root mean squared error                  0.155 
Relative absolute error                  7.6997 % 
Root relative squared error             32.8794 % 
Total Number of Instances              150

Tip

在这个食谱中，我们保存了一个机器学习模型。如果您需要加载一个模型，您需要知道使用了哪种类型的学习算法(例如，本菜谱中的 Naive Bayes)来生成模型，以便您可以将模型加载到适当的学习算法对象中。可以使用以下方法加载模型:

public void loadModel(String modelPath){
try {
nb = (NaiveBayes)
weka.core.SerializationHelper.read(modelPath);
} catch (Exception e) {
}
}

分类看不见的测试数据

经典的监督机器学习分类任务是在标记的训练实例上训练分类器，并在看不见的测试实例上应用分类器。这里要记住的关键点是，训练集中的属性数量、它们的类型、它们的名称以及它们在训练数据集中的取值范围(如果它们是常规的名义属性或名义类属性)必须与测试数据集中的完全相同。

准备就绪

在 Weka 中，训练数据集和测试数据集之间可能存在关键差异。测试部分中 ARFF 文件的@DATA部分看起来类似于 ARFF 文件的@DATA部分。它可以具有如下属性值和分类标签:

@DATA 
   5.1,3.5,1.4,0.2,Iris-setosa 
   4.9,3.0,1.4,0.2,Iris-setosa 
   4.7,3.2,1.3,0.2,Iris-setosa

当分类器应用于这种标记的测试数据时，分类器在预测实例的类别时忽略类别标签。还要注意，如果您的测试数据被标记，您可以将您的分类器的预测标签与实际标签进行比较。这使您有机会生成分类器的评估指标。然而，最常见的情况是，您的测试数据没有任何类别信息，并且基于它对训练数据的学习，分类器将预测并分配类别标签。这样一个测试数据集的@DATA 部分将如下所示，其中类标签是未知的，用一个问号(?)表示:

@DATA 
   5.1,3.5,1.4,0.2,? 
   4.9,3.0,1.4,0.2,? 
   4.7,3.2,1.3,0.2,?

Weka 的数据目录不包含任何这样的测试文件。因此，您可以创建自己的类似于iris.arff文件的测试文件。复制以下内容，打开记事本将其放入一个文本文件中，并在您的文件系统中保存为iris-test.arff文件(比如在C:/驱动器中):

@RELATION iris-test 

@ATTRIBUTE sepallength  REAL 
@ATTRIBUTE sepalwidth   REAL 
@ATTRIBUTE petallength  REAL 
@ATTRIBUTE petalwidth   REAL 
@ATTRIBUTE class  {Iris-setosa,Iris-versicolor,Iris-virginica} 
@DATA 
3.1,1.2,1.2,0.5,? 
2.3,2.3,2.3,0.3,? 
4.2,4.4,2.1,0.2,? 
3.1,2.5,1.0,0.2,? 
2.8,1.6,2.0,0.2,? 
3.0,2.6,3.3,0.3,? 
4.5,2.0,3.4,0.1,? 
5.3,2.0,3.1,0.2,? 
3.2,1.3,2.1,0.3,? 
2.1,6.4,1.2,0.1,?

怎么做...

We will have the following instance variables:
```
        NaiveBayes nb; 
        Instances train, test, labeled; 
```
为了让这个食谱有点挑战性，同时也给我们一个好的学习体验，我们将加载一个先前构建并保存的模型，并将该模型分配给我们的NaiveBayes分类器。该分类器将应用于未标记的测试实例。测试实例将作为标记实例复制到分类器中。在不改变测试实例的情况下，由分类器预测的类别标签将作为类别标签被分配给相应的标记实例。

首先，我们将创建一个方法来加载一个预先构建并保存的模型。事实上，我们可以加载我们在前面名为的配方中构建并保存的分类器，交叉验证一个机器学习模型:

        public void loadModel(String modelPath){ 
          try { 
            nb = (NaiveBayes)  
              weka.core.SerializationHelper.read(modelPath); 
          } catch (Exception e) { 
          }  
        }

然后，我们需要读取训练和测试数据集。作为训练数据集，我们将使用文件系统中 Weka 安装文件的数据目录中的iris.arff文件。作为测试文件，我们将使用我们之前在菜谱中创建的iris-test.arff文件:
```
        public void loadDatasets(String training, String testing){ 
```
To read the training dataset, we have used Weka's DataSource class. The key advantage in using this class is that it can deal with all file types supported by Weka. However, Weka users can use Java's BufferedReader class also to read the contents of a dataset. In this recipe, just to introduce a new way to read datasets, we will be using the BufferedReader class instead of the DataSource class.

我们将把 ARFF 文件视为常规文件，并使用一个BufferedReader阅读器来指向训练数据集。然后，读取器将用于使用 Weka 的 instances 类的构造函数创建训练实例。最后，我们将最后一个属性设置为数据集的类属性:
```
        BufferedReader reader = null; 
          try { 
            reader = new BufferedReader(new FileReader(training)); 
            train = new Instances (reader); 
            train.setClassIndex(train.numAttributes() -1); 
          } catch (IOException e) { 
       } 
```

In a similar way, we will be reading the test dataset:

        try { 
          reader = new BufferedReader(new FileReader(testing)); 
          test = new Instances (reader); 
          test.setClassIndex(train.numAttributes() -1); 
        } catch (IOException e) { 
        }

请注意，您不需要创建新的BufferedReader对象。

最后，关闭打开的BufferedReader对象并结束方法:

        try { 
          reader.close(); 
        } catch (IOException e) { 
        } 
     }

我们的下一个方法是从训练数据中创建一个NaiveBayes分类器，并将该分类器应用于我们的测试数据集的看不见的、未标记的实例。该方法还将显示由NaiveBayes分类器预测的类值的概率:
```
        public void classify(){ 
          try { 
            nb.buildClassifier(train); 
          } catch (Exception e) { 
          } 
        }
```
我们将创建带标签的实例，它们是测试实例的副本。我们在前面步骤中生成的分类器预测的标签将被分配给这些实例，而测试实例保持不变:
```
        labeled = new Instances(test); 
```

现在，对于测试数据集的每个实例，我们将创建一个类标签，这是一个双精度变量。然后，朴素贝叶斯将应用它的classifyInstance()方法，该方法将一个实例作为参数。类别标签将被分配给类别标签变量，并且该变量的值将被分配为在被标记的实例中的该特定实例的类别标签。换句话说，标记实例中测试实例的? 值将被朴素贝叶斯预测的值所替换:

        for (int i = 0; i < test.numInstances(); i++) { 
          double clsLabel; 
          try { 
            clsLabel = nb.classifyInstance(test.instance(i)); 
            labeled.instance(i).setClassValue(clsLabel); 
            double[] predictionOutput = 
              nb.distributionForInstance(test.instance(i)); 
            double predictionProbability = predictionOutput[1]; 
            System.out.println(predictionProbability); 
          } catch (Exception e) { 
          } 
        }

最后，我们将在文件系统中写入带标签的测试数据集(即labeled):

public void writeArff(String outArff){ 
      BufferedWriter writer; 
      try { 
         writer = new BufferedWriter(new FileWriter(outArff)); 
         writer.write(labeled.toString()); 
         writer.close(); 
      } catch (IOException e) { 
      } 
}

该配方的完整可执行代码如下:

import java.io.BufferedReader; 
import java.io.BufferedWriter; 
import java.io.FileReader; 
import java.io.FileWriter; 
import java.io.IOException; 

import weka.classifiers.bayes.NaiveBayes; 
import weka.core.Instances; 

public class WekaTrainTest { 
   NaiveBayes nb; 
   Instances train, test, labeled; 

   public void loadModel(String modelPath){ 
      try { 
         nb = (NaiveBayes) 
           weka.core.SerializationHelper.read(modelPath); 
      } catch (Exception e) { 
      } 
   } 

   public void loadDatasets(String training, String testing){ 
      BufferedReader reader = null; 
      try { 
         reader = new BufferedReader(new FileReader(training)); 
         train = new Instances (reader); 
         train.setClassIndex(train.numAttributes() -1); 
      } catch (IOException e) { 
      } 

      try { 
         reader = new BufferedReader(new FileReader(testing)); 
         test = new Instances (reader); 
         test.setClassIndex(train.numAttributes() -1); 
      } catch (IOException e) { 
      } 

      try { 
         reader.close(); 
      } catch (IOException e) { 
      } 
   } 

   public void classify(){ 
      try { 
         nb.buildClassifier(train); 
      } catch (Exception e) { 
      } 

      labeled = new Instances(test); 

      for (int i = 0; i < test.numInstances(); i++) { 
         double clsLabel; 
         try { 
            clsLabel = nb.classifyInstance(test.instance(i)); 
            labeled.instance(i).setClassValue(clsLabel); 
            double[] predictionOutput = 
              nb.distributionForInstance(test.instance(i)); 
            double predictionProbability = predictionOutput[1]; 
            System.out.println(predictionProbability); 
         } catch (Exception e) { 
         } 
      } 
   } 

   public void writeArff(String outArff){ 
      BufferedWriter writer; 
      try { 
         writer = new BufferedWriter(new FileWriter(outArff)); 
         writer.write(labeled.toString()); 
         writer.close(); 
      } catch (IOException e) { 
      } 
   } 

   public static void main(String[] args) throws Exception{ 
      WekaTrainTest test = new WekaTrainTest(); 
      test.loadModel("path to your Naive Bayes Model"); 
      test.loadDatasets("path to iris.arff dataset", "path to iris-
        test.arff dataset"); 
      test.classify(); 
      test.writeArff("path to your output ARFF file"); 
   } 
}

在控制台上，您将看到我们的模型预测的概率值:

5.032582653870928E-13 
2.1050052853672135E-4 
5.177104804026096E-16 
1.2459904922893976E-16 
3.1771015903129274E-10 
0.9999993509430146 
0.999999944638627 
0.9999999844862647 
3.449759371835354E-8 
4.0178483420981394E-77

如果您打开由代码生成的 ARFF 文件，该文件包含以前未知实例的类值，您应该会看到如下内容:

@relation iris-test 

@attribute sepallength numeric 
@attribute sepalwidth numeric 
@attribute petallength numeric 
@attribute petalwidth numeric 
@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica} 

@data 
3.1,1.2,1.2,0.5,Iris-setosa 
2.3,2.3,2.3,0.3,Iris-setosa 
4.2,4.4,2.1,0.2,Iris-setosa 
3.1,2.5,1,0.2,Iris-setosa 
2.8,1.6,2,0.2,Iris-setosa 
3,2.6,3.3,0.3,Iris-versicolor 
4.5,2,3.4,0.1,Iris-versicolor 
5.3,2,3.1,0.2,Iris-versicolor 
3.2,1.3,2.1,0.3,Iris-setosa 
2.1,6.4,1.2,0.1,Iris-setosa

用过滤的分类器对看不见的测试数据进行分类

很多时候，在开发分类器之前，您需要使用过滤器。该过滤器可用于移除、转换、离散化和添加属性，移除错误分类的实例，随机化或规范化实例，等等。通常的方法是使用 Weka 的 Filter 类，然后用类方法执行一系列过滤。此外，Weka 有一个名为FilteredClassifier的类，它是一个类，用于对通过任意过滤器的数据运行任意分类器。

在这个菜谱中，我们将看到如何同时使用一个过滤器和一个分类器来对看不见的测试例子进行分类。

怎么做...

This time, we will be using a Random Forest classifier. As our dataset, we will be using weather.nominal.arff that can be found in the Data directory of the installed Weka folder in your file system.

下面两个将是我们的实例变量:
```
        Instances weather = null; 
        RandomForest rf; 
```

接下来，在我们的代码中，我们将有一个加载数据集的方法。我们将把weather.nominal.arff文件的目录路径从我们的驱动方法发送到这个方法。使用 Weka 的DataSource类，我们将读取weather.nominal.arff文件的数据，并将数据集的最后一个属性设置为类属性:

        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            weather = source.getDataSet(); 
            weather.setClassIndex(iris.numAttributes() - 1); 
          } catch (Exception e1) { 
          } 
        }

接下来，我们将创建这个食谱的核心方法:

        public void buildFilteredClassifier(){

为了创建这个方法，我们将首先创建一个随机森林分类器:
```
        rf = new RandomForest(); 
```
我们将创建一个过滤器，从weather.nominal.arff文件中删除一个特定的属性。为此，我们将使用 Weka 的 Remove 类。下面的代码将创建一个过滤器，用于删除数据集的第一个属性:
```
        Remove rm = new Remove(); 
        rm.setAttributeIndices("1"); 
```
在接下来的几行代码中，创建一个FilteredClassifier，添加我们在上一步中创建的过滤器，并添加RandomForest分类器:
```
        FilteredClassifier fc = new FilteredClassifier(); 
        fc.setFilter(rm); 
        fc.setClassifier(rf); 
```
使用过滤的分类器，我们可以从名义上的天气数据集构建一个随机的森林分类器。然后，对于名义天气数据集的每个实例，分类器将预测类值。在 try 块中，我们将打印实例的实际值和预测值:

        try{ 
          fc.buildClassifier(weather); 
          for (int i = 0; i < iris.numInstances(); i++){ 
             double pred = fc.classifyInstance(weather.instance(i)); 
             System.out.print("given value: " + 
                weather.classAttribute().value((int) 
                weather.instance(i).classValue())); 
             System.out.println("---predicted value: " +
             weather.classAttribute().value((int) pred)); 
        } 
        } catch (Exception e) { 
        } 
        }

食谱的完整代码如下:

import weka.classifiers.meta.FilteredClassifier; 
import weka.classifiers.trees.RandomForest; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 
import weka.filters.unsupervised.attribute.Remove; 

public class WekaFilteredClassifierTest { 
   Instances weather = null; 
   RandomForest rf; 

   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         weather = source.getDataSet(); 
         weather.setClassIndex(weather.numAttributes() - 1); 
      } catch (Exception e1) { 
      } 
   } 

   public void buildFilteredClassifier(){ 
      rf = new RandomForest(); 
      Remove rm = new Remove(); 
      rm.setAttributeIndices("1"); 
      FilteredClassifier fc = new FilteredClassifier(); 
      fc.setFilter(rm); 
      fc.setClassifier(rf); 
      try{ 
         fc.buildClassifier(weather); 
         for (int i = 0; i < weather.numInstances(); i++){ 
            double pred = fc.classifyInstance(weather.instance(i)); 
            System.out.print("given value: " + 
              weather.classAttribute().value((int) 
                weather.instance(i).classValue())); 
            System.out.println("---predicted value: " +
              weather.classAttribute().value((int) pred)); 
         } 
      } catch (Exception e) { 
      } 
   } 

   public static void main(String[] args){ 
      WekaFilteredClassifierTest test = new 
        WekaFilteredClassifierTest(); 
      test.loadArff("C:/Program Files/Weka-3-
        6/data/weather.nominal.arff"); 
      test.buildFilteredClassifier(); 
   } 
}

代码的输出如下所示:

given value: no---predicted value: yes 
given value: no---predicted value: no 
given value: yes---predicted value: yes 
given value: yes---predicted value: yes 
given value: yes---predicted value: yes 
given value: no---predicted value: yes 
given value: yes---predicted value: yes 
given value: no---predicted value: yes 
given value: yes---predicted value: yes 
given value: yes---predicted value: yes 
given value: yes---predicted value: yes 
given value: yes---predicted value: yes 
given value: yes---predicted value: yes 
given value: no---predicted value: yes

生成线性回归模型

大多数线性回归模型遵循一个通用的模式-将有许多自变量共同产生一个结果，这是一个因变量。例如，我们可以生成一个回归模型，根据房屋的不同属性/特征(主要是数字、真实值)来预测房屋的价格，例如房屋的平方英尺大小、卧室数量、卫生间数量、位置的重要性等等。

在这个食谱中，我们将使用 Weka 的线性回归分类器来生成回归模型。

怎么做...

In this recipe, the linear regression model we will be creating is based on the cpu.arff dataset, which can be found in the data directory of the Weka installation directory.

我们的代码将有两个实例变量:第一个变量将包含cpu.arff文件的数据实例，第二个变量将是我们的线性回归分类器:
```
        Instances cpu = null; 
        LinearRegression lReg ; 
```

接下来，我们将创建一个方法来加载 ARFF 文件，并将 ARFF 文件的最后一个属性指定为它的类属性:

        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            cpu = source.getDataSet(); 
            cpu.setClassIndex(cpu.numAttributes() - 1); 
          } catch (Exception e1) { 
          } 
        }

我们将创建一种方法来构建线性回归模型。为此，我们只需调用线性回归变量的buildClassifier()方法。模型可以作为参数直接发送给System.out.println():

         public void buildRegression(){
           lReg = new LinearRegression(); 
           try { 
             lReg.buildClassifier(cpu);
           } catch (Exception e) {
           } 
           System.out.println(lReg);
         }

食谱的完整代码如下:

import weka.classifiers.functions.LinearRegression; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 

public class WekaLinearRegressionTest { 
   Instances cpu = null; 
   LinearRegression lReg ; 

   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         cpu = source.getDataSet(); 
         cpu.setClassIndex(cpu.numAttributes() - 1); 
      } catch (Exception e1) { 
      } 
   } 

   public void buildRegression(){    
      lReg = new LinearRegression(); 
      try { 
         lReg.buildClassifier(cpu); 
      } catch (Exception e) { 
      } 
      System.out.println(lReg); 
   } 

   public static void main(String[] args) throws Exception{ 
      WekaLinearRegressionTest test = new WekaLinearRegressionTest(); 
      test.loadArff("path to the cpu.arff file"); 
      test.buildRegression(); 
   } 
}

代码的输出如下所示:

Linear Regression Model 

class = 

      0.0491 * MYCT + 
      0.0152 * MMIN + 
      0.0056 * MMAX + 
      0.6298 * CACH + 
      1.4599 * CHMAX + 
    -56.075

生成逻辑回归模型

Weka 有一个名为 logistic 的类，可用于构建和使用带有岭估计的多项式 Logistic 回归模型。虽然最初的逻辑回归不处理实例权重，但是 Weka 中的算法已经被修改来处理实例权重。

在这个配方中，我们将使用 Weka 在 iris 数据集上生成一个逻辑回归模型。

怎么做...

We will be generating a logistic regression model from the iris dataset, which can be found in the data directory in the installed folder of Weka.

我们的代码将有两个实例变量:一个包含 iris 数据集的数据实例，另一个是逻辑回归分类器:
```
        Instances iris = null; 
        Logistic logReg ; 
```

我们将使用一个方法来加载和读取数据集，以及分配它的类属性(iris.arff文件的最后一个属性):

        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            iris = source.getDataSet(); 
            iris.setClassIndex(iris.numAttributes() - 1); 
          } catch (Exception e1) { 
          } 
        }

接下来，我们将创建食谱中最重要的方法，该方法从 iris数据集构建逻辑回归分类器:

        public void buildRegression(){    
          logReg = new Logistic(); 

          try { 
            logReg.buildClassifier(iris); 
          } catch (Exception e) { 
          } 
          System.out.println(logReg); 
        }

该配方的完整可执行代码如下:

import weka.classifiers.functions.Logistic; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 

public class WekaLogisticRegressionTest { 
   Instances iris = null; 
   Logistic logReg ; 

   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         iris = source.getDataSet(); 
         iris.setClassIndex(iris.numAttributes() - 1); 
      } catch (Exception e1) { 
      } 
   } 

   public void buildRegression(){    
      logReg = new Logistic(); 

      try { 
         logReg.buildClassifier(iris); 
      } catch (Exception e) { 
      } 
      System.out.println(logReg); 
   } 

   public static void main(String[] args) throws Exception{ 
      WekaLogisticRegressionTest test = new 
        WekaLogisticRegressionTest(); 
      test.loadArff("path to the iris.arff file "); 
      test.buildRegression(); 
   } 
}

代码的输出如下所示:

Logistic Regression with ridge parameter of 1.0E-8 
Coefficients... 
                         Class 
Variable           Iris-setosa  Iris-versicolor 
=============================================== 
sepallength            21.8065           2.4652 
sepalwidth              4.5648           6.6809 
petallength           -26.3083          -9.4293 
petalwidth             -43.887         -18.2859 
Intercept               8.1743           42.637 

Odds Ratios... 
                         Class 
Variable           Iris-setosa  Iris-versicolor 
=============================================== 
sepallength    2954196659.8892          11.7653 
sepalwidth             96.0426         797.0304 
petallength                  0           0.0001 
petalwidth                   0                0

Tip

对配方结果的解释超出了本书的范围。感兴趣的读者可以在这里看到关于堆栈溢出的讨论:http://Stack Overflow . com/questions/19136213/how-to-interpret-WEKA-logistic-regression-output

使用 KMeans 算法聚类数据点

在这个菜谱中，我们将使用 KMeans 算法对数据集的数据点进行聚类或分组。

怎么做...

We will be using the cpu dataset to cluster its data points based on a simple KMeans algorithm. The cpu dataset can be found in the data directory of the installed folder in the Weka directory.

我们将有两个实例变量，就像前面的食谱一样。第一个变量将包含 cpu 数据集的数据点，第二个变量将是我们简单的 KMeans clusterer:
```
        Instances cpu = null; 
        SimpleKMeans kmeans; 
```

然后，我们将创建一个方法来加载 cpu 数据集，并读取其内容。请注意，由于聚类是一种无监督的方法，我们不需要指定数据集的 class 属性:

        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            cpu = source.getDataSet(); 
          } catch (Exception e1) { 
         } 
        }

接下来，我们将创建我们的方法来开发集群器:
```
        public void clusterData(){ 
```
我们实例化集群，并将 seed 的值设置为 10。种子将用于生成一个随机数，它取一个整数值:
```
        kmeans = new SimpleKMeans(); 
        kmeans.setSeed(10); 
```
然后，我们告诉集群器保持数据实例的顺序不变。如果您觉得不需要维护数据集中实例的顺序，可以将setPreserveInstancesOrder()方法的参数设置为 false。我们还将集群的数量设置为 10。最后，我们从 cpu 数据集构建集群:
```
        try { 
          kmeans.setPreserveInstancesOrder(true); 
          kmeans.setNumClusters(10); 
          kmeans.buildClusterer(cpu); 
```
接下来，我们使用一个for循环，通过简单的 KMeans 算法获得每个实例和分配给它们的集群号:

        int[] assignments = kmeans.getAssignments(); 
          int i = 0; 
          for(int clusterNum : assignments) { 
             System.out.printf("Instance %d -> Cluster %d\n", i,  
               clusterNum); 
             i++; 
          } 
        } catch (Exception e1) { 
        }

食谱的完整代码如下:

import weka.clusterers.SimpleKMeans; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 

public class WekaClusterTest { 
   Instances cpu = null; 
   SimpleKMeans kmeans; 

   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         cpu = source.getDataSet(); 
      } catch (Exception e1) { 
      } 
   } 

   public void clusterData(){  
      kmeans = new SimpleKMeans(); 
      kmeans.setSeed(10); 
      try { 
         kmeans.setPreserveInstancesOrder(true); 
         kmeans.setNumClusters(10); 
         kmeans.buildClusterer(cpu); 
         int[] assignments = kmeans.getAssignments(); 
         int i = 0; 
         for(int clusterNum : assignments) { 
            System.out.printf("Instance %d -> Cluster %d\n", i, 
              clusterNum); 
            i++; 
         } 
      } catch (Exception e1) { 
      } 
   } 

   public static void main(String[] args) throws Exception{ 
      WekaClusterTest test = new WekaClusterTest(); 
      test.loadArff("path to cpu.arff file"); 
      test.clusterData(); 
   } 
}

cpu.arff文件有 209 个数据实例。前 10 个的输出如下:

Instance 0 -> Cluster 7 
Instance 1 -> Cluster 5 
Instance 2 -> Cluster 5 
Instance 3 -> Cluster 5 
Instance 4 -> Cluster 1 
Instance 5 -> Cluster 5 
Instance 6 -> Cluster 5 
Instance 7 -> Cluster 5 
Instance 8 -> Cluster 4 
Instance 9 -> Cluster 4

聚类来自类的数据

如果你有一个包含类的数据集，这对于无监督学习来说是一种不寻常的情况，Weka 有一种方法叫做从类中聚类。在这种方法中，Weka 首先忽略类属性并生成聚类。然后在测试阶段，它根据每个集群内的类属性的多数值将类分配给集群。我们将在这份食谱中介绍这种方法。

怎么做...

In this recipe, we will use a dataset with class values for instances. We will use a weather.nominal.arff file, which can be found in the data directory of the installed Weka directory.

在我们的代码中，我们将有两个实例变量。第一个变量将包含数据集的实例，第二个变量将包含一个期望最小化聚类器:
```
        Instances weather = null; 
        EM clusterer; 
```

接下来，我们将加载数据集，读取它，并将最后一个索引设置为它的类索引:

       public void loadArff(String arffInput){ 
         DataSource source = null; 
         try { 
           source = new DataSource(arffInput); 
           weather = source.getDataSet(); 
           weather.setClassIndex(weather.numAttributes() - 1); 
         } catch (Exception e1) { 
         } 
       }

然后，我们将在这个配方中创建我们的关键方法，它将从类:
```
        public void generateClassToCluster(){ 
```
中生成集群
为此，我们将首先创建一个移除过滤器。此过滤器将用于从数据集中移除类属性，因为 Weka 在聚类过程中会忽略此属性:
```
        Remove filter = new Remove(); 
        filter.setAttributeIndices("" + (weather.classIndex() + 1)); 
```

然后，我们将把过滤器应用到我们的数据集:

        try { 
          filter.setInputFormat(weather);

我们将获得没有类变量的数据集，并可以从数据中创建一个期望最大化聚类器:

         Instances dataClusterer = Filter.useFilter(weather, filter); 
         clusterer = new EM(); 
         clusterer.buildClusterer(dataClusterer);

然后，我们将使用原始数据集的类来评估集群:

         ClusterEvaluation eval = new ClusterEvaluation(); 
         eval.setClusterer(clusterer); 
         eval.evaluateClusterer(weather);

最后，我们将在控制台上打印聚类结果:

         System.out.println(eval.clusterResultsToString()); 
         } catch (Exception e) { 
      } 
   }

食谱的完整代码如下:

import weka.clusterers.ClusterEvaluation; 
import weka.clusterers.EM; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 
import weka.filters.Filter; 
import weka.filters.unsupervised.attribute.Remove; 

public class WekaClassesToClusterTest { 
   Instances weather = null; 
   EM clusterer; 

   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         weather = source.getDataSet(); 
         weather.setClassIndex(weather.numAttributes() - 1); 
      } catch (Exception e1) { 
      } 
   } 

   public void generateClassToCluster(){ 
      Remove filter = new Remove(); 
      filter.setAttributeIndices("" + (weather.classIndex() + 1)); 
      try { 
         filter.setInputFormat(weather); 
         Instances dataClusterer = Filter.useFilter(weather, filter); 
         clusterer = new EM(); 
         clusterer.buildClusterer(dataClusterer); 
         ClusterEvaluation eval = new ClusterEvaluation(); 
         eval.setClusterer(clusterer); 
         eval.evaluateClusterer(weather); 

         System.out.println(eval.clusterResultsToString()); 
      } catch (Exception e) { 
      } 
   } 

   public static void main(String[] args){ 
      WekaClassesToClusterTest test = new WekaClassesToClusterTest(); 
      test.loadArff("path to weather.nominal.arff file"); 
      test.generateClassToCluster(); 
   } 
}

从数据中学习关联规则

关联规则学习是一种机器学习技术，用于发现数据集中各种特征或变量之间的关联和规则。统计学中的一种类似技术被称为相关性，这在第 3 章、中有所介绍，统计分析数据，但是关联规则学习在决策制定中更有用。例如，通过分析大型超市数据，机器学习学习者可以发现，如果一个人买了洋葱、西红柿、鸡肉馅饼和蛋黄酱，她很可能会买包子(来做汉堡)。

在这个菜谱中，我们将看到如何使用 Weka 从数据集中学习关联规则。

准备就绪

我们将使用超市数据集，该数据集可以在我们安装的 Weka 目录的data目录中找到。数据集中的实例总数为 4，627 个，每个实例有 217 个二进制属性。属性的值为true或missing。有一个名为total的名义类属性，如果交易少于 100 美元，其值为low，如果交易多于 100 美元，其值为high。

怎么做...

声明两个实例变量来包含超市数据集的数据并表示先验学习者:
```
        Instances superMarket = null; 
        Apriori apriori; 
```

创建一个方法来加载数据集并读取它。对于这个配方，不需要设置数据集的 class 属性:

        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            superMarket = source.getDataSet(); 
          } catch (Exception e1) { 
          } 
        }

创建一个方法来实例化先验学习者。然后，该方法从给定的数据集构建关联。最后，在控制台上显示学员:

        public void generateRule(){ 
          apriori = new Apriori(); 
            try { 
              apriori.buildAssociations(superMarket); 
              System.out.println(apriori); 
            } catch (Exception e) { 
            } 
        }

Tip

先验学习者产生的规则的默认数量被设置为 10。如果您需要生成更多的规则，您可以在构建关联之前输入以下代码行，其中n是一个整数，表示规则learn-apriori.setNumRules(n);的数量

食谱的完整代码如下:

import weka.associations.Apriori; 
import weka.core.Instances; 
import weka.core.converters.ConverterUtils.DataSource; 

public class WekaAssociationRuleTest { 
   Instances superMarket = null; 
   Apriori apriori; 
   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         superMarket = source.getDataSet(); 
      } catch (Exception e1) { 
      } 
   } 
   public void generateRule(){ 
      apriori = new Apriori(); 
      try { 
         apriori.buildAssociations(superMarket); 
         System.out.println(apriori); 
      } catch (Exception e) { 
      } 
   } 
   public static void main(String args[]){ 
      WekaAssociationRuleTest test = new WekaAssociationRuleTest(); 
      test.loadArff("path to supermarket.arff file"); 
      test.generateRule(); 
   } 
}

先验学习者发现的规则如下:

1\. biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723    <conf:(0.92)> lift:(1.27) lev:(0.03) [155] conv:(3.35) 
 2\. baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696    <conf:(0.92)> lift:(1.27) lev:(0.03) [149] conv:(3.28) 
 3\. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705    <conf:(0.92)> lift:(1.27) lev:(0.03) [150] conv:(3.27) 
 4\. biscuits=t fruit=t vegetables=t total=high 815 ==> bread and cake=t 746    <conf:(0.92)> lift:(1.27) lev:(0.03) [159] conv:(3.26) 
 5\. party snack foods=t fruit=t total=high 854 ==> bread and cake=t 779    <conf:(0.91)> lift:(1.27) lev:(0.04) [164] conv:(3.15) 
 6\. biscuits=t frozen foods=t vegetables=t total=high 797 ==> bread and cake=t 725    <conf:(0.91)> lift:(1.26) lev:(0.03) [151] conv:(3.06) 
 7\. baking needs=t biscuits=t vegetables=t total=high 772 ==> bread and cake=t 701    <conf:(0.91)> lift:(1.26) lev:(0.03) [145] conv:(3.01) 
 8\. biscuits=t fruit=t total=high 954 ==> bread and cake=t 866    <conf:(0.91)> lift:(1.26) lev:(0.04) [179] conv:(3) 
 9\. frozen foods=t fruit=t vegetables=t total=high 834 ==> bread and cake=t 757    <conf:(0.91)> lift:(1.26) lev:(0.03) [156] conv:(3) 
10\. frozen foods=t fruit=t total=high 969 ==> bread and cake=t 877    <conf:(0.91)> lift:(1.26) lev:(0.04) [179] conv:(2.92)

使用低级方法、过滤方法和元分类器方法选择特征/属性

特征选择是一个重要的机器学习过程，它从一组属性中识别数据集中最重要的属性，因此，如果基于所选属性生成分类器，该分类器会比包含所有属性的分类器产生更好的结果。

在 Weka 中，有三种选择属性的方式。该方法将使用 Weka 中可用的所有三种属性选择技术:低级属性选择方法、使用过滤器的属性选择和使用元分类器的属性选择。

准备就绪

配方将选择可以在 Weka 安装目录的data目录中找到的iris数据集的重要属性。

要执行属性选择，需要两个元素:搜索方法和评估方法。在我们的配方中，我们将使用最佳优先搜索作为我们的搜索方法，以及一个名为基于相关性的特征子集选择的子集评估方法。

怎么做...

声明一个实例变量来保存 iris 数据集中的数据。为NaiveBayes分类器声明另一个变量:
```
        Instances iris = null; 
        NaiveBayes nb; 
```

创建一个方法来加载我们的数据集。该方法还将读取数据实例，并将数据集的最后一个属性设置为类属性:

        public void loadArff(String arffInput){ 
          DataSource source = null; 
          try { 
            source = new DataSource(arffInput); 
            iris = source.getDataSet(); 
            iris.setClassIndex(iris.numAttributes() - 1); 
          } catch (Exception e1) { 
          } 
       }

我们将从简单开始——我们将创建一个使用 Weka 的低级属性选择方法的方法:
```
       public void selectFeatures(){ 
```

创建一个AttributeSelection对象:

        AttributeSelection attSelection = new AttributeSelection();

接下来，为搜索和评估器创建对象，并为属性selection对象:

        CfsSubsetEval eval = new CfsSubsetEval(); 
        BestFirst search = new BestFirst(); 
        attSelection.setEvaluator(eval); 
        attSelection.setSearch(search);

设置评估器和搜索对象

Then, use the attribute selection object to select attributes from the iris dataset using the search and evaluator. We will get the index of the attributes that are selected by this technique and will display the selected attribute numbers (the attribute numbers start from 0):
```
        try { 
          attSelection.SelectAttributes(iris); 
          int[] attIndex = attSelection.selectedAttributes(); 
          System.out.println(Utils.arrayToString(attIndex)); 
        } catch (Exception e) { 
      } 
      } 
```
该方法的输出如下:
```
 2, 3, 4
```
输出意味着属性选择技术从虹膜数据集的所有属性中选择属性号 2、3 和 4。
现在，我们将创建一个方法，实现基于过滤器选择属性的第二种技术:
```
        public void selectFeaturesWithFilter(){ 
```

创建属性选择过滤器。注意，这个过滤器的包不是我们在这个配方的第一个方法中使用的包:

        weka.filters.supervised.attribute.AttributeSelection filter = 
          new weka.filters.supervised.attribute.AttributeSelection();

接下来，为搜索和评估器创建对象，并为过滤器设置评估器和搜索对象:

        CfsSubsetEval eval = new CfsSubsetEval(); 
        BestFirst search = new BestFirst(); 
        filter.setEvaluator(eval); 
        filter.setSearch(search);

Then, apply the filter to the iris dataset, and retrieve new data using the useFilter() of the Filter class, which takes the dataset and filter as its two arguments. This is something different than what we saw in the previous method. This is very useful if we want to create a new ARFF file by selecting the attributes selected by the filtering technique on the fly:

```java
       try { 
         filter.setInputFormat(iris); 
         Instances newData = Filter.useFilter(iris, filter); 
         System.out.println(newData); 
       } catch (Exception e) { 
       } 
       } 

```

从控制台的输出中，我们可以看到 ARFF 文件数据的属性部分现在包含以下条目:

```java
        @attribute petallength numeric 
        @attribute petalwidth numeric 
        @attribute class {Iris-setosa,Iris-versicolor,Iris-virginica} 

```

这意味着与类属性一起列出的两个属性是由我们刚刚使用的属性选择方法选择的。

最后，我们将创建一个方法，在将数据集交给分类器之前选择属性(在我们的例子中，它是一个NaiveBayes分类器):

```java
        public void selectFeaturesWithClassifiers(){ 

```

在将数据传递给NaiveBayes分类器:

```java
        AttributeSelectedClassifier classifier = new  
          AttributeSelectedClassifier(); 

```

之前，创建一个元分类器来减少数据的维数(即选择属性)

创建一个评估器、一个搜索对象和一个NaiveBayes分类器:

```java
        CfsSubsetEval eval = new CfsSubsetEval(); 
        BestFirst search = new BestFirst(); 
        nb = new NaiveBayes(); 

```

将评估器、搜索对象和NaiveBayes分类器设置为元分类器:

```java
        classifier.setClassifier(nb); 
        classifier.setEvaluator(eval); 
        classifier.setSearch(search); 

```

现在，我们将使用元分类器技术选择的属性来评估NaiveBayes分类器的性能。注意，在这个例子中，元分类器选择的属性就像一个黑盒。在评估中，使用了 10 重交叉验证:

        Evaluation evaluation; 
        try { 
           evaluation = new Evaluation(iris); 
           evaluation.crossValidateModel(classifier, iris, 10, new 
             Random(1)); 
           System.out.println(evaluation.toSummaryString()); 
         } catch (Exception e) { 
         } 
         }

食谱的完整代码如下:

import java.util.Random; 
import weka.attributeSelection.AttributeSelection; 
import weka.attributeSelection.BestFirst; 
import weka.attributeSelection.CfsSubsetEval; 
import weka.classifiers.Evaluation; 
import weka.classifiers.bayes.NaiveBayes; 
import weka.classifiers.meta.AttributeSelectedClassifier; 
import weka.core.Instances; 
import weka.core.Utils; 
import weka.core.converters.ConverterUtils.DataSource; 
import weka.filters.Filter; 

public class WekaFeatureSelectionTest { 
   Instances iris = null; 
   NaiveBayes nb; 
   public void loadArff(String arffInput){ 
      DataSource source = null; 
      try { 
         source = new DataSource(arffInput); 
         iris = source.getDataSet(); 
         iris.setClassIndex(iris.numAttributes() - 1); 
      } catch (Exception e1) { 
      } 
   } 

   public void selectFeatures(){ 
      AttributeSelection attSelection = new AttributeSelection(); 
       CfsSubsetEval eval = new CfsSubsetEval(); 
       BestFirst search = new BestFirst(); 
       attSelection.setEvaluator(eval); 
       attSelection.setSearch(search); 
       try { 
         attSelection.SelectAttributes(iris); 
         int[] attIndex = attSelection.selectedAttributes(); 
         System.out.println(Utils.arrayToString(attIndex)); 
      } catch (Exception e) { 
      } 
   } 

   public void selectFeaturesWithFilter(){ 
      weka.filters.supervised.attribute.AttributeSelection filter = new 
       weka.filters.supervised.attribute.AttributeSelection(); 
       CfsSubsetEval eval = new CfsSubsetEval(); 
       BestFirst search = new BestFirst(); 
       filter.setEvaluator(eval); 
       filter.setSearch(search); 
       try { 
         filter.setInputFormat(iris); 
         Instances newData = Filter.useFilter(iris, filter); 
         System.out.println(newData); 
      } catch (Exception e) { 
      } 
   } 

   public void selectFeaturesWithClassifiers(){ 
      AttributeSelectedClassifier classifier = new 
        AttributeSelectedClassifier(); 
      CfsSubsetEval eval = new CfsSubsetEval(); 
      BestFirst search = new BestFirst(); 
      nb = new NaiveBayes(); 
      classifier.setClassifier(nb); 
      classifier.setEvaluator(eval); 
      classifier.setSearch(search); 
      Evaluation evaluation; 
      try { 
         evaluation = new Evaluation(iris); 
         evaluation.crossValidateModel(classifier, iris, 10, new 
           Random(1)); 
         System.out.println(evaluation.toSummaryString()); 
      } catch (Exception e) { 
      } 
   } 

   public static void main(String[] args){ 
      WekaFeatureSelectionTest test = new WekaFeatureSelectionTest(); 
      test.loadArff("C:/Program Files/Weka-3-6/data/iris.arff"); 
      test.selectFeatures(); 
      test.selectFeaturesWithFilter(); 
      test.selectFeaturesWithClassifiers(); 
   }

该方法的输出如下:

Correctly Classified Instances         144               96      % 
Incorrectly Classified Instances         6                4      % 
Kappa statistic                          0.94   
Mean absolute error                      0.0286 
Root mean squared error                  0.1386 
Relative absolute error                  6.4429 % 
Root relative squared error             29.4066 % 
Total Number of Instances              150

注意

以下频道有关于使用 Weka 完成许多不同机器学习任务的教程，包括其 API 和 GUI:https://www.youtube.com/c/rushdishams。

五、从数据中学习——第二部分

在本章中，我们将介绍以下配方:

使用 Java 机器学习库在数据上应用机器学习
数据集导入和导出
聚类和评估
分类
交叉验证和延期测试
特征评分
特征选择
使用斯坦福分类器对数据点进行分类
使用大量在线分析对数据点进行分类

简介

在第 4 章、从数据中学习-第 1 部分中，我们使用 Weka 机器学习工作台进行不同的分类、聚类、关联规则挖掘、特征选择等。我们在那一章还提到，Weka 并不是唯一用 Java 编写的从数据中学习模式的工具。还有其他工具可以执行类似的任务。这类工具的例子包括但不限于 Java 机器学习 ( Java-ML )库、海量在线分析 ( MOA )和斯坦福机器学习库。

在这一章中，我们将关注这些其他工具的一点一滴，以对数据进行机器学习分析。

使用 Java 机器学习(Java-ML)库对数据应用机器学习

Java 机器学习 ( Java-ML )库是标准机器学习算法的集合。与 Weka 不同，该库没有任何 GUI，因为它主要面向软件开发人员。Java-ML 的一个特别有利的特性是，它对每种类型的算法都有一个公共接口，因此，算法的实现相当容易和直接。对该库的支持是它的另一个关键特性，因为源代码有很好的文档记录，因此是可扩展的，并且有大量的代码示例和教程，可以使用该库完成各种机器学习任务。http://java-ml.sourceforge.net/网站有关于图书馆的所有细节。

在这个菜谱中，我们将使用这个库来完成以下任务:

数据集导入和导出
聚类和评估
分类
交叉验证和延期测试
特征评分
特征选择

准备就绪

为了执行此配方，我们需要以下内容:

In this recipe, we will be using the 0.1.7 version of the library. Download this version from https://sourceforge.net/projects/java-ml/files/java-ml/:
The file you downloaded is a compressed zip file. Extract the files to a directory. The directory structure looks like the following:

我们需要将javaml-0.1.7.jar文件作为一个外部 JAR 文件包含在 Eclipse 项目中，我们将使用它来实现菜谱。
The directory also has a folder named lib. By opening the lib folder, we will see that it contains several other JAR files:

这些 JAR 文件是 Java-ML 的依赖项，因此也必须作为外部 JAR 文件包含在项目中:
In our recipe, we will also be using a sample Iris dataset that is compatible with Java-ML's native file format. Iris and other data files, however, do not come with the library's distribution; they need to be downloaded from a different repository. To download the datasets, go to http://java-l.sourceforge.net/content/databases. Java-ML has two types of datasets: 111 small UCI datasets and 7 large UCI datasets. For your practice, it is highly recommended to download both types of datasets. For the recipe, click 111 small UCI datasets and you will be prompted for its download:
下载完成后，解压缩文件夹。您将看到在这个分布中有 111 个文件夹，每个文件夹代表一个数据集。找到 iris 数据集文件夹并将其打开。您将看到有两个数据文件和一个名称文件。在我们的食谱中，我们将使用iris.data文件。需要注意到该文件的路径，因为我们将在菜谱中使用该路径:

注意

如果您使用任何 UCI 数据集，您需要相应地注明并提供对原始作品的引用。详情可在http://archive.ics.uci.edu/ml/citation_policy.html找到。

怎么做...

创建一个名为JavaMachineLearning的类。我们将使用一个主方法来实现所有的机器学习任务。该方法将抛出一个IOException :
```
        public class JavaMachineLearning { 
          public static void main(String[] args) throws IOException{ 
```
First, we will be reading the iris dataset using Java-ML's FileHandler class's loadDataset() method:
```
        Dataset data = FileHandler.loadDataset(new File("path to your 
        iris.data"), 4, ","); 
```
该方法的参数是到dataset的路径、类属性的位置和分隔值的分隔符。可以用任何标准文本编辑器阅读dataset。属性的起始索引是 0，iris dataset的第五个属性是 class 属性。因此，第二个参数设置为 4。同样，在我们的例子中，数据值由逗号分隔。因此，第三个参数被设置为逗号。文件的内容被带到一个Dataset对象中。
通过简单地将对象传递给System.out.println()方法:
```
        System.out.println(data); 
```
来打印dataset内容

代码的部分输出如下:

        [{[5.1, 3.5, 1.4, 0.2];Iris-setosa}, {[4.9, 3.0, 1.4, 
          0.2];Iris-setosa}, {[4.7, 3.2, 1.3, 0.2];Iris-setosa}, {[4.6,  
             3.1, 1.5, 0.2];Iris-setosa}, {[5.0, 3.6, 1.4, 0.2];Iris-
               setosa}, {[5.4, 3.9, 1.7, 0.4];Iris-setosa}, {[4.6, 3.4, 
                 1.4, 0.3];Iris-setosa}, {[5.0, 3.4, 1.5, 0.2];Iris-
                    setosa}, {[4.4, 2.9, 1.4, 0.2];Iris-setosa}, ...]

If at any point, you need to export your dataset from the .data format to a .txt format, Java-ML has a very simple way to accomplish it using its exportDataset() method of the FileHandler class. The method takes the data and the output file as its parameter. The following line of code creates a text file in the C:/ drive with the contents of the iris dataset:
```
        FileHandler.exportDataset(data, new File("c:/javaml-
          output.txt")); 
```
上述代码生成的文本文件的部分输出如下:
```
 Iris-setosa 5.1   3.5   1.4   0.2 
        Iris-setosa 4.9   3.0   1.4   0.2 
        Iris-setosa 4.7   3.2   1.3   0.2 
        Iris-setosa 4.6   3.1   1.5   0.2 
        ...................................
```
注意

对于 Java-ML 生成的数据文件，有两点需要注意。首先，类值是第一个属性，其次，这些值不再像在。数据文件；相反，它们由制表符分隔。
要读取上一步创建的数据文件，我们可以再次使用loadDataset()方法。但是这次参数的值会不同:
```
        data = FileHandler.loadDataset(new File("c:/javaml-
          output.txt"), 0,"\t"); 
```

If we print the data:

        System.out.println(data);

然后，它将与我们在步骤 3 中看到的输出相同:

        [{[5.1, 3.5, 1.4, 0.2];Iris-setosa}, {[4.9, 3.0, 1.4, 
          0.2];Iris-setosa}, {[4.7, 3.2, 1.3, 0.2];Iris-setosa}, {[4.6, 
            3.1, 1.5, 0.2];Iris-setosa}, {[5.0, 3.6, 1.4, 0.2];Iris-
              setosa}, {[5.4, 3.9, 1.7, 0.4];Iris-setosa}, {[4.6, 3.4, 
                1.4, 0.3];Iris-setosa}, {[5.0, 3.4, 1.5, 0.2];Iris-
                  setosa}, {[4.4, 2.9, 1.4, 0.2];Iris-setosa}, ...]

Java-ML 提供了非常简单的接口来应用聚类、显示聚类和评估聚类。我们将在菜谱中使用KMeans聚类。创建一个KMeans集群器:
```
        Clusterer km = new KMeans(); 
```
使用cluster()方法向集群器提供数据。结果将是数据点的多个聚类(或多个数据集)。将结果放入一个数组Dataset :
```
        Dataset[] clusters = km.cluster(data); 
```
If you want to see the data points in each cluster, use a for loop to iterate over the array of datasets:

```java
        for(Dataset cluster:clusters){ 
          System.out.println("Cluster: " + cluster); 
        } 

```

该步骤中代码的部分输出如下:

```java
 Cluster: [{[6.3, 3.3, 6.0, 2.5];Iris-virginica}, {[7.1, 3.0, 
          5.9, 2.1];Iris-virginica}, ...] 
        Cluster: [{[5.5, 2.3, 4.0, 1.3];Iris-versicolor}, {[5.7, 2.8, 
           4.5, 1.3];Iris-versicolor}, ...] 
        Cluster: [{[5.1, 3.5, 1.4, 0.2];Iris-setosa}, {[4.9, 3.0, 1.4, 
           0.2];Iris-setosa}, ...] 
        Cluster: [{[7.0, 3.2, 4.7, 1.4];Iris-versicolor}, {[6.4, 3.2, 
           4.5, 1.5];Iris-versicolor}, ...]

```

从输出中，我们可以看到 KMeans 算法从 iris 数据集创建了四个聚类。

误差平方和是衡量聚类器性能的指标之一。我们将使用ClusterEvaluation类来测量聚类的误差:

```java
        ClusterEvaluation sse = new SumOfSquaredErrors(); 

```

接下来，我们简单地将聚类发送给对象的 score 方法，以获得聚类的误差平方和:

```java
        double score = sse.score(clusters); 

```

Print the error score:

```java
        System.out.println(score);  

```

输出中将显示以下内容:

```java
 114.9465465309897 

```

这是 iris 数据集的 KMeans 聚类的误差平方和。

Java-ML 中的分类也非常容易，只需要几行代码。下面的代码创建了一个 K 近邻 ( KNN )分类器。分类器将基于来自五个最近邻居的多数投票来预测看不见的数据点的标签。buildClassifier()方法用于训练一个分类器，该分类器将数据集(在我们的例子中是 iris)作为参数:

```java
        Classifier knn = new KNearestNeighbors(5); 
        knn.buildClassifier(data); 

```

After a model is built, the recipe will then continue to evaluate the model. We will see two evaluation methods that can be accomplished using Java-ML:

*   k 倍交叉验证和
*   坚持测试

对于 KNN 分类器的 k-fold 交叉验证，我们将使用分类器创建一个CrossValidation实例。CrossValidation类有一个名为crossValidation()的方法，它将数据集作为参数。该方法返回一个 map，它的第一个参数是 object，第二个参数是 evaluation metric:

```java
        CrossValidation cv = new CrossValidation(knn); 
        Map<Object, PerformanceMeasure> cvEvaluation = 
          cv.crossValidation(data); 

```

Now that we have the cross-validation results, we can simply print them by using the following:

```java
        System.out.println(cvEvaluation); 

```

这将显示每个类别的真阳性、假阳性、真阴性和假阴性:

```java
        {Iris-versicolor=[TP=47.0, FP=1.0, TN=99.0, FN=3.0], Iris-
          virginica=[TP=49.0, FP=3.0, TN=97.0, FN=1.0], Iris-setosa=
            [TP=50.0, FP=0.0, TN=100.0, FN=0.0]} 

```

In order to do a held-out testing, we need to have a test dataset. Unfortunately, we do not have any test dataset for iris. Therefore, we will be using the same iris.data file (that was used to train our KNN classifier) as our test dataset. But note that in real life, you will have a test dataset with the exact number of attributes as in your training dataset, while the labels of the data points will be unknown.

首先，我们加载测试数据集:

```java
        Dataset testData = FileHandler.loadDataset(new File("path to 
          your iris.data "), 4, ","); 

```

然后，我们使用以下代码获得分类器对测试数据的性能:

```java
        Map<Object, PerformanceMeasure> testEvaluation = 
          EvaluateDataset.testDataset(knn, testData); 

```

Then, we can simply print the results for each class by iterating over the map object:

```java
        for(Object classVariable:testEvaluation.keySet()){ 
          System.out.println(classVariable + " class has " + 
            testEvaluation.get(classVariable).getAccuracy()); 
        } 

```

前面的代码将打印每个类的 KNN 分类器的精度:

```java
        Iris-versicolor class has 0.9666666666666667 
        Iris-virginica class has 0.9666666666666667 
        Iris-setosa class has 1.0 

```

Feature scoring is a key aspect of machine learning to reduce dimensionality. In Java-ML, we will be implementing the following method that generates a score for a given attribute:

```java
        public double score(int attIndex); 

```

首先，创建一个特征评分算法实例。在我们的配方中，我们将使用增益比算法:

```java
        GainRatio gainRatio = new GainRatio(); 

```

接下来，将算法应用于数据:

```java
        gainRatio.build(data); 

```

Finally, print the scores of each feature using a for loop and by iterating through sending the attribute index to the score() method, one by one:

```java
        for (int i = 0; i < gainRatio.noAttributes(); i++){ 
          System.out.println(gainRatio.score(i)); 
        } 

```

虹膜数据集的特征得分如下:

```java
 0.2560110727706682 
        0.1497001925156687 
        0.508659832906763 
        0.4861382158327255

```

We can also rank features based on some feature-ranking algorithms. To do this, we will be implementing the rank() method of Java-ML that works in a similar way like the score() method--both take the index of an attribute:

```java
        public int rank(int attIndex); 

```

创建一个特征排序算法实例。在我们的例子中，我们将依赖于 SVM 的基于递归消除特征方法的特征排序。构造函数的参数表示将被消除的排名最差的特征的百分比:

```java
        RecursiveFeatureEliminationSVM featureRank = new 
          RecursiveFeatureEliminationSVM(0.2);
```

*   接下来，对数据集应用算法:

    ```java
            featureRank.build(data); 

    ```

*   最后，使用 for 循环并通过将属性索引依次发送给`rank()`方法进行迭代来打印每个特性的排名:

    ```java
            for (int i = 0; i < featureRank.noAttributes(); i++){ 
               System.out.println(featureRank.rank(i)); 
            } 

    ```

*   iris 数据集的特征排序如下:

```java
 3 
        2 
        0 
        1

```

而对于特征的评分和排序，我们得到单个特征的信息，当我们应用 Java-ML 的特征子集选择时，我们只得到从数据集中选择的特征子集。

首先，创建一个特征选择算法。在我们的配方中，我们将使用一种使用greedy方法的正向选择方法。在选择特征的过程中，我们需要一个距离度量，在我们的例子中是皮尔逊相关度量。构造函数的第一个参数代表子集中要选择的属性数:

        GreedyForwardSelection featureSelection = new 
        GreedyForwardSelection(5, new PearsonCorrelationCoefficient());

然后，将算法应用于数据集:

        featureSelection.build(data);

最后，您可以轻松打印算法选择的特征:

        System.out.println(featureSelection.selectedAttributes());

要素的输出子集如下:

[0]

食谱的完整代码如下:

import java.io.File; 
import java.io.IOException; 
import java.util.Map; 
import net.sf.javaml.classification.Classifier; 
import net.sf.javaml.classification.KNearestNeighbors; 
import net.sf.javaml.classification.evaluation.CrossValidation; 
import net.sf.javaml.classification.evaluation.EvaluateDataset; 
import net.sf.javaml.classification.evaluation.PerformanceMeasure; 
import net.sf.javaml.clustering.Clusterer; 
import net.sf.javaml.clustering.KMeans; 
import net.sf.javaml.clustering.evaluation.ClusterEvaluation; 
import net.sf.javaml.clustering.evaluation.SumOfSquaredErrors; 
import net.sf.javaml.core.Dataset; 
import net.sf.javaml.distance.PearsonCorrelationCoefficient; 
import net.sf.javaml.featureselection.ranking.
  RecursiveFeatureEliminationSVM; 
import net.sf.javaml.featureselection.scoring.GainRatio; 
import net.sf.javaml.featureselection.subset.GreedyForwardSelection; 
import net.sf.javaml.tools.data.FileHandler; 

public class JavaMachineLearning { 
   public static void main(String[] args) throws IOException{ 
      Dataset data = FileHandler.loadDataset(new File("path to 
        iris.data"), 4, ","); 
      System.out.println(data); 
      FileHandler.exportDataset(data, new File("c:/javaml-
        output.txt")); 
      data = FileHandler.loadDataset(new File("c:/javaml-output.txt"), 
        0,"\t"); 
      System.out.println(data); 

      //Clustering 
      Clusterer km = new KMeans(); 
      Dataset[] clusters = km.cluster(data); 
      for(Dataset cluster:clusters){ 
         System.out.println("Cluster: " + cluster); 
      } 
      ClusterEvaluation sse= new SumOfSquaredErrors(); 
      double score = sse.score(clusters); 
      System.out.println(score); 

      //Classification 
      Classifier knn = new KNearestNeighbors(5); 
      knn.buildClassifier(data); 
      //Cross validation 
      CrossValidation cv = new CrossValidation(knn); 
      Map<Object, PerformanceMeasure> cvEvaluation = 
        cv.crossValidation(data); 
      System.out.println(cvEvaluation); 
      //Held-out testing 
      Dataset testData = FileHandler.loadDataset(new File("path to 
        iris.data"), 4, ","); 
      Map<Object, PerformanceMeasure> testEvaluation = 
            EvaluateDataset.testDataset(knn, testData); 
      for(Object classVariable:testEvaluation.keySet()){ 
         System.out.println(classVariable + " class has 
           "+testEvaluation.get(classVariable).getAccuracy()); 
      } 

      //Feature scoring 
      GainRatio gainRatio = new GainRatio(); 
      gainRatio.build(data); 
      for (int i = 0; i < gainRatio.noAttributes(); i++){ 
         System.out.println(gainRatio.score(i)); 
      } 

      //Feature ranking 
      RecursiveFeatureEliminationSVM featureRank = new 
        RecursiveFeatureEliminationSVM(0.2); 
      featureRank.build(data); 
      for (int i = 0; i < featureRank.noAttributes(); i++){ 
         System.out.println(featureRank.rank(i)); 
      } 

      //Feature subset selection 
      GreedyForwardSelection featureSelection = new 
        GreedyForwardSelection(5, new 
            PearsonCorrelationCoefficient()); 
      featureSelection.build(data); 
      System.out.println(featureSelection.selectedAttributes()); 
   } 
}

使用斯坦福分类器分类数据点

斯坦福分类器是斯坦福大学自然语言处理小组开发的机器学习分类器。软件用 Java 实现，作为它的分类器，软件使用最大熵。最大熵相当于多类逻辑回归模型，只是参数设置略有不同。使用斯坦福分类器的优势在于，软件中使用的技术与谷歌或亚马逊使用的基本技术相同。

准备就绪

在这个食谱中，我们将使用斯坦福分类器，根据它使用最大熵的学习来分类数据点。我们将使用 3.6.0 版本的软件。详情请参考http://nlp.stanford.edu/software/classifier.html。要运行这个食谱的代码，您需要 Java 8。为了执行此配方，我们需要执行以下操作:

Go to http://nlp.stanford.edu/software/classifier.html, and download version 3.6.0. This is the latest version at the time of writing of this book. The software distribution is a compressed zip file:
Once downloaded, decompress the files. You will see a list of files and folders as follows:

stanford-classifier-3.6.0.jar文件需要包含在您的 Eclipse 项目中:

该发行版还有一个名为 examples 的文件夹。我们将在食谱中使用这个文件的内容。示例文件夹包含两个数据集:奶酪疾病数据集和虹膜数据集。每个数据集包含三个相关联的文件:一个训练文件(带有。train 扩展名)、一个测试文件(带有。测试扩展名)，以及一个属性(带有.prop扩展名)。在我们的食谱中，我们将使用奶酪疾病数据集:

如果打开cheeseDisease.train文件，内容如下:
```
2 Back Pain
2 Dissociative Disorders
2 Lipoma
1 Blue Rathgore
2 Gallstones
1 Chevrotin des Aravis
2 Pulmonary Embolism
2 Gastroenteritis
2 Ornithine Carbamoyltransferase Deficiency Disease ............
```
第一列表示 1 或 2，是数据实例的类，而第二列是字符串值，是名称。类别 1 表示其后的名称是奶酪的名称，类别 2 表示其后的名称是疾病的名称。对数据集应用监督分类的目标是构建一个分类器，将奶酪名称与疾病名称分开。

注意

数据集中的列由制表符分隔。这里，只有一个类列和一个预测列。这是训练分类器的最低要求。但是，您可以拥有任意数量的预测列并指定它们的角色。
Cheese2007.prop文件是数据集的属性文件。您需要理解该文件的内容，因为我们的配方中的代码将联系该文件，以获得关于类特性、分类器要使用的特性类型、控制台上的显示格式、分类器的参数等等的必要信息。因此，我们现在将检查该文件的内容。

属性文件的前几行列出了特性选项。带#的行表示注释。这些行告诉分类器在学习过程中使用类特征(在训练文件中是第 1 列)。它还提供信息，例如分类器将使用 N 元语法特征，其中 N 元语法的最小长度为 1，最大长度为 4。分类器在计算 N-gram 和 10、20 和 30 的装箱长度时也将使用前缀和后缀:

        # # Features 
        # useClassFeature=true
        1.useNGrams=true 
        1.usePrefixSuffixNGrams=true 
        1.maxNGramLeng=4 
        1.minNGramLeng=1 
        1.binnedLengths=10,20,30

下一个重要的行块是映射，其中属性文件向分类器传达评估的基础事实将是列 0，并且需要对列 1 进行预测:

        # # Mapping 
        # goldAnswerColumn=0 
        displayedColumn=1

接下来，属性文件保存最大熵分类器的优化参数:

       # 
       # Optimization 
       # intern=true 
        sigma=3 
        useQN=true 
        QNsize=15 
        tolerance=1e-4

最后，属性文件包含训练和测试文件路径的条目:

       #  Training input
       # trainFile=./examples/cheeseDisease.train
       testFile=./examples/cheeseDisease.test

怎么做...

对于食谱，我们将在项目中创建一个类。我们将只使用一种main()方法来演示分类。该方法将抛出一个异常:
```
        public class StanfordClassifier { public static void 
          main(String[] args) throws Exception {
```
斯坦福分类器是在ColumnDataClassifier类中实现的。通过为奶酪疾病数据集提供属性文件的路径来创建分类器:
```
        ColumnDataClassifier columnDataClassifier = new 
          ColumnDataClassifier("examples/cheese2007.prop");
```
接下来，使用训练数据构建分类器。Classifier类的泛型是<String, String>，因为第一列是类，第二列是奶酪/疾病的名称。注意，即使 class 列的标称值是 1 和 2，它们也被视为字符串:
```
        Classifier<String,String> classifier =   
          columnDataClassifier.makeClassifier
            (columnDataClassifier.readTrainingExamples
              ("examples/cheeseDisease.train"));
```
Finally, iterate through each line of the test dataset. The test dataset is similar to the training dataset: the first column is the actual class and the second column is the name. A first few lines of the test dataset are as follows:

2 鹦鹉热

2 库欣综合征

2 内斜视

2 黄疸，新生儿

2 胸腺瘤...............

它是如何工作的...

列数据分类器被应用到测试集的每一行，结果被发送到一个Datum对象。分类器预测Datum对象的类别，并在控制台上打印预测结果:

for (String line : ObjectBank.getLineIterator("examples/cheeseDisease.test", "utf-8")) { Datum<String,String> d = columnDataClassifier.makeDatumFromLine(line); System.out.println(line + " ==> " + classifier.classOf(d)); }

控制台上的输出如下(输出被截断):

2 Psittacosis ==> 2 2 Cushing Syndrome ==> 2 2 Esotropia ==> 2 2 Jaundice, Neonatal ==> 2 2 Thymoma ==> 2 1 Caerphilly ==> 1 2 Teratoma ==> 2 2 Phantom Limb ==> 1 2 Iron Overload ==> 1 ...............

第一列是实际类别，第二列是名称，==>符号右侧的值是分类器预测的类别。食谱的完整代码如下:

import edu.stanford.nlp.classify.Classifier;
import edu.stanford.nlp.classify.ColumnDataClassifier; 
import edu.stanford.nlp.ling.Datum; 
import edu.stanford.nlp.objectbank.ObjectBank; 
public class StanfordClassifier { 
public static void main(String[] args) throws Exception {   ColumnDataClassifier columnDataClassifier = new ColumnDataClassifier("examples/cheese2007.prop"); Classifier<String,String> classifier = columnDataClassifier.makeClassifier(columnDataClassifier.readTrainingExamples("examples/cheeseDisease.train"));
 for (String line : ObjectBank.getLineIterator("examples/cheeseDisease.test", "utf-8")) { Datum<String,String> d = columnDataClassifier.makeDatumFromLine(line); System.out.println(line + " ==> " + classifier.classOf(d)); 
}
} 
}

该方法不演示斯坦福分类器模型的加载和保存。如果有兴趣，可以看看发行版中的ClassifierDemo.java文件。

使用海量在线分析(MOA)对数据点进行分类

海量在线分析或 MOA 与 Weka 相关，但它具有更大的可扩展性。它是一个著名的用于数据流挖掘的 Java 工作台。有了强大的社区，MOA 实现了分类、聚类、回归、概念漂移识别和推荐系统。MOA 的其他主要优势是其可被开发者扩展的能力以及与 Weka 进行双向交互的能力。

准备就绪

为了执行此配方，我们需要以下内容:

MOA can be downloaded from https://sourceforge.net/projects/moa-datastream/, which eventually is accessible from the MOA getting started webpage at http://moa.cms.waikato.ac.nz/getting-started/:

这会将一个名为moa-release-2016.04.zip的 zip 文件下载到您的系统中。把它保存在你喜欢的任何地方。
Once downloaded, extract the files. You will see files and folders as follows:
您需要将moa.jar文件作为项目的外部库:

怎么做...

首先，创建一个类和一个带两个参数的方法:第一个参数表示我们将要处理的实例数量，第二个参数表示我们是否要测试分类器:
```
        public class MOA { public void run(int numInstances, boolean 
          isTesting){
```
创建一个HoeffdingTree分类器:
```
Classifier learner = new HoeffdingTree();
```
注意

恐鸟实现了以下分类器：朴素贝叶斯、赫夫丁树、赫夫丁选项树、赫夫丁自适应树、装袋、增压、使用阿德温的装袋、杠杆装袋、SGD、感知器、斯佩加索斯.
接下来，创建一个随机径向基函数流。
准备要使用的流:
```
        stream.prepareForUse();
```
设置对数据流头的引用。使用getHeader()方法可以找到数据流的头:
```
        learner.setModelContext(stream.getHeader());
```
然后，准备要使用的分类器:
```
        learner.prepareForUse();
```

声明两个变量，用于跟踪样本数和正确分类的样本数:

        int numberSamplesCorrect = 0; int numberSamples = 0;

声明另一个变量来跟踪分类所花费的时间:

        long evaluateStartTime =  
          TimingUtils.getNanoCPUTimeOfCurrentThread();

接下来，循环执行，直到流中有更多的实例，并且被分类的样本数没有达到实例总数。在循环中，获取流的每个实例的数据。然后，检查分类器是否正确地对实例进行了分类。如果是，将变量numberSamplesCorrect增加 1。仅当测试打开时才选中此项(通过此方法的第二个参数)。
然后，通过增加样本数来转到下一个样本，并使用下一个训练实例来训练您的学员:

```java
      while (stream.hasMoreInstances() && numberSamples < numInstances)
       {
        Instance trainInst = stream.nextInstance().getData(); 
        if (isTesting)
         { 
          if (learner.correctlyClassifies(trainInst)){ 
            numberSamplesCorrect++; 
           }
         }
       numberSamples++; learner.trainOnInstance(trainInst); 
      }

```

计算精度:

```java
       double accuracy = 100.0 * (double) numberSamplesCorrect/ 
         (double) numberSamples;
```

此外，计算分类所需的时间:

```java
        double time = TimingUtils.nanoTimeToSeconds(TimingUtils.
          getNanoCPUTimeOfCurrentThread()- evaluateStartTime);

```

最后，显示这些评估指标并关闭方法:

```java
        System.out.println(numberSamples + " instances processed with " 
          + accuracy + "% accuracy in "+time+" seconds."); }
```

要执行该方法，可以有如下的main()方法。关闭您的班级:

        public static void main(String[] args) throws IOException { MOA 
          exp = new MOA(); exp.run(1000000, true); } }

该配方的完整代码如下:

import moa.classifiers.trees.HoeffdingTree;
import moa.classifiers.Classifier;
import moa.core.TimingUtils;
import moa.streams.generators.RandomRBFGenerator;
import com.yahoo.labs.samoa.instances.Instance;
import java.io.IOException;

public class MOA {
	public void run(int numInstances, boolean isTesting){
		Classifier learner = new HoeffdingTree();
		RandomRBFGenerator stream = new RandomRBFGenerator();
		stream.prepareForUse();

		learner.setModelContext(stream.getHeader());
		learner.prepareForUse();

		int numberSamplesCorrect = 0;
		int numberSamples = 0;
		long evaluateStartTime = Tim-
                  ingUtils.getNanoCPUTimeOfCurrentThread();
		while (stream.hasMoreInstances() && numberSamples < 
                  numIn-stances) {
		      Instance trainInst = 
                          stream.nextInstance().getData();
			if (isTesting) {
				if 
                        (learner.correctlyClassifies(trainInst)){
				numberSamplesCorrect++;
				}
			}
			numberSamples++;
			learner.trainOnInstance(trainInst);
		}
		double accuracy = 100.0 * (double) 
                  numberSamplesCorrect/ (double) numberSamples;
		double time = Tim-in-
                  gUtils.nanoTimeToSeconds(TimingUtils.
                   getNanoCPUTimeOfCurrentThread()- evaluateStartTime);
		System.out.println(numberSamples + " instances 
                 processed with " + accuracy + "% accuracy in "+time+" 
                  seconds.");
	}

	public static void main(String[] args) throws IOException {
		MOA exp = new MOA();
		exp.run(1000000, true);
	}
}

菜谱中代码的输出如下所示(输出可能因机器而异):

 1000000 instances processed with 91.0458% accuracy in 6.769871032 seconds.

使用木兰对多标记数据点进行分类

到目前为止，我们已经看到了多类分类，其目的是将一个数据实例分类到几个类中的一个。多标签数据实例是可以有多个类或标签的数据实例。到目前为止，我们使用的机器学习工具无法处理具有多个目标类这一特征的数据点。

为了对多标记数据点进行分类，我们将使用一个名为 Mulan 的开源 Java 库。Mulan 实现了各种分类、排序、特征选择和模型评估。由于木兰没有 GUI，使用它的唯一方法是通过命令行或使用它的 API。在这份食谱中，我们将把重点限制在使用两个不同分类器对多标记数据集进行分类和分类评估上。

准备就绪

为了执行此配方，我们需要以下内容:

首先，下载花木兰。在我们的食谱中，我们将使用它的 1.5 版本。库的压缩文件可以在https://SourceForge . net/projects/Mulan/files/Mulan-1-5/Mulan-1 . 5 . 0 . zip/download找到，可以通过http://mulan.sourceforge.net/download.html访问。
Unzip the compressed files. You will see a data folder as follows. Take a look inside the dist folder:
You will see three files there. Among the three files, there is another compressed file named Mulan-1.5.0.zip. Unzip the files:
This time, you will see three or four JAR files. From the four JAR files, we will be using the JAR files highlighted in the following image:
Add these three JAR files into your project as external libraries:
木兰分布的数据文件夹包含一个多标签数据集示例。在我们的食谱中，我们将使用emotions.arff文件和emotions.xml文件:

在开始我们的食谱之前，让我们先来看看木兰的数据集格式。Mulan 需要两个文件来指定多标签数据集。第一个是 ARFF 文件(看看第四章，从数据中学习——第一部分)。标签应该被指定为具有两个值“0”和“1”的标称属性，其中前者表示标签不存在，后者表示标签存在。以下示例演示了数据集有三个数字要素，每个实例可以有五个类或标签。在@data部分，前三个值代表实际的特征值，然后我们有五个 0 或 1 表示类的存在或不存在:

@relation MultiLabelExample

@attribute feature1 numeric
@attribute feature2 numeric
@attribute feature3 numeric
@attribute label1 {0, 1}
@attribute label2 {0, 1}
@attribute label3 {0, 1}
@attribute label4 {0, 1}
@attribute label5 {0, 1}

@data
2.3,5.6,1.4,0,1,1,0,0

另一方面，XML 文件指定标签以及它们之间的任何层次关系。以下示例是与上一示例相对应的 XML 文件:

<labels >
<label name="label1"></label>
  <label name="label2"></label>
  <label name="label3"></label>
  <label name="label4"></label>
  <label name="label5"></label>
</labels>

更多详情，请参见http://mulan.sourceforge.net/format.html。

怎么做...

创建一个类和一个方法。我们将在main()方法中编写所有的代码:
```
        public class Mulan { public static void main(String[] args){
```

创建一个数据集并将emotions.arff和emotions.xml文件读取到数据集:

        MultiLabelInstances dataset = null; 
          try { 
            dataset = new  MultiLabelInstances("path to 
               emotions.arff", 
              "path to emotions.xml"); 
          } catch (InvalidDataFormatException e) { 
          }

接下来，我们将创建一个RAkEL分类器和一个MLkNN分类器。请注意，RAkEL是一个元分类器，这意味着它可以有一个多标签学习器，它通常与LabelPowerset算法一起使用。LabelPowerset是一种基于变换的算法，可以将单标签分类器(在我们的例子中，J48)作为参数。MLkNN是一个自适应分类器，基于 k 近邻:
```
        RAkEL learner1 = new RAkEL(new LabelPowerset(new J48())); 
            MLkNN 
          learner2 = new MLkNN();
```

创建一个评估器来评估分类性能:

        Evaluator eval = new Evaluator();

因为我们将对两个分类器进行评估，所以我们需要声明一个可以有多个评估结果的变量:
```
        MultipleEvaluation results;
```
我们将进行 10 倍交叉验证评估。因此，声明一个你要创建的折叠数的变量:
```
         int numFolds = 10;
```

接下来，评估你的第一个学员并显示结果:

         results = eval.crossValidate(learner1, dataset, numFolds); 
         System.out.println(results);

最后，评估你的第二个学员并显示结果:

         results = eval.crossValidate(learner2, dataset, numFolds);
         System.out.println(results);

关闭方法和类:

} }

食谱的完整代码如下:

import mulan.classifier.lazy.MLkNN;
import mulan.classifier.meta.RAkEL;
import mulan.classifier.transformation.LabelPowerset;
import mulan.data.InvalidDataFormatException;
import mulan.data.MultiLabelInstances;
import mulan.evaluation.Evaluator;
import mulan.evaluation.MultipleEvaluation;
import weka.classifiers.trees.J48;

public class Mulan {
	public static void main(String[] args){
		MultiLabelInstances dataset = null;
		try {
		   dataset = new MultiLabelInstances("path to emo-
                     tions.arff", "path to emotions.xml");
		} catch (InvalidDataFormatException e) {
		}
		RAkEL learner1 = new RAkEL(new LabelPowerset(new 
                  J48()));
		MLkNN learner2 = new MLkNN();
		Evaluator eval = new Evaluator();
		MultipleEvaluation results;
		int numFolds = 10;
		results = eval.crossValidate(learner1, dataset, num-
                  Folds);
		System.out.println(results);
		results = eval.crossValidate(learner2, dataset, num-
                  Folds);
		System.out.println(results);
	}
}

代码的输出将是您选择的两个学习者的表现:

Fold 1/10
Fold 2/10
Fold 3/10
Fold 4/10
Fold 5/10
Fold 6/10
Fold 7/10
Fold 8/10
Fold 9/10
Fold 10/10
Hamming Loss: 0.2153±0.0251
Subset Accuracy: 0.2562±0.0481
Example-Based Precision: 0.6325±0.0547
Example-Based Recall: 0.6307±0.0560
Example-Based F Measure: 0.5990±0.0510
Example-Based Accuracy: 0.5153±0.0484
Example-Based Specificity: 0.8607±0.0213
........................................
Fold 1/10
Fold 2/10
Fold 3/10
Fold 4/10
Fold 5/10
Fold 6/10
Fold 7/10
Fold 8/10
Fold 9/10
Fold 10/10
Hamming Loss: 0.1951±0.0243
Subset Accuracy: 0.2831±0.0538
Example-Based Precision: 0.6883±0.0655
Example-Based Recall: 0.6050±0.0578
Example-Based F Measure: 0.6138±0.0527
Example-Based Accuracy: 0.5326±0.0515
Example-Based Specificity: 0.8994±0.0271
........................................

六、从文本数据中检索信息

在本章中，我们将介绍以下配方:

使用 Java 检测标记(单词)
使用 Java 检测句子
使用 OpenNLP 检测单词和句子
使用 Stanford CoreNLP 从标记中检索词条和词性并识别命名实体
使用 Java 8 用余弦相似度度量文本相似度
使用 Mallet 从文本文档中提取主题
使用 Mallet 分类文本文档
使用 Weka 对文本文档进行分类

简介

由于 web 数据的可用性，而且大部分都是文本格式，数据科学家现在处理最多的数据类型就是文本。可以从文档、文章、博客文章、社交媒体更新、新闻专线以及任何你能想到的地方检索到文本数据的许多方面。

许多基于 Java 的工具可供数据科学家从文本数据中检索信息。此外，还有一些工具可以完成各种数据科学任务。在本章中，我们将范围限制在几个数据科学任务上，如句子和单词等琐碎文本特征提取、使用机器学习的文档分类、主题提取和建模、从文档中提取关键字以及命名实体识别。

使用 Java 检测令牌(单词)

数据科学家需要使用文本数据完成的最常见任务之一是检测其中的标记。这个任务叫做标记化 。虽然“记号”可以指单词、符号、短语或任何其他有意义的文本单元，但在本章中，我们将把单词视为记号，因为单词是一个合理的文本单元。然而，单词标记的概念因人而异；有些人只需要单词，有些人希望在检测过程中省略符号，而有些人希望在单词中保留标点符号以获得更多的上下文信息。根据不同的需求，在这个配方中，我们将使用三种不同的技术，当应用于相同的字符串时，会产生三种不同的结果。这些技术将涉及字符串标记化、中断迭代器和正则表达式。您需要决定使用哪种技术。

请记住，我们只选择性地选择了三种方法，尽管还有许多其他的选择；它们是供你探索的。

准备就绪

转到https://docs . Oracle . com/javase/7/docs/API/Java/util/regex/pattern . html，浏览关于Pattern类支持的正则表达式模式的文档。
进入https://docs . Oracle . com/javase/7/docs/API/Java/text/break iterator . html查看示例。这将让你对 break 迭代器的用法有所了解。

怎么做...

首先，我们将创建一个使用 Java 的StringTokenzier类来检测令牌的方法。这个方法将获取输入的句子，并使用这个类对句子进行标记。最后，该方法将打印令牌:
```
        public void useTokenizer(String input){ 
```

通过将输入句子作为参数:

        StringTokenizer tokenizer = new StringTokenizer(input);

来调用StringTokenizer构造函数

创建一个字符串对象来保存令牌:
```
        String word =""; 
```

遍历分词器，获取每个单词，并在控制台上打印出来:

        while(tokenizer.hasMoreTokens()){ 
          word = tokenizer.nextToken(); 
          System.out.println(word); 
        }

Close the method:

这种方法的输出对于一个类似让我们来看看这个相对于 的句子，他说，“这些男生的分数真的那么好吗？**如下:

 ***"Let's 
        get 
        this 
        vis-a-vis", 
        he 
        said, 
        "these 
        boys' 
        marks 
        are 
        really 
        that 
        well?"***

*Second, we will create a method that will use Java's BreakIterator class to iterate through each word in a text. You will see that the code is slightly more complex than the first method we have created in this recipe.

该方法将获得输入句子作为其参数:
```
        public void useBreakIterator(String input){ 
```

*然后，使用BreakIterator类创建一个tokenizer :

        BreakIterator tokenizer = BreakIterator.getWordInstance();

*将tokenizer应用于输入句子:
```
        tokenizer.setText(input); 
```

*获取tokenizer :

        int start = tokenizer.first();

的起始索引*

*使用 for 循环将每个令牌作为字符串，并在控制台上打印出来，如下所示:

```java
        for (int end = tokenizer.next(); 
             end != BreakIterator.DONE; 
             start = end, end = tokenizer.next()) { 
             System.out.println(input.substring(start,end)); 
        } 

```

*Close the method:

```java
        } 

```

这种方法的输出对于类似*“让我们来看看这个”的句子，他说，“这些男孩的分数真的那么好吗？*将如下:

```java
 " 
        Let's 

        get 

        this 

        vis-a-vis 
        " 
        , 

        he 

        said 
        , 

        " 
        these 

        boys 
        ' 

        marks 

        are 

        really 

        that 

        well 
        ? 
        "

```

*最后，我们将创建一个使用正则表达式对输入文本进行标记的方法:

```java
        public void useRegEx(String input){ 

```

*使用带有正则表达式的模式，该模式可以捕获标点符号、单个或多个连字符、引号、末尾的撇号等等。如果你需要一些特定的模式，只需在下面一行使用你自己的正则表达式:

```java
        Pattern pattern = Pattern.compile("\\w[\\w-]+('\\w*)?"); 

```

*在pattern :

```java
        Matcher matcher = pattern.matcher(input); 

```

上施加一个`matcher`*

*使用matcher从输入文本中检索所有单词:

```java
        while ( matcher.find() ) { 
          System.out.println(input.substring(matcher.start(), 
            matcher.end())); 
        } 

```

关闭方法:

*}*

这种方法对于类似Let ' s get this visa-vis，* 他说，这些男生的分数真的有那么好吗？ 将如下:*

 ***Let's 
        get 
        this 
        vis-a-vis 
        he 
        said 
        these 
        boys' 
        marks 
        are 
        really 
        that 
        well***

该配方的完整代码如下:

*import java.text.BreakIterator; 
import java.util.StringTokenizer; 
import java.util.regex.Matcher; 
import java.util.regex.Pattern; 

public class WordDetection { 
   public static void main(String[] args){ 
      String input = ""Let's get this vis-a-vis", he said, "these boys' 
        marks are really that well?""; 
      WordDetection wordDetection = new WordDetection(); 
      wordDetection.useTokenizer(input); 
      wordDetection.useBreakIterator(input); 
      wordDetection.useRegEx(input); 

   } 

   public void useTokenizer(String input){ 
      System.out.println("Tokenizer"); 
      StringTokenizer tokenizer = new StringTokenizer(input); 
      String word =""; 
      while(tokenizer.hasMoreTokens()){ 
          word = tokenizer.nextToken(); 
          System.out.println(word); 
      } 
   } 

   public void useBreakIterator(String input){ 
      System.out.println("Break Iterator"); 
      BreakIterator tokenizer = BreakIterator.getWordInstance(); 
        tokenizer.setText(input); 
        int start = tokenizer.first(); 
        for (int end = tokenizer.next(); 
             end != BreakIterator.DONE; 
             start = end, end = tokenizer.next()) { 
             System.out.println(input.substring(start,end)); 
        } 
   } 

   public void useRegEx(String input){ 
      System.out.println("Regular Expression"); 
      Pattern pattern = Pattern.compile("\\w[\\w-]+('\\w*)?"); 
      Matcher matcher = pattern.matcher(input); 

      while ( matcher.find() ) { 
          System.out.println(input.substring(matcher.start(), 
            matcher.end())); 
      } 
   } 
}*

使用 Java 检测句子

在这个食谱中，我们将看到如何检测句子，以便我们可以使用它们进行进一步的分析。对于数据科学家来说，句子是一个非常重要的文本单元，可以用来试验不同的路由练习，例如分类。为了从文本中检测句子，我们将使用 Java 的BreakIterator类。

准备就绪

进入https://docs . Oracle . com/javase/7/docs/API/Java/text/break iterator . html查看示例。这将让你对 break 迭代器的用法有所了解。

怎么做...

为了测试这个菜谱的代码，我们将使用两个句子，这两个句子可能会给许多基于正则表达式的解决方案造成混淆。用于测试的两个句子是:我的名字是 Rushdi Shams。你可以在我的名字前用博士，因为我有博士学位。但是我有点不好意思用它。有趣的是，我们会看到 Java 的BreakIterator类非常高效地处理它们。

创建一个将测试字符串作为参数的方法。

public void useSentenceIterator(String source){

创建一个BreakIterator类的sentenceiterator对象:

        BreakIterator iterator =  
          BreakIterator.getSentenceInstance(Locale.US);

在测试管柱
```
        iterator.setText(source); 
```
上使用iterator
将测试字符串的start索引获取到一个整数变量:
```
        int start = iterator.first(); 
```
最后，遍历迭代器中的所有句子并打印出来。要循环遍历迭代器中的句子，您需要另一个名为end的变量来指向句子的结尾索引:

        for (int end = iterator.next(); end != BreakIterator.DONE; 
          start = end, end = iterator.next()) { 
          System.out.println(source.substring(start,end)); 
        }

代码的输出如下所示:

My name is Rushdi Shams. 
You can use Dr. before my name as I have a Ph.D. but I am a bit shy to use it.

食谱的完整代码如下:

import java.text.BreakIterator; 
import java.util.Locale; 
public class SentenceDetection { 
   public void useSentenceIterator(String source){ 
      BreakIterator iterator = 
        BreakIterator.getSentenceInstance(Locale.US); 
      iterator.setText(source); 
      int start = iterator.first(); 
      for (int end = iterator.next(); 
          end != BreakIterator.DONE; 
          start = end, end = iterator.next()) { 
        System.out.println(source.substring(start,end)); 
      } 
   } 
   public static void main(String[] args){ 
      SentenceDetection detection = new SentenceDetection(); 
      String test = "My name is Rushdi Shams. You can use Dr. before my 
        name as I have a Ph.D. but I am a bit shy to use it."; 
      detection.useSentenceIterator(test); 
   } 
}

使用 OpenNLP 检测单词和句子

本章前面的两个方法使用遗留 Java 类和其中的方法来检测令牌(单词)和句子。在这个菜谱中，我们将把检测标记和句子这两个任务与 Apache 的一个名为 OpenNLP 的开源库结合起来。引入 OpenNLP 和这两个可以用传统方法很好完成的任务的原因是向数据科学家介绍一个非常方便的工具，并且在标准和经典语料库的几个信息检索任务中具有非常高的准确性。OpenNLP 的主页可以在https://opennlp.apache.org/找到。使用这个库进行标记化、句子分割、词性标注、命名实体识别、分块、解析和共同引用解析的一个有力论据是，您可以在文章或文档的语料库上训练自己的分类器。

准备就绪

At the time of writing this book, the 1.6.0 version was the latest for OpenNLP and therefore you are encouraged to use this version. Download the 1.6.0 version of the library from https://opennlp.apache.org/download.html. Go to this webpage and download the binary zip files:
After downloading the files, unzip them. In the distribution, you will find a directory named lib.
在lib目录中，您会发现以下两个 Jar 文件:

从该目录中，将opennlp-tools-1.6.0.jar文件作为外部库添加到您需要为这个配方创建的 Eclipse 项目中:

对于这个菜谱，您将使用 OpenNLP 提供的预构建的令牌和句子检测模型。因此，您需要下载模型并将其保存在硬盘中。记住这些模型的位置，这样您就可以在代码中包含这些模型。

去http://opennlp.sourceforge.net/models-1.5/下载英语分词器和句子检测器模型。将它们保存在您的C:/驱动器中。现在，您可以编写一些代码了:

怎么做...

In this recipe, you will create a method that uses the tokenizer and sentence detector models of OpenNLP to tokenize and fragment a source text into sentences. As parameters, you will send the following:
- 将包含源文本的字符串。
- 模型的路径。
- 一个字符串，表示您是要对源文本进行标记，还是要将其分割成句子单元。对于前者，你将选择发送的是字，对于后者，选择将是句。
```
        public void useOpenNlp(String sourceText, String modelPath, 
          String choice) throws IOException{
```

首先，阅读将它们视为输入流的模型:

        InputStream modelIn = null; 
        modelIn = new FileInputStream(modelPath);

然后，为选择句子创建一个 if 块，它将包含代码来检测源文本:
```
        if(choice.equalsIgnoreCase("sentence")){ 
```
的句子片段
从预先构建的模型中创建一个句子模型，然后关闭用于保存预先构建的模型的变量:
```
        SentenceModel model = new SentenceModel(modelIn); 
        modelIn.close(); 
```

使用该模型，创建一个句子检测器:

        SentenceDetectorME sentenceDetector = new 
          SentenceDetectorME(model);

使用句子检测器检测源文本的句子。生成的句子将被视为一个字符串数组:
```
        String sentences[] = sentenceDetector.sentDetect(sourceText); 
```

现在，在控制台上打印句子并关闭 if 块:

       System.out.println("Sentences: "); 
         for(String sentence:sentences){ 
            System.out.println(sentence); 
         } 
       }

接下来，创建一个 else if 块，它将保存用于对源文本进行标记化的代码:
```
        else if(choice.equalsIgnoreCase("word")){ 
```

从预建模型创建一个tokenizer模型，并关闭预建模型:

        TokenizerModel model = new TokenizerModel(modelIn); 
        modelIn.close();

使用该模型，创建一个tokenizer :

```java
        Tokenizer tokenizer = new TokenizerME(model); 

```

使用tokenizer从源文本中提取单词。提取的令牌将被放入一个字符串数组:

```java
        String tokens[] = tokenizer.tokenize(sourceText); 

```

最后，在控制台上打印令牌并关闭 else if 块:

```java
       System.out.println("Words: "); 
         for(String token:tokens){ 
            System.out.println(token); 
         } 
       } 

```

您将需要一个 else 块，以防来自用户的无效选择:

```java
       else{ 
         System.out.println("Error in choice"); 
         modelIn.close(); 
         return; 
       } 

```

关闭方法:

该配方的完整源代码如下:

import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.tokenize.Tokenizer; 
import opennlp.tools.tokenize.TokenizerME; 
import opennlp.tools.tokenize.TokenizerModel; 

public class OpenNlpSenToken { 
   public static void main(String[] args){ 
      OpenNlpSenToken openNlp = new OpenNlpSenToken(); 
      try { 
         openNlp.useOpenNlp("My name is Rushdi Shams. " 
               + "You can use Dr. before my name as I have a Ph.D. " 
               + "but I am a bit shy to use it.", "C:/en-sent.bin", 
                 "sentence"); 
         openNlp.useOpenNlp(""Let's get this vis-a-vis", he said, 
           "these boys' marks are really that well?"", "C:/en-
              token.bin", "word"); 
      } catch (IOException e) { 
      } 
   } 

   public void useOpenNlp(String sourceText, String modelPath, String 
       choice) throws IOException{ 
      InputStream modelIn = null; 
      modelIn = new FileInputStream(modelPath); 

      if(choice.equalsIgnoreCase("sentence")){ 
         SentenceModel model = new SentenceModel(modelIn); 
         modelIn.close(); 
         SentenceDetectorME sentenceDetector = new 
           SentenceDetectorME(model); 
         String sentences[] = sentenceDetector.sentDetect(sourceText); 
         System.out.println("Sentences: "); 
         for(String sentence:sentences){ 
            System.out.println(sentence); 
         } 
      } 
      else if(choice.equalsIgnoreCase("word")){ 
         TokenizerModel model = new TokenizerModel(modelIn); 
         modelIn.close(); 
         Tokenizer tokenizer = new TokenizerME(model); 
         String tokens[] = tokenizer.tokenize(sourceText); 
         System.out.println("Words: "); 
         for(String token:tokens){ 
            System.out.println(token); 
         } 
      } 
      else{ 
         System.out.println("Error in choice"); 
         modelIn.close(); 
         return; 
      } 
   } 
}

现在，您可以将这个源代码的输出与前两个菜谱进行比较，因为这两个菜谱使用了相同的源文本。

注意

对于 OpenNLP 库的其他用途，强烈建议本书的读者查看https://open NLP . Apache . org/documentation/1 . 6 . 0/manual/open NLP . html。

使用斯坦福 CoreNLP 从标记中检索词条、词性和识别命名实体

现在我们知道了如何从给定的文本中提取标记或单词，我们将看到如何从标记中获得不同类型的信息，如它们的词条、词性以及标记是否是命名实体。

把一个词的屈折形式组合在一起，使它们可以作为一个单一的文本单位来分析的过程。这类似于词干处理过程，区别在于词干处理在分组时不考虑上下文。因此，对文本数据分析来说，词汇化比词干化更有用，但需要更多的计算能力。

文章或文档中的标记的词性标签被广泛用作许多机器学习模型的特征，这些机器学习模型对数据科学家来说可能是有用的。

另一方面，命名实体对于新闻文章数据分析非常重要，并且对与商业公司相关的研究具有非常高的影响。

在这个菜谱中，我们将使用 Stanford CoreNLP 3.7.0 从文本中检索这些信息，这是编写本章时的最新版本。

准备就绪

去 http://stanfordnlp.github.io/CoreNLP/download.html 下载斯坦福 CoreNLP 3.7.0。
The files that you have downloaded in step 1 are compressed. If you decompress them, you will find a directory structure as follows:
将图中所示的所有 jar 文件作为外部 Jar 文件包含到您现有的项目中，您就可以编写一些代码了:

怎么做...

创建一个类和一个main()方法，用于保存这个菜谱的所有代码:

        public class Lemmatizer { 
          public static void main(String[] args){

接下来，创建一个斯坦福 CoreNLP 管道。通过这条管道，你将为 CoreNLP 引擎提供许多属性值:
```
        StanfordCoreNLP pipeline; 
```

创建一个Properties对象并添加一些属性。在我们的例子中，我们将使用词性标注和词汇化来标记:

        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner");

接下来，使用这些属性创建一个 CoreNLP 对象:

        pipeline = new StanfordCoreNLP(props, false);

创建一个需要为其生成词条的字符串:

        String text = "Hamlet's mother, Queen Gertrude, says this 
          famous line while watching The Mousetrap. " 
            + "Gertrude is talking about the queen in the play. " 
            + "She feels that the play-queen seems insincere because 
              she repeats so dramatically that she'll never remarry 
              due to her undying love of her husband.";

接下来，用给定的文本

        Annotation document = pipeline.process(text);

创建一个Annotation

最后，对于每一个令牌，得到原令牌，得到该令牌的引理。您不需要获得原始令牌，但要看到单词形式和词条形式之间的区别，这可能很方便。使用您在上一步中创建的名为document的Annotation变量:

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
         {     
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
             {        
                String word = token.get(TextAnnotation.class);       
                String lemma = token.get(LemmaAnnotation.class); 
                String pos = token.get(PartOfSpeechAnnotation.class); 
                String ne = token.get(NamedEntityTagAnnotation.class); 
                System.out.println(word + "-->" + lemma + "-->" + pos 
                + "-->" + ne); 
            } 
         }

对所有句子重复上述步骤

关闭方法和类:

        } 
        }

代码的部分输出如下:

... 
Queen-->Queen-->NNP-->PERSON 
Gertrude-->Gertrude-->NNP-->PERSON 
,-->,-->,-->O 
says-->say-->VBZ-->O 
this-->this-->DT-->O 
famous-->famous-->JJ-->O 
line-->line-->NN-->O 
while-->while-->IN-->O 
watching-->watch-->VBG-->O 
The-->the-->DT-->O 
Mousetrap-->mousetrap-->NN-->O 
.-->.-->.-->O 
Gertrude-->Gertrude-->NNP-->PERSON 
is-->be-->VBZ-->O 
talking-->talk-->VBG-->O 
...

这个食谱的完整代码如下:

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation; 
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; 
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation; 
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; 
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation; 
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; 
import edu.stanford.nlp.ling.CoreLabel; 
import edu.stanford.nlp.pipeline.Annotation; 
import edu.stanford.nlp.pipeline.StanfordCoreNLP; 
import edu.stanford.nlp.util.CoreMap; 
import java.util.Properties; 

public class Lemmatizer { 
    public static void main(String[] args){ 
      StanfordCoreNLP pipeline; 
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); 
        pipeline = new StanfordCoreNLP(props, false); 
        String text = "Hamlet's mother, Queen Gertrude, says this 
          famous line while watching The Mousetrap. " 
            + "Gertrude is talking about the queen in the play. " 
            + "She feels that the play-queen seems insincere because 
              she repeats so dramatically that she'll never remarry 
                due to her undying love of her husband."; 
        Annotation document = pipeline.process(text);   

        for(CoreMap sentence: document.get(SentencesAnnotation.class)) 
          {     
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
             {        
                String word = token.get(TextAnnotation.class);       
                String lemma = token.get(LemmaAnnotation.class); 
                String pos = token.get(PartOfSpeechAnnotation.class); 
                String ne = token.get(NamedEntityTagAnnotation.class); 
                System.out.println(word + "-->" + lemma + "-->" + pos 
                + "-->" + ne); 
            } 
        } 
    } 
}

使用 Java 8 用余弦相似度度量文本相似度

数据科学家经常测量两个数据点之间的距离或相似性——有时用于分类或聚类，有时用于检测异常值，以及许多其他情况。当他们将文本作为数据点处理时，传统的距离或相似性度量就不能使用了。有许多标准的和经典的以及新兴的和新颖的相似性度量可用于比较两个或多个文本数据点。在这个食谱中，我们将使用一种叫做余弦相似度的方法来计算两个句子之间的距离。余弦相似度被认为是信息检索社区中事实上的标准，因此被广泛使用。在这个菜谱中，我们将使用这个度量来查找字符串格式的两个句子之间的相似性。

准备就绪

虽然读者可以从https://en.wikipedia.org/wiki/Cosine_similarity那里得到测量的全面展望，但是让我们来看看使用公式的两个句子的算法:

首先，从两个字符串中提取单词。
对于相应字符串中的每个单词，计算它们的频率。这里的频率表示单词在每个句子中出现的次数。设 A 是来自第一个字符串的单词及其频率的向量，B 是来自第二个字符串的单词及其频率的向量。
通过删除重复项，找出每个字符串中唯一的单词。
找出两个字符串相交的单词列表。
余弦相似性公式的分子将是矢量 A 和 b 的点积。
公式的分母是矢量 A 和 b 的算术乘积。

注意

请注意，两个句子的余弦相似性得分将介于 1(表示完全相反)和 1(表示完全相同)之间，而 0 分表示去相关。

怎么做...

创建一个接受两个字符串参数的方法。这些字符串将与您的calculateCosine相似:
```
        public double calculateCosine(String s1, String s2){ 
```
Use the power of regular expressions and Java 8's parallelization facility to tokenize the given strings. This gives you two streams of words in the tw O strings:
```
        Stream<String> stream1 = 
          Stream.of(s1.toLowerCase().split("\\W+")).parallel(); 
        Stream<String> stream2 = 
          Stream.of(s2.toLowerCase().split("\\W+")).parallel(); 
```
tip

For tokenization, you can use any method in the first method in this chapter, but the method shown in this step is also convenient and short, and makes use of the powerful functions of regular expressions and Java 8.

获取每个单词在每个字符串中的出现频率。同样，您将使用 Java 8 来实现这一点。结果将是两个地图:

        Map<String, Long> wordFreq1 = stream1 
          .collect(Collectors.groupingBy
          (String::toString,Collectors.counting())); 
        Map<String, Long> wordFreq2 = stream2 
          .collect(Collectors.groupingBy
          (String::toString,Collectors.counting()));

从每个句子的单词列表中，通过删除重复的单词，只保留唯一的单词。为此，您将使用在上一步中创建的地图创建两个集合:
```
        Set<String> wordSet1 = wordFreq1.keySet(); 
        Set<String> wordSet2 = wordFreq2.keySet(); 
```
因为您将在步骤 3 中计算两个图的点积，以用于余弦相似性度量的分子，所以您需要创建两个字符串共有的单词列表:
```
        Set<String> intersection = new HashSet<String>(wordSet1); 
        intersection.retainAll(wordSet2); 
```

接下来计算公式的分子，就是两个图的点积:

        double numerator = 0; 
          for (String common: intersection){ 
          numerator += wordFreq1.get(common) * wordFreq2.get(common); 
        }

From this point on, you will be preparing to compute the denominator of the formula, which is the arithmetic product of the magnitudes of the two maps.

首先，创建变量来保存矢量的量值(在地图数据结构中):
```
        double param1 = 0, param2 = 0; 
```

现在，计算你的第一个向量的大小:

        for(String w1: wordSet1){ 
          param1 += Math.pow(wordFreq1.get(w1), 2); 
        } 
        param1 = Math.sqrt(param1);

接下来，计算第二个向量的大小:

        for(String w2: wordSet2){ 
          param2 += Math.pow(wordFreq2.get(w2), 2); 
        } 
        param2 = Math.sqrt(param2);

现在你已经有了denominator的所有参数，乘以幅度得到它:

```java
        double denominator = param1 * param2; 

```

最后，将分子和分母放在适当的位置，计算两个字符串的余弦相似度。将分数返回给呼叫者。关闭方法:

```java
        double cosineSimilarity = numerator/denominator; 
        return cosineSimilarity; 
        } 

```

此菜谱的完整代码如下:

import java.util.HashSet; 
import java.util.Map; 
import java.util.Set; 
import java.util.stream.Collectors; 
import java.util.stream.Stream; 

public class CosineSimilarity { 
   public double calculateCosine(String s1, String s2){ 
      //tokenization in parallel with Java 8 
      Stream<String> stream1 = 
        Stream.of(s1.toLowerCase().split("\\W+")).parallel(); 
      Stream<String> stream2 = 
        Stream.of(s2.toLowerCase().split("\\W+")).parallel(); 

      //word frequency maps for two strings 
      Map<String, Long> wordFreq1 = stream1 
         .collect(Collectors.groupingBy
           (String::toString,Collectors.counting())); 
      Map<String, Long> wordFreq2 = stream2 
         .collect(Collectors.groupingBy
            (String::toString,Collectors.counting())); 

      //unique words for each string 
      Set<String> wordSet1 = wordFreq1.keySet(); 
      Set<String> wordSet2 = wordFreq2.keySet(); 

      //common words of two strings 
      Set<String> intersection = new HashSet<String>(wordSet1); 
      intersection.retainAll(wordSet2); 

      //numerator of cosine formula. s1.s2 
      double numerator = 0; 
      for (String common: intersection){ 
         numerator += wordFreq1.get(common) * wordFreq2.get(common); 
      } 

      //denominator of cosine formula has two parameters 
      double param1 = 0, param2 = 0; 

      //sqrt (sum of squared of s1 word frequencies) 
      for(String w1: wordSet1){ 
         param1 += Math.pow(wordFreq1.get(w1), 2); 
      } 
      param1 = Math.sqrt(param1); 

      //sqrt (sum of squared of s2 word frequencies) 
      for(String w2: wordSet2){ 
         param2 += Math.pow(wordFreq2.get(w2), 2); 
      } 
      param2 = Math.sqrt(param2); 

      //denominator of cosine formula. sqrt(sum(s1^2)) X 
        sqrt(sum(s2^2)) 
      double denominator = param1 * param2; 

      //cosine measure 
      double cosineSimilarity = numerator/denominator; 
      return cosineSimilarity; 
   }//end method to calculate cosine similarity of two strings 

   public static void main(String[] args){ 
      CosineSimilarity cos = new CosineSimilarity(); 
      System.out.println(cos.calculateCosine("To be, or not to be: that 
        is the question.", "Frailty, thy name is woman!")); 
      System.out.println(cos.calculateCosine("The lady doth protest too 
         much, methinks.", "Frailty, thy name is woman!")); 
   } 
}

如果运行该代码，您会发现以下输出:

0.11952286093343936
0.0

输出意味着句子之间的余弦相似度是，还是不是:这是个问题。和脆弱，你的名字是女人！大约是0.11；在我看来，这位女士抗议得太多了。脆弱，你的名字是女人！是0.0。

Tip

在这个方法中，您没有从字符串中删除停用词。为了获得无偏差的结果，最好从两个文本单元中删除停用词。

使用 Mallet 从文本文档中提取主题

如今，随着文本格式的文档数量不断增加，任何数据科学家的一项重要任务都是获得大量带有摘要、概要或抽象主题列表的文章的概览，这不是因为这样可以节省通读文章的时间，而是为了进行聚类、分类、语义相关度测量、情感分析等。

在机器学习和自然语言处理领域，主题建模是指使用统计模型从文本文章中检索抽象主题或关键词。在这个食谱中，我们将使用一个复杂的基于 Java 的机器学习和自然语言处理库，名为 Mallet，是语言工具包机器学习的首字母缩写(见http://mallet.cs.umass.edu/)。Mallet 在学术界和工业中广泛应用于以下方面:

文件分类，
聚类，
主题建模，以及
信息提取。

然而，这本书的范围仅限于主题建模和文档分类。在这个菜谱中，我们将介绍如何使用 Mallet 提取主题，而下一个菜谱将重点关注使用 Mallet 和监督机器学习对文本文档进行分类。

注意

请注意，您将只使用命令提示符来使用该工具，并且您将不会参与该配方和下一个配方的编码。这是因为 Mallet 更容易与命令提示符一起使用。希望使用 Java API 的感兴趣的读者可以去 http://mallet.cs.umass.edu/api/阅读 Mallet 丰富的 API 文档。

准备就绪

First, you will be installing Mallet. We will be providing installation instructions only for Windows operating systems in this recipe. Go to http://mallet.cs.umass.edu/download.php and download Mallet. At the time of writing this book, version 2.0.8 was the latest version and therefore you are encouraged to download it (preferably the zip file):
Unzip Mallet into your C:/ directory. Your C:/ drive will have a directory named C:\mallet-2.0.8RC2 or similar:
Inside the directory, the files and folders will look something as the following screenshot shows. The actual runnable file is in the bin folder and there are some sample datasets in the sample-data folder:
Go to Control Panel\System and Security\System in your Windows PC. Click on Advanced system settings.
Now, you will see a system property window. Click on the Environment Variables... button:
This will open a window for setting up environment variables. Click on New for system variables:
在变量名文本框中，输入MALLET_HOME。并在变量值文本框中，给出路径C:\mallet-2.0.8RC2。点击确定关闭窗口。
要查看 mallet 是否已正确安装，请打开命令提示符窗口，转到 Mallet 目录的 bin 文件夹，并键入 Mallet。您应该会看到所有可以在屏幕上使用的 Mallet 2.0 命令:

现在，你可以使用木槌了。在任何时候，如果您对命令或参数不确定，您可以使用 Mallet append-help 中的 Mallet 2.0 命令。这将列出特定 Mallet 2.0 命令的所有可能的命令和参数选项:

怎么做...

The Mallet distribution folder in your C:/ drive has a directory named sample-data. This directory contains another directory named web. Inside web, you will find two more directories-the directory named en contains a few text files that are text versions of a few English web articles, and the directory named de contains a few text files that are text versions of a few German web articles. The en directory can be seen as our dataset or corpus for this recipe and you will be extracting topics from these web articles. If you have your own set of documents for which you need to extract topics, just imagine the following tasks that you are going to do by simply replacing the en directory with the directory where your documents reside.

为此，首先将文本文件转换成一个单独的文件，这个文件是 Mallet 类型的，是二进制的，不可读的。从命令行，转到C:\mallet-2.0.8RC2/bin并键入以下命令:
```
 mallet import-dir --input C:\mallet-2.0.8RC2\sample-
          data\web\en --output c:/web.en.mallet --keep-sequence --
            remove-stopwords
```
该命令在您的C:/驱动器中创建一个 Mallet 文件，名为 web.en.mallet，方法是保留 en 目录中列出的数据文件的原始序列，从标准英语词典中删除停用词。

如果您希望模型在建模过程中考虑文本的二元模型，请将该命令替换为:

注意

mallet import-dir --input C:\mallet-2.0.8RC2\sample-data\web\en --output c:/web.en.mallet --keep-sequence-bigrams --remove-stopwords
Type in the following command to run Mallet's topic modelling routing with default settings on the we.en.mallet file:
```
mallet train-topics --input c:/web.en.mallet
```
该命令将在命令提示符下生成如下信息:

让我们检查输出。Mallet 主题建模输出的第二行包含一行:
```
1       5       zinta role hindi actress film indian acting 
         top-grossing naa award female filmfare earned films drama written areas evening 
         protection april
```
如果你是一个印地语电影迷，那么你会立刻明白这个话题是关于女演员普丽缇·泽塔的印地语电影。为了确认，您可以查看在C:\mallet-2.0.8RC2\sample-data\web\en目录中名为zinta.txt的文件。

输出中的 1 表示段落编号(编号从 0 开始)，5 是主题的狄利克雷参数(可以看做是主题的权重)。由于我们没有设置它，这个数字将是输出中所有段落的默认值。

注意

MALLET 包含了主题建模和提取的随机性元素，因此每次程序运行时，即使是在相同的数据集上，关键字列表看起来也会不同。因此，如果您的输出与本食谱中概述的不同，不要认为有什么地方出错了。

这一步中的命令太普通了，没有使用 Mallet 中任何精彩的参数，并且在控制台上显示结果。
Next, we will be applying topic modelling on the same data but with more options and we will output the topics to an external file so that we can further use them. On your command prompt, type in the following:
```
 mallet train-topics --input c:/web.en.mallet --num-topics 20--
          num-top-words 20 --optimize-interval 10 --xml-topic-phrase-
            report C:/web.en.xml
```
该命令表示我们将使用c:/web.en.mallet 文件作为输入，为数据生成最多 20 个主题，打印前 20 个主题，并在c:/web.en.xml文件中输出结果。--optimize-区间用于通过打开超参数优化来生成更好的主题模型，最终通过将一些主题优先于其他主题来允许模型更好地拟合数据。

运行该命令后，您将看到在您的C:/驱动器中，生成了一个名为web.en.xml的 XML 文件。如果您打开该文件，您将看到类似以下内容:
There are some other options in Mallet that you can use when you use topic modelling. One of the important options is the alpha parameter, Which is known as the smoothing parameter for the topic distribution. Try the following command:
```
 mallet train-topics --input c:/web.en.mallet --num-topics 20--
          num-top-words 20 --optimize-interval 10 --alpha 2.5 --xml-
            topic-phrase-report C:/web.en.xml
```
Tip

Set . The rule of thumb for the alpha value is 50/T, where t is the number of topics you select with the-num-topics [NUMBER] option. So, if you generate 20 themes, you should set the value of alpha to 50/20 = 2.5.

If-random-seed is not set to generate a theme model for the document, randomness will be applied, and a slightly/completely different xml file will be generated using the theme every time.
Mallet 还以不同的格式生成输出，以多种不同的方式帮助分析主题。在命令行中键入以下命令:

 mallet train-topics --input c:/web.en.mallet --num-topics 20--
          num-top-words 20 --optimize-interval 10 --output-state 
            C:\web.en.gz  --output-topic-keys C:\web.en.keys.txt --
               output-doc-topics c:/web.en.composition.txt

该命令将在您的C:/驱动器中生成三个新文件。

C:\web.en.gz contains a file where every word in your corpus and the topic it belongs to. A partial look of the file can be as follows:
C:\web.en.keys.txt包含我们已经在步骤 2 中在控制台上看到的数据，即主题编号、权重和每个主题的热门关键词。
C:/web.en.composition.txt包含您导入的每个原始文本文件中每个主题的百分比细分。以下是该文件的部分外观。可以使用任何电子表格应用程序(如 Microsoft Excel)打开该文件。

在大多数情况下，这些是您将用来从文章集中提取主题的关键命令。在这个食谱中遵循的步骤是从文本集合中提取主题。如果您有一篇需要提取主题的文章，请将这篇文章放在一个目录中，并将其视为单个文档的语料库。

在我们完成菜谱之前，让我们来看看 Mallet 可以使用的主题建模算法:

皱胃向左移
并行 LDA
DMR·艾达
分级 LDA
标签 LDA
多语种主题模型
分层弹球分配模型
加权主题模型
具有集成短语发现的 LDA
使用具有负采样的跳跃图的单词嵌入(word2vec)

使用木槌对文本文档进行分类

本章的最后两个方法是经典的机器学习分类问题——使用语言建模对文档进行分类。在这个菜谱中，我们将使用 Mallet 及其命令行界面来训练一个模型，并将该模型应用于看不见的测试数据。

木槌分类取决于三个步骤:

将您的培训文档转换成 Mallet 的本地格式。
根据培训文档培训您的模型。
应用该模型对看不见的测试文档进行分类。

当提到需要将训练文档转换成 Mallet 的原生格式时，其技术含义是将文档转换成特征向量。您不需要从您的培训或测试文档中提取任何特性，因为 Mallet 会处理这些。您可以在物理上分离培训和测试数据，或者您可以有一个文档的平面列表，并从命令行选项中分割培训和测试部分。

让我们考虑一个简单的设置:您有纯文本文件中的文本数据，每个文档一个文件。不需要识别文档的开始或结束。文件将被组织在目录中，其中具有相同类别标签的所有文档将被包含在一个目录中。例如，如果您的文本文件有两个类，spam 和 ham，那么您需要创建两个目录——一个包含所有的 spam 文档，另一个包含所有的 ham 文档。

准备就绪

Mallet 的安装已经在前面的名为使用 Mallet 从文本文档中提取主题的菜谱中详细介绍过了，因此我们将避免重复。
打开网页浏览器，粘贴以下网址:http://www . cs . CMU . edu/AFS/cs/project/theo-11/www/naive-Bayes/20 _ news groups . tar . gz。这将下载一个文件夹，其中包含分类在 20 个不同目录中的新闻文章。将它保存在 Mallet 安装目录中:

怎么做...

打开一个命令提示窗口，进入Mallet安装文件夹的 bin 文件夹。
Write the following command while you are inside the bin folder:
```
 mallet import-dir --input C:\mallet-2.0.8RC2\20_newsgroups\*--
          preserve-case --remove-stopwords --binary-features --gram-
            sizes 1 --output C:\20_newsgroup.classification.mallet
```
该命令将获取C:\mallet-2.0.8RC2\20_newsgroups文件夹中的所有文档，删除其中的停用词，保留文档中单词的实际大小写，并创建 gram 大小为 1 的二进制特征。Mallet 从文档中输出的本地文件格式将被保存为C:\20_newsgroup.classification.mallet.
Next, create a Maximum Entropy classifier from the data using the following command. The command takes the output of the previous step as input, creates a Naïve Bayes classifier from the binary features with 1-grams and outputs the classifier as C:\20_newsgroup.classification.classifier:
```
 mallet train-classifier --trainer NaiveBayes --input 
        C:\20_newsgroup.classification.mallet --output-classifier 
        C:\20_newsgroup.classification.classifier
```
除了朴素贝叶斯，Mallet 还支持许多其他算法。以下是完整的列表:
- adaboost 算法
- 制袋材料
- 扬
- C45 决策树
- 合奏教练
- 最大熵分类器(多项式逻辑回归)
- 朴素贝叶斯
- 秩最大熵分类器
- 后验正则化辅助模型
Besides training on the full dataset, you can also provide a portion of data to be used as training data and the rest as test data; and based on the test data's actual class labels, you can see the classifier's prediction performance.

在 bin 文件夹中编写以下命令:
```
 mallet train-classifier --trainer NaiveBayes --input 
          C:\20_newsgroup.classification.mallet --training-portion 0.9
```
该命令随机选择 90%的数据，并在这些数据上训练朴素贝叶斯分类器。最后，通过看不到它们的实际标签，将分类器应用于剩余的 10%数据；它们的实际类别仅在分类器评估期间被考虑。

该命令为您提供了 20 个类别的分类器的总体准确性，以及每个类别的精确度、召回率和准确性，以及标准误差。
您也可以多次运行培训和测试；每次训练和测试集将被随机选择。例如，如果您想在 90%的数据上训练您的分类器，并在剩下的 10%的数据上用随机分割测试分类器 10 次，使用下面的命令:
```
 mallet train-classifier --trainer NaiveBayes --input 
        C:\20_newsgroup.classification.mallet --training-portion 0.9--
          num-trials 10
```
You can also do cross-validation using Mallet where you can specify number of folds to be created during cross validation. For instance, if you want to do a 10-fold cross validation, use the following command:
```
 mallet train-classifier --trainer NaiveBayes --input 
          C:\20_newsgroup.classification.mallet --cross-validation 10
```
该命令将为您提供 10 次试验中每一次的单独结果，每次都包含原始数据的新测试部分以及 10 次试验的平均结果。Mallet 还给出了一个混淆矩阵，这对数据科学家更好地理解他们的模型真的很重要。
Mallet 允许您比较从不同算法开发的多个分类器的性能。例如，下面的命令将给出使用朴素贝叶斯和使用 10 重交叉验证的最大熵的两个分类器的比较:
```
 mallet train-classifier --trainer MaxEnt --trainer NaiveBayes-
         -input C:\20_newsgroup.classification.mallet --cross-
            validation 10
```
If you want to use your saved classifier on a set of unseen test documents (which is not our case as we have used the entire directory for training in step 2), you can use the following command:
```
 mallet classify-dir --input <directory containing unseen test  
         data> --output - --classifier 
           C:\20_newsgroup.classification.classifier
```
这个命令将会在控制台上显示你未看到的测试文档的预测类。还可以使用以下命令将预测保存在制表符分隔的值文件中:
```
 mallet classify-dir --input <directory containing unseen test 
         data> --output <Your output file> --classifier 
           C:\20_newsgroup.classification.classifier
```
最后，还可以在单个看不见的测试文档上使用一个保存的分类器。为此，请使用以下命令:

 mallet classify-file --input <unseen test data file path> --
        output - --classifier 
          C:\20_newsgroup.classification.classifier

这个命令将会在控制台上显示出你未看到的测试文档的预测类。还可以使用以下命令将预测保存在制表符分隔的值文件中:

 mallet classify-file --input <unseen test data file path> --
        output <Your output file> --classifier C:\20_ne
         wsgroup.classification.classifier

使用 Weka 对文本文档进行分类

我们在第 4 章、从数据中学习-第 1 部分中使用 Weka 对非文本格式的数据点进行分类。Weka 也是使用机器学习模型对文本文档进行分类的非常有用的工具。在这个菜谱中，我们将演示如何使用 Weka 3 来开发文档分类模型。

准备就绪

要下载 Weka，请前往http://www.cs.waikato.ac.nz/ml/weka/downloading.html，你会找到 Windows、Mac 和其他操作系统(如 Linux)的下载选项。仔细阅读选项并下载合适的版本。在撰写本书期间，3.9.0 是开发人员的最新版本，由于作者已经在他的 64 位 Windows 机器上安装了 1.8 版本的 JVM，他选择了来下载一个用于 64 位 Windows 的自解压可执行文件，无需 Java VM。
下载完成后，双击可执行文件并按照屏幕上的说明进行操作。你需要安装 Weka 的完整版本。
安装完成后，不要运行该软件。相反，转到安装它的目录，找到 Weka 的 Java 归档文件(weka.jar)。将这个文件作为外部库添加到 Eclipse 项目中。
将在该配方中使用的示例文档文件将保存在目录中。每个目录包含相似类别的文档。要下载示例文档，请打开 web 浏览器，复制并粘贴以下 URL:https://WEKA . wikispaces . com/file/view/text _ example . zip/82917283/text _ example . zip。这将提示您保存文件(如果您的浏览器配置为询问您保存文件的位置)。将文件保存在您的C:/驱动器上。解压缩文件，您将看到如下所示的目录结构:

每个目录包含一些属于特定类的 html 文件。这些类别具有标签 class1、class2 和 class3。

现在，您已经为使用 Weka 对这些文档进行分类做好了准备。

怎么做...

创建一个类和一个main()方法来存放你所有的代码。main 方法会抛出异常:

        public class WekaClassification { 
          public static void main(String[] args) throws Exception {

创建一个加载器，通过设置父目录到加载器的路径来加载所有类目录的父目录:

        TextDirectoryLoader loader = new TextDirectoryLoader(); 
        loader.setDirectory(new File("C:/text_example"));

从加载的 html 文件创建实例:

        Instances data = loader.getDataSet();

从数据字符串中创建单词向量。为此，首先创建一个过滤器，将字符串转换为单词向量，然后为过滤器设置上一步的原始数据:
```
        StringToWordVector filter = new StringToWordVector(); 
        filter.setInputFormat(data); 
```
为了完成字符串到单词的向量转换，使用这个过滤器从数据中创建实例:
```
        Instances dataFiltered = Filter.useFilter(data, filter); 
```

从这个单词向量生成一个朴素贝叶斯分类器:

        NaiveBayes nb = new NaiveBayes(); 
        nb.buildClassifier(dataFiltered);

此时，你也可以考虑看看你的模型是什么样子的。为此，在控制台上打印您的模型:
```
        System.out.println("\n\nClassifier model:\n\n" + nb); 
```
A partial output on your screen will look like the following:
要使用 k-fold 交叉验证评估模型，请编写以下代码:

        Evaluation eval = null; 
        eval = new Evaluation(dataFiltered); 
        eval.crossValidateModel(nb, dataFiltered, 5, new Random(1)); 
        System.out.println(eval.toSummaryString());

这将在控制台上打印分类器评估:

Correctly Classified Instances           1               14.2857 % 
Incorrectly Classified Instances         6               85.7143 % 
Kappa statistic                         -0.5    
Mean absolute error                      0.5714 
Root mean squared error                  0.7559 
Relative absolute error                126.3158 % 
Root relative squared error            153.7844 % 
Total Number of Instances                7

请注意，我们使用了五重交叉验证，而不是标准的 10 重交叉验证，因为文档的数量少于 10(确切地说，是 7)。

食谱的完整代码如下:

import weka.core.*; 
import weka.core.converters.*; 
import weka.classifiers.Evaluation; 
import weka.classifiers.bayes.NaiveBayes; 
import weka.filters.*; 
import weka.filters.unsupervised.attribute.*; 

import java.io.*; 
import java.util.Random; 

public class WekaClassification { 
   public static void main(String[] args) throws Exception { 
      TextDirectoryLoader loader = new TextDirectoryLoader(); 
      loader.setDirectory(new File("C:/text_example")); 
      Instances data = loader.getDataSet(); 

      StringToWordVector filter = new StringToWordVector(); 
      filter.setInputFormat(data); 
      Instances dataFiltered = Filter.useFilter(data, filter); 

      NaiveBayes nb = new NaiveBayes(); 
      nb.buildClassifier(dataFiltered); 
      System.out.println("\n\nClassifier model:\n\n" + nb); 

      Evaluation eval = null; 
      eval = new Evaluation(dataFiltered); 
      eval.crossValidateModel(nb, dataFiltered, 5, new Random(1)); 
      System.out.println(eval.toSummaryString()); 
   } 
}

七、处理大数据

在本章中，我们将介绍以下配方:

使用 Apache Mahout 训练在线逻辑回归模型
使用 Apache Mahout 应用在线逻辑回归模型
用 Apache Spark 解决简单的文本挖掘问题
基于 MLib 的 KMeans 聚类算法
使用 MLib 创建线性回归模型
使用 MLib 通过随机森林模型对数据点进行分类

简介

在这一章中，你会看到大数据框架中用到的三个关键技术，对数据科学家来说极其有用:Apache Mahout、Apache Spark，以及其名为 MLib 的机器学习库。

我们将从 Apache Mahout 开始这一章，Apache Mahout 是一个可伸缩的或分布式的机器学习平台，用于分类、回归、聚类和协作过滤任务。Mahout 最初是一个机器学习工作台，只在 Hadoop MapReduce 上工作，但最终选择了 Apache Spark 作为其平台。

Apache Spark 是一个在大数据处理中引入并行化的框架，与 MapReduce 相似，因为它也跨集群分布数据。但 Spark 和 MapReduce 之间的一个关键区别是，前者试图尽可能地将内容保存在内存中，而后者则不断地从磁盘中读写。所以 Spark 比 MapReduce 快很多。我们将看到，作为一名数据科学家，您如何使用 Spark 来完成简单的文本挖掘相关任务，例如计算空行或获取大型文件中单词的频率。使用 Spark 的另一个原因是，它不仅可以与 Java 一起使用，还可以与 Python、Scala 等其他流行语言一起使用；对于 MapReduce，通常的选择是 Java。

MLib 是来自 Apache Spark 的可扩展机器学习库，它实现了多种分类、回归、聚类、协作过滤和特征选择算法。它基本上坐在 Spark 上，利用它的速度来解决机器学习问题。在本章中，你将看到如何使用这个库来解决分类、回归和聚类问题。

注意

在本书中，我们已经使用了 0.9 版本的 Mahout，但是有兴趣的读者可以在这里看看 Mahout 0.10.x 和 MLib 的区别:http://www . weatheringthroughtechdays . com/2015/04/Mahout-010 x-first-Mahout-release-as . html。

使用 Apache Mahout 训练在线逻辑回归模型

在这个菜谱中，我们将使用 Apache Mahout 来训练一个使用 Apache Mahout Java 库的在线逻辑回归模型。

准备就绪

In Eclipse, create a new Maven project. The author had Eclipse Mars set up. To do so, go to File. Then select New and Other...:
Then, expand Maven from the wizard and select Maven Project. Click on Next until you reach the window where Eclipse prompts you to provide an Artifact Id. Type in mahout as Artifact Id, and the grayed out Finish button will become visible. Click on Finish. This will create a Maven project for you named mahout:
Double-click on pom.xml from your Eclipse Package Explorer to edit:

点击pom.xml选项卡。现在你会在屏幕上看到pom.xml文件。将下面几行放到您的<dependencies>...</dependencies>标签内的pom.xml中，并保存它。这将自动下载依赖 JAR 文件到您的项目:

      <dependency> 
         <groupId>org.apache.mahout</groupId> 
         <artifactId>mahout-core</artifactId> 
         <version>0.9</version> 
      </dependency> 
      <dependency> 
         <groupId>org.apache.mahout</groupId> 
         <artifactId>mahout-examples</artifactId>  
         <version>0.9</version> 
      </dependency> 
      <dependency> 
         <groupId>org.apache.mahout</groupId> 
         <artifactId>mahout-math</artifactId> 
         <version>0.9</version> 
      </dependency>

Create a package named chap7.science.data in your project under src/main/java directory:
在 Eclipse 中右键单击项目名称，选择新建，然后选择文件夹。您将创建两个文件夹。第一个文件夹将包含您将为其创建模型的输入数据集，其名称将为data。第二个文件夹将被命名为model，您将在其中保存您的模型。现在输入data作为文件夹名，点击完成。重复这个步骤，创建一个名为model的文件夹。

用以下数据在data文件夹中创建一个名为weather.numeric.csv的 CSV 文件:

        outlook,temperature,humidity,windy,play 
        sunny,85,85,FALSE,no 
        sunny,80,90,TRUE,no 
        overcast,83,86,FALSE,yes 
        rainy,70,96,FALSE,yes 
        rainy,68,80,FALSE,yes 
        rainy,65,70,TRUE,no 
        overcast,64,65,TRUE,yes 
        sunny,72,95,FALSE,no 
        sunny,69,70,FALSE,yes 
        rainy,75,80,FALSE,yes 
        sunny,75,70,TRUE,yes 
        overcast,72,90,TRUE,yes 
        overcast,81,75,FALSE,yes 
        rainy,71,91,TRUE,no

现在您已经准备好编码了。

怎么做...

在您刚刚创建的包中，创建一个名为OnlineLogisticRegressionTrain.java的 Java 类。双击类文件，写下您的代码。创建一个名为OnlineLogisticRegressionTrain :
```
        public class OnlineLogisticRegressionTrain { 
```
的类

开始编写你的main方法:

        public static void main(String[] args) throws IOException {

创建两个String变量来包含输入数据文件路径和您将要构建和保存的模型文件的路径:

        String inputFile = "data/weather.numeric.csv"; 
        String outputFile = "model/model";

现在创建一个包含数据文件特性的列表:

        List<String> features =Arrays.asList("outlook", "temperature", 
          "humidity", "windy", "play");

此步骤列出数据文件的所有特征名称，并按照它们在数据文件中出现的顺序排列。
接下来，定义每个特征的类型。特征类型w表示名义特征，特征类型n表示数字特征类型:
```
        List<String> featureType = Arrays.asList("w", "n", "n", "w", 
        "w"); 
```
现在是时候设置分类器的参数了。在此步骤中，您将创建一个参数变量，并将一些值设置为参数。您将设置目标变量或类变量(在我们的例子中是"play")。如果您看一下数据，您会发现类变量“play”最多取两个值——是或否。因此，您会将最大目标类别设置为2。接下来，您将设置非类特征的数量(在我们的例子中是4)。接下来的三个参数取决于算法。在这个方法中，您将不会使用任何偏差来生成分类器，您将使用一个平衡学习率0.5。最后，您需要使用类型映射方法设置特性及其类型:
```
        LogisticModelParameters params = new 
          LogisticModelParameters(); 
        params.setTargetVariable("play"); 
        params.setMaxTargetCategories(2); 
        params.setNumFeatures(4); 
        params.setUseBias(false); 
        params.setTypeMap(features,featureType); 
        params.setLearningRate(0.5); 
```
您将使用10 passes创建分类器。这个数字是任意的，你可以选择你凭经验找到的任何数字:
```
        int passes = 10; 
```

创建在线线性回归分类器:

        OnlineLogisticRegression olr;

创建一个变量从 CSV 文件中读取数据，并开始创建回归模型:

```java
        CsvRecordFactory csv = params.getCsvRecordFactory(); 
        olr = params.createRegression(); 

```

接下来，您将创建一个for循环来遍历每个10 passes :

```java
        for (int pass = 0; pass < passes; pass++) { 

```

开始读取数据文件:

```java
        BufferedReader in = new BufferedReader(new 
          FileReader(inputFile)); 

```

获取数据文件的文件头，它由特性的名称组成:

```java
        csv.firstLine(in.readLine()); 

```

读取数据行:

```java
        String row = in.readLine(); 

```

现在循环遍历不是null :

```java
        while (row != null) { 

```

的每一行

现在对于每一行(或数据行)，显示数据点，并创建一个输入向量:

```java
        System.out.println(row); 
        Vector input = new 
          RandomAccessSparseVector(params.getNumFeatures()); 

```

获取行

```java
        int targetValue = csv.processLine(row, input); 

```

的`targetValue`

Train具有该数据点的模型:

```java
        olr.train(targetValue, input); 

```

阅读下一个row :

```java
        row = in.readLine(); 

```

闭环:

```java
        } 

```

关闭阅读器阅读输入数据文件:

```java
        in.close(); 

```

关闭循环以遍历路径:

```java
        } 

```

最后，将output模型保存到 Eclipse 项目

```java
        OutputStream modelOutput = new FileOutputStream(outputFile); 
        try { 
            params.saveTo(modelOutput); 
        } finally { 
            modelOutput.close(); 
        } 

```

的`model`目录下名为`model`的文件中

关闭main方法和类:

```java
        } 
        } 

```

如果您运行代码，您将在控制台上看到输入数据文件的数据行作为您的输出，并且在学习的模型中，它将保存在您的 Eclipse 项目的模型目录中。

食谱的完整代码如下:

package chap7.science.data; 

import java.io.BufferedReader; 
import java.io.FileOutputStream; 
import java.io.FileReader; 
import java.io.IOException; 
import java.io.OutputStream; 
import java.util.Arrays; 
import java.util.List; 
import org.apache.mahout.classifier.sgd.CsvRecordFactory; 
import org.apache.mahout.classifier.sgd.LogisticModelParameters; 
import org.apache.mahout.classifier.sgd.OnlineLogisticRegression; 
import org.apache.mahout.math.RandomAccessSparseVector; 
import org.apache.mahout.math.Vector; 

public class OnlineLogisticRegressionTrain { 

   public static void main(String[] args) throws IOException { 
      String inputFile = "data/weather.numeric.csv"; 
      String outputFile = "model/model"; 

      List<String> features =Arrays.asList("outlook", "temperature", 
        "humidity", "windy", "play"); 
      List<String> featureType = Arrays.asList("w", "n", "n", "w", 
        "w"); 
      LogisticModelParameters params = new LogisticModelParameters(); 
      params.setTargetVariable("play"); 
      params.setMaxTargetCategories(2); 
      params.setNumFeatures(4); 
      params.setUseBias(false); 
      params.setTypeMap(features,featureType); 
      params.setLearningRate(0.5); 

      int passes = 10; 
      OnlineLogisticRegression olr;     

      CsvRecordFactory csv = params.getCsvRecordFactory(); 
      olr = params.createRegression(); 

      for (int pass = 0; pass < passes; pass++) { 
         BufferedReader in = new BufferedReader(new 
           FileReader(inputFile)); 
         csv.firstLine(in.readLine()); 
         String row = in.readLine(); 
         while (row != null) { 
            System.out.println(row); 
            Vector input = new 
              RandomAccessSparseVector(params.getNumFeatures()); 
            int targetValue = csv.processLine(row, input); 
            olr.train(targetValue, input); 
            row = in.readLine(); 
         } 
         in.close(); 
      } 

      OutputStream modelOutput = new FileOutputStream(outputFile); 
      try { 
         params.saveTo(modelOutput); 
      } finally { 
         modelOutput.close(); 
      } 
   } 
}

使用 Apache Mahout 应用在线逻辑回归模型

在这个菜谱中，我们将演示如何使用 Apache Mahout 对看不见的、未标记的测试数据应用在线逻辑回归模型。请注意，这个配方与上一个配方非常相似，需要您使用训练数据来建立模型。这个要求在前面的配方中已经演示过了。

准备就绪

完成前面的配方后，转到您创建的项目文件夹，进入您在最后一个配方中创建的名为model的目录。您应该会在那里看到一个model文件。

接下来，创建一个测试文件。转到您在上一个配方的项目文件夹中创建的data文件夹。用以下数据创建一个名为weather.numeric.test.csv的测试文件:

        outlook,temperature,humidity,windy,play 
        overcast,90,80,TRUE,yes 
        overcast,95,88,FALSE,yes 
        rainy,67,78,TRUE,no 
        rainy,90,97,FALSE,no 
        sunny,50,67,FALSE,yes 
        sunny,67,75,TRUE,no

在名为 mahout 的 Eclipse 项目中，您应该会在src/main/java folder中看到名为chap7.science.data的包。这个包是在前一个配方中创建的。在这个包中创建一个名为OnlineLogisticRegressionTest.java的 Java 类。双击要编辑的 Java 类文件。

怎么做...

创建class :

        public class OnlineLogisticRegressionTest {

声明几个类变量。首先，创建两个变量来保存您的测试文件data和model的路径(您在上一个菜谱中创建的):

        private static String inputFile = 
          "data/weather.numeric.test.csv"; 
        private static String modelFile = "model/model";

开始创建你的main方法:

        public static void main(String[] args) throws Exception {

创建一个类类型 AUC 的变量，因为您将计算您的分类器的曲线下面积 ( AUC )作为性能指标:
```
        Auc auc = new Auc(); 
```

接下来，从model文件中读取并加载在线逻辑回归算法的参数:

        LogisticModelParameters params = 
          LogisticModelParameters.loadFrom(new File(modelFile));

创建一个变量来读取测试数据文件:

        CsvRecordFactory csv = params.getCsvRecordFactory();

创建一个onlinelogisticregression分类器:

        OnlineLogisticRegression olr = params.createRegression();

现在读取测试数据文件:

        InputStream in = new FileInputStream(new File(inputFile)); 
        BufferedReader reader = new BufferedReader(new 
          InputStreamReader(in, Charsets.UTF_8));

测试数据文件的第一行是文件头或特性列表。因此，您将从分类中忽略这一行，并读取下一行(或行或数据点):
```
        String line = reader.readLine(); 
        csv.firstLine(line); 
        line = reader.readLine(); 
```
您可能希望在控制台上显示分类结果。为此创建一个PrintWriter变量:

```java
        PrintWriter output=new PrintWriter(new 
          OutputStreamWriter(System.out, Charsets.UTF_8), true); 

```

您将打印预测类、model's output和log likelihood。创建标题并在控制台上打印:

```java
        output.println(""class","model-output","log-likelihood""); 

```

现在遍历每一个不为空的行:

```java
        while (line != null) { 

```

为您的测试数据:

```java
        Vector vector = new 
          SequentialAccessSparseVector(params.getNumFeatures()); 

```

创建特性`vector`

创建一个变量来保存每行/数据点的实际classvalue:

```java
        int classValue = csv.processLine(line, vector); 

```

对测试数据点进行分类，从分类器

```java
        double score = olr.classifyScalarNoLink(vector); 

```

中获取`score`

在控制台上打印以下内容-classValue、score和log likelihood :

```java
        output.printf(Locale.ENGLISH, "%d,%.3f,%.6f%n", classValue, 
        score, olr.logLikelihood(classValue, vector)); 

```

将score和classvalue添加到AUC变量:

```java
        auc.add(classValue, score); 

```

阅读下一行并关闭循环:

```java
        line = reader.readLine(); 
        } 

```

关闭reader :

```java
        reader.close(); 

```

现在让我们打印您的分类的输出。首先，打印AUC :

```java
        output.printf(Locale.ENGLISH, "AUC = %.2f%n", auc.auc()); 

```

接下来，你将打印出你的分类中的困惑。为此制造混乱matrix。由于训练/测试数据有两类，你会有一个 2x2 混淆matrix :

```java
        Matrix matrix = auc.confusion(); 
        output.printf(Locale.ENGLISH, "confusion: [[%.1f, %.1f], [%.1f, 
          %.1f]]%n", matrix.get(0, 0), matrix.get(1, 0), matrix.get(0,  
            1), matrix.get(1, 1)); 

```

保存matrix中的熵值。您不需要为此创建一个新的matrix变量，但是如果您愿意，您可以这样做:

```java
        matrix = auc.entropy(); 
        output.printf(Locale.ENGLISH, "entropy: [[%.1f, %.1f], [%.1f, 
          %.1f]]%n", matrix.get(0, 0), matrix.get(1, 0), matrix.get(0, 
            1), matrix.get(1, 1)); 

```

关闭main方法和类:

        } 
        }

食谱的完整代码如下:

package chap7.science.data; 

import com.google.common.base.Charsets; 
import org.apache.mahout.math.Matrix; 
import org.apache.mahout.math.SequentialAccessSparseVector; 
import org.apache.mahout.math.Vector; 
import org.apache.mahout.classifier.evaluation.Auc; 
import org.apache.mahout.classifier.sgd.CsvRecordFactory; 
import org.apache.mahout.classifier.sgd.LogisticModelParameters; 
import org.apache.mahout.classifier.sgd.OnlineLogisticRegression; 
import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.io.OutputStreamWriter; 
import java.io.PrintWriter; 
import java.util.Locale; 

public class OnlineLogisticRegressionTest { 

   private static String inputFile = "data/weather.numeric.test.csv"; 
   private static String modelFile = "model/model"; 

   public static void main(String[] args) throws Exception { 
      Auc auc = new Auc(); 
      LogisticModelParameters params = 
        LogisticModelParameters.loadFrom(new File(modelFile)); 
      CsvRecordFactory csv = params.getCsvRecordFactory(); 
      OnlineLogisticRegression olr = params.createRegression(); 
      InputStream in = new FileInputStream(new File(inputFile)); 
      BufferedReader reader = new BufferedReader(new 
        InputStreamReader(in, Charsets.UTF_8)); 
      String line = reader.readLine(); 
      csv.firstLine(line); 
      line = reader.readLine(); 
      PrintWriter output=new PrintWriter(new 
        OutputStreamWriter(System.out, Charsets.UTF_8), true); 
      output.println(""class","model-output","log-likelihood""); 
      while (line != null) { 
         Vector vector = new 
            SequentialAccessSparseVector(params.getNumFeatures()); 
         int classValue = csv.processLine(line, vector); 
         double score = olr.classifyScalarNoLink(vector); 
         output.printf(Locale.ENGLISH, "%d,%.3f,%.6f%n", classValue, 
           score, olr.logLikelihood(classValue, vector)); 
         auc.add(classValue, score); 
         line = reader.readLine(); 
      } 
      reader.close(); 
      output.printf(Locale.ENGLISH, "AUC = %.2f%n", auc.auc()); 
      Matrix matrix = auc.confusion(); 
      output.printf(Locale.ENGLISH, "confusion: [[%.1f, %.1f], [%.1f, 
        %.1f]]%n", matrix.get(0, 0), matrix.get(1, 0), matrix.get(0, 
          1), matrix.get(1, 1)); 
      matrix = auc.entropy(); 
      output.printf(Locale.ENGLISH, "entropy: [[%.1f, %.1f], [%.1f,  
        %.1f]]%n", matrix.get(0, 0), matrix.get(1, 0), matrix.get(0, 
          1), matrix.get(1, 1)); 
   } 
}

如果运行该代码，输出将如下所示:

"class","model-output","log-likelihood" 
1,119.133,0.000000 
1,123.028,0.000000 
0,15.888,-15.887942 
0,63.213,-100.000000 
1,-6.692,-6.693089 
0,24.286,-24.286465 
AUC = 0.67 
confusion: [[0.0, 1.0], [3.0, 2.0]] 
entropy: [[NaN, NaN], [0.0, -9.2]]

使用 Apache Spark 解决简单的文本挖掘问题

根据 Apache Spark 网站，Spark 在内存中运行程序的速度比 Hadoop MapReduce 快 100 倍，在磁盘上快 10 倍。一般来说，Apache Spark 是一个开源的集群计算框架。它的处理引擎提供了良好的速度和易用性，并为数据科学家提供了复杂的分析。

在这个菜谱中，我们将演示如何使用 Apache Spark 来解决非常简单的数据问题。当然，数据问题仅仅是虚拟的问题，而不是真实世界的问题，但是这可以作为一个起点，让您直观地理解 Apache Spark 在大规模使用时的用法。

准备就绪

In Eclipse, create a new Maven project. The author had Eclipse Mars set up. To do so, go to File. Then select New and Other...:
Expand Maven from the wizard and select Maven Project. Click on Next until you reach the window where Eclipse prompts you to provide an Artifact Id. Type in mlib as the Artifact Id, and the grayed-out Finish button will become visible. Click on Finish. This will create a Maven project for you named mlib:
Double-click on pom.xml from your Eclipse Package Explorer to edit:
点击pom.xml选项卡。现在你会在屏幕上看到pom.xml文件。将下面几行放到您的<dependencies>...</dependencies>标签内的pom.xml中，并保存它。这将自动下载依赖 JAR 文件到您的项目:
```
        <dependency> 
          <groupId>org.apache.spark</groupId> 
          <artifactId>spark-mllib_2.10</artifactId> 
          <version>1.3.1</version> 
        </dependency>  
```
Create a package named com.data.big.mlib in your project under src/main/java directory:
在 Eclipse 中右键单击项目名称，选择 New ，然后选择 Folder。您将创建一个名为data的文件夹，用于存放该配方的输入数据文件。
You will be using the literature of William Shakespeare in text format. Open a browser and put the link http://norvig.com/ngrams/. This will open a page named Natural Language Corpus Data: Beautiful Data. In the Files for Download section, you will find a .txt file named shakespeare. Download this file anywhere in your system:
在您创建的包中，创建一个名为SparkTest的 Java 类文件。双击开始在其中编写代码。

怎么做...

创建您的类:
```
        public class SparkTest { 
```

开始编写你的main方法:

        public static void main( String[] args ){

首先，获取输入数据文件的路径。这是您下载的莎士比亚文学文件，保存在项目的 data 文件夹中:
```
        String inputFile = "data/shakespeare.txt"; 
```

火花属性用于控制应用程序设置，并为每个应用程序单独配置。设置这些属性的一种方法是使用传递给 SparkContext 的SparkConf。SparkConf允许您配置一些常用属性:

        SparkConf configuration = new 
          SparkConf().setMaster("local[4]").setAppName("My App"); 
        JavaSparkContext sparkContext = new 
          JavaSparkContext(configuration);

注意，如果我们使用local[2]，它将实现最小的并行性。上述语法使应用程序能够运行四个线程。
JavaRDD 是一个分布式对象集合。创建一个 RDD 对象。该方法中该对象的主要用途是收集shakespeare.txt文件中的空行:
```
        JavaRDD<String> rdd = 
           sparkContext.textFile(inputFile).cache();  
```
提示

如果我们使用local[*]，火花将使用系统的所有核心

统计输入数据文件中空行的行数:

        long emptyLines = rdd.filter(new Function<String,Boolean>(){ 
          private static final long serialVersionUID = 1L; 
          public Boolean call(String s){ 
          return s.length() == 0; 
          } 
        }).count();

在控制台上打印文件

        System.out.println("Empty Lines: " + emptyLines);

中emptylines的编号

Next, create the following code snippet to retrieve the word frequencies from the input data file:
```
        JavaPairRDD<String, Integer> wordCounts = rdd 
          .flatMap(s -> Arrays.asList(s.toLowerCase().split(" "))) 
          .mapToPair(word -> new Tuple2<>(word, 1)) 
          .reduceByKey((a, b) -> a + b); 
```
Note

One of the reasons for choosing Apache Spark instead of MapReduce is that it requires less code to achieve the same thing. For example, the lines of code in this step retrieve words and their frequencies from a text document. The same effect can be achieved by using more than 100 lines of MapReduce code, as shown below: https://Hadoop.apache.org/docs/r1.2.1/mapred _ tutorial.html # example% 3a+wordcount+v2.0 .
使用wordCounts RDD，您可以收集单词和它们的频率作为地图，然后迭代地图并打印单词-频率对:

```java
       Map<String, Integer> wordMap = wordCounts.collectAsMap(); 
       for (Entry<String, Integer> entry : wordMap.entrySet()) { 
          System.out.println("Word = " + entry.getKey() + ", Frequency 
             = " + entry.getValue()); 
       } 

```

关闭您创建的sparkContext:

```java
        sparkContext.close(); 

```

关闭main方法和类:

       } 
       }

食谱的完整代码如下:

package com.data.big.mlib; 

import java.util.Arrays; 
import java.util.Map; 
import java.util.Map.Entry; 
import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.JavaPairRDD; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.api.java.function.Function; 
import scala.Tuple2; 
public class SparkTest { 
   public static void main( String[] args ){ 
      String inputFile = "data/shakespeare.txt"; 
      SparkConf configuration = new 
        SparkConf().setMaster("local[4]").setAppName("My App"); 
      JavaSparkContext sparkContext = new 
        JavaSparkContext(configuration); 
      JavaRDD<String> rdd = sparkContext.textFile(inputFile).cache(); 

      long emptyLines = rdd.filter(new Function<String,Boolean>(){ 
         private static final long serialVersionUID = 1L; 
         public Boolean call(String s){ 
            return s.length() == 0; 
         } 
      }).count(); 

      System.out.println("Empty Lines: " + emptyLines); 

      JavaPairRDD<String, Integer> wordCounts = rdd 
          .flatMap(s -> Arrays.asList(s.toLowerCase().split(" "))) 
          .mapToPair(word -> new Tuple2<>(word, 1)) 
          .reduceByKey((a, b) -> a + b); 

      Map<String, Integer> wordMap = wordCounts.collectAsMap(); 
      for (Entry<String, Integer> entry : wordMap.entrySet()) { 
          System.out.println("Word = " + entry.getKey() + ", Frequency 
              = " + entry.getValue()); 
      } 

      sparkContext.close(); 
   } 
}

如果运行该代码，部分输出将如下所示:

Empty Lines: 35941 
...................................................................................................... 

Word = augustus, Frequency = 4 
Word = bucklers, Frequency = 3 
Word = guilty, Frequency = 66 
Word = thunder'st, Frequency = 1 
Word = hermia's, Frequency = 7 
Word = sink, Frequency = 37 
Word = burn, Frequency = 76 
Word = relapse, Frequency = 2 
Word = boar, Frequency = 16 
Word = cop'd, Frequency = 2 

......................................................................................................

注意

能鼓励用户使用 Apache Spark 而不是 MapReduce 的好文章可以在这里找到:https://www . mapr . com/blog/5-minute-guide-understanding-significance-Apache-Spark。

使用带 MLib 的 KMeans 算法进行聚类

在本食谱中，我们将演示如何使用带有 MLib 的 KMeans 算法对没有标签的数据点进行聚类。正如本章介绍中所讨论的，MLib 是 Apache Spark 的机器学习组件，是 Apache Mahout 的一个有竞争力(甚至更好)的替代方案。

准备就绪

您将使用您在前一个菜谱中创建的 Maven 项目(用 Apache Spark 解决简单的文本挖掘问题)。如果您还没有这样做，请遵循该配方的准备好部分中的步骤 1-6。
进入https://github . com/Apache/spark/blob/master/data/ml lib/k means _ data . txt，下载数据并另存为km-data.txt在您按照步骤 1 中的说明创建的项目的数据文件夹中。或者，您可以在项目的 data 文件夹中创建一个名为km-data.txt的文本文件，并从上述 URL 复制粘贴数据。
在您创建的包中，创建一个名为KMeansClusteringMlib.java的 Java 类文件。双击开始在其中编写代码。

现在，您已经准备好进行一些编码了。

怎么做...

创建一个名为KMeansClusteringMlib :

        public class KMeansClusteringMlib {

的类

开始编写你的main方法:

        public static void main( String[] args ){

创建一个 Spark 配置，并使用该配置创建一个 Spark 上下文。注意，如果我们使用local[2]，它将实现最小的并行性。以下语法使应用程序能够运行四个线程:

        SparkConf configuration = new  
         SparkConf().setMaster("local[4]").setAppName("K-means 
          Clustering"); 
        JavaSparkContext sparkContext = new 
          JavaSparkContext(configuration);

现在您将加载并解析您的输入数据:

      String path = "data/km-data.txt";

JavaRDD是对象的分布式集合。创建一个 RDD 对象来读取数据文件:
```
      JavaRDD<String> data = sparkContext.textFile(path); 
```

现在，您需要从前面的 RDD 中读取数据值，这些值由空格分隔。将这些数据值解析并读取到另一个 RDD:

         JavaRDD<Vector> parsedData = data.map( 
            new Function<String, Vector>() { 
               private static final long serialVersionUID = 1L; 

               public Vector call(String s) { 
                  String[] sarray = s.split(" "); 
                  double[] values = new double[sarray.length]; 
                  for (int i = 0; i < sarray.length; i++) 
                     values[i] = Double.parseDouble(sarray[i]); 
                  return Vectors.dense(values); 
               } 
            } 
            ); 
         parsedData.cache();

现在为 KMeans 聚类算法定义几个参数。我们将只使用两个聚类来分离数据点，最多迭代 10 次。连同解析的数据一起，使用参数值创建一个集群器:
```
      int numClusters = 2; 
      int iterations = 10; 
      KMeansModel clusters = KMeans.train(parsedData.rdd(), 
        numClusters, iterations); 
```

计算聚类器集合内的误差平方和:

      double sse = clusters.computeCost(parsedData.rdd()); 
         System.out.println("Sum of Squared Errors within set = " + 
           sse);

最后，关闭sparkContext、main方法和类:

         sparkContext.close(); 
       } 
     }

食谱的完整代码如下:

package com.data.big.mlib; 

import org.apache.spark.api.java.*; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.clustering.KMeans; 
import org.apache.spark.mllib.clustering.KMeansModel; 
import org.apache.spark.mllib.linalg.Vector; 
import org.apache.spark.mllib.linalg.Vectors; 
import org.apache.spark.SparkConf; 

public class KMeansClusteringMlib { 
   public static void main( String[] args ){ 
      SparkConf configuration = new 
        SparkConf().setMaster("local[4]").setAppName("K-means 
          Clustering"); 
      JavaSparkContext sparkContext = new 
         JavaSparkContext(configuration); 

      // Load and parse data 
      String path = "data/km-data.txt"; 
      JavaRDD<String> data = sparkContext.textFile(path); 
      JavaRDD<Vector> parsedData = data.map( 
            new Function<String, Vector>() { 
               private static final long serialVersionUID = 1L; 

               public Vector call(String s) { 
                  String[] sarray = s.split(" "); 
                  double[] values = new double[sarray.length]; 
                  for (int i = 0; i < sarray.length; i++) 
                     values[i] = Double.parseDouble(sarray[i]); 
                  return Vectors.dense(values); 
               } 
            } 
            ); 
      parsedData.cache(); 

      // Cluster the data into two classes using KMeans 
      int numClusters = 2; 
      int iterations = 10; 
      KMeansModel clusters = KMeans.train(parsedData.rdd(), 
        numClusters, iterations); 

      // Evaluate clustering by computing Within Set Sum of Squared 
        Errors 
      double sse = clusters.computeCost(parsedData.rdd()); 
      System.out.println("Sum of Squared Errors within set = " + sse); 
      sparkContext.close(); 
   } 
}

如果运行该代码，输出将如下所示:

Sum of Squared Errors within set = 0.11999999999994547

使用 MLib 创建线性回归模型

在本菜谱中，您将了解如何使用线性回归模型来构建 MLib 模型。

准备就绪

您将使用您在名为的菜谱中创建的 Maven 项目，通过 Apache Spark 解决简单的文本挖掘问题。如果您还没有这样做，那么请按照该食谱的准备部分中的步骤 1-6 进行操作。
转到https://github . com/Apache/spark/blob/master/data/ml lib/ridge-data/lpsa . data，下载数据，并另存为按照步骤 1 中的说明创建的项目的数据文件夹中的lr-data.txt。或者，您可以在项目的 data 文件夹中创建一个名为lr-data.txt的文本文件，并从上述 URL 复制粘贴数据。
在您创建的包中，创建一个名为LinearRegressionMlib.java的 Java 类文件。双击开始在其中编写代码。

现在，您已经准备好进行一些编码了。

怎么做...

创建一个名为LinearRegressionMlib :

        public class LinearRegressionMlib {

的类

开始编写你的main方法:

        public static void main(String[] args) {

        SparkConf configuration = new 
          SparkConf().setMaster("local[4]").setAppName("Linear 
             Regression"); 
        JavaSparkContext sparkContext = new 
          JavaSparkContext(configuration);

现在您将加载并解析您的输入数据:

        String inputData = "data/lr-data.txt";

JavaRDD是对象的分布式集合。创建一个 RDD 对象来读取数据文件:

          JavaRDD<String> data = sparkContext.textFile(inputData);

现在，您需要从上述 RDD 中读取数据值。输入数据由逗号分隔的两部分组成。在第二部分中，这些功能由空格分隔。标记点是输入数据中每一行的第一部分。解析这些数据值并将其读取到另一个 RDD。用特征创建特征向量。将特征向量与标记点放在一起:

         JavaRDD<LabeledPoint> parsedData = data.map( 
            new Function<String, LabeledPoint>() { 
               private static final long serialVersionUID = 1L; 

               public LabeledPoint call(String line) { 
                  String[] parts = line.split(","); 
                  String[] features = parts[1].split(" "); 
                  double[] featureVector = new 
                     double[features.length]; 
                  for (int i = 0; i < features.length - 1; i++){ 
                     featureVector[i] = 
                       Double.parseDouble(features[i]); 
                  } 
                 return new LabeledPoint(Double.parseDouble(parts[0]), 
                    Vectors.dense(featureVector)); 
               } 
             } 
            ); 
         parsedData.cache();

接下来，您将使用 10 次迭代来构建线性回归模型。使用特征向量、标记点和关于迭代次数的信息创建模型:

        int iterations = 10; 
        final LinearRegressionModel model = 
          LinearRegressionWithSGD.train(JavaRDD.toRDD(parsedData), 
           iterations);

然后，您将使用该模型获得预测，并将它们放入另一个名为 predictions 的 RDD 变量中。该模型将根据给定的要素集预测一个值，并返回预测值和实际标注。请注意，此时您将获得的预测是针对您的训练集中的数据点的预测(lr-data.txt)。Tuple2 包含回归预测值和实际值:

         JavaRDD<Tuple2<Double, Double>> predictions = parsedData.map( 
            new Function<LabeledPoint, Tuple2<Double, Double>>() { 
               private static final long serialVersionUID = 1L; 

               public Tuple2<Double, Double> call(LabeledPoint point) 
            { 
                  double prediction = model.predict(point.features()); 
                  return new Tuple2<Double, Double>(prediction, 
                    point.label()); 
               } 
            } 
         );

最后，计算训练数据的线性回归模型的均方误差。对于每个数据点，误差是模型预测值与数据集中提到的实际值之差的平方。最后，平均每个数据点的误差:

       double mse = new JavaDoubleRDD(predictions.map( 
            new Function<Tuple2<Double, Double>, Object>() { 
               private static final long serialVersionUID = 1L; 

               public Object call(Tuple2<Double, Double> pair) { 
                  return Math.pow(pair._1() - pair._2(), 2.0); 
               } 
            } 
            ).rdd()).mean(); 
        System.out.println("training Mean Squared Error = " + mse);

最后，关闭sparkContext、main方法和类:

      sparkContext.close(); 
      } 
      }

食谱的完整代码将是:

package com.data.big.mlib; 

import scala.Tuple2; 
import org.apache.spark.api.java.*; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.linalg.Vectors; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.regression.LinearRegressionModel; 
import org.apache.spark.mllib.regression.LinearRegressionWithSGD; 
import org.apache.spark.SparkConf; 

public class LinearRegressionMlib { 

   public static void main(String[] args) { 
      SparkConf configuration = new 
        SparkConf().setMaster("local[4]").setAppName("Linear 
           Regression"); 
      JavaSparkContext sparkContext = new 
         JavaSparkContext(configuration); 

      // Load and parse the data 
      String inputData = "data/lr-data.txt"; 
      JavaRDD<String> data = sparkContext.textFile(inputData); 
      JavaRDD<LabeledPoint> parsedData = data.map( 
            new Function<String, LabeledPoint>() { 
               private static final long serialVersionUID = 1L; 

               public LabeledPoint call(String line) { 
                  String[] parts = line.split(","); 
                  String[] features = parts[1].split(" "); 
                  double[] featureVector = new 
                    double[features.length]; 
                  for (int i = 0; i < features.length - 1; i++){ 
                     featureVector[i] = 
                        Double.parseDouble(features[i]); 
                  } 
                 return new LabeledPoint(Double.parseDouble(parts[0]), 
                     Vectors.dense(featureVector)); 
               } 
            } 
            ); 
      parsedData.cache(); 

      // Building the model 
      int iterations = 10; 
      final LinearRegressionModel model = 
            LinearRegressionWithSGD.train(JavaRDD.toRDD(parsedData), 
                iterations); 

      // Evaluate model on training examples and compute training 
          error 
      JavaRDD<Tuple2<Double, Double>> predictions = parsedData.map( 
            new Function<LabeledPoint, Tuple2<Double, Double>>() { 
               private static final long serialVersionUID = 1L; 

               public Tuple2<Double, Double> call(LabeledPoint point) { 
                  double prediction = model.predict(point.features()); 
                  return new Tuple2<Double, Double>(prediction, 
                    point.label()); 
               } 
            } 
            ); 
      double mse = new JavaDoubleRDD(predictions.map( 
            new Function<Tuple2<Double, Double>, Object>() { 
               private static final long serialVersionUID = 1L; 

               public Object call(Tuple2<Double, Double> pair) { 
                  return Math.pow(pair._1() - pair._2(), 2.0); 
               } 
            } 
            ).rdd()).mean(); 
      System.out.println("training Mean Squared Error = " + mse); 
      sparkContext.close(); 
   } 
}

运行该代码时，其输出如下:

training Mean Squared Error = 6.487093790021849

利用 MLib 对随机森林模型的数据点进行分类

在这个菜谱中，我们将演示如何使用 MLib 的随机森林算法来分类数据点。

准备就绪

您将使用您在名为的菜谱中创建的 Maven 项目，通过 Apache Spark 解决简单的文本挖掘问题。如果您还没有这样做，那么请按照该食谱的准备部分中的步骤 1-6 进行操作。
转到https://github . com/Apache/spark/blob/master/data/ml lib/sample _ binary _ class ification _ data . txt，下载数据，并另存为rf-data.txt在您按照步骤 1 中的说明创建的项目的数据文件夹中。或者，您可以在项目的 data 文件夹中创建一个名为rf-data.txt的文本文件，并从上述 URL 复制粘贴数据。
在您创建的包中，创建一个名为RandomForestMlib.java的 Java 类文件。双击开始在其中编写代码。

怎么做...

创建一个名为RandomForestMlib :

        public class RandomForestMlib {

的类

开始编写你的main方法。

        public static void main(String args[]){

       SparkConf configuration = new 
         SparkConf().setMaster("local[4]").setAppName("Random 
           Forest");   
       JavaSparkContext sparkContext = new 
          JavaSparkContext(configuration);

现在您将加载并解析您的输入数据:

        String input = "data/rf-data.txt";

通过将输入文件加载为 LibSVM 文件并将其放入 RDD 来读取数据。

        JavaRDD<LabeledPoint> data =  
          MLUtils.loadLibSVMFile(sparkContext.sc(), 
            input).toJavaRDD();

您将使用 70%的数据来训练模型，30%的数据作为模型的测试数据。数据的选择将是随机的。

       JavaRDD<LabeledPoint>[] dataSplits = data.randomSplit(new 
          double[]{0.7, 0.3});
       JavaRDD<LabeledPoint> trainingData = dataSplits[0];
       JavaRDD<LabeledPoint> testData = dataSplits[1];

现在，您将配置一些参数来设置随机森林，以便根据训练数据生成模型。您需要定义数据点可以拥有的数量类别。您还需要为名义要素创建地图。您可以定义森林中的树木数量。如果您不知道选择什么作为分类器的特征子集选择过程，您可以选择"auto"。其余四个参数是森林结构所必需的。
```
        Integer classes = 2;
        HashMap<Integer, Integer> nominalFeatures = new   
         HashMap<Integer, nteger>();
        Integer trees = 3;
        String featureSubsetProcess = "auto";
        String impurity = "gini";
        Integer maxDepth = 3;
        Integer maxBins = 20;
        Integer seed = 12345;
```

使用这些参数，创建一个RandomForest分类器。

        final RandomForestModel rf =  
          RandomForest.trainClassifier(trainingData, classes,  
            nominalFeatures, trees, featureSubsetProcess, impurity, 
               maxDepth, maxBins, seed);

下一步，使用该模型预测给定特征向量的数据点的类别标签。Tuple2<Double,Double>包含每个数据点的预测值和实际类别值:

        JavaPairRDD<Double, Double> label = 
           testData.mapToPair(new PairFunction<LabeledPoint, Double, 
             Double>() { 
               private static final long serialVersionUID = 1L; 

               public Tuple2<Double, Double> call(LabeledPoint p) { 
                  return new Tuple2<Double, Double>
                    (rf.predict(p.features()), p.label()); 
               } 
        });

最后，计算预测的误差。您只需计算预测值与实际值不匹配的次数，然后通过除以测试实例的总数得到平均值:

```java
      Double error = 
         1.0 * label.filter(new Function<Tuple2<Double, Double>, 
           Boolean>() { 
               private static final long serialVersionUID = 1L; 

               public Boolean call(Tuple2<Double, Double> pl) { 
                  return !pl._1().equals(pl._2()); 
               } 
      }).count() / testData.count(); 

```

在控制台上打印出测试错误。您可能还想看看从训练数据中学习到的实际的RandomForest模型:

```java
      System.out.println("Test Error: " + error); 
      System.out.println("Learned classification forest model:\n" + 
        rf.toDebugString()); 

```

关闭sparkContext、main方法和类:

        sparkContext.close(); 
       } 
       }

食谱的完整代码如下:

package com.data.big.mlib; 

import scala.Tuple2; 
import java.util.HashMap; 
import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.JavaPairRDD; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.api.java.function.PairFunction; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.tree.RandomForest; 
import org.apache.spark.mllib.tree.model.RandomForestModel; 
import org.apache.spark.mllib.util.MLUtils; 

public class RandomForestMlib { 
   public static void main(String args[]){ 

      SparkConf configuration = new 
         SparkConf().setMaster("local[4]").setAppName("Random Forest"); 
      JavaSparkContext sparkContext = new 
          JavaSparkContext(configuration); 

      // Load and parse the data file. 
      String input = "data/rf-data.txt"; 
      JavaRDD<LabeledPoint> data = 
          MLUtils.loadLibSVMFile(sparkContext.sc(), input).toJavaRDD(); 
      // Split the data into training and test sets (30% held out for 
         testing) 
      JavaRDD<LabeledPoint>[] dataSplits = data.randomSplit(new 
          double[]{0.7, 0.3}); 
      JavaRDD<LabeledPoint> trainingData = dataSplits[0]; 
      JavaRDD<LabeledPoint> testData = dataSplits[1]; 

      // Train a RandomForest model. 
      Integer classes = 2; 
      HashMap<Integer, Integer> nominalFeatures = new HashMap<Integer, 
         Integer>();//  Empty categoricalFeaturesInfo indicates all 
             features are continuous. 
      Integer trees = 3; // Use more in practice. 
      String featureSubsetProcess = "auto"; // Let the algorithm 
        choose. 
      String impurity = "gini"; 
      Integer maxDepth = 3; 
      Integer maxBins = 20; 
      Integer seed = 12345; 

      final RandomForestModel rf = 
          RandomForest.trainClassifier(trainingData, classes, 
            nominalFeatures, trees, featureSubsetProcess, impurity, 
             maxDepth, maxBins, seed); 

      // Evaluate model on test instances and compute test error 
      JavaPairRDD<Double, Double> label = 
          testData.mapToPair(new PairFunction<LabeledPoint, Double, 
               Double>() { 
               private static final long serialVersionUID = 1L; 

               public Tuple2<Double, Double> call(LabeledPoint p) { 
                  return new Tuple2<Double, Double>
                     (rf.predict(p.features()), p.label()); 
               } 
            }); 

      Double error = 
          1.0 * label.filter(new Function<Tuple2<Double, Double>, 
              Boolean>() { 
               private static final long serialVersionUID = 1L; 

               public Boolean call(Tuple2<Double, Double> pl) { 
                  return !pl._1().equals(pl._2()); 
               } 
            }).count() / testData.count(); 

      System.out.println("Test Error: " + error); 
      System.out.println("Learned classification forest model:\n" + 
          rf.toDebugString()); 

      sparkContext.close(); 
   } 
}

如果运行该代码，输出将如下所示:

Test Error: 0.034482758620689655 
Learned classification forest model: 
TreeEnsembleModel classifier with 3 trees 

  Tree 0: 
    If (feature 427 <= 0.0) 
     If (feature 407 <= 0.0) 
      Predict: 0.0 
     Else (feature 407 > 0.0) 
      Predict: 1.0 
    Else (feature 427 > 0.0) 
     Predict: 0.0 
  Tree 1: 
    If (feature 405 <= 0.0) 
     If (feature 624 <= 253.0) 
      Predict: 0.0 
     Else (feature 624 > 253.0) 
      If (feature 650 <= 0.0) 
       Predict: 0.0 
      Else (feature 650 > 0.0) 
       Predict: 1.0 
    Else (feature 405 > 0.0) 
     If (feature 435 <= 0.0) 
      If (feature 541 <= 0.0) 
       Predict: 1.0 
      Else (feature 541 > 0.0) 
       Predict: 0.0 
     Else (feature 435 > 0.0) 
      Predict: 1.0 
  Tree 2: 
    If (feature 271 <= 72.0) 
     If (feature 323 <= 0.0) 
      Predict: 0.0 
     Else (feature 323 > 0.0) 
      Predict: 1.0 
    Else (feature 271 > 72.0) 
     If (feature 414 <= 0.0) 
      If (feature 159 <= 124.0) 
       Predict: 0.0 
      Else (feature 159 > 124.0) 
       Predict: 1.0 
     Else (feature 414 > 0.0) 
      Predict: 0.0

八、从数据中深度学习

在本章中，我们将介绍以下配方:

使用面向 Java 的深度学习(DL4j)创建 Word2vec 神经网络
使用面向 Java 的深度学习(DL4j)创建深度信念神经网络
使用面向 Java 的深度学习(DL4j)创建深度自动编码器

简介

深度学习就是简单的多层神经网络。也被称为深度神经网络学习或无监督特征学习。作者认为，深度学习由于其解决现实世界数据问题的能力，将成为机器学习从业者和数据科学家的下一个帮凶。

面向 Java 的深度学习 ( DL4j )是一个面向 JVM 的深度学习开源分布式 Java 库。它附带了其他库，如下所示:

Deeplearning4J:神经网络平台
ND4J:JVM 的 NumPy
DataVec:用于机器学习 ETL 操作的工具
JavaCPP:Java 和原生 C++之间的桥梁
仲裁器:机器学习算法的评估工具
RL4J:JVM 的深度强化学习

然而，考虑到本书的范围，我们将只关注 DL4j 的几个关键配方。具体来说，我们将讨论使用 Word2vec 算法的方法及其在现实世界 NLP 和信息检索问题中的应用，深度信念神经网络和深度自动编码器及其用法。强烈建议好奇的读者去 https://github.com/deeplearning4j/dl4j-examples 看更多的例子。注意，本章食谱中的代码是基于 GitHub 上的这些例子。

还要注意的是，在这一章中，很大一部分致力于展示 DL4j 库是如何设置的，因为这个过程非常复杂，为了成功地执行本书中的代码和自己的代码，读者需要集中注意力。

本章中的所有方法都有两个先决条件:Java Developer 或更高版本(作者的版本是 1.8)和 Apache Maven。本章中的食谱是使用 Eclipse Java IDE 实现的(作者有 Eclipse Mars)。虽然https://deeplearning4j.org/quickstart包含了大量关于用 Java 设置 DL4j 的材料，但大多数都集中在另一个 IDE 上，名为 IntelliJ。

要执行本章中的配方，我们需要以下内容:

要使用 DL4j，您需要安装 Apache Maven，这是一个软件项目管理和理解工具。在写这本书的时候，Apache Maven 的 3.3.9 版本是最新的，我们鼓励读者使用这个版本。
Go to https://maven.apache.org/download.cgi and download a binary zip archive into your system:
Once you download, unzip the file archive, and you will find a folder structure as follows:
Now you need to put the path of bin folder in this distribution in your class path. To do so, right-click on your My Computer icon and click on properties. Click on Advanced system settings and then on Environmental variables...:
When the Environment Variables window appears, go to System variables and select the variable named Path. Click on the Edit... button:
When the Edit environment variable window appears, click on the New button and add the path to the bin folder of the Maven distribution. Click on OK to complete this action:
Now that you are back to Environment Variables window, set up the JAVA_HOME system variable. In the System variables section, click on the New button:
将变量命名为JAVA_HOME，并将路径放入您的 Java 开发工具包 ( JDK )文件夹(记住，不是bin文件夹)。

注意

请注意,您需要在您的系统上安装至少七的 Java 语言(一种计算机语言，尤用于创建网站)版本才能运行本章中的配方
Click on OK to complete the command. Close windows opened along the way:
Now check if Maven has been installed properly using the mvn -v command:
Also, check the version of Java installed on your system with the java -version command:
Now open the Eclipse IDE. The author has the Mars version installed. Go to File, then click on New, then click on Other...:
In the Wizard, expand the Maven option and select the Maven Project. Click on Next:
Keep clicking on Next until you reach the following window. In this window, fill out the Group Id and Artifact Id as follows or with anything you like. Click on Finish:
This will create a project as follows. If you expand your project by double-clicking on its name, you will see an xml file named POM.xml:
双击pom.xml文件，当它打开时，删除其所有内容，并复制粘贴以下内容到其中:

```java
        <project   
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
        http://maven.apache.org/xsd/maven-4.0.0.xsd"> 
       <modelVersion>4.0.0</modelVersion> 

        <groupId>org.deeplearning4j</groupId> 
        <artifactId>deeplearning4j-examples</artifactId> 
        <version>0.4-rc0-SNAPSHOT</version> 

        <name>DeepLearning4j Examples</name> 
       <description>Examples of training different data   
          sets</description> 
       <properties> 
        <nd4j.version>0.4-rc3.7</nd4j.version> 
        <dl4j.version>  0.4-rc3.7</dl4j.version> 
        <canova.version>0.0.0.13</canova.version> 
        <jackson.version>2.5.1</jackson.version> 
       </properties> 

      <distributionManagement> 
        <snapshotRepository> 
            <id>sonatype-nexus-snapshots</id> 
            <name>Sonatype Nexus snapshot repository</name> 

     <url>https://oss.sonatype.org/content/repositories/snapshots</url> 
        </snapshotRepository> 
        <repository> 
            <id>nexus-releases</id> 
            <name>Nexus Release Repository</name> 

        <url>http://oss.sonatype.org/service/local/
            staging/deploy/maven2/</url> 
        </repository> 
         </distributionManagement> 
        <dependencyManagement> 
         <dependencies> 
            <dependency> 
                <groupId>org.nd4j</groupId> 
                <artifactId>nd4j-x86</artifactId> 
                <version>${nd4j.version}</version> 
            </dependency> 
          </dependencies> 
         </dependencyManagement> 
        <dependencies> 
        <dependency> 
            <groupId>org.deeplearning4j</groupId> 
            <artifactId>deeplearning4j-nlp</artifactId> 
            <version>${dl4j.version}</version> 
        </dependency> 

        <dependency> 
            <groupId>org.deeplearning4j</groupId> 
            <artifactId>deeplearning4j-core</artifactId> 
            <version>${dl4j.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.deeplearning4j</groupId> 
            <artifactId>deeplearning4j-ui</artifactId> 
            <version>${dl4j.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>org.nd4j</groupId> 
            <artifactId>nd4j-x86</artifactId> 
            <version>${nd4j.version}</version> 
        </dependency> 
        <dependency> 
            <artifactId>canova-nd4j-image</artifactId> 
            <groupId>org.nd4j</groupId> 
            <version>${canova.version}</version> 
        </dependency> 
        <dependency> 
            <artifactId>canova-nd4j-codec</artifactId> 
            <groupId>org.nd4j</groupId> 
            <version>${canova.version}</version> 
        </dependency> 
        <dependency> 
            <groupId>com.fasterxml.jackson.dataformat</groupId> 
            <artifactId>jackson-dataformat-yaml</artifactId> 
            <version>${jackson.version}</version> 
        </dependency> 

       </dependencies> 

        <build> 
          <plugins> 
            <plugin> 
                <groupId>org.codehaus.mojo</groupId> 
                <artifactId>exec-maven-plugin</artifactId> 
                <version>1.4.0</version> 
                <executions> 
                    <execution> 
                        <goals> 
                            <goal>exec</goal> 
                        </goals> 
                    </execution> 
                </executions> 
                <configuration> 
                    <executable>java</executable> 
                </configuration> 
            </plugin> 
            <plugin> 
                <groupId>org.apache.maven.plugins</groupId> 
                <artifactId>maven-shade-plugin</artifactId> 
                <version>1.6</version> 
                <configuration> 

        <createDependencyReducedPom>true</createDependencyReducedPom> 
              <filters> 
              <filter> 
              <artifact>*:*</artifact> 
              <excludes> 
                 <exclude>org/datanucleus/**</exclude> 
                  <exclude>META-INF/*.SF</exclude> 
                  <exclude>META-INF/*.DSA</exclude> 
                  <exclude>META-INF/*.RSA</exclude> 
              </excludes> 
                        </filter> 
                    </filters> 
                </configuration> 
                <executions> 
                    <execution> 
                        <phase>package</phase> 
                        <goals> 
                            <goal>shade</goal> 
                        </goals> 
                        <configuration> 
                            <transformers> 
             <transformer implementation="org.apache.maven.plugins.
               shade.resource.AppendingTransformer"> 
             <resource>reference.conf</resource> 
                                </transformer> 
             <transformer implementation="org.apache.maven.plugins.
                     shade.resource.ServicesResourceTransformer"/> 
             <transformer implementation="org.apache.maven.plugins.
                     shade.resource.ManifestResourceTransformer"> 
                                </transformer> 
                            </transformers> 
                        </configuration> 
                    </execution> 
                </executions> 
            </plugin> 

            <plugin> 
                <groupId>org.apache.maven.plugins</groupId> 
                <artifactId>maven-compiler-plugin</artifactId> 
                <configuration> 
                    <source>1.7</source> 
                    <target>1.7</target> 
                </configuration> 
              </plugin> 
            </plugins> 
          </build> 
         </project> 

```

This will download all the necessary dependencies (see the following screenshot for a partial picture), and you are ready to create some code:
进入https://github . com/deep learning 4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources下载raw_sentences.txt文件到你的C:/ drive:

使用面向 Java 的深度学习(DL4j)创建 Word2vec 神经网络

Word2vec 可以看作是一个处理自然文本的两层神经网络。根据其典型用法，该算法的输入可以是文本语料库，其输出是该语料库中单词的一组特征向量。注意，严格地说，Word2vec 不是深度神经网络，因为它将文本翻译成深度神经网络可以读取和理解的数字形式。在这个菜谱中，我们将看到如何使用流行的名为 deep learning for Java 的深度学习 Java 库(从现在开始，DL4j)将 Word2vec 应用于原始文本。

怎么做...

创建一个名为Word2VecRawTextExample :

        public class Word2VecRawTextExample {

的类

为此类创建一个记录器。logger 工具已经包含在您的项目中，因为您已经使用 Maven 构建了您的项目:
```
        private static Logger log = 
          LoggerFactory.getLogger(Word2VecRawTextExample.class); 
```
Start creating your main method:

<dependency><groupId>org.nd4j</groupId><artifactId>nd4j-native</artifactId><version>0.7.2</version></dependency>
```
        public static void main(String[] args) throws Exception { 
```
您要做的第一件事是获取您已经下载的示例原始句子文本文件的文件路径:
```
        String filePath = "c:/raw_sentences.txt"; 
```
现在获取.txt文件中的原始句子，用迭代器遍历它们，并对它们进行预处理(例如，将所有内容转换成小写，去掉每行前后的空格):
```
        log.info("Load & Vectorize Sentences...."); 
        SentenceIterator iter = 
          UimaSentenceIterator.createWithPath(filePath); 
```

Word2vec 使用单词或记号，而不是句子。因此，您的下一个任务将是对原始文本进行标记:

       TokenizerFactory t = new DefaultTokenizerFactory(); 
        t.setTokenPreProcessor(new CommonPreprocessor());

词汇表缓存或 Vocab 缓存是 DL4j 中的一种机制，用于处理通用自然语言任务，如 TF-IDF。InMemoryLookupCache是参考实现:

        InMemoryLookupCache cache = new InMemoryLookupCache(); 
          WeightLookupTable table = new InMemoryLookupTable.Builder() 
                .vectorLength(100) 
                .useAdaGrad(false) 
                .cache(cache) 
               .lr(0.025f).build();

Now that the data is ready, you are also ready to configure the Word2vec neural network:
```
        log.info("Building model...."); 
         Word2Vec vec = new Word2Vec.Builder() 
              .minWordFrequency(5).iterations(1) 
              .layerSize(100).lookupTable(table) 
              .stopWords(new ArrayList<String>()) 
              .vocabCache(cache).seed(42) 
              .windowSize(5).iterate(iter).tokenizerFactory(t).build(); 
```
minWordFrequency 是一个单词在语料库中必须出现的最小次数。在这个食谱中，如果出现少于五次，一个单词就没有学会。单词必须出现在多种上下文中，以了解关于它们的有用特征。如果你有一个非常大的语料库，提高最小值是合理的。LayerSize 表示单词向量中的特征数量或特征空间中的维数。接下来，通过开始神经网络训练来拟合模型:
```
        log.info("Fitting Word2Vec model...."); 
        vec.fit(); 
```

将从神经网络创建的单词向量写入输出文件。在我们的例子中，输出被写到名为c:/word2vec.txt :

        <dependency><groupId>org.nd4j</groupId><artifactId>nd4j- 
          native</artifactId><version>0.7.2</version></dependency>
        log.info("Writing word vectors to text file...."); 
        WordVectorSerializer.writeWordVectors(vec, "c:/word2vec.txt");

的文件中

您还可以评估特征向量的质量。vec.wordsNearest("word1", numWordsNearest)为我们提供了被神经网络聚类为语义相似词的词。您可以使用 wordsNearest 的第二个参数设置所需的最近单词数。vec.similarity("word1","word2")会返回你输入的两个词的余弦相似度。它越接近 1，网络对这些单词的感知就越相似:

```java
        log.info("Closest Words:"); 
        Collection<String> lst = vec.wordsNearest("man", 5);   
        System.out.println(lst); 
        double cosSim = vec.similarity("cruise", "voyage"); 
        System.out.println(cosSim); 

```

前面几行的输出如下:

```java
        [family, part, house, program, business] 
        1.0000001192092896 

```

关闭 main 方法和类:

        } 
        }

它是如何工作的...

Right-click on your project name in Eclipse, select New, and then select Package. Give the following as your package name: word2vec.chap8.science.data. Click on Finish:
现在你有了一个包，右击包名，选择新建，然后选择类。类的名字应该是Word2VecRawTextExample。点击完成:

在编辑器中，复制并粘贴以下代码:

package word2vec.chap8.science.data; 

import org.deeplearning4j.models.embeddings.WeightLookupTable; 
import org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable; 
import org.deeplearning4j.models.embeddings.loader.WordVectorSerializer; 
import org.deeplearning4j.models.word2vec.Word2Vec; 
import org.deeplearning4j.models.word2vec.wordstore.inmemory.InMemoryLookupCache; 
import org.deeplearning4j.text.sentenceiterator.SentenceIterator; 
import org.deeplearning4j.text.sentenceiterator.UimaSentenceIterator; 
import org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor; 
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory; 
import org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory; 

import org.slf4j.Logger; 
import org.slf4j.LoggerFactory; 

import java.util.ArrayList; 
import java.util.Collection; 

public class Word2VecRawTextExample { 

    private static Logger log = LoggerFactory.getLogger(Word2VecRawTextExample.class); 

    public static void main(String[] args) throws Exception { 

        // Gets Path to Text file 
        String filePath = "c:/raw_sentences.txt"; 

        log.info("Load & Vectorize Sentences...."); 
        // Strip white space before and after for each line 
        SentenceIterator iter = 
           UimaSentenceIterator.createWithPath(filePath); 
        // Split on white spaces in the line to get words 
        TokenizerFactory t = new DefaultTokenizerFactory(); 
        t.setTokenPreProcessor(new CommonPreprocessor()); 

        InMemoryLookupCache cache = new InMemoryLookupCache(); 
        WeightLookupTable table = new InMemoryLookupTable.Builder() 
                .vectorLength(100) 
                .useAdaGrad(false) 
                .cache(cache) 
                .lr(0.025f).build(); 

        log.info("Building model...."); 
        Word2Vec vec = new Word2Vec.Builder() 
            .minWordFrequency(5).iterations(1) 
            .layerSize(100).lookupTable(table) 
            .stopWords(new ArrayList<String>()) 
            .vocabCache(cache).seed(42) 

            .windowSize(5).iterate(iter).tokenizerFactory(t).build(); 

        log.info("Fitting Word2Vec model...."); 
        vec.fit(); 

        log.info("Writing word vectors to text file...."); 
        // Write word 
        WordVectorSerializer.writeWordVectors(vec, "word2vec.txt"); 

        log.info("Closest Words:"); 
        Collection<String> lst = vec.wordsNearest("man", 5); 
        System.out.println(lst); 
        double cosSim = vec.similarity("cruise", "voyage"); 
        System.out.println(cosSim); 
    } 
}

还有更多

minWordFrequency:是一个词在语料库中必须出现的最小次数。在这个食谱中，如果出现少于五次，一个单词就没有学会。单词必须出现在多种上下文中，以了解关于它们的有用特征。如果你有非常大的语料库，提高最小值是合理的。
iterations:这是您允许神经网络为一批数据更新其系数的次数。迭代次数太少会造成学习不充分，迭代次数太多会使网的训练时间变长。
layerSize:表示单词向量中的特征数或特征空间中的维数。
iterate 告诉网络它正在数据集的哪一批上进行训练。
tokenizer:输入当前批次的单词。

使用面向 Java 的深度学习(DL4j)创建深度信念神经网络

深度信念网络可以定义为一堆受限的玻尔兹曼机器，其中每个 RBM 层都与前一层和后一层进行通信。在这份食谱中，我们将看到如何创建这样一个网络。为了简单起见，在这个食谱中，我们把自己限制在神经网络的一个单独的隐藏层。因此，我们在这个食谱中开发的网络严格来说并不是一个深度信念神经网络，而是鼓励读者添加更多的隐藏层。

怎么做...

创建一个名为DBNIrisExample :

        public class DBNIrisExample {

的类

为该类创建一个日志记录器来记录消息:

        private static Logger log = 
          LoggerFactory.getLogger(DBNIrisExample.class);

开始写你的主要方法:

        public static void main(String[] args) throws Exception {

首先，定制 Nd4j 类的两个参数:要打印的最大切片数和每个切片的最大元素数。将它们设置为-1 :
```
        Nd4j.MAX_SLICES_TO_PRINT = -1; 
        Nd4j.MAX_ELEMENTS_PER_SLICE = -1; 
```
接下来，自定义其他参数:
```
      final int numRows = 4; 
       final int numColumns = 1; 
        int outputNum = 3; 
        int numSamples = 150; 
        int batchSize = 150; 
        int iterations = 5; 
        int splitTrainNum = (int) (batchSize * .8); 
        int seed = 123; 
        int listenerFreq = 1; 
```
- 在 DL4j 中，输入的数据可以是二维数据，因此，需要指定数据的行数和列数。因为虹膜数据集是一维的，所以列数被设置为 1。
- 代码中，numSamples为总数据量，batchSize 为每批数据量。
- splitTrainNum 是为训练和测试分配数据的变量。这里，所有数据集的 80%是训练数据，其余的被视为测试数据。listenerFreq 决定了我们在日志记录过程中看到损失函数值的频率。此处该值设置为 1，这意味着在每个时期后记录该值。

使用以下代码自动加载 Iris 数据集，其中包含批次大小和样本数量信息:

        log.info("Load data...."); 
        DataSetIterator iter = new IrisDataSetIterator(batchSize,  
          numSamples);

格式化数据:

        DataSet next = iter.next(); 
          next.normalizeZeroMeanZeroUnitVariance();

接下来，将数据分为训练和测试。对于分裂，使用随机种子并加强数值稳定性进行训练:

       log.info("Split data...."); 
        SplitTestAndTrain testAndTrain = 
           next.splitTestAndTrain(splitTrainNum, new Random(seed)); 
        DataSet train = testAndTrain.getTrain(); 
        DataSet test = testAndTrain.getTest(); 
        Nd4j.ENFORCE_NUMERICAL_STABILITY = true;

现在，写下下面的代码块来构建你的模型:

         MultiLayerConfiguration conf = new 
            NeuralNetConfiguration.Builder() 
            .seed(seed) 
            .iterations(iterations) 
            .learningRate(1e-6f) 
           .optimizationAlgo(OptimizationAlgorithm.CONJUGATE_GRADIENT) 
            .l1(1e-1).regularization(true).l2(2e-4) 
            .useDropConnect(true) 
            .list(2)

Let's examine this piece of code:

*   使用种子方法，您可以锁定权重初始化以进行调整
*   然后，设置预测或分类的训练迭代次数
*   然后定义优化步长，并选择反向传播算法来计算梯度
*   最后，在 list()方法中，提供 2 作为神经网络层数的参数(除了输入层之外)

然后将下面的方法调用添加到上一步中的代码中。这段代码是为你的神经网络设置第一层:

```java
         .layer(0, new RBM.Builder(RBM.HiddenUnit.RECTIFIED, 
             RBM.VisibleUnit.GAUSSIAN) 

            .nIn(numRows * numColumns) 

            .nOut(3) 
           .weightInit(WeightInit.XAVIER) 
            .k(1) 
           .activation("relu") 
           .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
           .updater(Updater.ADAGRAD) 
            .dropOut(0.5) 
           .build() 
          )
```

*   第一行的 0 值是该层的索引
*   k()是对比散度
*   而不是二进制 RBM，在这种情况下我们不能使用，因为虹膜数据是浮点值，我们有 RBM。VisibleUnit.GAUSSIAN，使模型能够处理连续值
*   Updater。ADAGRAD 用于优化学习速率

然后将下面的方法调用添加到上一步中的代码中。这段代码是为你的神经网络设置第一层:

```java
        .layer(1, new 
           OutputLayer.Builder(LossFunctions.LossFunction.MCXENT) 

            .nIn(3) 

            .nOut(outputNum) 
            .activation("softmax") 
            .build() 
        )  .build(); 

```

最终确定模型构建:

```java
        MultiLayerNetwork model = new MultiLayerNetwork(conf); 
        model.init(); 

```

一旦模型被配置，完成它的训练:

```java
        model.setListeners(Arrays.asList((IterationListener) new 
          ScoreIterationListener(listenerFreq))); 
        log.info("Train model...."); 
        model.fit(train); 

```

您可以使用下面的代码评估权重:

```java
        log.info("Evaluate weights...."); 
        for(org.deeplearning4j.nn.api.Layer layer : model.getLayers()) 
        { 
           INDArray w = 
            layer.getParam(DefaultParamInitializer.WEIGHT_KEY); 
          log.info("Weights: " + w); 
        } 

```

最后，评价一下你的模型:

```java
        log.info("Evaluate model...."); 
         Evaluation eval = new Evaluation(outputNum); 
         INDArray output = model.output(test.getFeatureMatrix()); 

         for (int i = 0; i < output.rows(); i++) { 
             String actual = 
                test.getLabels().getRow(i).toString().trim(); 
             String predicted = output.getRow(i).toString().trim(); 
             log.info("actual " + actual + " vs predicted " + 
                predicted); 
         } 

         eval.eval(test.getLabels(), output); 
         log.info(eval.stats()); 

```

这段代码的输出如下:

```java
 ========================Scores================================== 
        Accuracy:  0.8333 
        Precision: 1 
        Recall:    0.8333 
        F1 Score:  0.9090909090909091 

```

最后，关闭 main 方法和类:

        } 
        }

它是如何工作的...

在 Eclipse 中右键单击您的项目名称，选择 New，然后选择 Package。给出下面的包名:deepbelief.chap8.science.data。点击完成。
现在您已经有了一个包，右键单击包名，选择 New，然后选择 Class。类的名字应该是DBNIrisExample。点击完成。

在编辑器中，复制并粘贴以下代码:

package deepbelief.chap8.science.data; 

import org.deeplearning4j.datasets.iterator.DataSetIterator; 
import org.deeplearning4j.datasets.iterator.impl.IrisDataSetIterator; 
import org.deeplearning4j.eval.Evaluation; 
import org.deeplearning4j.nn.api.OptimizationAlgorithm; 
import org.deeplearning4j.nn.conf.MultiLayerConfiguration; 
import org.deeplearning4j.nn.conf.NeuralNetConfiguration; 
import org.deeplearning4j.nn.conf.Updater; 
import org.deeplearning4j.nn.conf.layers.OutputLayer; 
import org.deeplearning4j.nn.conf.layers.RBM; 
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork; 
import org.deeplearning4j.nn.params.DefaultParamInitializer; 
import org.deeplearning4j.nn.weights.WeightInit; 
import org.deeplearning4j.optimize.api.IterationListener; 
import org.deeplearning4j.optimize.listeners.ScoreIterationListener; 
import org.nd4j.linalg.api.ndarray.INDArray; 
import org.nd4j.linalg.dataset.DataSet; 
import org.nd4j.linalg.dataset.SplitTestAndTrain; 
import org.nd4j.linalg.factory.Nd4j; 
import org.nd4j.linalg.lossfunctions.LossFunctions; 
import org.slf4j.Logger; 
import org.slf4j.LoggerFactory; 
import java.util.Arrays; 
import java.util.Random; 

public class DBNIrisExample { 

    private static Logger log = 
      LoggerFactory.getLogger(DBNIrisExample.class); 

    public static void main(String[] args) throws Exception { 
        Nd4j.MAX_SLICES_TO_PRINT = -1; 
        Nd4j.MAX_ELEMENTS_PER_SLICE = -1; 

        final int numRows = 4; 
        final int numColumns = 1; 
        int outputNum = 3; 
        int numSamples = 150; 
        int batchSize = 150; 
        int iterations = 5; 
        int splitTrainNum = (int) (batchSize * .8); 
        int seed = 123; 
        int listenerFreq = 1; 

        log.info("Load data...."); 
        DataSetIterator iter = new IrisDataSetIterator(batchSize, 
          numSamples); 
        DataSet next = iter.next(); 
        next.normalizeZeroMeanZeroUnitVariance(); 

        log.info("Split data...."); 
        SplitTestAndTrain testAndTrain = 
           next.splitTestAndTrain(splitTrainNum, new Random(seed)); 
        DataSet train = testAndTrain.getTrain(); 
        DataSet test = testAndTrain.getTest(); 
        Nd4j.ENFORCE_NUMERICAL_STABILITY = true; 

        log.info("Build model...."); 
        MultiLayerConfiguration conf = new 
          NeuralNetConfiguration.Builder() 
           .seed(seed) 
           .iterations(iterations) 
           .learningRate(1e-6f) 
           .optimizationAlgo(OptimizationAlgorithm.CONJUGATE_GRADIENT) 
           .l1(1e-1).regularization(true).l2(2e-4) 
           .useDropConnect(true) 
           .list(2) 
           .layer(0, new RBM.Builder(RBM.HiddenUnit.RECTIFIED, 
             RBM.VisibleUnit.GAUSSIAN) 
            .nIn(numRows * numColumns) 
            .nOut(3) 
            .weightInit(WeightInit.XAVIER) 
            .k(1) 
            .activation("relu") 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .updater(Updater.ADAGRAD) 
            .dropOut(0.5) 
            .build() 
          ) 
          .layer(1, new 
             OutputLayer.Builder(LossFunctions.LossFunction.MCXENT) 
            .nIn(3) 
            .nOut(outputNum) 
            .activation("softmax") 
            .build() 
        ) 
        .build(); 
        MultiLayerNetwork model = new MultiLayerNetwork(conf); 
        model.init(); 

        model.setListeners(Arrays.asList((IterationListener) new 
           ScoreIterationListener(listenerFreq))); 
        log.info("Train model...."); 
        model.fit(train); 

        log.info("Evaluate weights...."); 
        for(org.deeplearning4j.nn.api.Layer layer : model.getLayers()) 
        { 
            INDArray w = 
            layer.getParam(DefaultParamInitializer.WEIGHT_KEY); 
            log.info("Weights: " + w); 
        } 

        log.info("Evaluate model...."); 
        Evaluation eval = new Evaluation(outputNum); 
        INDArray output = model.output(test.getFeatureMatrix()); 

        for (int i = 0; i < output.rows(); i++) { 
            String actual = 
              test.getLabels().getRow(i).toString().trim(); 
            String predicted = output.getRow(i).toString().trim(); 
            log.info("actual " + actual + " vs predicted " + 
              predicted); 
        } 

        eval.eval(test.getLabels(), output); 
        log.info(eval.stats()); 

    } 
}

使用面向 Java 的深度学习(DL4j)创建深度自动编码器

深度自动编码器是由两个对称的深度信任网络组成的深度神经网络。网络通常有两个独立的四或五个浅层(受限玻尔兹曼机器)，代表网络的编码和解码部分。在这个菜谱中，您将开发一个深度自动编码器，由一个输入层、四个解码层、四个编码层和一个输出层组成。为此，我们将使用一个非常受欢迎的数据集，名为 MNIST。

注意

要了解更多关于 MNIST 的信息，请访问 http://yann.lecun.com/exdb/mnist/。如果你想了解更多关于深度自动编码器的信息，请访问 https://deeplearning4j.org/deepautoencoder。来完成命令。关闭沿途打开的窗户。命令。命令。并点击其他...直到你到达下面的窗口。在该窗口中，按照如下方式填写组 Id 和工件 Id ，或者填写您喜欢的任何内容。点击完成。

怎么做...

首先创建一个名为DeepAutoEncoderExample :

        public class DeepAutoEncoderExample {

的类

在整个代码中，您将记录消息。因此，为您的类创建一个日志记录器:

        private static Logger log = 
           LoggerFactory.getLogger(DeepAutoEncoderExample.class);

开始编写你的main方法:

        public static void main(String[] args) throws Exception {

在您的主方法的最开始，定义一些需要更改或配置的参数:
```
        final int numRows = 28; 
        final int numColumns = 28; 
        int seed = 123; 
        int numSamples = MnistDataFetcher.NUM_EXAMPLES; 
        int batchSize = 1000; 
        int iterations = 1; 
        int listenerFreq = iterations/5;
```
- 行和列被设置为 28，因为 MNIST 数据库中的图像大小是 28×28 像素
- 随机选择 123 个种子
- numSamples 是示例数据集中的样本总数
- batchSize 被设置为 1000，以便每次
- listenerFreq决定时使用 1000 个数据样本

然后，将批次大小和样本数量信息加载到 MNIST 数据点:

      log.info("Load data...."); 
      DataSetIterator iter = new 
         MnistDataSetIterator(batchSize,numSamples,true);

接下来，您将配置神经网络。首先，使用种子、迭代并通过将线梯度下降设置为优化算法来构建多层神经网络。您还设置了总共 10 层:一个输入层、四个编码层、一个解码层和一个输出层:

        log.info("Build model...."); 
        MultiLayerConfiguration conf = new 
         NeuralNetConfiguration.Builder() 
        .seed(seed) 
        .iterations(iterations) 
        .optimizationAlgo(OptimizationAlgorithm.LINE_GRADIENT_DESCENT) 
        .list(10)

然后将下面的代码添加到上一步的代码中。这是你创建所有 10 个反向传播层的地方:

        .layer(0, new RBM.Builder().nIn(numRows * 
           numColumns).nOut(1000).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(1, new RBM.Builder().nIn(1000).nOut(500).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(2, new RBM.Builder().nIn(500).nOut(250).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(3, new RBM.Builder().nIn(250).nOut(100).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(4, new RBM.Builder().nIn(100).nOut(30).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build())
              //encoding 
                stops 
       .layer(5, new RBM.Builder().nIn(30).nOut(100).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build())              
             //decoding 
                starts 
       .layer(6, new RBM.Builder().nIn(100).nOut(250).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(7, new RBM.Builder().nIn(250).nOut(500).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(8, new RBM.Builder().nIn(500).nOut(1000).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(9, new OutputLayer.Builder(LossFunctions.
           LossFunction.RMSE_XENT).nIn(1000).nOut(numRows*numColumns).
               build()) 
       .pretrain(true).backprop(true) .build();

现在您已经配置了模型，初始化它:

        MultiLayerNetwork model = new MultiLayerNetwork(conf); 
        model.init();

最终确定培训:

        model.setListeners(Arrays.asList((IterationListener) new    
        ScoreIterationListener(listenerFreq))); 
         log.info("Train model...."); 
          while(iter.hasNext()) { 
         DataSet next = iter.next(); 
          model.fit(new   
           DataSet(next.getFeatureMatrix(),next.getFeatureMatrix())); 
        }

最后，关闭main方法和类:

       } 
       }

它是如何工作的...

在 Eclipse 中右键单击您的项目名称，选择 New，然后选择 Package。给出下面的包名:deepbelief.chap8.science.data。点击完成。
现在您已经有了一个包，右击包名，选择 New ，然后选择 Class。班级的名字应该是DeepAutoEncoderExample.点击完成。

在编辑器中，复制并粘贴以下代码:

package deepbelief.chap8.science.data; 
import org.deeplearning4j.datasets.fetchers.MnistDataFetcher; 
import org.deeplearning4j.datasets.iterator.impl.MnistDataSetIterator; 
import org.deeplearning4j.nn.api.OptimizationAlgorithm; 
import org.deeplearning4j.nn.conf.MultiLayerConfiguration; 
import org.deeplearning4j.nn.conf.NeuralNetConfiguration; 
import org.deeplearning4j.nn.conf.layers.OutputLayer; 
import org.deeplearning4j.nn.conf.layers.RBM; 
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork; 
import org.deeplearning4j.optimize.api.IterationListener; 
import org.deeplearning4j.optimize.listeners.ScoreIterationListener; 
import org.nd4j.linalg.dataset.DataSet; 
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator; 
import org.nd4j.linalg.lossfunctions.LossFunctions; 
import org.slf4j.Logger; 
import org.slf4j.LoggerFactory; 
import java.util.Arrays; 

public class DeepAutoEncoderExample { 
     private static Logger log = 
       LoggerFactory.getLogger(DeepAutoEncoderExample.class); 

    public static void main(String[] args) throws Exception { 
        final int numRows = 28; 
        final int numColumns = 28; 
        int seed = 123; 
        int numSamples = MnistDataFetcher.NUM_EXAMPLES; 
        int batchSize = 1000; 
        int iterations = 1; 
        int listenerFreq = iterations/5; 

        log.info("Load data...."); 
        DataSetIterator iter = new 
          MnistDataSetIterator(batchSize,numSamples,true); 

        log.info("Build model...."); 
        MultiLayerConfiguration conf = new 
          NeuralNetConfiguration.Builder() 
                .seed(seed) 
                .iterations(iterations) 

       .optimizationAlgo(OptimizationAlgorithm.LINE_GRADIENT_DESCENT) 
                .list(10)

       .layer(0, new RBM.Builder().nIn(numRows * 
           numColumns).nOut(1000).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(1, new RBM.Builder().nIn(1000).nOut(500).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(2, new RBM.Builder().nIn(500).nOut(250).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(3, new RBM.Builder().nIn(250).nOut(100).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(4, new RBM.Builder().nIn(100).nOut(30).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
             //encoding 
                stops 
       .layer(5, new RBM.Builder().nIn(30).nOut(100).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build())
             //decoding 
                starts 
       .layer(6, new RBM.Builder().nIn(100).nOut(250).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(7, new RBM.Builder().nIn(250).nOut(500).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(8, new RBM.Builder().nIn(500).nOut(1000).lossFunction
             (LossFunctions.LossFunction.RMSE_XENT).build()) 
       .layer(9, new OutputLayer.Builder(LossFunctions.
           LossFunction.RMSE_XENT).nIn(1000).nOut(numRows*numColumns).
               build()) 
       .pretrain(true).backprop(true) .build(); 

        MultiLayerNetwork model = new MultiLayerNetwork(conf); 
        model.init(); 

        model.setListeners(Arrays.asList((IterationListener) new 
           ScoreIterationListener(listenerFreq))); 

        log.info("Train model...."); 
        while(iter.hasNext()) { 
            DataSet next = iter.next(); 
            model.fit(new 
          DataSet(next.getFeatureMatrix(),next.getFeatureMatrix())); 
        } 
    } 
}

九、可视化数据

在本章中，我们将介绍以下配方:

绘制 2D 正弦图
绘制直方图
绘制条形图
绘制箱线图或须状图
绘制散点图
绘制圆环图
绘制面积图

简介

数据可视化在数据科学界正变得越来越流行，因为它是在点、线或条的帮助下使用底层数据的可视化信息通信。可视化不仅向数据科学家传达信息，还向不了解或很少了解底层数据分布或数据性质的受众展示信息。在许多情况下，数据可视化被管理层、利益相关者和业务主管用来制定决策或了解趋势。

在这一章中，我们介绍了七种使用正弦图、直方图、条形图、箱线图、散点图、圆环图或饼图以及面积图来可视化数据的方法。这是一本烹饪书，除了非常简短的介绍之外，我们没有给出足够的背景知识，它们的优点和使用范围。相反，我们关注的是可以实现可视化的 Java 库的技术细节。

在这一章中，我们将使用一个名为 GRAL(GRAphing Library 的简写)的 Java 库来图形化地展示数据。本章中有几个理由考虑将 GRAL 用于数据可视化方案:

全面的类集合
数据处理功能的可用性，如平滑、重定比例、统计和直方图
Availability of plots popular among data scientists. The plots include the following:
- xy 散点图
- 气泡图
- 线形图
- 面积图
- bar plot
- 饼图
- donut plot
- 盒须图
- 光栅图
显示图例的功能
支持多种文件格式作为数据源或数据接收器(CSV、位图图像数据、音频文件数据)
以位图和矢量文件格式(PNG、GIF、JPEG、EPS、PDF、SVG)导出绘图
内存占用小(约 300 千字节)

鼓励感兴趣的读者查看各种 Java 数据可视化库的比较:https://github.com/eseifert/gral/wiki/comparison。

绘制 2D 正弦图

在这个菜谱中，我们将使用一个名为的免费 Java 图形库 ( GRAL )来绘制 2D 正弦图。在许多情况下，正弦图对数据科学家特别有用，因为它们是一种三角图，可用于对数据波动进行建模(例如，使用温度数据创建一个模型，预测一年中某个位置适合游览的时间)。

准备就绪

To use GRAL in your project, you need to download the GRAL JAR file and include it in your project as an external Java library. To download the Jar file, go to http://trac.erichseifert.de/gral/wiki/Download and download GRAL JAR file version 0.10 from the legacy version section. The file that you are going to download is a zip file named gral-core-0.10.zip:

下载文件后，解压文件，您将在发行版中看到文件和文件夹。其中，你会发现一个名为lib的文件夹，也就是感兴趣的文件夹。
Go to the lib folder. There will be two Jar files there: gral-core-0.10 and VectorGraphics2D-0.9.1. For this tutorial, you will only need to consider gral-core-0.10.jar:
In our Eclipse project, we add this JAR file as an external library file:
现在，您已经准备好编写一些代码来绘制正弦图。

怎么做...

首先，我们创建一个名为SineGraph的 Java 类，它扩展了JFrame，因为我们将把数据的输出图绘制到一个JFrame :
```
        public class SineGraph extends JFrame { 
```
上

接下来，将串行版本UID声明为类变量:

        private static final long serialVersionUID = 1L;

serialVersionUID可以看作是一个Serializable类中的版本控制。如果您没有显式声明serialVersionUID，JVM 会自动为您完成。关于这一点的更多细节超出了本书的范围，可以在http://docs . Oracle . com/javase/1 . 5 . 0/docs/API/Java/io/serializable . html找到。
接下来，为该类创建一个构造函数。构造函数将定义关闭框架时的行为，定义绘制正弦图的框架大小，并根据 for 循环中的值创建数据表。因此，在这个例子中，我们将看到一个真正的正弦图。您的实际数据可能不是完美的正弦图:
```
        public SineGraph() throws FileNotFoundException, IOException { 
```

设置关闭框架时的默认动作:

        setDefaultCloseOperation(EXIT_ON_CLOSE);

设置
```
        setSize(1600, 1400); 
```
帧的size

使用循环创建人工的x和y值，然后将它们放入数据表:

        DataTable data = new DataTable(Double.class, Double.class); 
         for (double x = -5.0; x <= 5.0; x+=0.25) { 
         double y = 5.0*Math.sin(x); 
         data.add(x, y); 
       }

为了得到正弦图，我们将使用 GRAL 的类。通过将您创建的数据作为参数发送来创建一个XYPlot对象:
```
        XYPlot plot = new XYPlot(data); 
```

将plot设置为交互面板:

        XYPlot plot = new XYPlot(data);

为了渲染绘图，创建一条2D线renderer。将此线条渲染和数据添加到XYPlot对象:

```java
        LineRenderer lines = new DefaultLineRenderer2D(); 
        plot.setLineRenderer(data, lines); 

```

使用 GRAL，可以使用它的Color类:

```java
        Color color = new Color(0.0f, 0.0f, 0.0f); 

```

来绘制`color`图

作为Color类的构造函数的参数，您需要发出红色、绿色和蓝色的值。在前面的例子中，您正在绘制一个黑白图形，因为您已经发送了 0 作为红色、绿色和蓝色值。
设置点和线的颜色:

```java
        plot.getPointRenderer(data).setColor(color); 
        plot.getLineRenderer(data).setColor(color); 

```

关闭构造函数:
要运行程序，写下下面的main()方法:

        public static void main(String[] args) { 
        SineGraph frame = null; 
        try { 
         frame = new SineGraph(); 
        } catch (IOException e) { 
        } 
        frame.setVisible(true); 
        }

该配方的完整代码如下:

import java.awt.Color; 
import java.io.FileNotFoundException; 
import java.io.IOException; 
import javax.swing.JFrame; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.plots.XYPlot; 
import de.erichseifert.gral.plots.lines.DefaultLineRenderer2D; 
import de.erichseifert.gral.plots.lines.LineRenderer; 
import de.erichseifert.gral.ui.InteractivePanel; 

public class SineGraph extends JFrame { 
   private static final long serialVersionUID = 1L; 

   public SineGraph() throws FileNotFoundException, IOException { 
      setDefaultCloseOperation(EXIT_ON_CLOSE); 
      setSize(1600, 1400); 

      DataTable data = new DataTable(Double.class, Double.class); 
      for (double x = -5.0; x <= 5.0; x+=0.25) { 
            double y = 5.0*Math.sin(x); 
            data.add(x, y); 
        } 

      XYPlot plot = new XYPlot(data); 
      getContentPane().add(new InteractivePanel(plot)); 
      LineRenderer lines = new DefaultLineRenderer2D(); 
      plot.setLineRenderer(data, lines); 
      Color color = new Color(0.0f, 0.3f, 1.0f); 
      plot.getPointRenderer(data).setColor(color); 
      plot.getLineRenderer(data).setColor(color); 
   } 

   public static void main(String[] args) { 
      SineGraph frame = null; 
      try { 
         frame = new SineGraph(); 
      } catch (IOException e) { 
      } 
      frame.setVisible(true); 
   } 
}

该程序的输出将是一个布局良好的正弦图:

绘制直方图

直方图是发现一组连续数据的频率分布的一种非常流行的方法。在直方图中，数据科学家通常沿着 x 轴显示定量变量，沿着 y 轴显示该变量的频率。直方图的一些关键特性使其非常有用，如下所示:

只能绘制数字数据
庞大的数据集可以很容易地绘制出来
x 轴通常用作数量变量的箱或区间
在这个食谱中，我们将看到如何使用 GRAL 绘制直方图

准备就绪

To use GRAL to plot histograms, we need the example applications provided with the library in the form of Jar files. These example applications can be downloaded from http://trac.erichseifert.de/gral/wiki/Download. Download the gral-examples-0.10.zip file from the download location to your local disk. Extract the files:
Once you download the Zip file and extract it, you will see a directory structure as follows, where our folder of interest is the lib folder:
Inside lib, you will find three Jar files: gral-core-0.10, gral-examples-0.10, and VectorGraphics2D-0.9.1. The first one was used in the first recipe of this chapter. In this recipe, you will use the second one as well:
Include these two Jar files in your project as an external library:
现在，您已经准备好使用 GRAL 示例包中包含的程序来绘制直方图。我们将在下一节看到的配方可以在您下载的示例包的gral-examples-0.10\gral-examples-0.10\src\main\java\de\erichseifert\gral\examples\barplot中找到。

怎么做...

创建一个名为HistogramPlot的类，它将扩展ExamplePanel类。创建一个序列版本 UID:

        public class HistogramPlot extends ExamplePanel {
        private static final long serialVersionUID = 
          4458280577519421950L;

在本例中，您将为1000样本数据点:

        private static final int SAMPLE_COUNT = 1000;

创建直方图

创建类的构造函数:
```
        public HistogramPlot() {
```

随机创建 1000 个样本数据点。您将创建的数据点来自高斯分布，因为您将使用 Java Random类的random.nextGaussian()方法:

        Random random = new Random();
        DataTable data = new DataTable(Double.class);
         for (int i = 0; i < SAMPLE_COUNT; i++) {
            data.add(random.nextGaussian());
         }

从数据中创建一个histogram，并创建第二个尺寸用于绘图:

        Histogram1D histogram = new Histogram1D(data,             
         Orientation.VERTICAL,new Number[] {-4.0, -3.2, -2.4, -1.6, 
           -0.8, 0.0, 0.8, 1.6, 2.4, 3.2, 4.0});
        DataSource histogram2d = new EnumeratedData(histogram, (-4.0 + 
          -3.2)/2.0, 0.8);

The values in the array are the intervals or bins on the x axis of your histogram:
如您所见，直方图为Barplot，创建一个条形图，并将您的直方图信息提供给条形图:
```
         BarPlot plot = new BarPlot(histogram2d);
```
现在，格式化绘图区。

设置框架内直方图的坐标:

         plot.setInsets(new Insets2D.Double(20.0, 65.0, 50.0, 40.0));

设置直方图的标题:

```java
         plot.getTitle().setText(
         String.format("Distribution of %d random samples", 
           data.getRowCount()));
```

设置直方图条的宽度:

```java
         plot.setBarWidth(0.78);
```

格式化 x 轴。如果您熟悉 Microsoft Excel，那么您一定知道对于给定的轴也有刻度对齐和间距的选项。用户可以选择是否要查看轴上的次要刻度。幸运的是，GRAL 给了你这个工具，让你的图形对科学界更有吸引力。
配置 x 轴的记号对齐。注意getAxisRenderer()方法的参数，它是针对 x 轴:

```java
        plot.getAxisRenderer(BarPlot.AXIS_X).setTickAlignment(0.0);
```

配置刻度间距:

```java
        plot.getAxisRenderer(BarPlot.AXIS_X).setTickSpacing(0.8);
```

最后，配置次要刻度，使它们不可见:

```java
      plot.getAxisRenderer(BarPlot.AXIS_X).setMinorTicksVisible(false);
```

格式化 y 轴。在这种情况下，您将定义条形可以延伸到的高度范围:

```java
         plot.getAxis(BarPlot.AXIS_Y).setRange(0.0,
      MathUtils.ceil(histogram.getStatistics().get(Statistics.MAX)*1.1,    
         25.0));
```

同样，像对 x 轴:

```java
      plot.getAxisRenderer(BarPlot.AXIS_Y).setTickAlignment(0.0);
      plot.getAxisRenderer(BarPlot.AXIS_Y).setMinorTicksVisible(false);
      plot.getAxisRenderer(BarPlot.AXIS_Y).setIntersection(-4.4);
```

所做的那样，设置记号对齐、间距和次要记号的可见性

接下来，格式化条形。设置条形的颜色，并配置直方图以在条形顶部显示频率值:

```java
         plot.getPointRenderer(histogram2d).setColor(
         GraphicsUtils.deriveWithAlpha(COLOR1, 128));
         plot.getPointRenderer(histogram2d).setValueVisible(true);
```

最后，添加情节到 swing 组件:

```java
         InteractivePanel panel = new InteractivePanel(plot);
         panel.setPannable(false);
         panel.setZoomable(false);
         add(panel);
```

关闭构造函数:

```java
         }
```

您还需要实现ExamplePanel类中的所有方法。为了简单起见，重写getTitle()和getDescription()方法如下:

```java
         @Override
          public String getTitle() {
             return "Histogram plot";
          }
         @Override
            public String getDescription() {
             return String.format("Histogram of %d samples",    
               SAMPLE_COUNT);
           }
```

该类的main方法如下:

```java
         public static void main(String[] args) {
         new HistogramPlot().showInFrame();
         }
```

最后，关闭类:

```java
         }
```

食谱的完整代码如下:

import java.util.Random; 
import de.erichseifert.gral.data.DataSource; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.data.EnumeratedData; 
import de.erichseifert.gral.data.statistics.Histogram1D; 
import de.erichseifert.gral.data.statistics.Statistics; 
import de.erichseifert.gral.examples.ExamplePanel; 
import de.erichseifert.gral.plots.BarPlot; 
import de.erichseifert.gral.ui.InteractivePanel; 
import de.erichseifert.gral.util.GraphicsUtils; 
import de.erichseifert.gral.util.Insets2D; 
import de.erichseifert.gral.util.MathUtils; 
import de.erichseifert.gral.util.Orientation; 

public class HistogramPlot extends ExamplePanel { 
   /** Version id for serialization. */ 
   private static final long serialVersionUID = 4458280577519421950L; 

   private static final int SAMPLE_COUNT = 1000; 

   //@SuppressWarnings("unchecked") 
   public HistogramPlot() { 
      // Create example data 
      Random random = new Random(); 
      DataTable data = new DataTable(Double.class); 
      for (int i = 0; i < SAMPLE_COUNT; i++) { 
         data.add(random.nextGaussian()); 
      } 

      // Create histogram from data 
      Histogram1D histogram = new Histogram1D(data, 
       Orientation.VERTICAL, new Number[] {-4.0, -3.2, -2.4, -1.6, 
          -0.8, 0.0, 0.8, 1.6, 2.4, 3.2, 4.0}); 
      // Create a second dimension (x axis) for plotting 
      DataSource histogram2d = new EnumeratedData(histogram, (-4.0 + 
          -3.2)/2.0, 0.8); 

      // Create new bar plot 
      BarPlot plot = new BarPlot(histogram2d); 

      // Format plot 
      plot.setInsets(new Insets2D.Double(20.0, 65.0, 50.0, 40.0)); 
      plot.getTitle().setText( 
            String.format("Distribution of %d random samples", 
               data.getRowCount())); 
      plot.setBarWidth(0.78); 

      // Format x axis 
      plot.getAxisRenderer(BarPlot.AXIS_X).setTickAlignment(0.0); 
      plot.getAxisRenderer(BarPlot.AXIS_X).setTickSpacing(0.8); 
      plot.getAxisRenderer(BarPlot.AXIS_X).setMinorTicksVisible(false); 
      // Format y axis 
      plot.getAxis(BarPlot.AXIS_Y).setRange(0.0, 
         MathUtils.ceil(histogram.getStatistics().
             get(Statistics.MAX)*1.1, 25.0)); 
      plot.getAxisRenderer(BarPlot.AXIS_Y).setTickAlignment(0.0); 
      plot.getAxisRenderer(BarPlot.AXIS_Y).setMinorTicksVisible(false); 
      plot.getAxisRenderer(BarPlot.AXIS_Y).setIntersection(-4.4); 

      // Format bars 
      plot.getPointRenderer(histogram2d).setColor( 
         GraphicsUtils.deriveWithAlpha(COLOR1, 128)); 
      plot.getPointRenderer(histogram2d).setValueVisible(true); 

      // Add plot to Swing component 
      InteractivePanel panel = new InteractivePanel(plot); 
      panel.setPannable(false); 
      panel.setZoomable(false); 
      add(panel); 
   } 

   @Override 
   public String getTitle() { 
      return "Histogram plot"; 
   } 

   @Override 
   public String getDescription() { 
      return String.format("Histogram of %d samples", SAMPLE_COUNT); 
   } 

   public static void main(String[] args) { 
      new HistogramPlot().showInFrame(); 
   } 
}

绘制条形图

条形图是数据科学家最常用的图表类型。用 GRAL 画条形图很简单。在本食谱中，我们将使用 GRAL 绘制以下条形图:

准备就绪

为了使用 GRAL 绘制条形图，我们需要以 Jar 文件的形式随库提供的示例应用程序。这些示例应用程序可以从 http://trac.erichseifert.de/gral/wiki/Download T2 下载。从下载位置下载gral-examples-0.10.zip文件到本地磁盘。提取文件。
一旦你下载了 ZIP 文件，并解压它们，你会看到一个目录结构，如绘制 2D 正弦图的准备部分所示，这里我们感兴趣的文件夹是lib文件夹。
在lib文件夹中，您会发现三个 Jar 文件:gral-core-0.10、gral-examples-0.10和VectorGraphics2D-0.9.1。在这个菜谱中，您将使用前面提到的前两个 Jar 文件。
将这两个 JAR 文件作为外部库包含在项目中。

现在，您已经准备好使用 GRAL 示例包中包含的程序来绘制直方图。我们将在下一节看到的配方可以在您下载的示例包的gral-examples-0.10\gral-examples-0.10\src\main\java\de\erichseifert\gral\examples\barplot中找到。

怎么做...

像前面的菜谱一样创建一个名为SimpleBarPlot.的类，这个类将扩展 GRAL 库的ExamplePanel类:
```
        publicclassSimpleBarPlotextendsExamplePanel {
```

创建一个系列版本 UID:

        privatestaticfinallong serialVersionUID =-2793954497895054530L;

开始开发构造函数:
```
        publicSimpleBarPlot() {
```

首先，您将创建示例数据。在本食谱开头显示的条形图中，每个条形有三个值:x 轴的值、y 轴的值和条形的名称。例如，第一个条形的 x 轴值为 0.1，y 轴值为 1，名称为一月。您将通过以下方式为所有条形创建数据点:

        DataTable data = new DataTable(Double.class, Integer.class, 
          String.class);
        data.add(0.1, 1, "January");
        data.add(0.2, 3, "February");
        data.add(0.3, -2, "March");
        data.add(0.4, 6, "April");
        data.add(0.5, -4, "May");
        data.add(0.6, 8, "June");
        data.add(0.7, 9, "July");
        data.add(0.8, 11, "August");

DataTable 类的构造函数在这里取三个值: x 轴(double)， y 轴(integer)，最后是条形的名称(String)。
剩下的代码将用于格式化你的条形图。

创建新的条形图:

        BarPlot plot = newBarPlot(data);

设置图表中条形的条形图厚度的尺寸:

        plot.setInsets(new Insets2D.Double(40.0, 40.0, 40.0, 40.0));
        plot.setBarWidth(0.075);

现在，您将格式化您的条形图。为此，首先您需要使用您的数据
```
        BarRenderer pointRenderer = (BarRenderer) 
          plot.getPointRenderer(data);
```
创建一个BarRenderer
接下来，设置条形的颜色:

```java
         pointRenderer.setColor(
          new LinearGradientPaint(0f,0f, 0f,1f,
          new float[] { 0.0f, 1.0f },
          new Color[] { COLOR1, GraphicsUtils.deriveBrighter(COLOR1) }
          )
         );
```

Next, set the properties of the bar chart:

要在条形图上显示值，请使用以下代码:

```java
          pointRenderer.setValueVisible(true);
```

将数据中的第三个值(月份名称)设置为值列:

```java
        pointRenderer.setValueColumn(2);
```

将值的位置设置为居中:

```java
        pointRenderer.setValueLocation(Location.CENTER);
```

设置值的颜色:

```java
       pointRenderer.setValueColor(GraphicsUtils.deriveDarker(COLOR1));
```

打开值字体的粗体功能:

```java
       pointRenderer.setValueFont(Font.decode
          (null).deriveFont(Font.BOLD));
```

将条形图添加到Swing组件:

```java
       add(newInteractivePanel(plot));
```

关闭构造函数:

```java
       }
```

您需要实现 GRAL 库的ExamplePanel类中的另外两个方法:

```java
        @Override
       public String getTitle() {
        return "Bar plot";
       }
        @Override
        public String getDescription() {
          return "Bar plot with example data and color gradients";
        }
```

到目前为止，运行代码的main方法将如下所示:

```java
        public static void main(String[] args) {
        new SimpleBarPlot().showInFrame();
        }
```

关闭课程:

该配方的完整源代码如下:

import java.awt.Color; 
import java.awt.Font; 
import java.awt.LinearGradientPaint; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.examples.ExamplePanel; 
import de.erichseifert.gral.plots.BarPlot; 
import de.erichseifert.gral.plots.BarPlot.BarRenderer; 
import de.erichseifert.gral.ui.InteractivePanel; 
import de.erichseifert.gral.util.GraphicsUtils; 
import de.erichseifert.gral.util.Insets2D; 
import de.erichseifert.gral.util.Location; 

public class SimpleBarPlot extends ExamplePanel { 
   /** Version id for serialization. */ 
   private static final long serialVersionUID = -2793954497895054530L; 

   @SuppressWarnings("unchecked") 
   public SimpleBarPlot() { 
      // Create example data 
      DataTable data = new DataTable(Double.class, Integer.class, 
        String.class); 
      data.add(0.1,  1, "January"); 
      data.add(0.2,  3, "February"); 
      data.add(0.3, -2, "March"); 
      data.add(0.4,  6, "April"); 
      data.add(0.5, -4, "May"); 
      data.add(0.6,  8, "June"); 
      data.add(0.7,  9, "July"); 
      data.add(0.8, 11, "August"); 

      // Create new bar plot 
      BarPlot plot = new BarPlot(data); 

      // Format plot 
      plot.setInsets(new Insets2D.Double(40.0, 40.0, 40.0, 40.0)); 
      plot.setBarWidth(0.075); 

      // Format bars 
      BarRenderer pointRenderer = (BarRenderer)    
        plot.getPointRenderer(data); 
      pointRenderer.setColor( 
         new LinearGradientPaint(0f,0f, 0f,1f, 
               new float[] { 0.0f, 1.0f }, 
               new Color[] { COLOR1,  
                 GraphicsUtils.deriveBrighter(COLOR1) } 
         ) 
      ); 
      /*pointRenderer.setBorderStroke(new BasicStroke(3f)); 
      pointRenderer.setBorderColor( 
         new LinearGradientPaint(0f,0f, 0f,1f, 
               new float[] { 0.0f, 1.0f }, 
               new Color[] { GraphicsUtils.deriveBrighter(COLOR1), 
                COLOR1 } 
         ) 
      );*/ 
      pointRenderer.setValueVisible(true); 
      pointRenderer.setValueColumn(2); 
      pointRenderer.setValueLocation(Location.CENTER); 
      pointRenderer.setValueColor(GraphicsUtils.deriveDarker(COLOR1)); 
   pointRenderer.setValueFont(Font.decode(null).deriveFont(Font.BOLD)); 

      // Add plot to Swing component 
      add(new InteractivePanel(plot)); 
   } 

   @Override 
   public String getTitle() { 
      return "Bar plot"; 
   } 

   @Override 
   public String getDescription() { 
      return "Bar plot with example data and color gradients"; 
   } 

   public static void main(String[] args) { 
      new SimpleBarPlot().showInFrame(); 
   } 
}

绘制箱形图或须状图

箱线图是数据科学家的另一种有效的可视化工具。它们给出了数据分布的重要描述性统计数据。典型的箱线图将包含以下关于数据分布的信息:

最小值
第一四分位数
中位数
第三个四分位数
最大值

通过获取第三个四分位数和第一个四分位数之间的差异，也可以从这些统计数据中得出其他值，如四分位数间距。

在本食谱中，您将使用 GRAL 绘制数据分布的箱线图。

准备就绪

为了使用 GRAL 绘制条形图，我们需要以 Jar 文件的形式随库提供的示例应用程序。这些示例应用程序可以从 http://trac.erichseifert.de/gral/wiki/Download T2 下载。从下载位置下载gral-examples-0.10.zip文件到本地磁盘。提取文件。
下载 ZIP 文件并解压后，你会看到一个目录结构，如绘制 2D 正弦图配方的准备部分所示。我们感兴趣的文件夹是lib文件夹。
在lib里面，你会发现三个 Jar 文件:gral-core-0.10、gral-examples-0.10和VectorGraphics2D-0.9.1。在这个菜谱中，您将使用前面提到的前两个 Jar 文件。
将这两个 JAR 文件作为外部库包含在项目中。

现在，您已经准备好使用 GRAL 示例包中包含的程序来绘制箱线图。我们将在下一节看到的配方可以在您下载的示例包的gral-examples-0.10\gral-examples-0.10\src\main\java\de\erichseifert\gral\examples\boxplot 中找到。当您成功运行该配方中的代码时，您将看到如下所示的方框图:

怎么做...

首先，创建一个名为SimpleBoxPlot的类，它扩展了 GRAL 库中的类ExamplePanel。提供一个序列版本 UID:

        public class SimpleBoxPlot extends ExamplePanel {
        private static final long serialVersionUID = 
           5228891435595348789L;

您将为将要创建和渲染的盒状图生成50随机样本。创建以下类变量:

        private static final int SAMPLE_COUNT = 50;
        private static final Random random = new Random();

为类创建构造函数:
```
        public SimpleBoxPlot() {
```

设置你的方框图窗口的尺寸:

        setPreferredSize(new Dimension(400, 600));

创建一个数据表，其中每一行包含三个列值，并且它们都是整数:

        DataTable data = new DataTable(Integer.class, Integer.class, 
          Integer.class);

用三个整数值(数据表中的列值)生成 50 个数据样本。数据样本将包含来自高斯分布的值(不一定需要从高斯分布中提取):

        for (int i = 0; i < SAMPLE_COUNT; i++) {
        int x = (int) Math.round(5.0*random.nextGaussian());
        int y = (int) Math.round(5.0*random.nextGaussian());
        int z = (int) Math.round(5.0*random.nextGaussian());
        data.add(x, y, z);
        }

用数据创建一个新的方框图:

        DataSource boxData = BoxPlot.createBoxData(data);
        BoxPlot plot = new BoxPlot(boxData);

设置您将要绘制Boxplot :

        plot.setInsets(newInsets2D.Double(20.0, 50.0, 40.0, 20.0));

的窗口的插入尺寸

格式化 x 轴:

        plot.getAxisRenderer(BoxPlot.AXIS_X).setCustomTicks(
        DataUtils.map(
        new Double[] {1.0, 2.0, 3.0},
        new String[] {"Column 1", "Column 2", "Column 3"}
        )
        );

中的值

代码的其余部分将呈现箱线图。首先，用数据创建一个点渲染器:

```java
        BoxWhiskerRenderer pointRenderer =
           (BoxWhiskerRenderer) plot.getPointRenderer(boxData);
```

接下来，设置盒状图的边框颜色，胡须的颜色(第三个四分位数为最大值，第一个四分位数为最小值)和中心条(中间条):

```java
        pointRenderer.setBoxBorderColor(COLOR1);
        pointRenderer.setWhiskerColor(COLOR1);
        pointRenderer.setCenterBarColor(COLOR1);
```

对于箱罐，使用垂直导航:

```java
      plot.getNavigator().setDirection(XYNavigationDirection.VERTICAL);
```

将方框图发送给 swing 组件进行渲染:

```java
        InteractivePanel panel = new InteractivePanel(plot);
        add(panel);
```

关闭构造函数:

```java
        }
```

您需要在 ExamplePanel 类中实现以下两个方法，方法是覆盖它们:

```java
         @Override
         public String getTitle() {
         return "Box-and-whisker plot";
         }
         @Override
         public String getDescription() {
         return String.format("Three box-and-whisker plots created from  
         %d random samples", SAMPLE_COUNT);
        }
```

然后，添加main方法并关闭类:

```java
        public static void main(String[] args) {
        new SimpleBoxPlot().showInFrame();
        }
        }
```

完整的源代码如下:

import java.awt.Dimension; 
import java.util.Random; 
import de.erichseifert.gral.data.DataSource; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.examples.ExamplePanel; 
import de.erichseifert.gral.plots.BoxPlot; 
import de.erichseifert.gral.plots.BoxPlot.BoxWhiskerRenderer; 
import de.erichseifert.gral.plots.XYPlot.XYNavigationDirection; 
import de.erichseifert.gral.ui.InteractivePanel; 
import de.erichseifert.gral.util.DataUtils; 
import de.erichseifert.gral.util.Insets2D; 

public class SimpleBoxPlot extends ExamplePanel { 
   /** Version id for serialization. */ 
   private static final long serialVersionUID = 5228891435595348789L; 
   private static final int SAMPLE_COUNT = 50; 
   private static final Random random = new Random(); 

   @SuppressWarnings("unchecked") 
   public SimpleBoxPlot() { 
      setPreferredSize(new Dimension(400, 600)); 

      // Create example data 
      DataTable data = new DataTable(Integer.class, Integer.class, 
        Integer.class); 
      for (int i = 0; i < SAMPLE_COUNT; i++) { 
         int x = (int) Math.round(5.0*random.nextGaussian()); 
         int y = (int) Math.round(5.0*random.nextGaussian()); 
         int z = (int) Math.round(5.0*random.nextGaussian()); 
         data.add(x, y, z); 
      } 

      // Create new box-and-whisker plot 
      DataSource boxData = BoxPlot.createBoxData(data); 
      BoxPlot plot = new BoxPlot(boxData); 

      // Format plot 
      plot.setInsets(new Insets2D.Double(20.0, 50.0, 40.0, 20.0)); 

      // Format axes 
      plot.getAxisRenderer(BoxPlot.AXIS_X).setCustomTicks( 
         DataUtils.map( 
               new Double[] {1.0, 2.0, 3.0}, 
               new String[] {"Column 1", "Column 2", "Column 3"} 
         ) 
      ); 

      // Format boxes 
      /*Stroke stroke = new BasicStroke(2f); 
      ScaledContinuousColorMapper colors = 
         new LinearGradient(GraphicsUtils.deriveBrighter(COLOR1), 
           Color.WHITE); 
      colors.setRange(1.0, 3.0);*/ 

      BoxWhiskerRenderer pointRenderer = 
            (BoxWhiskerRenderer) plot.getPointRenderer(boxData); 
      /*pointRenderer.setWhiskerStroke(stroke); 
      pointRenderer.setBoxBorderStroke(stroke); 
      pointRenderer.setBoxBackground(colors);*/ 
      pointRenderer.setBoxBorderColor(COLOR1); 
      pointRenderer.setWhiskerColor(COLOR1); 
      pointRenderer.setCenterBarColor(COLOR1); 

      plot.getNavigator().setDirection(XYNavigationDirection.VERTICAL); 

      // Add plot to Swing component 
      InteractivePanel panel = new InteractivePanel(plot); 
      add(panel); 
   } 

   @Override 
   public String getTitle() { 
      return "Box-and-whisker plot"; 
   } 

   @Override 
   public String getDescription() { 
      return String.format("Three box-and-whisker plots created from %d 
        random samples", SAMPLE_COUNT); 
   } 

   public static void main(String[] args) { 
      new SimpleBoxPlot().showInFrame(); 
   } 
}

绘制散点图

这个食谱演示了如何使用 GRAL 绘制 100，000 个随机数据点的散点图。散点图使用 x 和 y 轴来绘制数据点，是展示变量之间相关性的好方法。

准备就绪

为了使用 GRAL 绘制条形图，我们需要以 Jar 文件的形式随库提供的示例应用程序。这些示例应用程序可以从 http://trac.erichseifert.de/gral/wiki/Download T2 下载。从下载位置下载gral-examples-0.10.zip文件到本地磁盘。提取文件。
一旦你下载了 ZIP 文件并解压，你会看到一个目录结构，如绘制 2D 正弦图的准备部分所示。我们感兴趣的文件夹是lib文件夹。
在lib里面，你会发现三个 Jar 文件:gral-core-0.10、gral-examples-0.10和VectorGraphics2D-0.9.1。在这个菜谱中，您将使用前面提到的前两个 Jar 文件。
将这两个 Jar 文件作为外部库包含在项目中。

现在，您已经准备好使用 GRAL 示例包中包含的程序来绘制散点图。我们将在下一节看到的配方可以在您下载的示例包中的gral-examples-0.10\gral-examples-0.10\src\main\java\de\erichseifert\gral\examples\xyplot处找到。当您成功运行该配方中的代码时，您将看到一个包含 100，000 个随机数据点的散点图，如下所示:

怎么做...

首先，创建一个名为ScatterPlot的类，它扩展了 GRAL 库的ExamplePanel类。在类中添加串行版本 UID:

        public class ScatterPlot extends ExamplePanel {
        private static final long serialVersionUID = 
          -412699430625953887L;

在配方中，您将使用 100，000 个随机数据点。为数据点和其中的随机性元素创建类变量:

        private static final int SAMPLE_COUNT = 100000;
        private static final Random random = new Random();

开始编写构造函数的代码:
```
       publicScatterPlot() {
```

创建一个数据表来包含您将在散点图中绘制的随机 x 和 y 值。x 和 y 值将为双精度类型，并从高斯分布中提取:

         DataTable data = new DataTable(Double.class, Double.class);
         for (int i = 0; i <= SAMPLE_COUNT; i++) {
         data.add(random.nextGaussian()*2.0, 
          random.nextGaussian()*2.0);
         }

散点图可以看作是一个XYplot，因此，我们创建了一个:
```
        XYPlot plot =newXYPlot(data);
```

设置图的尺寸，获取图的描述:

         plot.setInsets(new Insets2D.Double(20.0, 40.0, 40.0, 40.0));
         plot.getTitle().setText(getDescription());

格式化数据点并添加一些颜色:

        plot.getPointRenderer(data).setColor(COLOR1);

最后，将绘图发送到 Java Swing 组件并关闭构造函数:

        add(new InteractivePanel(plot), BorderLayout.CENTER);
        }

在扩展ExamplePanel类时，您还需要实现以下两个方法:

        @Override
        public String getTitle() {
        return "Scatter plot";
        }
        @Override
        public String getDescription() {
        return String.format("Scatter plot with %d data points", 
        SAMPLE_COUNT);
        }

最后，放置 main 方法块来运行代码并关闭该类:

         public static void main(String[] args) {
         new ScatterPlot().showInFrame();
         }
         }

该配方的源代码如下:

import java.awt.BorderLayout; 
import java.util.Random; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.examples.ExamplePanel; 
import de.erichseifert.gral.plots.XYPlot; 
import de.erichseifert.gral.ui.InteractivePanel; 
import de.erichseifert.gral.util.Insets2D; 

public class ScatterPlot extends ExamplePanel { 
   /** Version id for serialization. */ 
   private static final long serialVersionUID = -412699430625953887L; 

   private static final int SAMPLE_COUNT = 100000; 
   /** Instance to generate random data values. */ 
   private static final Random random = new Random(); 

   @SuppressWarnings("unchecked") 
   public ScatterPlot() { 
      // Generate 100,000 data points 
      DataTable data = new DataTable(Double.class, Double.class); 
      for (int i = 0; i <= SAMPLE_COUNT; i++) { 
         data.add(random.nextGaussian()*2.0,  
           random.nextGaussian()*2.0); 
      } 

      // Create a new xy-plot 
      XYPlot plot = new XYPlot(data); 

      // Format plot 
      plot.setInsets(new Insets2D.Double(20.0, 40.0, 40.0, 40.0)); 
      plot.getTitle().setText(getDescription()); 

      // Format points 
      plot.getPointRenderer(data).setColor(COLOR1); 

      // Add plot to Swing component 
      add(new InteractivePanel(plot), BorderLayout.CENTER); 
   } 

   @Override 
   public String getTitle() { 
      return "Scatter plot"; 
   } 

   @Override 
   public String getDescription() { 
      return String.format("Scatter plot with %d data points", 
        SAMPLE_COUNT); 
   } 

   public static void main(String[] args) { 
      new ScatterPlot().showInFrame(); 
   } 
}

绘制圆环图

圆环图是饼图的一个版本，是一种流行的数据可视化技术，可以直观地显示数据中的比例。在这个食谱中，我们将看到如何使用 GRAL Java 库来绘制 10 个随机变量的环形图。

准备就绪

为了使用 GRAL 绘制条形图，我们需要以 Jar 文件的形式随库提供的示例应用程序。这些示例应用程序可以从 http://trac.erichseifert.de/gral/wiki/Download T2 下载。从下载位置下载gral-examples-0.10.zip文件到本地磁盘。提取文件。
你会看到一个目录结构，如绘制 2D 正弦图的准备部分所示。我们感兴趣的文件夹是lib文件夹。
在lib里面，你会发现三个 Jar 文件:gral-core-0.10、gral-examples-0.10和VectorGraphics2D-0.9.1。在这个菜谱中，您将使用前面提到的前两个 Jar 文件。
将这两个 Jar 文件作为外部库包含在项目中。

现在，您已经准备好使用 GRAL 示例包中包含的程序来绘制圆环图。我们将在下一节看到的配方可以在您下载的示例包中的gral-examples-0.10\gral-examples-0.10\src\main\java\de\erichseifert\gral\examples\pieplot处找到。当您成功地运行这个食谱中的代码时，您将看到一个由 10 个随机数据值组成的圆环图，如下所示:

怎么做...

创建一个名为SimplePiePlot的类，它扩展了 GRAL 库的ExamplePanel类。提供一个序列版本 UID:
```
        public class SimplePiePlot extends ExamplePanel {
```

接下来，声明两个用于生成10随机数据点的类变量:

        privatestaticfinalintSAMPLE_COUNT = 10;
        privatestatic Random random = new Random();

开始为你的构造函数编写代码:
```
        public SimplePiePlot() {
```
创建一个数据表，并在其中放入 10 个随机数。在这个例子中，您将生成一个带有种子值8的随机整数，并且总是用 random 类生成的随机数来添加2。然后，当您将值添加到数据表中时，检查生成的值是否小于或等于0.15。如果值满足这个条件，则加上生成值的负数；否则，将该值添加到您的数据表:
```
        DataTable data = new DataTable(Integer.class);
        for (int i = 0; i < SAMPLE_COUNT; i++) {
        int val = random.nextInt(8) + 2;
        data.add((random.nextDouble() <= 0.15) ? -val : val);
        }
```

用您的数据创建一个PiePlot:

        PiePlot plot = new PiePlot(data);

获取您的甜甜圈图的标题:

        plot.getTitle().setText(getDescription());

现在，设置甜甜圈的相对大小:
```
        plot.setRadius(0.9);
```
如果您想查看绘图上可见的图例，将图例可见设置为true；设置为false否则:
```
        plot.setLegendVisible(true);
```

提供您的图的尺寸:

        plot.setInsets(new Insets2D.Double(20.0, 40.0, 40.0, 40.0));

为你的圆环图创建一个点渲染:

```java
        PieSliceRenderer pointRenderer =
          (PieSliceRenderer) plot.getPointRenderer(data);
```

设置内部区域的相对大小:

```java
        pointRenderer.setInnerRadius(0.4);
```

设置合理的切片间隙:

```java
        pointRenderer.setGap(0.2);
```

改变切片的颜色:

```java
        LinearGradient colors = new LinearGradient(COLOR1, COLOR2);
        pointRenderer.setColor(colors);
```

格式化标签以及您希望如何显示它们。在本例中，您用白色粗体显示值:

```java
        pointRenderer.setValueVisible(true);
        pointRenderer.setValueColor(Color.WHITE);
        pointRenderer.setValueFont(Font.decode(null)
        .deriveFont(Font.BOLD));
```

最后，添加情节到 Swing 组件:

```java
        add(new InteractivePanel(plot), BorderLayout.CENTER);
```

关闭构造函数:

```java
        }
```

您需要在代码中实现另外两个方法，因为您已经从ExamplePanel类:

```java
        @Override
        public String getTitle() {
        return "Donut plot";
        }
         @Override
          public String getDescription() {
          return String.format("Donut plot of %d random data values", 
            SAMPLE_COUNT);
         }
```

扩展了您的类

将主方法添加到run代码:

```java
        publicstaticvoid main(String[] args) {
        new SimplePiePlot().showInFrame();
        }
```

关闭您的班级:

```java
        }
```

该配方的代码如下:

import java.awt.BorderLayout; 
import java.awt.Color; 
import java.awt.Font; 
import java.util.Random; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.examples.ExamplePanel; 
import de.erichseifert.gral.plots.PiePlot; 
import de.erichseifert.gral.plots.PiePlot.PieSliceRenderer; 
import de.erichseifert.gral.plots.colors.LinearGradient; 
import de.erichseifert.gral.ui.InteractivePanel; 
import de.erichseifert.gral.util.Insets2D; 

public class SimplePiePlot extends ExamplePanel { 
   /** Version id for serialization. */ 
   private static final long serialVersionUID = -3039317265508932299L; 

   private static final int SAMPLE_COUNT = 10; 
   /** Instance to generate random data values. */ 
   private static Random random = new Random(); 

   @SuppressWarnings("unchecked") 
   public SimplePiePlot() { 
      // Create data 
      DataTable data = new DataTable(Integer.class); 
      for (int i = 0; i < SAMPLE_COUNT; i++) { 
         int val = random.nextInt(8) + 2; 
         data.add((random.nextDouble() <= 0.15) ? -val : val); 
      } 

      // Create new pie plot 
      PiePlot plot = new PiePlot(data); 

      // Format plot 
      plot.getTitle().setText(getDescription()); 
      // Change relative size of pie 
      plot.setRadius(0.9); 
      // Display a legend 
      plot.setLegendVisible(true); 
      // Add some margin to the plot area 
      plot.setInsets(new Insets2D.Double(20.0, 40.0, 40.0, 40.0)); 

      PieSliceRenderer pointRenderer = 
            (PieSliceRenderer) plot.getPointRenderer(data); 
      // Change relative size of inner region 
      pointRenderer.setInnerRadius(0.4); 
      // Change the width of gaps between segments 
      pointRenderer.setGap(0.2); 
      // Change the colors 
      LinearGradient colors = new LinearGradient(COLOR1, COLOR2); 
      pointRenderer.setColor(colors); 
      // Show labels 
      pointRenderer.setValueVisible(true); 
      pointRenderer.setValueColor(Color.WHITE); 
      pointRenderer.setValueFont(Font.decode(null).deriveFont(Font.BOLD)); 

      // Add plot to Swing component 
      add(new InteractivePanel(plot), BorderLayout.CENTER); 
   } 

   @Override 
   public String getTitle() { 
      return "Donut plot"; 
   } 

   @Override 
   public String getDescription() { 
      return String.format("Donut plot of %d random data values", 
        SAMPLE_COUNT); 
   } 

   public static void main(String[] args) { 
      new SimplePiePlot().showInFrame(); 
   } 
}

绘制面积图

面积图是显示定量值在给定时间间隔内如何发展的有用工具。对于数据科学家来说，它们是了解趋势的有效手段。它们基于线图，但是基于轴中的值绘制的线下方的区域用某种颜色或纹理填充。在本菜谱中，您将使用 GRAL Java 库来绘制面积图。

准备就绪

为了使用 GRAL 绘制条形图，我们需要以 Jar 文件的形式随库提供的示例应用程序。这些示例应用程序可以从 http://trac.erichseifert.de/gral/wiki/Download T2 下载。从下载位置下载gral-examples-0.10.zip文件到本地磁盘。提取文件。
然后你会看到一个目录结构，如绘制 2D 正弦图的准备部分所示。我们感兴趣的文件夹是lib文件夹。
在lib里面，你会发现三个 Jar 文件:gral-core-0.10、gral-examples-0.10和VectorGraphics2D-0.9.1。在这个菜谱中，您将使用前面提到的前两个 Jar 文件。
将这两个 Jar 文件作为外部库包含在项目中。

现在，您已经准备好使用 GRAL 示例包中包含的程序来绘制面积图。当您成功运行此配方中的代码时，您将看到类似如下的面积图:

怎么做...

首先，创建一个名为AreaPlot的类，它扩展了 GRAL 的ExamplePanel.为你的类提供一个串行版本 UID:

        public class AreaPlot extends ExamplePanel {
        private static final long serialVersionUID = 
          3287044991898775949L;

您将使用随机值绘制面积图。因此，为这个随机化创建一个类变量:
```
        private static final Random random = new Random();
```
接下来，开始为这个类创建构造函数:
```
       public AreaPlot() {
```
创建一个包含四个数据点的数据表:一个 x 值和三个 y 值。本例中所有数据点的值都是 double 类型:
```
       DataTable data = new DataTable(Double.class, Double.class, 
          Double.class, Double.class);
```
创建一个运行50次的 for 循环，从x的0.0值开始，每次递增 1:
```
        for (double x = 0.0; x < 50; x ++) {
```

创建三个y变量来保存y值。y值将在下一步从高斯数据分布中随机产生:

        y1 = random.nextGaussian();
        y2 = random.nextGaussian();
        y3 = random.nextGaussian();

Finally, add the (x, y1, y2, y3) values to your data table and close the for loop:

        data.add(x, y1, y2, y3);
        }

为了获得更好的图形，您可以用以下代码替换步骤 5 到 7 中的 For 循环:

         for (double x=0.0; x<.5*Math.PI; x+=Math.PI/15.0) {
          double y1 = Double.NaN, y2 = Double.NaN, y3 = Double.NaN;
          if (x>=0.00*Math.PI && x<2.25*Math.PI) {
            y1 = 4.0*Math.sin(x + 0.5*Math.PI) + 
                 0.1*random.nextGaussian();
         }
         if (x>=0.25*Math.PI && x<2.50*Math.PI) {
            y2 = 4.0*Math.cos(x + 0.5*Math.PI) + 
                 0.1*random.nextGaussian();
         }
         if (x>=0.00*Math.PI && x<2.50*Math.PI) {
            y3 = 2.0*Math.sin(2.0*x/2.5)       + 
                 0.1*random.nextGaussian();
         }
            data.add(x, y1, y2, y3);
         }

然后，使用 GRAL 的DataSeries类将三组数据系列相加。该类有一个构造函数，其形式如下:
```
        public DataSeries(DataSource data, int... cols)
```
这里，根据 GRAL 在的 Java API 文档 http://www . erichseifert . de/dev/gral/0.9/API docs/de/erichseifert/gral/data/data series . html，第一列将是第 0 列，第二列是第 1 列，依此类推，而指定列的值是数据源中的列号:
```
        DataSeries data1 = new DataSeries("series 1", data, 0, 1);
        DataSeries data2 = new DataSeries("series 2", data, 0, 2);
        DataSeries data3 = new DataSeries("series 3", data, 0, 3);
```
用三个数据系列创建一个XYPlot。您可能会对显示图表上的图例感兴趣。此外，设置图形的尺寸:

```java
        XYPlot plot = new XYPlot(data1, data2, data3);
        plot.setLegendVisible(true);
        plot.setInsets(new Insets2D.Double(20.0, 40.0, 20.0, 20.0));
```

面积图的一个附加任务是用颜色填充绘图区域。您将为此任务调用一个名为formatFilledArea和formatLineArea的静态方法。前两个系列和第三个系列的区别见面积图:

```java
       formatFilledArea(plot, data1, COLOR2);
       formatFilledArea(plot, data2, COLOR1);
       formatLineArea(plot, data3, GraphicsUtils.deriveDarker(COLOR1));
```

将绘图添加到 Swing 组件并关闭构造函数:

```java
       add(new InteractivePanel(plot));
       }
```

创建一个静态方法，用某种颜色填充该区域。该方法将您创建的 XY 图、数据系列和颜色作为参数:

```java
        private static void formatFilledArea(XYPlot plot, DataSource 
          data, Color color) {
```

创建点渲染器。请记住，您将呈现 2D 图像，因此，使用适当的类。设置点渲染器的颜色，用数据序列设置点渲染器:

```java
        PointRenderer point = new DefaultPointRenderer2D();
        point.setColor(color);
        plot.setPointRenderer(data, point);
```

同样，使用 GRAL 的适当类创建一个 2D 线渲染器，设置线渲染器的颜色，并将线之间的间隙设置为3.0点。接下来，将间隙格式化为圆角。最后，用数据序列设置线条渲染器:

```java
        LineRenderer line = new DefaultLineRenderer2D();
        line.setColor(color);
        line.setGap(3.0);
        line.setGapRounded(true);
        plot.setLineRenderer(data, line);
```

点和线渲染器之后，剩下的是区域渲染器。创建一个 2D 区域渲染器并设置其颜色。用数据系列设置渲染器。关闭方法:

```java
       AreaRenderer area = new DefaultAreaRenderer2D();
       area.setColor(GraphicsUtils.deriveWithAlpha(color, 64));
       plot.setAreaRenderer(data, area);
       }
```

类似地，创建一个静态方法来格式化行区域。该方法将接受三个参数:您在构造函数中创建的XYPlot、数据序列和颜色:

```java
       private static void formatLineArea(XYPlot plot, DataSource 
        data, Color color) {
```

创建 2D 点渲染器，设置颜色，用数据序列设置渲染器:

```java
        PointRenderer point = new DefaultPointRenderer2D();
        point.setColor(color);
        plot.setPointRenderer(data, point);
```

在此方法中，您将不会使用线渲染器。这将使第三个数据系列看起来不同于前两个数据系列:

```java
        plot.setLineRenderer(data, null);
```

就像上一步一样，创建一个 2D 区域渲染器，设置区域的间隙，设置其颜色，用数据序列设置渲染器:

```java
        AreaRenderer area = new LineAreaRenderer2D();
        area.setGap(3.0);
        area.setColor(color);
        plot.setAreaRenderer(data, area);
        }
```

您需要重写ExamplePanel类的两个方法，如下所示:

```java
        @Override
        public String getTitle() {
        return "Area plot";
        }
        @Override
        public String getDescription() {
        return "Area plot of three series with different styling";
        }
```

要运行前面的代码，您需要一个如下所示的 main 方法。之后结束课程:

        public static void main(String[] args) {
        new AreaPlot().showInFrame();
        }
       }

这个食谱的完整代码如下:

import java.awt.Color; 
import java.util.Random; 
import de.erichseifert.gral.data.DataSeries; 
import de.erichseifert.gral.data.DataSource; 
import de.erichseifert.gral.data.DataTable; 
import de.erichseifert.gral.examples.ExamplePanel; 
import de.erichseifert.gral.plots.XYPlot; 
import de.erichseifert.gral.plots.areas.AreaRenderer; 
import de.erichseifert.gral.plots.areas.DefaultAreaRenderer2D; 
import de.erichseifert.gral.plots.areas.LineAreaRenderer2D; 
import de.erichseifert.gral.plots.lines.DefaultLineRenderer2D; 
import de.erichseifert.gral.plots.lines.LineRenderer; 
import de.erichseifert.gral.plots.points.DefaultPointRenderer2D; 
import de.erichseifert.gral.plots.points.PointRenderer; 
import de.erichseifert.gral.ui.InteractivePanel; 
import de.erichseifert.gral.util.GraphicsUtils; 
import de.erichseifert.gral.util.Insets2D; 

public class AreaPlot extends ExamplePanel { 
   /** Version id for serialization. */ 
   private static final long serialVersionUID = 3287044991898775949L; 

   /** Instance to generate random data values. */ 
   private static final Random random = new Random(); 

   public AreaPlot() { 
      // Generate data 
      DataTable data = new DataTable(Double.class, Double.class, 
        Double.class, Double.class); 
      for (double x = 0.0; x < 50; x ++) { 
         double y1 = Double.NaN, y2 = Double.NaN, y3 = Double.NaN; 
         y1 = random.nextGaussian(); 
         y2 = random.nextGaussian(); 
         y3 = random.nextGaussian(); 
         data.add(x, y1, y2, y3); 
      } 

      // Create data series 
      DataSeries data1 = new DataSeries("series 1", data, 0, 1); 
      DataSeries data2 = new DataSeries("series 2", data, 0, 2); 
      DataSeries data3 = new DataSeries("series 3", data, 0, 3); 

      // Create new xy-plot 
      XYPlot plot = new XYPlot(data1, data2, data3); 
      plot.setLegendVisible(true); 
      plot.setInsets(new Insets2D.Double(20.0, 40.0, 20.0, 20.0)); 

      // Format data series 
      formatFilledArea(plot, data1, COLOR2); 
      formatFilledArea(plot, data2, COLOR1); 
      formatLineArea(plot, data3, GraphicsUtils.deriveDarker(COLOR1)); 

      // Add plot to Swing component 
      add(new InteractivePanel(plot)); 
   } 

   private static void formatFilledArea(XYPlot plot, DataSource data, 
      Color color) { 
      PointRenderer point = new DefaultPointRenderer2D(); 
      point.setColor(color); 
      plot.setPointRenderer(data, point); 
      LineRenderer line = new DefaultLineRenderer2D(); 
      line.setColor(color); 
      line.setGap(3.0); 
      line.setGapRounded(true); 
      plot.setLineRenderer(data, line); 
      AreaRenderer area = new DefaultAreaRenderer2D(); 
      area.setColor(GraphicsUtils.deriveWithAlpha(color, 64)); 
      plot.setAreaRenderer(data, area); 
   } 

   private static void formatLineArea(XYPlot plot, DataSource data, 
      Color color) { 
      PointRenderer point = new DefaultPointRenderer2D(); 
      point.setColor(color); 
      plot.setPointRenderer(data, point); 
      plot.setLineRenderer(data, null); 
      AreaRenderer area = new LineAreaRenderer2D(); 
      area.setGap(3.0); 
      area.setColor(color); 
      plot.setAreaRenderer(data, area); 
   } 

   @Override 
   public String getTitle() { 
      return "Area plot"; 
   } 

   @Override 
   public String getDescription() { 
      return "Area plot of three series with different styling"; 
   } 

   public static void main(String[] args) { 
      new AreaPlot().showInFrame(); 
   } 
}

posted @ 2025-10-08 11:34 绝不原创的飞龙阅读(4) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

Java-数据科学秘籍-全-

Java 数据科学秘籍（全）

零、前言

这本书涵盖了什么

这本书你需要什么

这本书是给谁的

章节

准备就绪

怎么做...

它是如何工作的...

还有更多...

参见

习俗

注意

Tip

读者反馈

客户支持

下载示例代码

下载这本书的彩色图片

勘误表

盗版

问题

一、获取和清理数据

简介

使用 Java 从分层目录中检索所有文件名

准备就绪

怎么做...

注意

使用 Apache Commons IO 从分层目录中检索所有文件名

准备就绪

怎么做...

Tip

使用 Java 8 从文本文件中一次性读取内容

怎么做...

使用 Apache Commons IO 从文本文件中一次性读取内容

准备就绪

怎么做...

使用 Apache Tika 提取 PDF 文本

准备就绪

怎么做...

使用正则表达式清理 ASCII 文本文件

怎么做...

使用 Univocity 解析逗号分隔值(CSV)文件

准备就绪

怎么做...

注意

使用 Univocity 解析制表符分隔值(TSV)文件

准备就绪

怎么做...

使用 JDOM 解析 XML 文件

准备就绪

怎么做...

注意

使用 JSON.simple 编写 JSON 文件

准备就绪

怎么做...

使用 JSON.simple 读取 JSON 文件

准备就绪

怎么做...

使用 JSoup 从 URL 中提取 web 数据

准备就绪

怎么做...

Tip

使用 Selenium Webdriver 从网站提取 web 数据

准备就绪

Tip

怎么做...

注意

从 MySQL 数据库读取表格数据

准备就绪

怎么做...

二、索引和搜索数据

简介

使用 Apache Lucene 索引数据

准备就绪

怎么做...

提示

它是如何工作的...