Spark-时间序列分析-全-

Spark 时间序列分析（全）

原文：annas-archive.org/md5/92b806e0bfe2149821d7ca72450c0f00

译者：飞龙

协议：CC BY-NC-SA 4.0

前言

时间序列无处不在，时刻在增长。借助可以扩展的正确工具，您可以轻松释放其时间维度的洞察，赋予您在时间的赛道上的优势。

时间序列分析——从时间序列中提取洞见——对于企业和组织做出明智决策至关重要。这通过分析在时间间隔中收集的数据中的模式、趋势和异常来实现。Apache Spark 是一个强大的大数据处理框架，能够高效处理大规模时间序列数据，是处理此类数据的理想工具。

对于任何涉及 Apache Spark 的时间序列分析任务，有三个主要支柱：

数据准备与处理：包括收集、清洗和转换时间序列数据，使其适合进行分析。
建模与预测：包括应用统计模型或机器学习算法来揭示模式并预测未来趋势。
部署与维护：包括将模型集成到运营系统中，并持续监控和更新，以确保准确性和相关性。

本书《使用 Spark 进行时间序列分析》旨在涵盖所有这些支柱。它将提供处理、建模和预测时间序列数据的实际技术，帮助读者使用 Apache Spark 进行时间序列分析。本书基于两个主要信息来源：

实践经验：借鉴实际项目和经验，处理大规模时间序列数据时的 Apache Spark 运用。
行业洞察：融入时间序列分析和大数据处理领域专家和从业者的见解。

随着 Apache Spark 在时间序列分析中的应用不断增长，对掌握此领域技能的专业人才需求也在快速增加。本书将引导您了解使用 Apache Spark 有效进行时间序列分析所需的最佳实践和技巧，帮助您在这个快速发展的领域中保持领先。

本书适用对象

数据和 AI 专业人员，尤其是处理时间相关数据集的人员，将会发现《使用 Spark 进行时间序列分析》有助于提升他们在时间序列分析中运用 Apache Spark 和 Databricks 的技能。本书面向广泛的受众，从初学者到有经验的从业者，都可以从中学习如何利用 Spark 进行时序数据分析。

更具体地说，数据工程师将提升他们在使用 Spark 和 Databricks 进行大规模时间序列数据准备方面的能力。机器学习（ML）工程师将更容易扩展他们的机器学习项目范围。数据科学家和分析师将掌握新的时间序列分析技能，拓宽工具使用的范围。

本书涵盖的内容

第一章，什么是时间序列？，介绍了时间序列数据的概念及其分析中的独特挑战。这个基础对于有效分析和预测时间相关数据至关重要。

第二章，为什么进行时间序列分析？，详细阐述了分析时间相关数据在预测建模、趋势识别和异常检测中的重要性。通过各行业的实际应用来说明这一点。

第三章，Apache Spark 简介，深入探讨了 Apache Spark 及其在处理大规模时间序列数据时的分布式计算能力。

第四章，时间序列分析项目的端到端视角，引导我们了解时间序列分析项目的整个过程。从用例开始，涵盖了数据处理、特征工程、模型选择和评估等关键阶段。

第五章，数据准备，深入探讨了组织、清理和转换时间序列数据的关键步骤。内容包括处理缺失值、处理异常值和数据结构化等技术，从而提升后续分析过程的可靠性。

第六章，探索性数据分析，介绍了如何通过探索性数据分析揭示时间序列数据中的模式和洞察。此过程对于识别趋势和季节性等特征至关重要，为后续建模决策提供指导。

第七章，构建与测试模型，专注于为时间序列数据构建预测模型，涵盖了各种类型的模型、选择模型的方法以及如何训练、调优和评估模型。

第八章，扩展分析，讨论了在大规模和分布式计算环境中扩展时间序列分析时需要考虑的因素。内容涵盖了如何利用 Apache Spark 扩展特征工程、超参数调优以及单模型和多模型训练的不同方式。

第九章，投入生产，探讨了将时间序列模型部署到生产环境中的实际考虑因素和步骤，同时确保时间序列模型在操作环境中的可靠性和有效性。

第十章，深入探索 Apache Spark，提供了通过使用 Databricks 作为云端托管的 PaaS 平台，解决设置和管理 Apache Spark 平台的挑战。

第十一章，时间序列分析的最新发展，探讨了时间序列分析领域的最新发展，包括将生成性 AI 应用于时间序列预测的激动人心的研究方向，以及使时间序列分析结果对非技术人员易于理解的新方法。

为了最大限度地发挥本书的作用

本书要求你具备基本的 Python 编程语言知识，并且对数据科学和机器学习概念有基本了解。

第一章、第二章、第五章、第六章 和 第七章 使用 Databricks Community Edition。
第三章、第四章 和 第九章 使用本地容器化环境。本书中的示例在 macOS 上使用 Docker 进行测试。如果你在 Windows 或 Linux 上使用 Docker 或 Podman，适当调整后也应能够运行。如果你不打算在本地构建环境，而是更倾向于使用 Databricks 等托管平台，可以跳过这些章节的实践部分。
第八章、第十章 和 第十一章 使用 Databricks 平台。

获取安装和设置的附加说明及信息已在各个章节中记录。

本书中涉及的软件/硬件	操作系统要求
Databricks Community Edition
在 Amazon Web Services (AWS) 或 Microsoft Azure 上的 Databricks
Docker v4.48 或 Podman v1.16	Windows、macOS 或 Linux

示例代码所需的附加软件包会在代码执行时自动安装。由于软件包和用户界面可能会发生变化，请参考相应的软件包或产品文档，以了解更改信息。

如果你使用的是本书的数字版本，我们建议你自己输入代码或从本书的 GitHub 仓库访问代码（链接在下一节提供）。这样可以帮助你避免与复制粘贴代码相关的潜在错误。

如果有说明更新，它将尽可能地添加到 GitHub 仓库中各个章节的 README.md 文件中。

下载示例代码文件

你可以从 GitHub 下载本书的示例代码文件，链接为 github.com/PacktPublishing/Time-Series-Analysis-with-Spark。如果代码有更新，GitHub 仓库中的内容将会进行相应更新。

我们还在 https://github.com/PacktPublishing/ 提供了其他丰富书籍和视频的代码包。快去看看吧！

使用的约定

本书中使用了许多文本约定。

文本中的代码：表示文本中的代码字、数据库表名、文件夹名、文件名、文件扩展名、路径名、虚拟 URL、用户输入和 Twitter 账号。例如：“将下载的 WebStorm-10*.dmg 磁盘映像文件挂载为系统中的另一个磁盘。”

代码块设置如下：

#### Summary Statistics
# Code in cell 10
df.summary().display()

当我们希望引起你对代码块中特定部分的注意时，相关的行或项目会以粗体显示：

sns.boxplot(x='dayOfWeek', y='Global_active_power', data=pdf)

任何命令行输入或输出都按如下方式显示：

Test SMAPE: 41.193985580947896
Test WAPE: 0.35355667972102317

粗体：表示一个新术语、一个重要的词，或者你在屏幕上看到的词。例如，菜单或对话框中的词会以粗体显示。以下是一个示例：“报告的其他部分涉及警报，如图 6.8所示，包括在数据集上运行的测试结果，包括特定于时间序列的测试，以及一个重现部分，详细描述了分析运行的情况。”

提示或重要说明

显示效果如下。

与我们联系

我们始终欢迎读者的反馈。

一般反馈：如果你对本书的任何内容有疑问，请通过电子邮件 customercare@packtpub.com 联系我们，并在邮件主题中注明书名。

勘误表：尽管我们已经尽力确保内容的准确性，但错误难免。如果你在本书中发现任何错误，恳请报告给我们。请访问www.packtpub.com/support/errata并填写表格。

盗版：如果你在互联网上发现任何非法复制的作品，恳请提供相关的网址或网站名称。请通过 copyright@packtpub.com 与我们联系，并附上链接。

如果你有兴趣成为作者：如果你在某个领域拥有专业知识，并有意撰写或贡献书籍，请访问authors.packtpub.com。

分享你的想法

一旦你阅读了《使用 Spark 进行时间序列分析》，我们很希望听到你的想法！请点击这里直接访问亚马逊评价页面并分享你的反馈。

你的评价对我们和技术社区都非常重要，它将帮助我们确保提供高质量的内容。

下载本书的免费 PDF 副本

感谢购买本书！

你喜欢随时随地阅读，但又无法随身携带纸质书籍吗？

你的电子书购买是否与选定的设备不兼容？

别担心，现在每本 Packt 书籍你都能免费获得该书的 DRM-free PDF 版本。

在任何地方、任何设备上阅读。直接从你最喜爱的技术书籍中搜索、复制并粘贴代码到你的应用程序中。

福利不仅仅到此为止，你还可以独享折扣、新闻通讯和每天通过电子邮件收到的精彩免费内容。

按照这些简单步骤来获取福利：

扫描二维码或访问以下链接

packt.link/free-ebook/978-1-80323-225-6

提交你的购买凭证
就这样！我们将直接通过电子邮件发送你的免费 PDF 及其他福利。

第一部分：时间序列和 Apache Spark 简介

在本部分中，您将了解时间序列分析和 Apache Spark。从时间序列数据的基础概念开始，我们将深入探讨时间序列分析的实际意义及其在各行业的应用，并结合一些实践示例。接着，您将了解 Apache Spark，理解它的使用方式、架构以及工作原理，最后在您的环境中安装它。

本部分包括以下章节：

第一章，什么是时间序列？
第二章，为什么进行时间序列分析？
第三章，Apache Spark 简介

第一章：什么是时间序列？

“时间是最智慧的顾问。” – 伯里克勒斯

历史是迷人的。它提供了我们起源的深刻叙述，展现了我们所走的路和我们奋斗的目标。历史赋予我们从过去汲取的教训，使我们更好地面对未来。

让我们以气象数据对历史的影响为例。天气模式的变化，从中世纪开始，直到 1783 年拉基火山爆发后加剧，给法国带来了广泛的困苦。这场气候动荡加剧了社会的不安，最终导致了 1789 年的法国大革命。（关于这一点，详细内容请参考进一步 阅读部分。）

时间序列体现了这一叙事，数字回响着我们的过去。它们是历史的量化，是我们集体过去的数值化叙事，为未来提供了宝贵的经验。

本书将带你进行一段全面的时间序列之旅，从基础概念开始，指导你进行实践中的数据准备和模型构建技巧，最终涵盖诸如扩展性和部署到生产等高级主题，同时跟进跨行业的尖端应用最新进展。通过本书的学习，你将能够结合 Apache Spark 构建强大的时间序列模型，以满足你所在行业应用场景的需求。

本章作为本书旅程的起点，介绍了时间序列数据的基本概念，探讨了其顺序性质和所面临的独特挑战。内容涵盖了趋势和季节性等关键组成部分，为使用 Spark 框架进行大规模时间序列分析奠定了基础。对于数据科学家和分析师来说，这些知识至关重要，它为有效利用 Spark 的分布式计算能力来分析和预测时间相关数据，并在金融、医疗保健、营销等多个领域做出明智决策提供了基础。

本章将涵盖以下主题：

时间序列简介
将时间序列分解成其组成部分
时间序列分析中的额外考虑

技术要求

在本书的第一部分，奠定了基础，你可以不参与实际操作示例而跟随阅读（尽管推荐参与）。本书后半部分将更加侧重于实践。如果你希望从一开始就进行实际操作，本章的代码可以在本书的 GitHub 仓库中找到，地址为：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch1

注意

请参考此 GitHub 仓库，以获取最新版本的代码，若在出版后有更新，将会进行注释说明。更新后的代码（如果有）可能与书中展示的代码部分有所不同。

以下动手实践部分将提供更多细节，帮助你开始进行时间序列分析。

时间序列简介

在这一部分，我们将了解什么是时间序列以及一些相关术语。通过动手实践的示例来可视化时间序列。我们将查看不同类型的时间序列及其特点。了解时间序列的性质对于我们在接下来的章节中选择合适的时间序列分析方法是必要的。

让我们从一个例子开始，这个时间序列表示的是自 1950 年以来毛里求斯每年的平均气温。数据的一个简短示例如表 1.1所示。

年份	平均气温
1950	22.66
1951	22.35
1952	22.50
1953	22.71
1954	22.61
1955	22.40
1956	22.22
1957	22.53
1958	22.71
1959	22.49

表 1.1：样本时间序列数据——平均气温

在可视化并解释这个例子时，我们将接触到一些与时间序列相关的术语。用于可视化这个数据集的代码将在本章的动手部分讲解。

在下图中，我们可以看到自 1950 年以来气温的变化。如果我们将注意力集中在 1980 年后的这段时间，我们可以更仔细地观察到温度的变化，呈现出类似的逐年升高的趋势（趋势——在两图中用虚线表示），直到当前的温度。

图 1.1：自 1950 年以来毛里求斯的平均气温

如果气温继续以相同的方式上升，我们正走向一个更温暖的未来，这也正是当前广泛接受的全球变暖的表现。与气温逐年上升的同时，每年夏季气温也会上升，冬季则会下降（季节性）。我们将在本章的动手部分可视化这一现象及温度时间序列的其他组成部分。

随着多年来气温逐渐升高（趋势），全球变暖对我们星球及其居民产生了影响（因果关系）。这一影响也可以通过时间序列来表示——例如海平面或降水量的测量。全球变暖的后果可能是剧烈的和不可逆的，这进一步突显了理解这一趋势的重要性。

这些时间序列的温度读数构成了我们所称的时间序列。对这种时间序列的分析和理解对我们的未来至关重要。

那么，更一般地说，什么是时间序列呢？它仅仅是一个按时间顺序排列的测量序列，以及每个测量值由源系统生成的特定时间。在温度的例子中，源系统是位于特定地理位置的温度计。

时间序列也可以以聚合形式表示，例如每年的平均气温，如表 1.1所示。

通过这个定义，并结合一个示例，我们将进一步探讨时间序列的性质。我们还将在本书的其余部分详细介绍这里提到的术语，如趋势、季节性和因果关系。

时间顺序

在本章开头，我们在定义时间序列时提到时间顺序，这是因为它是处理时间序列数据时与其他数据集的主要区别之一。顺序重要的一个主要原因是时间序列中的潜在自相关性，其中在时间t的测量值与n个时间步之前的测量值相关（滞后）。忽视这一顺序会使我们的分析不完整，甚至不正确。稍后我们将研究识别自相关性的方法，在第六章的探索性数据分析中详细讨论。

值得注意的是，在许多时间序列的情况下，自相关往往使得时间上较近的测量值之间的值更加接近，而与时间上较远的测量值相比，值的差异较大。

尊重时间顺序的另一个原因是避免在模型训练过程中发生数据泄露。在一些分析和预测方法中，我们会使用过去的数据训练模型，以预测未来目标日期的值。我们需要确保所有使用的数据点都在目标日期之前。时间序列数据中的数据泄露往往难以发现，这会破坏方法的完整性，并在开发阶段让模型表现得过于理想，而在面对新的未见数据时表现不佳。

本书的其余部分将进一步解释在这里提到的术语，如自相关、滞后和数据泄露。

本节讨论的时间顺序是时间序列的一个定义特征。在下一节中，我们将重点讨论规律性或其缺乏，这是另一个特征。

定期和不定期

时间序列可以是定期的或不定期的，这取决于它们的测量间隔。

定期时间序列在时间上的值是按规律的时间间隔预期的，比如每分钟、每小时、每月等等。这通常是源系统生成连续值的情况，这些值随后在规律的时间间隔内进行测量。这种规律性是预期的，但并不保证，因为这些时间序列可能会有间隔或值为零的情况，这可能是由于缺失的数据点或测量值本身为零造成的。在这种情况下，它们仍然会被视为定期的。

非规则时间序列是指在源头上测量的时间间隔不规则的情况。这通常发生在事件在不规则的时间点发生，并且会测量某种类型的值。这些不规则时间间隔的值可以通过降频重采样转换为规则间隔，从而变成规则时间序列。例如，一个不按每分钟发生的事件，可能每小时发生一次，按小时来看的话，它是规则的。

本书将主要关注规则时间序列。在时间序列的规则性之后，我们将在下一节考虑的另一个特征是平稳性。

平稳与非平稳

考虑到时间序列的统计性质随时间的变化，它们可以进一步分为平稳和非平稳。

平稳时间序列是指那些统计性质（如均值和方差）随时间变化不大的时间序列。

非平稳时间序列具有变化的统计性质。这些时间序列可以通过多种方法转换为平稳序列，例如，通过对差分进行一阶或多阶差分来稳定均值，使用对数值来稳定方差。这个区分非常重要，因为它决定了可以使用哪种分析方法。例如，如果某种分析方法假设时间序列是平稳的，那么可以先对非平稳数据进行上述转换。你将在第六章中学习如何识别平稳性。

注意

将非平稳时间序列转换为平稳时间序列可以去除趋势和季节性成分，但如果我们想分析这些成分，可能就不符合我们的需求。

本节内容对于理解时间序列的基本特性非常重要，这是在本书后半部分选择合适的分析方法的前提。图 1.2总结了可以使用的时间序列类型和转换操作。

图 1.2：时间序列类型

这部分结束了本章的理论内容。在接下来的部分，我们将进行第一次动手实践，同时设置编码环境。本章将从可视化和分解时间序列开始。我们将在下一章深入探讨不同类型的时间序列分析及其使用场景。

动手实践：加载和可视化时间序列

让我们通过动手练习加载一个时间序列数据集并进行可视化。我们将尝试创建之前在图 1.1中看到的可视化表示。

开发环境

为了运行代码，你需要一个 Python 开发环境，在其中安装 Apache Spark 和其他所需的库。具体的库和安装说明将在相关章节中详细介绍。

PaaS

一个简单的方法是使用免费的 Databricks Community Edition，它包含一个基于笔记本的开发界面，并且预安装了 Spark 和其他一些库。

docs.databricks.com/en/getting-started/community-edition.html

Community Edition 的计算能力是有限的，因为它是一个免费的基于云的 PaaS。你还可以注册 Databricks 的 14 天免费试用版，具体取决于你选择的注册选项，可能需要你首先拥有云服务提供商的账户。一些云服务提供商可能会有一些免费积分的促销活动，供你在开始时使用。这将为你提供比 Community Edition 更多的资源，时间有限。

你可以通过以下网址注册 Databricks 的免费试用版：www.databricks.com/try-databricks

Databricks 的开发团队是 Apache Spark 的原创作者，因此在这里工作会是一个不错的选择。

早期章节中的示例将使用 Community Edition 和 Apache Spark 的开源版本。我们将在 第八章 和 第十章 中使用完整的 Databricks 平台。

自定义

或者，你可以搭建自己的环境，设置完整的技术栈，例如在 Docker 容器中。这将在 第三章 中介绍，Apache Spark 简介。

代码

本节的代码位于本书 GitHub 仓库的 ch1 文件夹中的名为 ts-spark_ch1_1.dbc 的笔记本文件中，具体参考 技术要求 部分。

数据集的下载链接如下：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch1/ts-spark_ch1_1.dbc

数据集

一旦选择了开发和运行时环境，另一个需要考虑的因素是数据集。我们将使用的是毛里求斯年均地表空气温度数据，可以在气候变化知识门户网站上找到，网址为 climateknowledgeportal.worldbank.org/country/mauritius。

数据集的副本（文件名为 ts-spark_ch1_ds1.csv）可以在 GitHub 上的 ch1 文件夹中找到。可以使用前面提到的代码进行下载。

接下来，你将在 Databricks Community Edition 工作区中工作，这将是你自己的独立环境。

步骤：加载和可视化时间序列

现在我们已经完成了所有设置，让我们开始第一个编程练习。首先，登录 Databricks Community Edition 导入代码，创建一个集群，并最终运行代码：

使用你在注册过程中指定的凭证登录到 Databricks Community Edition，如图 1.3所示。访问登录页面的 URL 为：community.cloud.databricks.com/

如果你还没有注册，请参考开发环境部分，了解如何进行注册。

图 1.3：登录 Databricks Community Edition

进入工作区后，点击创建笔记本。见图 1.4。

图 1.4：创建笔记本

从这里开始，我们将进入代码部分，首先导入提供的ts-spark_ch1_1.dbc笔记本，该笔记本可以在 GitHub 上找到，链接为第一章，如图 1.5所示。

图 1.5：导入笔记本

请注意，你可以从技术要求部分提供的 GitHub URL 下载第一章的文件到本地计算机，然后从那里导入，或者可以按图 1.6所示，指定以下原始文件 URL 进行导入：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch1/ts-spark_ch1_1.dbc

图 1.6：从文件或 URL 导入笔记本

到现在为止，我们已经进入了实际的代码部分。你应该现在已经有了一个带有代码的笔记本，如图 1.7所示。

图 1.7：带代码的笔记本

最后，让我们运行代码。点击全部运行，如图 1.8所示。

图 1.8：运行笔记本中的所有代码

如果你还没有启动集群，你需要创建并启动一个新的集群。请注意，在 Databricks Community Edition 中，当集群未使用时会自动终止，在这种情况下，你将看到附加集群已终止的消息，如图 1.9所示，你需要选择另一个资源。

图 1.9：附加集群已终止

从此时起，你可以选择连接到另一个活动集群（非终止状态的集群），或者选择创建一个新的资源，如图 1.10所示。

图 1.10：计算 – 创建新资源

接下来，你需要为集群指定一个名称，并选择你想要使用的 Spark 版本，如图 1.11所示。这里的推荐做法是使用最新版本，除非由于需要在其他环境中运行的兼容性原因，你需要让代码在旧版本上工作。

图 1.11：计算 – 创建、附加并运行

一旦集群创建并启动（在这个免费环境中可能需要几分钟时间），代码就会运行，你将看到章节开头的图 1.1所示的图表作为输出。用于创建和显示图表的图形库提供了交互式界面，使你可以进行例如放大特定时间段的操作。
由于这是第一次动手实践，我们已经详细介绍了逐步操作。在未来的实践部分，我们将专注于特定的数据集和代码，因为其他部分将非常相似。只要有差异，会提供额外的说明。

现在我们已经执行了代码，接下来我们将回顾主要部分。在本介绍性部分，我们将保持高层次的讨论，待介绍完 Apache Spark 概念后，后续章节将进一步深入细节：

import语句添加了日期格式转换和绘制图表的库：

import pyspark.pandas as ps
import Plotly, a graphing library that enables interactive visualization, converts data points into graphs.

然后我们使用 spark.read将 CSV 数据文件读取到表中：

df1 = spark.read.format("csv") \
    .option("header", "true") \
    .load("file:///" + SparkFiles.get(DATASET_FILE))
df1.createOrReplaceTempView("temperatures")

spark.sql语句基于源数据集中的年份列（命名为Category）选择数据集的一个子集：

df2 = spark.sql("select to_date(Category) as year, float(`Annual Mean`) as annual_mean from temperatures where Category > '1950'")

最后，我们根据普通最小二乘法（OLS）回归绘制时间序列以及趋势线，如图 1.1所示：

fig = px.scatter(
    df2_pd, x="year", y="annual_mean",
    trendline="ols", 
    title='Average Temperature - Mauritius (from 1950)'
)

使用的绘图库plotly允许在用户界面上实现互动，例如鼠标悬停时显示数据点信息以及缩放。

从这一点开始，随时可以在代码和 Databricks 社区版环境中进行实验，我们将在本书的大部分初始章节中使用该环境。

在这一部分，你首次接触了时间序列和编码环境，从一个简单的练习开始。在下一节中，我们将详细讲解到目前为止介绍的一些概念，并将时间序列分解为其组成部分。

将时间序列分解为其组成部分

本节旨在通过分析时间序列的组成部分，进一步加深对时间序列的理解，并详细说明迄今为止介绍的几个术语。这将为接下来的章节奠定基础，使你能够根据分析的时间序列特性使用正确的方法。

时间序列模型可以分解为三个主要组成部分：趋势、季节性和残差：

注意

本书中的数学表示将采用简化的英文符号，以便于广泛的受众。关于时间序列的数学公式，请参考以下优秀资源：《预测：原理与实践》：otexts.com/fpp3/。

正如您将在接下来的实践部分看到的那样，这种成分的划分是基于拟合到时间序列数据的模型得出的。对于大多数实际数据集来说，这种分解仅仅是模型对现实的近似。因此，每个模型都会有自己对这些成分的识别和近似。整个目标是找到最适合时间序列的模型。这就是我们在第七章中将要构建和测试的内容。

让我们逐一分析这些成分，定义它们的含义，并根据一个示例数据集进行可视化，如图 1.12所示。

图 1.12：时间序列分解

系统性和非系统性成分

水平、趋势、季节性和周期性被称为系统性成分。它们代表了时间序列的基础结构，可以进行建模，因此可以预测。

除了系统成分外，还有一个非系统性部分无法建模，这部分被称为残差、噪声或误差。时间序列建模的目标是找到最适合系统成分的模型，同时最小化残差。

现在我们将详细介绍每个系统性和非系统性部分。

水平

水平，也称为基准水平，是序列的均值，作为基线，其他成分的效应会在其上叠加。有时，它会作为额外成分明确加入到前面的公式中。然而，水平并不总是出现在公式中，因为它可能不是分析的主要焦点，或者分解方法可能已经将其隐含在其他成分中。

趋势

趋势是指时间序列中值在一段时间内的总体变化方向：上升、下降或平稳。这种变化可以是线性的，如图 1.1和图 1.12所示，也可以是非线性的。趋势本身可以在不同的时间点发生变化，我们可以将其称为趋势变化点。更广泛地说，变化点是指时间序列的统计特性发生变化的时间点。这可能对模型参数，甚至我们用来分析时间序列的模型产生显著影响。

季节性和周期性

季节性表示时间序列在固定时间间隔内的变化。这通常是由季节性日历事件引起的。以我们的温度例子为例，每年夏季的温度都会相对于其他季节升高，冬季则下降，如图 1.12所示。类似地，礼品销售的时间序列可能会在每个圣诞节期间显示出销售的增加，形成其季节性模式。

多重季节性（间隔和振幅）可以在同一时间序列中产生组合效应，如图 1.13所示。例如，在温度的例子中，除了夏冬季的起伏变化外，白天温度升高，夜间温度下降。

图 1.13：多重重叠的季节性（合成数据）

周期性是指类似季节性在不规则间隔发生的变化。时间序列中的周期性反映了外部周期对序列的影响。例如，经济衰退每隔若干年发生一次，并对经济指标产生影响。我们无法提前预测其发生时间，这与圣诞节的季节性不同，后者每年 12 月 25 日都能预测发生。

残差或剩余项

残差或剩余项是指在模型已考虑了趋势、季节性和周期性之后所剩下的部分。残差可以使用自回归（AR）或移动平均（MA）方法进行建模。此时剩余的部分，也被称为噪声或误差，具有随机性，无法被建模。在图 1.12的最上方图表中，你可以将残差可视化为数据点与拟合线之间的距离。我们将在第六章中介绍如何测试残差，内容涉及探索性数据分析。

注意

当残差只是时间序列中的一部分随机时，整个序列可能完全是随机的，或者是一个随机游走。完全随机的序列将不依赖于先前的时间值，而对于随机游走，时间t的值依赖于t-1时的值（加上一些漂移和随机成分）。

加法型或乘法型

时间序列可以是加法型（前述公式）或乘法型。在加法型的情况下，季节性和残差成分不依赖于趋势。而在乘法型的情况下，它们随趋势变化，可以视为季节性成分的振幅变化——例如，较高的峰值和较低的谷值。

现在我们已经了解了时间序列的各个组成部分，接下来我们通过代码来实践一下。

实践操作：分解时间序列

以演示ts-spark_ch1_2fp.dbc为例。

位置 URL 如下：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch1/ts-spark_ch1_2fp.dbc

我们将使用的数据集是 1981 到 1990 年间澳大利亚墨尔本的每日最低温度，原始数据来自澳大利亚气象局，并可在 Kaggle 上通过以下链接获取：www.kaggle.com/datasets/samfaraday/daily-minimum-temperatures-in-me

数据集的副本已提供在 GitHub 文件夹中，文件名为ts-spark_ch1_ds2.csv。

在本章中，我们将保持高层次的讨论，选取笔记本中的一些内容进行讲解，之后会在接下来的章节中详细介绍预测模型的其他概念：

import 语句添加了用于预测模型和绘制图表的库：
```
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
```
使用的预测库是 Prophet，它是 Facebook 开源的库。无论是专家还是非专家，都可以使用它进行时间序列数据的自动预测。

然后，我们使用 spark.read 将 CSV 数据文件读入表格中：

df1 = spark.read.format("csv") \
    .option("header", "true") \
    .load("file:///" + SparkFiles.get(DATASET_FILE))
df1.createOrReplaceTempView("temperatures")

spark.sql 语句将 date 和 daily_min_temperature 列转换为正确的格式和列名，这是 Prophet 所要求的：

df2 = spark.sql("select to_date(date) as ds, float(daily_min_temperature) as y from temperatures sort by ds asc")

接下来，我们使用 Prophet 库根据 12 个月的季节性创建一个预测模型，并将其拟合到数据上：

model = Prophet(
    n_changepoints=20, 
    yearly_seasonality=True,
    changepoint_prior_scale=0.001)
model.fit(df2_pd)

该模型随后用于预测未来日期的温度：

future_dates = model.make_future_dataframe(
    periods=365, freq='D')
forecast = model.predict(future_dates)

最后，我们绘制了模型识别出的时间序列成分，如图 1.12所示：
```
plot_components_plotly(model, forecast)
```

现在我们已经对成分和预测做了基本讨论，让我们来探讨一下重叠季节性案例。

多重重叠季节性

我们将通过代码来创建图 1.13中的数据可视化。此部分代码位于名为 ts-spark_ch1_3.dbc 的笔记本文件中。

位置 URL 如下：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch1/ts-spark_ch1_3.dbc

该数据集是合成的，生成了三条不同的正弦曲线，代表三种重叠的季节性。

以下代码摘自笔记本。让我们从高层次进行查看：

import 语句添加了用于数值计算和绘图的库：
```
import numpy as np
from plotly.subplots import make_subplots
```
NumPy 是一个开源的 Python 科学计算库，相比标准 Python，它在计算和内存使用上显著更高效。我们将在此使用它的数学函数。

接下来，我们生成多个正弦曲线，使用np.sin来表示不同的季节性，并将它们叠加在一起：

(amp, freq) = (3, 0.33)
seasonality1 = amp * np.sin(2 * np.pi * freq * time_period)
(amp, freq) = (2, 1)
seasonality2 = amp * np.sin(2 * np.pi * freq * time_period)
(amp, freq) = (1, 4)
seasonality3 = amp * np.sin(2 * np.pi * freq * time_period)
combined = seasonality1 + seasonality2 + seasonality3

最后，我们绘制了各个季节性以及它们的合成季节性，如图 1.13所示：

fig = make_subplots(rows=4, cols=1, shared_xaxes=True)
fig.add_scatter(
    x=time_period, y=seasonality1, 
    row=1, col=1, name=f"seasonality 1")
fig.add_scatter(
    x=time_period, y=seasonality2, 
    row=2, col=1, name=f"seasonality 2")
fig.add_scatter(
    x=time_period, y=seasonality3, 
    row=3, col=1, name=f"seasonality 3")
fig.add_scatter(
    x=time_period, y=combined, 
    row=4, col=1, name=f"combined")fig.show()

从现在开始，尽管在笔记本中自由地尝试完整的代码。

在本节中，我们开始了分析时间序列的旅程，探讨了其潜在结构，并根据数据的性质铺平了进一步分析的道路。在下一节中，我们将涵盖一些关键的考虑因素和挑战，帮助你在整个过程中做好准备。

时间序列分析的额外考虑因素

本节可能是本书早期部分中最重要的一节。在导言部分，我们提到了一些时间序列的关键考虑因素，例如保持时间顺序、规律性和稳定性。在这里，我们列出了在实际项目中分析时间序列时遇到的关键挑战和额外的考虑因素。通过这样做，你可以根据本书中相关部分的指导以及进一步阅读来规划自己的学习和实践。

根据 2015 年发表的著名论文《机器学习系统中的隐性技术债务》，在高级分析项目中，只有一小部分工作与代码相关。剩余的时间大多数用于其他考虑因素，如数据准备和基础设施建设。

这些挑战的解决方案是非常具体的，依赖于你的具体背景。本章的目的是让你意识到这些考虑因素，如图 1.14所总结的。

图 1.14：时间序列分析中的考虑因素和挑战

尽管这些考虑因素大多数与非时间序列分析（如机器学习）共享，但时间序列分析通常是高级分析方法中最具挑战性的。我们将在本书的其余部分详细讨论一些应对这些挑战的解决方案。

面对数据挑战

与所有数据科学和机器学习项目一样，数据是关键。你运行的分析和构建的模型的效果将取决于数据的质量。数据挑战各式各样，且非常依赖于你的具体环境和数据集。

我们将在这里列出一些常见的问题：

数据访问 可能是所有问题的起点。对于本书而言，我们将使用几个免费访问的数据集，因此这不是问题。在实际项目中，所需数据集的所有权可能属于你所在组织的其他部门，甚至可能完全属于另一家组织。在这种情况下，你将不得不经历获取数据集的过程，可能会涉及财务成本，并确保数据能够以可靠的方式进行传输，同时保证传输速度和数据的新鲜度。传输管道的构建将有其自身的成本，以及传输本身的成本。传输机制必须具备生产级别的能力，以支持操作需求：稳健、可恢复、可监控等。

最初，你的数据访问需求将用于探索性数据分析和模型训练。批量导出可能足够。进入生产阶段后，你可能需要实时或近实时的数据访问。那时，考虑因素将完全不同。

一旦数据被摄取，接下来的要求是以安全且可用的方式存储它。使用专门的时间序列数据库是一个优化性能的选择，尽管对于大多数情况，通用存储已足够：

敏感性是另一个关键方面。在这里，开发和生产中可能会有不同的要求。然而，在许多情况下，开发和测试中使用的是生产数据的子集。某些包含个人身份信息（PII）的列需要进行遮蔽或加密，以遵守如欧洲 GDPR 等法规。在高度敏感的情况下，整个数据集可能需要加密。这对大规模处理来说是一个挑战，因为每次访问数据都可能需要解密和重新加密。这会带来处理开销。

总结来说，端到端的安全性和数据治理将成为你的高优先级需求，这从第一天就开始了。你希望在每个阶段都避免安全性和合规性风险，包括开发阶段，尤其是当你处理敏感数据时。
数据量和频率在实时或近实时的大流量数据源中，将需要合适的平台来实现快速处理而不丢失数据。在预生产环境中，这一点可能不太明显，因为规模较小。性能和可靠性问题通常会在生产环境扩展时才显现出来。我们将在介绍 Apache Spark 时讨论扩展和流处理，这将帮助你避免此类问题。
数据质量是我们早期将面临的挑战，一旦数据访问问题解决，我们开始在探索阶段和开发中处理数据。挑战包括数据缺失、数据损坏、数据噪声，甚至对于时间序列数据来说，更为相关的是数据延迟和乱序。如前所述，对于时间序列数据，保持时间顺序非常重要。在我们讨论数据准备时，我们将进一步探讨解决数据质量问题的方法。

在数据挑战之后，下一步的重点是为需要解决的问题选择正确的方法和模型。

使用正确的模型

这对于那些刚接触时间序列的人来说可能是一个更大的挑战。正如我们到目前为止所看到的，时间序列具有不同的统计特性。一些分析和建模方法是基于对时间序列统计特性的假设而创建的，其中平稳性是常见的假设。如果使用不正确类型的时间序列，这些方法将无法按预期工作，甚至可能导致误导性结果。如果你已经识别出多个重叠的季节性，某些方法处理这些季节性也可能会是一个挑战。图 1.14 回顾了时间序列和分析模型的类型。模型的选择将在第七章，构建和 测试模型 中进一步讨论。

选择正确的模型也在很大程度上取决于我们希望实现的结果，无论是预测未来一个或多个时间步长，还是同时分析一个（单变量）或多个（多变量）序列。对于一些领域，如受监管行业，通常还需要可解释性，而某些模型（如黑箱模型）可能难以满足这一要求。我们将在下一章《为什么时间序列重要》中进一步讨论时间序列分析的结果及如何选择合适的模型，包括用于异常和模式检测以及预测建模的模型。

维持空间和时间层次结构

请注意，另一个关键考虑因素是数据收集和分析的层次结构。这需要在不同层级之间保持一致性。为了说明这一点，让我们以一个多店零售商销售不同产品的时间序列预测为例。这里的空间层次结构可能位于产品和产品类别层级，以及特定商店和区域层级。时间层次结构将对应于每小时、每日、每季度等的销售情况。在这种情况下的挑战是确保单个产品和产品类别的预测一致性，以及例如，日度预测与季度预测的一致性。

最终，选择正确的模型取决于数据量，正如我们将在后续章节中讨论的构建模型的内容。

解决可扩展性问题

主要有两个因素影响可扩展性：数据量和处理复杂性。之前，我们讨论了数据量作为数据挑战。这里我们来考虑处理复杂性。复杂性可能来自于准备数据所需的数据转换的程度，以及需要管理的模型的数量、层次结构和大小：

大量和复杂的模型层次结构：在实际项目中工作时，您可能需要在相对较短的时间内并行运行数十甚至数千个模型 - 比如，如果您在商店工作并需要为商店中销售的成千上万种商品预测第二天的销售和库存水平。这种并行性的需求是使用 Apache Spark 的主要原因之一，我们将在本书中进一步了解。
模型的大小：可扩展性的另一个要求来自模型本身的大小，如果我们使用具有许多层和节点的深度学习技术，模型可能会非常庞大，并且具有高计算要求。

我们将在本书后面专门讨论扩展。

接近实时

早些时候，我们确定高频数据是一个重要的数据挑战。接近实时不仅需要数据级别的调整，还需要一个设计用于处理这种需求的处理管道。通常，模型是在一段时间内收集的数据批次上进行训练，然后部署到诸如预测或异常检测等任务中，其中实时处理变得至关重要。例如，在检测欺诈交易时，尽可能接近事件发生时识别异常是至关重要的。近乎即时数据处理的可行解决方案是 Apache Spark 结构化流，这是我们在本书后面讨论 Apache Spark 时将探讨的一个主题。

生产管理

前述考虑也适用于生产环境。此外，将开发的解决方案移入生产环境还有一些特定要求。如果管理不当，这些要求可能会带来挑战。

一旦正确的模型已经训练好并准备好使用，下一步是将其与任何必需的 API 包装器一起打包，以及数据管道和消耗模型的应用程序代码。这意味着一个涉及 DataOps、ModelOps 和 DevOps 的端到端过程。在我们讨论生产时，我们将在第九章更详细地讨论这些内容。

监控和解决漂移

一旦模型投入使用，随着时间的推移会发生变化，导致模型不再适合使用。这些变化大致分为以下几类：

数据集性质的变化（数据漂移）
输入和输出之间关系的变化（概念漂移）
意外事件，如 COVID，或在建模过程中遗漏的重要事件（突发漂移，一种概念漂移）

这些漂移将影响模型的性能，因此需要进行监控。在这种情况下的解决方案通常是根据新数据重新训练模型或找到在更新的数据集上性能更好的新模型。

本节概述了处理时间序列时的考虑因素和挑战。与处理其他数据集的通用性有很多共同之处，因此这里的指导在更广泛的背景下也将非常有用。然而，正如我们在介绍部分看到的那样，时间序列也有其特定的考虑因素。

总结

时间序列随处可见，本章介绍了它们的基本概念、组成部分以及处理中的挑战。我们从一些简单的代码开始探索时间序列，为后续章节的进一步实践奠定基础。本书的第一章讨论的概念将逐步加深，最终使我们能够扩展到大规模分析时间序列的程度。

现在您已经理解了时间序列的“是什么”，在下一章中，我们将探讨“为什么”，这将为在各个领域中应用打下基础。

进一步阅读

本节作为资源库，可帮助您进一步了解该主题：

气候混乱助长了法国 大革命：time.com/6107671/french-revolution-history-climate/
Databricks 社区版: docs.databricks.com/en/getting-started/community-edition.html
气候变化知识门户：climateknowledgeportal.worldbank.org/country/mauritius
预测: 原理与实践 由 Rob J Hyndman 和 George Athanasopoulos: otexts.com/fpp3/
机器学习系统中的隐藏技术债务 (Sculley et al., 2015): papers.neurips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

加入我们的 Discord 社区

加入我们社区的 Discord 空间，与作者和其他读者讨论：

packt.link/ds

第二章：为什么需要时间序列分析？

本章深入探讨了分析时间序列数据的实际意义。它阐明了时间序列分析如何支持预测建模、趋势识别和异常检测。通过展示各行业的实际应用，本章强调了时间洞察在决策中的关键作用。掌握时间序列分析的重要性对专业人士至关重要，因为它凸显了对预测准确性、资源优化和战略规划的影响，促进了对面向时间的数据分析的全面理解。

本章将涵盖以下内容：

时间序列分析的需求
行业特定的应用案例
选择应用案例的动手实践

技术要求

在第一章之后，我们将在这里进一步提高代码的难度，目标是展示时间序列在选定应用案例中的使用。本章的代码可以在本书 GitHub 仓库的 ch2 文件夹中找到： https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch2。

请参考这个 GitHub 仓库，获取代码的最新修订版本，若更新内容与本书中代码部分不同，更新将会在仓库中注释说明。

本章的动手实践部分将进一步详细介绍。

理解时间序列分析的需求

正如我们在上一章讨论的，时间序列在生活的各个方面以及所有行业中都存在。因此，分析时间序列的需求无处不在。本章将探讨不同行业的不同应用案例。在此之前，我们将在本节中研究其基本方法。这些方法大致可以分为预测、模式检测与分类，以及异常检测。图 2.1展示了本章将讨论的几个关键时间序列分析概念。

图 2.1：时间序列分析中的概念

现在，让我们更详细地讨论每一个部分。

预测

时间序列预测是基于先前观察到的数值来预测未来的值。这是通过建模时间序列数据中的基本模式——例如趋势、季节性和周期——来预测未来的数据点。例如，在我们在第一章中可视化的温度时间序列的例子中，我们可以使用预测模型根据从前几个月学习到的模式来预测下个月的温度。预测是时间序列分析中最常见的方法，本书将重点讨论这一方法。这可以是单步预测、多步预测、单变量预测或多变量预测。

单步预测

在单步预测中，我们根据对历史数据点的分析以及由此构建的模型，预测时间序列中的下一次发生事件。预测步骤的粒度通常与我们学习历史模式的数据集中的粒度相同。例如，如果我们的历史时间序列中包含的是每日温度，那么下一步将是第二天。如果我们将数据点汇总为例如月度平均值，并且对月度变化模式进行了建模，那么下一步将是下个月的平均温度。

虽然单步预测通常是我们能获得的最可靠的预测，但不幸的是，它对于许多需求来说是不够的，因为在现实生活中，我们往往需要比仅仅预测一个（时间）步骤更长远的计划。如果我们在做每日预测，我们不仅仅想预测明天的天气。我们希望能够预测未来几天、几周甚至几个月的情况。单步预测对于规划来说显然不足够。

多步预测

在多步预测中，我们使用从历史数据点构建的模型，预测时间序列中的多个下一步。我们还将预测的前一步作为输入。以我们的每日温度为例，这可能意味着对接下来一周的每天进行逐日预测。

挑战

多步预测的挑战在于，后续的预测是基于先前的预测，这与单步预测不同，单步预测是基于实际数据点的。实际上，这意味着递归地逐步应用预测算法，每一步都将预测结果添加到数据集中，并使用历史和预测数据点来预测下一步。因此，预测中的不准确性随着每一步向未来推进而逐渐累积。

注意事项

这种多步预测中的预测误差累积是你希望与业务或任何为你构建预测的人提前沟通的限制。你需要确保在长期预测方面设定清晰的预期。

解决方案

除了将预测限制在非常短的时间范围内外，还有几种方法可以应对多步预测的挑战：

首先，构建一个尽可能准确的模型，使得初步预测的结果接近现实。
另一种方法是使用模型组合，旨在平均化预测误差。
最后，限制预测间隔或步数，并在有新的测量数据时重新计算预测。

单变量预测

到目前为止，在第一章中，我们只考虑了单一变量（单变量）的时间序列，也就是特定位置的温度。另一个单变量时间序列的例子，在经济领域，是特定地区或国家的失业率。根据定义，单个时间序列是单变量的，无论是温度还是失业率。在现实世界中，需求往往是同时预测多个时间序列，这样我们能比单一时间序列提供的视角获得更全面的未来预测。以温度为例，这可能意味着需要考虑多个地点，或者额外预测空气中的污染物水平。对于经济预测，这可能意味着需要预测国内生产总值（GDP）以及失业率。这引出了多变量预测。

多变量预测

单变量是指单一的时间序列，而多变量预测有几种方式可以进行描述：

多重输入维度是指我们将多个变量（包括时间序列和非时间序列）作为输入提供给预测模型——例如，使用过去的温度和污染物水平来预测未来的温度。
多重输出预测使用预测模型来预测多个变量。以之前的温度示例为例，这意味着要同时预测温度和污染物水平。

正如我们从前面的例子中看到的，可能会有多个时间序列的不同场景。这些序列可以是相关的，如果它们有相同的潜在原因（这个原因本身也可以通过另一个时间序列来表示），它们就可能有共动关系。它们也可以有因果依赖性，即一个时间序列与另一个时间序列之间存在因果关系。它们还可能是独立的，我们只是同时进行预测。在第六章的探索性数据分析中，我们将讨论一些这些考虑因素。

总结来说，预测很少是单独进行的，我们需要同时分析的时间序列数量可能达到数百甚至数千。这个对多个时间序列的需求是使用并行执行工具（如 Apache Spark）的一个很好的理由，正如我们将在后续章节中看到的那样。

既然我们已经讨论了预测，现在让我们看看另一种分析方法，这将使我们能够对时间序列进行分类。

模式检测与分类

模式检测与分类是基于某些模式识别和分类时间序列。通常，时间序列遵循某种模式，我们可以识别并标记这些模式。这些标签允许我们通过将标记的模式与新的时间序列发生情况进行匹配，从而对时间序列进行分类。我们可以采取不同的方法来实现这一点，广义上可分为基于距离、基于区间、基于频率、基于字典、形状模式、集成方法和深度学习。接下来将详细介绍这些方法。

基于距离

基于距离的时间序列分类方法，使用k-最近邻（kNN）和动态时间规整（DTW），将在这里进行解释，这是分析时间序列数据的成熟方法。由于时间序列数据的偏移和扭曲，标准的欧几里得距离并不是衡量相似性的良好度量。DTW 通过将序列在时间上对齐提供了一种替代方案。它计算两条时间序列之间的最小距离，考虑所有可能的对齐方式，这使得其计算量较大。然后使用 kNN 根据时间序列的形状相似性来进行分类。

以下图表展示了使用动态时间规整（DTW）计算谷歌股票（下方，黑线）与亚马逊股票，以及谷歌股票与 Meta 股票之间的距离。我们将在本章的动手操作部分执行这个示例。

图 2.2：DTW 距离（GOOG – 黑线）

基于区间

使用这种方法，时间序列首先被划分为多个区间，并计算每个区间的描述性统计量。这些区间随后与分类器如随机森林或支持向量机一起作为特征向量使用。该方法的优势在于它能够捕捉时间序列在不同阶段的特性，这对于那些在时间上具有不均匀模式的时间序列非常有效。将时间序列汇总成带有统计特征的区间也是降低复杂性并提高可解释性的好方法。

基于频率

这种分类方法，例如随机区间谱集成（RISE），首先通过例如傅里叶变换将时间序列转换到频域。RISE 是一种基于分类器的集成方法，分类器如决策树，是通过随机区间和为这些区间提取的谱特征构建的。该方法的优势在于能够识别与频率相关或周期性特征，并且作为一种集成方法，它在提供准确性的同时具有鲁棒性。

基于字典

基于字典的时间序列分类方法将时间序列转换为符号表示，从而可以使用基于文本的技术进行分类。遵循这一方法的突出方法有：

模式包（BoP）：BoP 通过应用滑动窗口捕获局部模式，从而创建一个“模式包”。然后将其哈希为频率直方图，用作分类的特征向量。
SFA 符号包（BOSS）：BOSS 是一种更能抵御噪声并有效捕获关键模式的变体。它使用符号傅里叶近似（SFA）在不同分辨率下捕获时间序列。
随机卷积核变换（ROCKET）：ROCKET 为大数据集提供了更高的速度和效率。它生成并使用随机卷积核将时间序列转换为特征向量。

Shapelets

这些是时间序列中的子序列，代表了特定类别模式。通过找到在时间序列或其他相关时间序列中匹配的形状，可以相应地对时间序列进行分类。当定义类别的特征在时间上是局部化的时候，例如交易金额突然增加，可能对应信用卡盗窃。Shapelets 还可以帮助解释性——一旦形状被充分理解，可以用于解释在时间上匹配的时间序列点。

集成

迄今为止提到的集成分类方法将相似类型的分类器分组。另一种方法，如基于层次化投票的转换集成（HIVE-COTE），则采用不同类型的分类器。其思想是基于不同分类器捕获的时间序列的不同方面构建集成。这些分类器是独立训练的，其预测结果根据层次化投票方法进行组合。与其他集成方法一样，这可以提高鲁棒性和准确性。此外，由于使用了各种技术，HIVE-COTE 是复杂多模式时间序列的一个良好选择。然而，这也带来了较高的计算成本。

深度学习

诸如TimeNet的方法利用深度神经网络自动提取时间序列中的特征、模式和关系。TimeNet 是预训练的，这使得它能够快速应用。它结合了卷积神经网络（CNNs）用于局部特征和循环神经网络（RNNs）用于顺序模式。这使得 TimeNet 能够有效地捕获低级和高级模式，从而学习层次化表示。其优点在于适应各种时间序列分类问题，同时减少了手动特征工程的需求。与其他深度学习方法类似，缺点在于需要大量数据进行预训练、需要的计算资源较多，以及缺乏可解释性。尽管如此，在多种复杂情况下，它们仍然是性能最先进的方法之一。

虽然本书不会像预测那样详细讨论时间序列的分类，但这仍是一个值得进一步研究的有前景领域，并且面临一系列操作性挑战。

这带我们进入最后一种分析类型，即从时间序列数据中检测异常。

异常检测

时间序列分析的第三类使用场景是异常检测，旨在标记意外的模式或事件。虽然这与模式识别和预测有关，但其目的是不同的：识别源系统行为的意外偏差。异常检测在多个领域至关重要，如金融、医疗保健和工业系统。这些异常可能表明严重事件的发生，例如系统故障、金融欺诈或网络入侵。

除了单变量或多变量外，异常还可以表现为以下几种形式：

点：这是指一个单一的数据点被识别为异常的情况。
集合：当多个数据点作为一组近似的测量值被标记为异常时，就属于此类情况。
上下文：某个数据点或集合在周围测量值的上下文中可能是异常的，而在另一个上下文中，同一个数据点或集合可能不是问题。

异常也可以分为离群值和新颖性，离群值可能表示错误或故障，而新颖性则是之前未见过的模式，可能不是问题。

注意

为了使异常检测有效，数据准备阶段必须保留数据集中的异常值，这与通常在数据整理过程中所做的相反。

除了传统的统计和基于规则的方法，还有更新的机器学习技术。时间序列数据中的异常检测方法可以分为无监督、监督和半监督方法，每种方法都有其独特的技术和算法。异常分数通常是通过设置阈值来计算，用于标记异常。

无监督异常检测

无监督异常检测不需要标记数据。这假设异常与正常数据的差异足够大，可以在没有先验知识的情况下检测到。常见的方法包括：

统计方法，如z-score和箱线图分析，用于根据统计特性识别异常值。
基于聚类的方法，如DBSCAN或k-means，将相似的数据点聚集在一起，而异常点则是那些不属于任何聚类的点。
基于密度的方法，如 kNN 和局部异常因子（LOF），利用局部邻域的密度来识别异常。
隔离森林是一种基于树的模型，适用于高维数据，能够有效地将异常值隔离出来。

图 2.3中的图表显示了隔离森林模型的结果，用于检测家庭能源消费中的异常。

图 2.3：能耗异常检测

该模型是根据到 11 月 9 日为止的消费数据拟合的，然后用于之前未见过的数据。异常数据以红色/较浅的颜色显示。我们将在本章的实践部分运行此示例。

有监督异常检测

有监督异常检测适用于包含正常和异常情况的标注数据集。虽然能更好地检测异常，但它需要标注数据，这些数据可能比较难获得。相关技术包括以下几种：

分类模型，如传统的分类器（如逻辑回归）、支持向量机（SVM）或更复杂的模型，如 CNN 和 RNN，这些模型经过训练用来区分正常与异常实例。
集成方法，如随机森林或梯度提升，可以通过组合多个模型来提高检测精度。

半监督异常检测

半监督异常检测需要较少的标注数据，并结合大量未标注的数据。例如，在工业监控中，当我们从传感器读取的数据显示数据点有限时，这些测量值大多数对应设备的正常运行，可以标注为正常。然后，超出正常标签的新读数可以被标记为异常。

当标注一个大规模数据集昂贵时，半监督技术包括以下几种：

修改无监督技术以包含有限的可用标签——例如，修改基于密度的或聚类方法，以增强对标注异常的敏感性。
新颖性检测通过在正常数据上训练模型以找到其分布，类似于无监督统计方法，然后偏离该分布的值会被标记出来。一类 SVM 和自编码器就是这种技术的示例。

高级深度学习方法

使用深度学习技术的方法包括以下几种：

自编码器是神经网络，通过压缩然后重建输入数据。其原理是，这些模型能很好地重建正常数据，而对异常数据的重建误差较高。
序列类型的模型，如长短期记忆（LSTM）、RNN和Transformers，能够识别时间序列数据中的时间依赖性，因此在序列异常检测中非常有用。

注

从操作角度来看，异常检测系统是全面监控和告警架构的一部分。需要注意的是，在低延迟或实时检测和告警要求的情况下，卡尔曼滤波器常常被使用。

这总结了各种异常检测方法。选择异常检测方法取决于时间序列和异常的特征、可用的标记数据、计算资源以及实时检测的要求。混合方法和高级方法，特别是基于深度学习的方法，由于能够建模时间序列数据中的复杂模式和依赖关系，已在各种应用中取得了令人鼓舞的结果。

从时间序列分析的整体概述出发，我们现在来看看它们在各个行业中的使用及其影响。

行业特定的应用案例

我们在前一部分讨论了不同类型的时间序列分析。接下来我们将探讨它们在不同行业中的适用性。在此之前，图 2.4中的图表能帮助你了解跨行业应用的多样性。

图 2.4：时间序列分析在各行业中的应用

让我们详细了解时间序列分析在各个行业中的应用。

金融服务

金融服务中的时间序列分析对于理解趋势、模式和未来行为至关重要。其应用广泛，能为决策、战略规划、风险管理和合规提供宝贵的见解。以下是时间序列分析在金融服务各个职能中的应用：

市场分析：通过分析资产的历史价格，包括趋势和季节性，预测未来资产价格。这有助于交易员和投资者决定交易哪些资产以及何时交易。
风险管理：除了前述内容，金融工具价格的另一个重要方面是它们的波动性，这需要进行分析，以更好地管理风险并制定减缓策略。这包括风险价值（VaR）建模，通过基于历史波动性和相关性来估算某个投资在一段时间内的潜在损失。风险管理的另一个领域是信用风险，包括偿还历史、违约和经济状况的时间序列数据。这有助于评估未来违约和损失的可能性。为此，预备金估算确保足够的资金被预留以应对潜在的贷款损失，而流动性管理确保保持足够的流动性。最后，在宏观经济层面，压力测试涉及分析历史上的最坏情况。
投资组合管理：这里的两个主要方面是优化资产配置和相关的投资组合表现评估。通过分析历史回报和资产之间的相关性，投资组合经理可以确定资产配置，以满足预期的风险回报特征。回顾其随时间的表现后，投资组合可以根据需要进行调整。
算法交易：从本质上讲，这涉及利用微秒或毫秒级的时间序列数据来做出高频交易决策。完整的周期包括开发策略、进行回测，然后一旦策略投入实际使用，就生成正确的交易信号。
欺诈检测：这里的核心思想是分析交易，识别并标记可能表明欺诈活动的模式，包括市场操纵或内幕交易。
经济预测：这用于预测例如利率和其他对中央银行、政府和金融机构政策制定有影响的经济指标。

从本质上讲，金融服务中的时间序列分析是支撑广泛活动的基础，从交易决策到投资组合管理和合规监管。它利用历史数据预测未来事件、管理风险并揭示有价值的洞察力。因此，时间序列分析推动了金融生态系统中的明智决策。

零售

零售行业中的时间序列分析使用按时间顺序排列的数据来进行决策、优化运营和提升客户体验。零售商可以洞察影响其业务的趋势、季节性变化和周期性行为。以下是一些关键的应用场景：

销售预测和收入预测：根据历史数据预测未来的销售情况，考虑季节性变化、趋势和外部因素（如假期和经济条件），有助于规划库存、员工和营销活动。这对于财务规划和投资决策至关重要。
库存管理和供应链优化：零售商可以通过分析购买模式和交货时间来优化库存水平。更好的需求规划和补货调度可以最大限度地减少缺货现象并减少过剩库存。在每个产品层级进行预测，有助于高效的库存补充。这也对供应链管理产生积极影响。零售行业中一个正在增加应用的相关案例是食品浪费预测和减少。
价格优化与营销规划：通过分析价格变动对销量的影响，可以确定最佳定价策略，从而最大化销售和利润。这包括对季节性价格敏感性、促销和营销活动的影响以及竞争定价的洞察。同时，这也优化了营销支出，更好地与季节性模式对接。
客户行为分析与产品生命周期管理：零售商可以通过了解客户购买习惯随时间的变化，来指导营销策略和产品开发。此分析有助于识别趋势，如购买渠道的变化或对某些产品类别的兴趣增加。这反过来有助于在产品推出、停产或重新上市方面做出更好的决策。通过更深入的客户行为分析，还可以制定有效的忠诚度项目，从而提高客户留存率。
门店表现分析与员工规划：通过比较不同地点的销售趋势，分析可以识别高表现的门店和需要帮助的门店。这有助于做出有关门店扩张或关闭的决策，并将人员配置与合适的门店及繁忙时段对接。其影响不仅体现在运营效率上，还能改善客户服务。

时间序列分析在零售中的应用影响着业务的方方面面，从库存和定价到营销和员工管理。通过利用历史数据，零售商可以做出明智的决策，从而提升运营效率、客户满意度和盈利能力。

医疗保健

时间序列分析在医疗保健中是跟踪健康相关数据的重要工具。此方法可以观察健康指标的模式、趋势和变化，这对改善患者护理、运营效率和临床结果至关重要。以下是医疗保健中的应用概述：

患者监测：在持续监测生命体征（如心率、血压等）时，可以使用时间序列分析对患者健康进行实时评估，并及早发现急性医疗事件。同样，通过分析可穿戴设备的数据，可以监测身体活动、睡眠模式及其他健康指标随时间的变化。这可以在临床环境中进行，也可以用于个人健康意识。
流行病学与疾病监测：这一需求的重要性在 COVID-19 期间得到了突出体现。必须随着时间推移追踪传染病的传播，以了解传播模式、识别疫情爆发，并相应规划公共干预措施。在个人层面，患有慢性病的患者可能需要根据病情的发展调整治疗计划。虽然相似，但这与监测生命体征在干预时间尺度上有所不同。
医院资源管理：由于大多数公立医院的资源紧张，能够预测医院的入院率以优化床位分配和人员安排是一个巨大的优势。

时间序列分析在医疗行业有众多其他应用，如医疗质量监测、药物研发和医学研究，尤其是在新冠疫情期间，公共卫生监测、分析和政策制定。

通过使用时间序列分析，医疗行业可以提升患者护理质量、提高运营效率，并推动医学研究，最终改善健康结果并推动更为科学的医疗政策。

制造业和公用事业

时间序列分析在制造业和公用事业领域至关重要，能够确保安全、优化运营并提高效率。以下是这些行业中时间序列分析的应用概述：

制造业：首先，在规划方面，通过需求预测、生产调度和库存管理，时间序列帮助满足市场需求而不造成过度生产。然后，为了保持生产运行，机器传感器将数据发送到预测性维护模型，以生成潜在故障的警告，从而进行及时维护。这可以显著减少停机时间，同时节省不必要的预防性维护。异常检测进一步帮助提前发现并限制质量问题。
石油和天然气：与制造业类似，预测性维护和异常检测确保减少停机时间，同时最大化产出。此外，由于该领域基础设施需要大量前期投资，因此准确的需求和价格预测对规划至关重要。
公用事业：公用事业领域的主要应用包括需求和负荷预测，进而帮助规划、配电管理和发展。这进一步促使了电网的最优利用，提高了客户服务，同时防止了停电。最后，时间序列分析和对新能源来源的预测确保它们能够与整体能源结构进行最优整合。

在所有这些领域，时间序列分析有助于资源优化、成本降低和战略规划，最终促进更具韧性和高效的运营。

注

一个主要且快速增长的时间序列数据来源是物联网（IoT）设备和传感器。这是因为连接设备数量的激增。此数据被收集、存储并分析——通常是实时的——应用场景遍及各行各业，其中一些已经讨论过，例如用于预测性维护的机器传感器数据、用于预测消费的能源表数据、健康追踪器等，涵盖范围广泛。

本节总结了行业特定用例，展示了时间序列分析在不同领域中的广泛应用。随着新分析方法和新业务需求的推动，应用案例不断创新并扩展。接下来，我们将通过行业特定数据集，展示一些时间序列分析方法的实际应用。

使用选定的用例进行动手实践

在这一动手实践部分，我们将使用行业特定的数据集，逐步演示一些选定的用例。

预测

对于预测用例，我们从 第一章 中关于温度的示例开始，加载数据集，分析其组成并可视化结果。重点放在过去——即历史数据。在接下来的步骤中，我们将重点展示与未来相关的代码部分——即预测。这基于 ts-spark_ch1_2fp 中的代码，我们已将其导入到 Databricks 社区版，并在 第一章 中详细介绍。

预测步骤如下：

加载数据集，内容详见 第一章。

使用 Prophet 库，模型在数据上创建并训练（拟合）：

model = Prophet(
    n_changepoints=20, 
    yearly_seasonality=True,
    changepoint_prior_scale=0.001)
model.fit(df2_pd)

接着我们使用 Prophet 中一个非常有用的函数 make_future_dataframe 来生成未来日期。这些日期将作为输入，并作为参数传递，用于进行预测部分，即使用 predict 函数：
```
future_dates = model.make_future_dataframe(
    periods=365, freq='D')
forecast = model.plot_plotly generates *Figure 2**.5*. The right-most part of the graph does not have any collected data points as it is for the forecasted dates:
```
plot_plotly(model, forecast, changepoints=True)

图 2.5：温度预测

这是对预测的简要动手实践介绍。接下来我们将在本书的其他部分进行更多的预测，包括使用除 Prophet 之外的其他库。

模式分类

对于模式分类，我们将使用金融时间序列——更具体地说，科技公司股票的股价。我们将探讨使用两个不同的开源库进行 DTW，分别是 fastdtw 和 dtw-python。这基于 ts-spark_ch2_1.dbc 中的代码，我们可以从 GitHub 上的链接导入，并按照 第一章 中的说明，将其导入到 Databricks 社区版中。

代码 URL 如下所示：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch2/ts-spark_ch2_1.dbc

让我们从 fastdtw 开始，以下是步骤代码：

首先，我们导入必要的库：

import yfinance as yf
import numpy as np
from fastdtw import fastdtw
yfinance library to download the share prices from Yahoo Finance for several technology companies for a date range:

from_date = "2019-01-01"

to_date = "2024-01-01"

yftickers = [

"AAPL", "AMZN", "GOOG", "META",

"MSFT", "NVDA", "PYPL", "TSLA"]

yfdata = {

yftick: yf.download(

yftick, start=from_date, end=to_date, multi_level_index=False)[使用 fastdtw 库来计算每对股票的 DTW 距离：

for i in range(num_tickers):
    for j in range(num_tickers):
        dtwdistance, _ = fastdtw(X[i], X[j])
        dtwmatrix[i, j] = float(dtwdistance)

然后我们使用以下代码绘制距离矩阵的热图：
```
fig = px.imshow(
    dtwmatrix,
    labels=dict(x="Tickers", y="", color="DTW distance"),
    x=yftickers,
    y=yftickers
)
fig.update_xaxes(side="top")
fig.show()
```
这创建了图 2**.6中的可视化，其中突出显示了 AMZN 和 GOOG 股价之间的 DTW 距离值。在分析的股票组合中，这两个股票的 DTW 距离最接近其他股价。

图 2.6：DTW 距离热图

显示 AMZN 和 GOOG 两个时间序列之间 DTW 距离测量的线在图 2**.2中可视化展示。

所有股票代码的时间序列图可以通过以下代码简单地生成：
```
fig = px.line(yfdata, y=yftickers)
fig.show()
```
这创建了图 2**.7：

图 2.7：选定的技术股价格

我们使用以下代码和dtw-python库来生成图 2**.2中的对齐图：

from dtw import *
alignment = dtw(
    yfdata['GOOG'], yfdata['AMZN'], 
    keep_internals=True,
    step_pattern=rabinerJuangStepPattern(6, "c"))
alignment.plot(
    type="twoway", offset=-2, 
    xlab="time_index", ylab="GOOG / AMZN")
alignment = dtw(
    yfdata['GOOG'], yfdata['META'], 
    keep_internals=True,
    step_pattern=rabinerJuangStepPattern(6, "c"))
alignment.plot(
    type="twoway", offset=-2, 
    xlab="time_index", ylab="GOOG / META")

这结束了模式分类的简要实践介绍——更具体地说，是基于 DTW 的距离计算方法的初步步骤，应用于金融时间序列。在这一步之后，你可以继续应用 kNN 分类算法。

异常检测

在本章的最后一个实践示例中，我们将探讨应用于家庭能耗的异常检测用例。这基于ts-spark_ch2_2.dbc中的代码，以及ts-spark_ch2_ds2.csv中的数据集。我们将按照第一章中解释的方法，将代码导入到 Databricks 社区版中。

代码网址如下：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch2/ts-spark_ch2_2.dbc

代码步骤如下：

首先，我们导入必要的库：
```
from pyspark import SparkFiles
from sklearn.ensemble import IsolationForest
import plotly.express as px
```
正如本章早些时候所讨论的，Isolation Forest 是一种基于树的模型，可以用来隔离异常值。

使用 Spark 读取数据集：

df = spark.read.csv(
    "file:///" + SparkFiles.get(DATASET_FILE),
    header=True, sep=";", inferSchema=True
)

请注意，这是与我们在第一章中使用的spark.load()不同但等效的语法。

为了进行列值的计算，我们需要将列的数据类型从字符串更改为双精度：

df = df.dropna() \
    .withColumn(
        'Global_active_power',
        df.Global_active_power.cast('double')) \
    .withColumn(
        'Global_reactive_power', 
        df.Global_reactive_power.cast('double')) \
    .withColumn(
        'Voltage', df.Voltage.cast('double')) \
    .withColumn(
        'Global_intensity', 
        df.Global_intensity.cast('double')) \

然后，我们选择数据集的第一部分用于训练模型：
```
df_train = df_pd.iloc[:35000,:]
```

然后可以创建并拟合 Isolation Forest 模型到训练数据集：

isoforest_model = IsolationForest(
    n_estimators=100, 
    max_samples='auto',
    contamination=float(0.0025), 
    random_state=123)
isoforest_model.contamination level to specify the expected proportion of outliers in the dataset.

然后可以使用该模型标记完整数据集中的异常：
```
df_pd['anomaly_'] = isoforest_model.predict(feature_col)
```

最后，为了显示图 2**.3中的结果：

fig = px.scatter(
    df_pd, x='Date', y=feature_name,
    color='anomaly_', 
    color_continuous_scale=px.colors.sequential.Bluered_r)
fig.update_traces(marker=dict(size=3))
fig.add_vrect(x0=df_train_lastdate, x1=df_lastdate)
fig.show()

这完成了使用能耗时间序列进行异常检测的实践介绍。正如本章所讨论的，Isolation Forest 方法在此处使用，但这只是众多可用方法之一。

总结

在本章中，我们重点讨论了分析时间序列数据在预测建模、趋势识别和异常检测中的实际意义。我们回顾了各行各业的实际应用，突出了时间序列分析的重要性，并且实践了两个不同领域的数据集。

在我们能够扩展这些及其他用例之前，我们需要一个额外的关键组件——Apache Spark，您将在下一章中了解它。

进一步阅读

本节作为一个资料库，汇集了可以帮助您加深对该主题理解的资源：

《时间序列分析 - 数据、方法与应用》，由 Chun-Kit Ngan 编著：www.intechopen.com/books/8362
金融服务：
- 《金融应用中的时间序列基础》（Massimo Guidolin 和 Manuela Pedio 著）：www.sciencedirect.com/book/9780128134092/essentials-of-time-series-for-financial-applications
- 《银行变量的时间序列预测技术》（Arindam Bandyopadhyay 著）：academic.oup.com/book/43110/chapter-abstract/361614151?redirectedFrom=fulltext&login=false
零售业：
- 《基于时间序列分析的零售店盈利预测模型》（Sridevi U. K. 和 Shanthi P 著）：www.researchgate.net/publication/325882164_A_profit_prediction_model_with_time_series_analysis_for_retail_store
- 《零售销售预测的比较研究》（Hasan 等，2022）：arxiv.org/pdf/2203.06848.pdf
医疗保健：
- 《医疗领域的人工智能：使用统计、神经网络和集成架构进行时间序列预测》（Kaushik 等，2020）：www.frontiersin.org/articles/10.3389/fdata.2020.00004/full
- 《聚焦心血管疾病的医疗诊断和预后时间序列预测》（Bui 等，2018）：www.researchgate.net/publication/320002542_Time_Series_Forecasting_for_Healthcare_Diagnosis_and_Prognostics_with_the_Focus_on_Cardiovascular_Diseases
制造业 与公用事业：
- 《工业 4.0 中的时间序列预测：全面综述及未来发展展望》（Kashpruk 等，2023）：www.mdpi.com/2076-3417/13/22/12374
- 《智能制造系统中的时间序列模式识别：文献综述与本体分析》（Farahani 等，2023）：www.sciencedirect.com/science/article/pii/S0278612523000997
- 从智能电表数据测量家庭活动的能量强度（Stankovic 等人，2016 年）：www.sciencedirect.com/science/article/pii/S0306261916313897
库：
- FastDTW: cs.fit.edu/~pkc/papers/tdm04.pdf
- dtw-python: dynamictimewarping.github.io/python/

加入我们的 Discord 社区

加入我们社区的 Discord 空间，与作者和其他读者进行讨论：

packt.link/ds

第三章：Apache Spark 简介

本章概述了 Apache Spark，解释了它的分布式计算能力以及处理大规模时间序列数据的适用性。它还解释了 Spark 如何解决并行处理、可扩展性和容错性的问题。这些基础知识非常重要，因为它为利用 Spark 在处理庞大的时间数据集时的优势奠定了基础，从而促进了高效的时间序列分析。了解 Spark 的作用，可以增强从业人员利用其强大计算能力的能力，使其成为可扩展、高性能时间序列应用的宝贵资源。

我们将覆盖以下主要内容：

Apache Spark 及其架构
Apache Spark 是如何工作的
Apache Spark 的安装

技术要求

本章的动手部分将着重于部署一个多节点的 Apache Spark 集群，以帮助熟悉部署过程中的重要组件。本章的代码可以在本书 GitHub 仓库的ch3文件夹中找到，网址为：https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch3。

本章的动手实践部分将进一步详细介绍这一过程。此过程需要一些搭建开源环境的技能。如果你不打算搭建自己的 Apache Spark 环境，而是专注于时间序列并使用 Spark（而不是部署它），你可以跳过本章的动手部分。你可以使用像 Databricks 这样的托管平台，它预先构建了 Spark，我们将在未来的章节中使用该平台。

什么是 Apache Spark？

Apache Spark 是一个分布式计算系统，它是开源的，具有编程接口和用于大规模并行数据处理的集群，并且具备容错能力。Spark 最初作为伯克利 AMPLab 的一个项目启动于 2009 年，并于 2010 年成为 Apache 软件基金会的一部分，开源发布。Spark 的原始创始人后来成立了 Databricks 公司，提供基于其多云平台的托管版 Spark。

Spark 可以处理批处理和流处理，使其成为大数据处理中的一个广泛适用的工具。相较于现有的大数据系统，Spark 通过内存计算和优化的查询执行实现了显著的性能提升，能够对任何规模的数据进行非常快速的分析查询。它基于弹性分布式数据集（RDDs）和数据框架（DataFrames）的概念构建。这些是分布在计算机集群中的数据元素集合，能够并行操作并具备容错能力。在本章的其余部分，我们将进一步扩展这些概念。

为什么使用 Apache Spark？

使用 Spark 有许多优势，这也是其作为大规模数据处理解决方案受欢迎的原因，正如图 3.1所示，这一趋势基于 Google 趋势数据。我们可以看到，Apache Spark 软件在大数据话题上的兴趣不断增加，而 Hadoop 软件的趋势在 2017 年 3 月被 Apache Spark 软件超越后开始下降。

图 3.1：与 Hadoop 和大数据相比，Apache Spark 的兴趣不断增加

这一兴趣激增可以通过一些关键优势来解释，具体如下：

速度：与非 Spark Hadoop 集群相比，Spark 在内存中运行速度可快达 100 倍，甚至在磁盘上运行时也能快达 10 倍。
容错性：通过使用分布式计算，Spark 提供了一个容错机制，能够在故障发生时进行恢复。
模块化：Spark 支持 SQL 和结构化数据处理、机器学习、图处理和流数据处理。凭借各种任务的库，Spark 能够处理广泛的数据处理任务。
可用性：Spark 提供了 Python、Java、Scala 和 R 等 API，以及 Spark Connect，能够让广泛的开发者和数据科学家使用。
兼容性：Spark 可以在不同平台上运行——包括 Databricks、Hadoop、Apache Mesos 和 Kubernetes，独立运行或在云端。它还可以访问各种数据源，相关内容将在接口和 集成部分中讨论。

Spark 日益流行以及其背后的诸多优势，是经过多年的演变而来的，接下来我们将进行回顾。

演变历程

Apache Spark 多年来经历了几次演进，以下是主要的版本发布：

1.x：这些是 Spark 的早期版本，起初基于 RDD 和一些分布式数据处理能力。
2.x：Spark 2.0（2016 年）引入了 Spark SQL、结构化流处理和 Dataset API，相比 RDD 更加高效。
3.x：自 2020 年起，Spark 3.0 进一步改进，增加了自适应查询执行（AQE），该功能根据运行时统计数据动态调整查询计划，增强了性能优化，并进行了动态分区剪枝。同时，新增对更新版本 Python 的支持，以及对机器学习 库（MLlib）的扩展。

截至目前，最新版本为 3.5.3。为了了解项目的未来发展方向，接下来我们将聚焦于一些最新版本的亮点，具体如下：

PySpark为 Python 类型提示提供了用户友好的支持，支持在 Spark 上的 pandas API，并通过优化提升了性能。
自适应查询执行（Adaptive Query Execution）的改进促进了更高效的查询执行和资源利用。
结构化流处理的改进提升了稳定性和性能。
Kubernetes 支持更好的集成和资源管理能力，用于在 Kubernetes 上运行 Spark。这带来了更高的效率和易用性。
API 和 SQL 的增强带来了更高效的数据处理和分析，新功能和现有功能的改进提升了效率。这里的关键主题是更好的可用性和性能。

从前述内容可以看出，最近的关注点主要集中在对现代基础设施的支持、性能和可用性上。作为一个大规模数据处理和分析的工具，这使得 Spark 成为一个更加广泛采用的工具。

Spark 的分发版

随着其流行度和广泛的应用，Spark 出现了多个分发版。这些分发版由不同的组织开发，Apache Spark 作为核心，提供了不同的集成能力、可用性特性和功能增强。与其他大数据工具捆绑在一起的这些分发版，通常提供改进的管理界面、增强的安全性以及不同的存储集成。

以下是最常见的分发版：

Apache Spark是由 Apache 软件基金会维护的原始开源版本，是其他分发版的基础。
Databricks Runtime是由 Databricks 公司开发的，这家公司由 Spark 的创始人创建。它针对云环境进行了优化，提供了一个统一的分析平台，促进了数据工程师、数据科学家和业务分析师之间的协作。Databricks 提供了经过优化的 Spark 性能，采用了 C++重写的版本Photon，互动笔记本，集成的数据工程工作流（包括Delta Live Tables（DLT）），以及与 MLflow 的机器学习支持，且作为其基于 Unity Catalog 的治理功能的一部分，提供企业级的合规性和安全性。
Cloudera Data Platform（CDP）将 Spark 作为其数据平台的一部分，平台中还包含了 Hadoop 和其他大数据工具。
Hortonworks Data Platform（HDP）在与 Cloudera 合并之前，提供了其自有的分发版，其中包括 Spark。
Microsoft Azure将 Spark 作为Azure Databricks的一部分，后者是 Azure 上的第一方服务，此外还包括 HDInsight、Synapse，以及未来的 Fabric。
Amazon Web Services（AWS）在其市场中提供 Databricks，以及作为云服务运行的Elastic MapReduce（EMR），可在 AWS 上运行如 Apache Spark 等大数据框架。
Google Cloud Platform（GCP）托管了 Databricks，以及Dataproc，这是 Google 为 Apache Spark 和 Hadoop 集群提供的云端托管服务。

从本地解决方案到云原生解决方案，再到与其他数据平台集成的解决方案，每种 Apache Spark 的分发版都能满足不同的需求。当组织选择分发版时，通常考虑的因素包括性能要求、管理的简便性、现有的技术栈以及每个分发版所提供的特定功能。

在了解了 Apache Spark 的基本概念、优势和演变之后，让我们深入探讨它的架构和组件。

Apache Spark 架构

使用 Apache Spark 架构的主要目标是跨分布式集群处理大规模数据集。架构可以根据应用的具体需求而有所不同，无论是批处理、流处理、机器学习、报告查询，还是这些需求的组合。一个典型的 Spark 架构包括多个关键组件，这些组件共同满足数据处理需求。此类架构的示例可见于 图 3.2。

图 3.2：基于 Apache Spark 的架构示例（独立模式）

现在让我们深入探讨一下这些部分的具体功能。

集群管理器

集群管理器负责将资源分配给集群，集群是 Spark 工作负载执行的操作系统环境。包括以下几种：

独立模式：Spark 附带一个基本的集群管理器，使得搭建集群并开始使用变得容易。这个集群管理器节点也被称为主节点：
Kubernetes：Spark 可以部署到 Kubernetes 上，Kubernetes 是一个开源的基于容器的系统，能够自动化容器化应用的部署、管理和扩展。
Apache Mesos：作为集群管理器，Mesos 支持 Spark，同时也能运行 Hadoop MapReduce。
Hadoop YARN：在与 YARN 一起运行时，Spark 可以与其他 Hadoop 组件共享集群和数据集。
专有和商业：将 Spark 融入的解决方案通常有自己的集群管理器——通常是对先前开源版本的变种和改进。

接下来，我们将查看这些 Spark 集群中的内容。

Spark Core、库和 API

一旦集群管理器提供了一个或多个集群，Spark Core 就会管理内存和故障恢复，以及与 Spark 作业相关的所有事务，如调度、分配和监控。Spark Core 抽象了存储的读写，使用 RDD 和最近的 DataFrame 作为数据结构。

在（并与之紧密合作）Core 的基础上，多个库和 API 提供了针对数据处理需求的附加功能。这些功能包括：

Spark SQL 允许通过 SQL 查询结构化数据。
Spark Structured Streaming 处理来自各种来源的数据流，例如 Kafka 和 Kinesis。
MLlib 提供多种机器学习算法，支持分类、回归、聚类等任务。
GraphX 允许使用图算法来创建、转换和查询图。

Spark 涉及数据处理，因此，解决方案中的一个重要部分是数据结构，接下来我们将讨论这个部分。

RDD、DataFrame 和数据集

自本章开始以来，我们提到了几次 RDD 和 DataFrame，但没有详细说明，现在我们将对此进行详细讲解，并引入 Datasets。

简而言之，这些是内存中的数据结构，表示数据并为我们提供了一种程序化的方式，正式来说，这是一种抽象，来操作数据。每种数据结构都有其适用的场景，如下所示：

RDD是 Spark 的基本数据结构。它是不可变的和分布式的，可以在集群内存中存储数据。具有容错性，RDD 可以自动从故障中恢复。需要注意的是，在集群内存不足的情况下，Spark 确实会将部分 RDD 存储到磁盘上，但由于这一过程是由后台管理的，因此我们仍然将 RDD 视为存在内存中。

随着越来越多操作变得可以通过更易用的 DataFrame 实现，你将越来越不可能使用 RDD，我们接下来将看到这一点。RDD 更适合进行低级转换，直接操作数据，当你需要对计算进行低级控制时，它们非常有用。
DataFrame是建立在 RDD 之上的分布式数据集合，具有命名的列。这类似于关系数据库中的表。除了更易用的高级 API，使代码更加简洁易懂外，DataFrame 还因为 Spark 的 Catalyst 优化器的支持，相较于 RDD 在性能上有了提升，我们将在本章后面讨论这一点。

在之前的动手练习中，我们已经开始使用 DataFrame。你可能已经注意到在做练习时，除了 Spark DataFrame，还有 pandas DataFrame。虽然在概念上它们类似，但它们属于不同的库，底层实现有所不同。从根本上讲，pandas DataFrame 运行在单台机器上，而 Spark DataFrame 是分布式的。pandas DataFrame 可以转换为 pandas-on-Spark DataFrame，除了并行化的优势外，还能支持 pandas DataFrame API。
Dataset结合了 RDD 的类型安全性和 DataFrame 的优化。类型安全性意味着你可以在编译时捕捉数据类型错误，从而提高运行时的可靠性。然而，这取决于编程语言是否支持在编码时定义数据类型，并在编译时进行验证和强制执行。因此，Dataset 仅在 Scala 和 Java 中得到支持，而 Python 和 R 由于是动态类型语言，只能使用 DataFrame。

总结来说，RDD 提供低级控制，DataFrame 提供优化后的高级抽象，而 Dataset 则提供类型安全。选择使用哪种数据结构取决于你应用的具体需求。

到目前为止，我们讨论了内部组件。接下来，我们将探讨外部接口部分，介绍 Spark 如何在后端与存储系统集成，并在前端与应用和用户交互。

接口与集成

在考虑与环境的接口和集成时，有几种方法可以通过 Apache Spark 实现。这些方法如下：

csv、json、xml、orc、avro、parquet 和 protobuf。其中，Parquet 是最常见的格式，因为它在使用 snappy 压缩时提供了良好的性能。此外，Spark 可以通过扩展包来支持多种存储协议和外部数据源。Delta 就是其中之一，我们将在第四章和第五章中进一步讨论。其他格式包括 Iceberg 和 Hudi。请注意，我们这里讨论的是数据的磁盘表示形式，这些数据会被加载到之前讨论的基于内存的 RDD 和 DataFrame 数据结构中。

我们已经通过目前为止的实践演练，积累了一些关于 Spark 和存储的经验，在这些演练中，我们已经从 Databricks Community Edition 的 Spark 集群读取了本地存储中的 CSV 文件。
应用程序：这是包含数据处理逻辑的代码，调用 Spark API 和库来执行数据变换、流处理、SQL 查询或机器学习等任务。开发人员可以使用 Python、R、Scala 或 Java 编写代码。然后，这些代码会在 Spark 集群上执行。

我们在应用程序方面的经验也已经开始，通过到目前为止的实践代码。
平台用户界面：除了我们在实践演练中看到的 Databricks Community Edition 的 Web 界面，开源 Apache Spark 还提供一个 Web 用户界面（UI），用于监控集群和 Spark 应用程序。它提供作业执行的阶段、资源使用情况和执行环境的洞察。其他集成了 Apache Spark 的数据平台也有自己的 UI。
应用程序终端用户界面：另一种 UI 是用于终端用户消费 Apache Spark 处理结果的界面。这可以是报告工具，或者例如在后端使用 Apache Spark 进行数据处理的应用程序。

在本节关于 Apache Spark 架构的内容中，我们看到架构如何支持从各种来源将数据引入 Spark 系统，通过 Spark 的库进行处理，然后将结果存储或提供给用户或下游应用程序。所选架构依赖于需求，如延迟、吞吐量、数据大小以及数据处理任务的复杂性和类型。在下一节中，我们将重点讨论 Spark 如何在大规模上执行分布式处理。

Apache Spark 的工作原理

迄今为止，我们已经查看了各个组件及其角色，但对它们的交互了解还不多。接下来我们将讨论这一部分，以了解 Spark 如何在集群中管理分布式数据处理，从变换和操作开始。

变换和操作

Apache Spark 在高层次上执行两种类型的数据操作：

filter 和 groupBy。
count和save类型的操作，如写入 Parquet 文件或使用saveAsTable操作。Action 操作触发所有在 DAG 中定义的变换的执行，这导致 Spark 计算一系列变换的结果。

变换和动作之间的区别是编写高效 Spark 代码时需要考虑的重要问题。这使得 Spark 能够利用其执行引擎高效地处理作业，接下来将进一步解释。

作业、阶段和任务

Spark 应用程序作为作业执行，作业被拆分为多个阶段，再进一步拆分为多个任务，具体如下：

作业：当在 RDD、DataFrame 或 Dataset 上调用 Action 时，Spark 会提交一个作业。作业会转化为一个包含多个阶段的物理执行计划，接下来我们会解释这些阶段。Spark 作业的目的是作为逻辑工作单元执行一系列计算步骤，以实现特定目标，比如聚合数据或排序，并最终生成输出。
阶段：一个作业可以有多个阶段，这些阶段在物理执行计划中定义。阶段是一组连续的任务，可以在不跨集群移动数据的情况下完成。阶段之间的数据移动称为洗牌（shuffle）。将作业拆分为多个阶段是有益的，因为洗牌在性能上开销较大。一个阶段进一步被拆分为任务，接下来我们将讨论任务。
任务：作为最小的处理单元，任务是在 Spark 内存中的数据分区上执行的单个操作。每个任务处理不同的数据集，并且可以与其他任务并行运行。这些任务在工作节点上运行，接下来我们将讨论工作节点。

总结来说，作业、阶段和任务是层级相关的。Spark 应用程序可以有多个作业，这些作业基于数据洗牌边界被划分为多个阶段。阶段进一步细分为任务，这些任务在集群的不同分区上并行运行。这样的执行层级使得 Spark 能够高效地将工作负载分配到集群的多个节点，从而在大规模数据处理时提高效率。

现在我们已经了解了处理单元，接下来的问题是如何在计算资源上运行这些单元，包括驱动程序和工作节点。

驱动程序和工作节点

驱动程序和工作节点是集群管理器创建的计算资源，用于组成一个 Spark 集群。它们协同工作，利用多台机器的资源并行处理大数据集。

让我们详细讨论这些资源：

驱动节点：驱动节点是 Spark 应用程序的主进程运行的地方，主要负责以下任务：
- 资源：驱动程序请求集群管理器分配资源，以便在工作节点上运行进程。
- SparkSession：这是一个由驱动程序创建的对象，用于以编程方式访问 Spark 并在集群上进行数据处理操作。
- 任务：驱动节点将代码转化为任务，调度任务到工作节点上的执行器，并管理任务的执行。
工作节点：工作节点是数据处理的核心，数据通过所谓的执行器进程在工作节点上处理。执行器与存储交互，并将数据保存在自己的内存空间中，同时拥有自己的一组 CPU 核心。任务由驱动节点调度到执行器上执行，驱动节点与执行器之间直接通信，传递任务状态和结果。

驱动节点和工作节点的交互：图 3.3 总结了驱动节点和工作节点之间的交互顺序。

图 3.3：驱动节点和工作节点的工作示意图

步骤如下：

初始化：当 Spark 应用程序启动时，驱动程序将作业转换为阶段，并进一步拆分为任务。
调度：驱动节点在工作节点的执行器上调度任务，跟踪任务状态，并在发生故障时重新调度。
执行：驱动节点分配的任务由工作节点上的执行器运行。此外，当数据需要在执行器之间传递时，驱动节点协调执行器之间的操作。这对于某些操作（如联接）是必需的。
结果：最终，执行器处理任务的结果被发送回驱动节点，驱动节点汇总结果并将其发送回用户。

驱动节点和工作节点之间的这种协作过程是 Spark 的核心，它使得数据处理能够在集群中并行进行，并能够处理容错问题。

现在我们已经了解了 Spark 集群的工作原理，让我们深入探讨是什么使它更加高效和优化。

Catalyst 优化器和 Tungsten 执行引擎

到目前为止，我们已经讨论了在不同版本中对 Apache Spark 的持续改进，其中两个显著的改进是 Catalyst 优化器和 Tungsten 执行引擎。它们在确保 Spark 过程优化、快速执行时间和高效资源利用方面发挥着关键作用。

Catalyst 优化器

Catalyst 优化器是在 Spark SQL 中引入的一个查询优化框架，通过对查询的 抽象语法树（AST）进行树形转换，显著提高了查询性能。它通过多个阶段实现优化，具体如下：

分析：查询被转化为一个名为逻辑计划的操作符树。
逻辑优化：优化器使用基于规则的转换来优化逻辑计划。
物理规划：逻辑计划被转换为物理计划，物理计划是基于选择的算法来进行查询操作的。
成本模型：然后基于成本模型比较物理计划，以找到在时间和资源上最有效的计划。
代码生成：作为最终阶段，物理计划被转换为可执行代码。

通过这些阶段，Catalyst 优化器确保运行最具性能和效率的代码。

Tungsten 执行引擎

另一个关注点是 Spark 进程对 CPU 和内存的高效利用。Tungsten 执行引擎通过以下方式实现这一目标：

代码生成：Tungsten 与 Catalyst 优化器协作，生成优化且紧凑的代码，从而减少运行时开销，同时最大化速度。
缓存意识：减少缓存未命中可以提高计算速度。Tungsten 通过使算法和数据结构具备缓存意识来实现这一点。
内存管理：Tungsten 高效管理内存，提高了缓存的影响力，同时减少了垃圾回收的开销。

Catalyst 优化器与 Tungsten 执行引擎共同合作，通过优化查询计划、生成高效代码以及减少计算开销，显著提高了 Spark 的性能。这提升了 Spark 在大数据处理中的效率，且具备可扩展性和高速性。

现在我们已经了解了 Apache Spark 的工作原理，接下来将介绍如何设置我们自己的 Spark 环境。

安装 Apache Spark

到目前为止，在前面的章节中，我们已成功在 Databricks Community Edition 上执行了 Spark 代码。然而，这仅限于单节点集群。如果我们希望充分利用 Spark 的并行处理能力，就需要多节点集群。我们可以选择使用 Databricks 管理的平台即服务（PaaS）云解决方案，或其他等效的云 PaaS，或者我们可以构建自己的 Apache Spark 平台。这正是我们现在要做的，按照图 3.2中展示的Apache Spark 架构来部署环境。

注意

如果您不打算构建自己的 Apache Spark 环境，可以跳过本节的实践部分，改为使用受管 Spark 平台，如 Databricks，我们将在未来的章节中使用。

使用容器进行部署

我们可以直接在本地机器上安装 Apache Spark，但这将只给我们一个节点。通过将其部署在容器中，如 Docker，我们可以在同一台机器上运行多个容器。这有效地为我们提供了一种方法来构建一个多节点集群。这种方法的其他优势包括与本地执行环境的隔离，以及提供一种可移植且可重复的方式，将其部署到其他机器上，包括如 Amazon 弹性 Kubernetes 服务（EKS）、Azure Kubernetes 服务（AKS）或Google Kubernetes 引擎（GKE）等基于云的容器服务。

在接下来的部分中，我们将使用 Docker 容器，首先安装 Docker，然后构建并启动包含 Apache Spark 的容器，最后验证我们的部署。

Docker 替代方案

您可以使用 Podman 作为 Docker 的开源替代方案。请在此查看更多信息：podman.io/。

Docker

以下说明指导你如何安装 Docker：

请参考以下链接，根据你的操作系统下载并安装 Docker：

docs.docker.com/get-docker/

对于 macOS 用户，请按照此处的说明操作：

docs.docker.com/desktop/install/mac-install/
一旦 Docker 安装完成，按图 3.4所示启动它。

图 3.4：Docker Desktop

在 macOS 上，你可能会看到 Docker Desktop 的警告：“另一个应用程序更改了你的桌面配置”。根据你的设置，以下命令可能解决此警告：

ln -sf /Applications/Docker.app/Contents/Resources/bin/docker-credential-ecr-login /usr/local/bin/docker-credential-ecr-login

一旦 Docker Desktop 启动并运行，我们可以使用 Apache Spark 构建容器。

网络端口

以下网络端口需要在本地机器或开发环境中可用：

Apache Spark：7077，8080，8081
Jupyter Notebook：4040，4041，4042，8888

你可以使用以下命令检查当前端口是否被现有应用程序占用，在命令行或终端中运行：

% netstat -an | grep LISTEN

如果你在已使用端口的列表中看到所需端口，你必须停止使用该端口的应用程序，或者修改docker-compose文件以使用其他端口。

作为示例，假设上述netstat命令的输出显示本地机器或开发环境中的端口8080已经在使用，并且你无法停止正在使用该端口的现有应用程序。

在这种情况下，你需要将docker-compose.yaml文件中用于 Apache Spark 的端口8080更改为另一个未使用的端口。只需在:左侧查找并替换8080为例如8070，前提是该端口未被占用，如以下示例所示：

来自：

     ports:
      - '7077:7077'
      - '8080:8080'

到：

     ports:
      - '7077:7077'
      - '8070:8080'

记下新端口，并在需要输入相应 URL 时使用此端口替代现有端口。在此示例中，端口8080已更改为8070，Airflow Web 服务器的匹配 URL 变更如下：

来自：localhost:8080/
到：localhost:8070/

注意

你需要更改以下各节中所有需要修改的 URL 中的网络端口，以配合本节内容。

构建并部署 Apache Spark

以下说明指导你如何构建和部署 Docker 镜像：

我们首先从本章的 Git 仓库下载部署脚本，网址如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch3

我们将使用适合 git 克隆的 URL，具体如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark.git

要做到这一点，启动终端或命令行并运行以下命令：
```
git clone https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark.git
cd Time-Series-Analysis-with-Spark/ch3
```
请注意，上述命令适用于 macOS 或基于 Linux/Unix 的系统，您需要运行适用于 Windows 的等效命令。

在 macOS 上，当您运行此命令时，可能会看到以下错误：

xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun

在这种情况下，您需要使用以下命令重新安装命令行工具：

xcode-select --install

现在我们可以开始容器的构建和启动。提供了一个 Makefile 来简化启动和停止容器的过程。以下命令构建容器的 Docker 镜像并启动它们：
```
make up
```

Windows 环境

如果您使用的是 Windows 环境，可以根据以下文档安装 Windows 版本的 Make：gnuwin32.sourceforge.net/packages/make.htm

这将产生以下或等效的输出：

docker-compose up -d
[+] Running 4/4
...
 ✔ Container ts-spark-env-spark-master-1    Started
 ✔ Container ts-spark-env-jupyter-1         Started
 ✔ Container ts-spark-env-spark-worker-1-1  Started
 ✔ Container ts-spark-env-ts-spark-env-spark-master-1), which is where the cluster manager runs, and two worker nodes (ts-spark-env-spark-worker-1-1 and ts-spark-env-spark-worker-2-1). In addition, there is a separate node (ts-spark-env-jupyter-1) for a notebook environment, called Jupyter Notebook, similar to what you have used in the previous chapters on Databricks Community Edition. In this deployment, this Jupyter node is also the driver node.
Let’s now validate the environment that we have just deployed.
Accessing the UIs
We will now access the UIs of the different components as a quick way to validate the deployment:

1.  We start with Jupyter Notebook at the following local URL: [`localhost:8888/lab`](http://localhost:8888/lab)

Note
You will need to change the network port in the preceding URL if you need to modify it as discussed in the *Network* *ports* section.
This will open the web page as per *Figure 3**.5*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_03_5.jpg)

Figure 3.5: Jupyter Notebook

1.  The next (and important) UI is for the Apache Spark master node, accessible via the following local URL: [`localhost:8080/`](http://localhost:8080/)

    *Figure 3**.6* shows this master node UI, as well as the worker nodes connected.

![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_03_6.jpg)

Figure 3.6: Spark master node UI
We now have our own Apache Spark cluster running.
As a final step to conclude this chapter, you can stop the containers with the following command:

make down


 If you do not intend to use it further, you can additionally delete the Docker containers created with the Delete action as explained here: [`docs.docker.com/desktop/use-desktop/container/#container-actions`](https://docs.docker.com/desktop/use-desktop/container/#container-actions)
Summary
In this chapter, we dove deep into the Apache Spark architecture, its key components, and its features. The key concepts, how it works, and what makes it such a great tool were explained. We then deployed a multi-node cluster representing an example architecture. The concepts presented in this chapter, while essential, cover only a part of an Apache Spark project. We will view such a project end to end in the next chapter.
Further reading
This section serves as a repository of sources that can help you build on your understanding of the topic:

*   Apache Spark official web page: [`spark.apache.org/`](https://spark.apache.org/)
*   *Mastering Apache Spark* (Packt Publishing) by Timothy Chen, Mike Frampton, and Tim Seear
*   *Azure Databricks Cookbook* (Packt Publishing) by Phani Raj and Vinod Jaiswal
*   Google Trends comparison: [`trends.google.com/trends/explore?date=2009-01-01%202024-08-28&q=%2Fm%2F0bs2j8q,%2Fm%2F0ndhxqz,%2Fm%2F0fdjtq&hl=en`](https://trends.google.com/trends/explore?date=2009-01-01%202024-08-28&q=%2Fm%2F0bs2j8q,%2Fm%2F0ndhxqz,%2Fm%2F0fdjtq&hl=en)
*   Cluster Overview: [`spark.apache.org/docs/latest/cluster-overview.html`](https://spark.apache.org/docs/latest/cluster-overview.html)
*   Spark Connect: [`spark.apache.org/docs/latest/spark-connect-overview.html`](https://spark.apache.org/docs/latest/spark-connect-overview.html)
*   Docker Compose: [`docs.docker.com/compose/`](https://docs.docker.com/compose/)
*   Make and Makefile: [`www.gnu.org/software/make/manual/make.html`](https://www.gnu.org/software/make/manual/make.html)
*   Jupyter: [`jupyter.org/`](https://jupyter.org/)

Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1).jpg)

第二部分：从数据到模型

在此基础上，本部分将提供时间序列分析项目中涉及的所有阶段的整体视图，重点关注数据和模型。从时间序列数据的导入和准备开始，我们将进行探索性分析，以了解时间序列的性质。数据准备和分析将引导我们选择用于分析、开发和测试的模型。

本部分包含以下章节：

第四章，时间序列分析项目的端到端视图
第五章，数据准备
第六章，探索性数据分析
第七章，构建与测试模型

第四章：时间序列分析项目的端到端视角

在前几章中，我们介绍了时间序列分析及其多个应用场景，以及 Apache Spark——作为这种分析的关键工具——的基础，本章将引导我们完成整个时间序列分析项目的过程。从用例出发，我们将过渡到涵盖 DataOps、ModelOps 和 DevOps 的端到端方法。我们将涵盖关键阶段，如数据处理、特征工程、模型选择和评估，并提供关于如何使用 Spark 和其他工具构建时间序列分析管道的实用见解。

这种全面的时间序列分析项目视角将为我们提供一种结构化的方法，帮助我们处理现实世界的项目，增强我们实施端到端解决方案的能力。这里的信息将为我们提供一个框架，帮助我们以一致的方式使用 Spark，并确保时间序列分析项目的成功执行。我们将以两种实施方法作为结尾。

本章将涵盖以下主题：

由用例驱动
从 DataOps 到 ModelOps 再到 DevOps
实施示例与工具

让我们开始吧！

技术要求

本章的实操部分将是实现时间序列分析项目的端到端示例。该章节的代码可以在 GitHub 仓库中的 ch4 文件夹找到，链接如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch4

本章的实操部分（实施示例与工具）将进一步详细说明。这需要一些构建开源环境的技能。如果你不打算构建自己的 Apache Spark 环境，且关注点仅在于时间序列分析并使用 Spark 和其他工具，而不是部署它们，那么你可以跳过本章的实操部分。你可以使用像 Databricks 这样的托管平台，它预装了 Spark、MLflow 以及用于工作流和笔记本的工具，正如我们在未来章节中所做的那样。

由用例驱动

在我们深入讨论如何进行端到端时间序列分析项目之前，像往常一样，先从为什么开始总是一个好的选择。可能有许多原因，通常是多种原因的组合，来证明启动时间序列分析项目的必要性。以下是一些原因：

技术更新：强调技术的原因可能是由于老化的平台需要替换，且无法再满足需求，或当有新技术出现，提供更好的性能、更低的成本或更多的功能，如高级机器学习模型或可扩展的云资源。
方法研究：对于专注于研究的组织或部门，主要驱动力是寻找新的、更好的方法，比如开发和测试用于分析时间序列的新算法。
数据探索：与研究类似，但需要更接近数据，这通常嵌入在企业的数据团队中。这里的需求是理解时间序列数据，而不一定需要预先定义的应用目标。其目的是发现数据中的模式、趋势和异常。
用例：在这种方法中，我们从结果出发，首先识别最终用户或相关方的具体需求和期望。然后，我们基于时间序列数据分析来设置项目，以回答这些需求。

尽管之前提到的所有理由都有其合理性，并且无疑是有效的，但多年来，我发现以业务驱动的用例方法在投资回报方面是最高的。我们在第二章中已经开始讨论基于时间序列的用例，涵盖了各个行业的应用场景，如库存预测、能源使用预测、金融市场趋势分析或传感器数据中的异常检测；在这里，我们将重点关注这种用例驱动的方法，并进一步探讨。

用例方法首先识别并定义现实世界中具体的业务应用或挑战。接着，选择最适合解决这些需求的技术方案。乍一看，这与任何商业环境中的项目似乎并无太大不同。这里的关键区别在于“具体”一词，用例方法强调的是具体、可衡量的业务成果。这遵循精益方法，因为我们希望避免那些无法为业务成果做出贡献的功能。

用例可以与敏捷软件开发方法中的用户故事进行对比。事实上，敏捷方法通常是实现用例的方式，通过简化的迭代开发过程，始终涉及用户的参与。

以下图 4.1概述了基于迄今为止讨论内容的用例驱动方法，包括它们的关键特征。

图 4.1：用例驱动方法

现在我们已经定义了用例驱动的方法，接下来我们将介绍这一方法的关键特征，如下所示：

业务成果：项目的成功通过业务指标来衡量，具体包括提高收入、降低成本、提升效率以及更好、更快速的决策制定。
以用户为中心：从一开始就与最终用户和相关方合作，明确他们的具体需求，项目目标除了回答这些需求外，还包括前述的业务成果。
具体：我们已经讨论过这个词几次。项目的具体性为其范围提供了明确的方向，使得执行更加灵活。我们希望解决一个具体的需求，例如销售预测，这个需求甚至可以更为细化，比如为特定产品线或区域进行预测。
迭代性：涉及最终用户和利益相关者的反馈和改进循环确保项目保持正轨，满足预期的商业成果。这再次强调了与敏捷方法的相似性，其短开发周期、增量交付、持续反馈和适应性。

遵循这些特征，确保用例范围足够小，能够在几个月内（如果不是几周的话）实现并带来价值。这些较小的用例通常意味着它们在开发资源上并行竞争。这就需要优先排序，以确保资源得到合理投资。以下标准通常用于优先排序用例：

影响力：这是用例预期商业影响的衡量标准，最好用货币价值来计算。如果结果是时间的减少，则需估算时间节省的等值货币价值。
成本：我们需要计算与用例相关的所有成本，从用例构想到用例上线并为业务带来价值的整个过程。成本可能与开发、基础设施、迁移、培训、支持和生产运营相关。
投资回报率（ROI）：这可以通过将影响力除以成本来简单估算。例如，对于一家希望更好地预测店铺库存的零售商，如果将库存预测用例投入生产的总成本为 50 万美元，而库存预测改进预计将带来三年 200 万美元的节省，则在此期间 ROI 为 4 倍。
技术可行性：用例的技术解决方案存在，并且能够在时间和预算内实现。
数据可用性和可访问性：数据可用且可访问，用以构建用例并将其投入运营。

根据前述标准，在资源竞争的情况下，具有高影响力、投资回报率（ROI）为 10 倍、可行且已有数据的用例，应优先于另一个影响较小、ROI 为 3 倍或没有数据访问的用例。

总结来说，从清晰理解用户需求开始，基于用例的项目确保了与业务的适用性和相关性，紧密对接利益相关者的目标，并能量化影响。然而，拥有一个好的用例仅仅是开始。接下来，我们将深入探讨从用例到成功完成时间序列分析项目、并实现商业成果的下一步。

从 DataOps 到 ModelOps 再到 DevOps

一旦确定了一个重要的用例，一些阶段发挥着至关重要的作用，从数据操作（DataOps）到模型操作（ModelOps），最后到部署（DevOps），将业务环境中的价值传递到实际应用中。一个覆盖这些阶段的完整端到端过程确保了我们可以持续地从一个用例交付到下一个，同时确保结果的可重复性。图 4.2概述了这些阶段，接下来将在本节中详细介绍。

图 4.2：DataOps、ModelOps 和 DevOps

DataOps

DataOps 在时间序列分析项目中的应用包括最佳实践和过程，确保时间序列数据在生命周期中的流动、质量和访问。其目标是及时、高效且准确地进行时间序列分析和建模，从而得出可操作的业务洞察。

DataOps 实践涵盖完整的数据和元数据生命周期，通常可以分为数据源集成、数据处理、数据治理和数据共享。

来源集成

数据源集成首先涉及识别数据源并获取访问权限，然后从源中摄取数据。

数据来源可以是内部的或外部的。内部数据来源主要是数据库，例如交易记录、系统日志或用于遥测的传感器数据。外部数据来源包括市场数据、天气数据或社交媒体数据，而后者现在正成为主流。不同领域的数据来源在数据量、更新频率和数据格式上差异巨大。一旦确定并访问了数据来源，数据摄取就是将数据引入平台进行处理的过程。通常通过自动化的数据摄取管道来实现，按照特定的频率（每小时、每天等）批量运行或持续流式传输。数据摄取机制包括数据库连接、API 调用或基于 Web 的抓取等。

处理与存储

数据处理包括清理数据、将其转化为正确的格式，并为分析存储。推荐的方法是奖牌方法，如图 4.3所示，包含多个处理阶段，从原始数据到精加工数据，再到准备报告的数据。

图 4.3：数据处理的奖牌阶段

奖牌方法

数据处理的奖牌方法将数据组织为三个阶段：青铜、白银和黄金。这通常用于数据湖和 Delta Lake 架构。在青铜阶段，原始数据从不同来源摄取而不进行转化。白银阶段是通过数据清理、增强和转化来创建一个精加工的数据集。最后，黄金阶段代表最高质量的数据，已清洗、汇总并优化以进行高级分析、报告和商业智能。这种多层结构提高了数据质量，并便于数据管理。

一旦数据从源端摄取，数据质量检查与清洗是构建数据可信度的第一步。这包括处理缺失值、检测并修正错误、去除重复项以及筛选异常值。这些操作能提升数据质量，并为分析提供坚实的基础。对于时间序列数据，特定要求是在此阶段验证并维护时间一致性，以确保数据的顺序性。

来源的原始数据通常不适合直接用于分析，需要经过多次转换，使其适合时间序列分析。此过程包括将半结构化数据转换为结构化格式以便于更快速的访问等多项转换操作。粒度较小或间隔不规则的数据需要聚合到更高层级的时间间隔，例如将每分钟的数据聚合为每小时数据、每小时数据聚合为每日数据，依此类推。日期和时间字段可能需要特别处理，以确保其为可排序格式，并用于设置时间索引，以便更快速地检索。不同的时区需要相应处理。

在较小的项目中常被忽视，元数据在企业环境中是确保数据可追溯性和数据血缘治理的重要要求。例如，这些关于数据的信息包括源标识符、数据摄取和更新的时间、所做的更改以及历史版本。元数据作为数据摄取和转化管道的一部分进行捕获，并且原生支持诸如 Delta 这样的存储协议。

尽管迄今为止描述的所有数据处理可以在内存中完成，但对数据的长期存储和检索仍有需求，以便进行时间跨度较长的分析。此存储需要具有成本效益、可扩展性、安全性，并且能够提供进行及时分析所需的高性能。根据数据的体积和流动速度，可选择的存储方式包括专门的时间序列数据库如 InfluxDB，或结合使用 Delta 等存储协议的云端存储。

我们将在第五章中更深入探讨数据处理，特别是数据准备的内容。现在，我们将焦点转向治理和安全，这些是从风险角度来看，DataOps 中最关键的考虑因素之一。

监控、安全性与治理

数据监控、安全性和治理涵盖了多个交叉的数据实践领域，包括数据质量、隐私、访问控制、合规性以及政策。为了理解这些实践的重要性，让我们来看看在撰写本文时新闻中的以下内容：

一起网络安全事件影响了多个主要组织，包括 Ticketmaster、Banco Santander 和 Ticketek。一个名为 ShinyHunters 的黑客组织获取了 Ticketmaster 的数据库，并因此泄露了 5.6 亿用户的个人信息。泄露的内容包括姓名、地址、电话号码、电子邮件地址和支付详情。据报道，这些数据正在黑客论坛上以高价出售。Banco Santander 也发生了类似的泄露，影响了客户和员工。

[来源：www.wired.com/story/snowflake-breach-ticketmaster-santander-ticketek-hacked/]

这些与第三方云数据仓库服务相关的数据泄露突显了网络安全的挑战，以及对监控、安全和治理强有力措施的需求。

监控

这里的目标是及时识别问题，并能够采取纠正措施，理想情况下在其产生负面后果之前。监控内容包括数据本身、转换管道的执行状态，以及安全和治理漏洞。对于数据监控，这意味着通过衡量数据的准确性和完整性来跟踪数据质量，同时捕捉数据缺口和异常。实现这一目标的一种方法是与一系列特定的时间序列模式进行比较，正如我们在第二章的异常检测示例中所看到的。至于数据管道监控，主要跟踪其性能，以确保数据的新鲜度、服务级别协议（SLA）得到遵守，以及通过数据溯源和完整性跟踪其血统。从安全角度来看，我们希望及时发现任何数据泄露的尝试，并采取相应的行动。监控应为自动化过程，并具备警报功能。

安全

无论是静态数据还是传输中的数据，我们都需要定义角色和相关权限，以确保访问控制。某些时间序列数据具有敏感性，只有授权人员才能查看或操作这些数据。

在受监管行业中，处理个人数据时，我们需要确保数据处理和存储实践符合相关法规（如 HIPAA、GDPR 等）。这还涉及确保隐私并管理个人数据的同意。

治理

除了前述内容，数据治理实践负责分配角色和责任以管理数据。作为其中的一部分，数据管理员负责监督数据质量、合规性和政策。

通过建立正确的流程、人员和工具，我们可以确保预防数据泄露，并在发生时有效减轻其影响。

我们现在已经涵盖了以可信和有用的方式将数据摄取并转化为受治理和安全的数据的过程。作为 DataOps 的一部分，剩下的步骤就是将数据共享给用户或其他系统进行分析和消费。

共享与消费

在数据摄取和处理之后，我们希望经过整理的数据和分析结果能够对用户可见并可访问。一个集中的数据目录，包括描述和使用指南，可以让用户轻松发现并访问可用的数据集。

最后，作为 DataOps 阶段的一部分，我们希望数据科学家、分析师和其他用户能够使用数据进行探索、分析和报告。理想情况下，我们希望将其与治理结合，确保只有授权的用户访问和使用允许的数据集。访问方式和使用方法包括基于文件的访问、数据库连接和 API 等。

正如本节所讨论的，DataOps 是一系列确保数据可用、可访问和可用的过程。它是迭代的，通过来自消费者的反馈以及对数据、管道和实践的持续改进来实现。通过建立一个可扩展且灵活的基础设施，并以 Apache Spark 的处理能力和多功能性为核心，DataOps 确保数据科学家和分析师能够在需要时获得高质量的数据，以获取洞察并推动决策。

我们将在第五章《数据准备》中讨论 DataOps 的实际考虑。目前，让我们专注于 ModelOps，它是继 DataOps 之后的下一个阶段。

ModelOps

虽然 DataOps 关注的是数据生命周期，ModelOps 关注的是模型生命周期——更具体地说，是统计和机器学习模型。其目标是从开发到部署管理模型，确保模型可靠、准确且可扩展，同时根据用例要求提供可操作的洞察。

ModelOps、MLOps 和 LLMOps

这些术语具有重叠的定义，有时可以互换使用。在本书中，我们将“ModelOps”作为不同类型模型（包括仿真模型、统计模型和机器学习模型）的更广泛生命周期管理实践来使用。我们将更具体地使用机器学习运维（MLOps）来指代机器学习模型，并将大语言模型运维（LLMOps）用于特指大语言模型生命周期中的相关考虑。因此，ModelOps 将指代这一系列实践的总和。

ModelOps 实践大致可以分为模型开发与测试，以及模型部署。

模型开发与测试

模型开发与测试涉及基于历史数据创建和微调时间序列分析模型。这个过程从特征工程开始，选择合适的算法，如自回归综合滑动平均（ARIMA）或长短期记忆（LSTM），并将数据划分为训练集和测试集。然后，模型通过反复训练和评估性能指标来确保准确性。接下来，通过在未见过的数据上测试模型，我们可以确保模型能够很好地推广到新的、真实的场景中。

我们现在将进一步详细介绍每个步骤：

特征工程：与 DataOps 阶段重叠，特征工程是模型开发的初始阶段，关注的是从时间序列数据中识别现有特征并创建新特征。这包括创建滞后和滚动平均特征，其中使用前一时间步的数据计算新特征，以及创建捕捉时间相关特征的时间特征，如一天中的特定时间、星期几、月份或假期。此外，特征工程阶段还包括对时间序列进行平稳化的转换，如差分或对数变换，或通过重采样使时间序列变得规律化，如在第一章中所讨论的那样。我们将在第八章中看到如何使用 Apache Spark 进行特征工程，涉及模型开发。
模型选择：选择模型是从一个不断增长的时间序列候选模型列表中挑选：ARIMA、Prophet、机器学习、深度学习模型（如 LSTM）等。正确的时间序列模型取决于可用的数据和我们要实现的用例，正如我们在第二章中的用例示例中看到的那样。探索性数据分析（EDA），在第六章中详细介绍，帮助我们理解数据的趋势、季节性和潜在模式，从而指导我们完成这一过程。然而，找到最佳模型是一个迭代过程，通过模型验证来不断完善，我们将在下一步中介绍这个过程。
数据集划分：一旦我们有了候选模型，训练模型之前的第一步是将历史数据划分为训练集、验证集和测试集。在时间序列数据中进行此操作的具体考虑因素有两个：一是保持数据集内的时间顺序，二是确保在划分之间没有数据泄漏。
训练：在这一阶段，模型通过调整其参数来拟合训练数据集。这可以是有监督的，使用预定义的标签或实际结果，或者是无监督的，如在第二章中解释的那样。在有监督训练的情况下，模型参数通过诸如梯度下降的过程进行调整，以最小化模型预测与实际结果之间的差异，使用损失函数进行优化。对于无监督训练，模型会进行调整，直到满足停止标准，如运行次数或分类类别数。
验证：作为训练迭代的一部分，模型验证使用未见过的验证数据集，并采用如基于时间的交叉验证等技术。这是为了检查是否存在过拟合，并确保训练后的模型能够以可接受的准确性对未见过的数据进行泛化。模型的准确性通过如平均绝对百分比误差（MAPE）或平均绝对误差（MAE）等指标进行评估。作为一个迭代过程，这一阶段包括超参数调优，在此过程中，不同设置的模型被训练和验证，以找到最佳的模型配置。技术如网格搜索或贝叶斯优化被用来寻找最优的超参数。

参数与超参数

请注意参数和超参数之间的区别。这些术语常常被混淆。模型参数是通过训练过程从数据中学习得来的，例如神经网络的权重和偏差。超参数则是在模型训练之前定义的模型配置，举例来说，在神经网络中，超参数可以是定义其架构的节点数和层数。

测试 – 作为模型开发的最后一步，模型会在未见过的测试数据集上进行评估，并与不同的算法或模型类型进行比较。测试还可以包括超出模型准确度的其他标准，如响应时间，以及与模型配合使用的应用代码的集成测试。

模型训练、验证和测试将在第七章中详细讨论。

模型部署与监控

模型部署与监控涉及将时间序列分析模型从开发环境过渡到生产环境，并持续监控其性能。这种持续的监控使得模型能够在数据模式或被分析的底层系统行为发生变化时，进行重新训练和更新。

现在我们将进一步详细说明这些步骤：

部署：模型被部署到生产环境中的模型服务框架中。这可以通过像 Kubernetes 和 Docker 这样的工具进行容器化，或者部署到基于云的解决方案中，例如 Databricks 模型服务、Amazon SageMaker、Azure 机器学习或 Google Vertex AI。一旦部署，模型可以用于批量推理，按定期间隔安排，或基于持续流数据源进行实时推理，或响应 API 请求。
监控：一旦模型在生产环境中部署，就需要进行监控，以确保模型持续适配目的并有价值。随着数据漂移（数据特征随时间的变化）和概念漂移（模型对现实的表征随时间恶化），模型的准确性会下降。这可以通过模型监控来检测，并根据情况发送警报。
再训练：当监控警报提示出现漂移时，如果漂移足够显著，下一步是对模型进行再训练。这可以手动启动，也可以自动化。如果再训练未能产生足够准确的模型，我们将不得不回到模型开发周期，寻找适合目的的其他模型。
治理：这包括几个关键的考虑因素。我们需要在模型的整个生命周期和相关过程中跟踪模型版本和生命周期阶段。此外，为了审计目的，会保存训练、部署和准确性指标的日志，在某些情况下，还会保存模型推理的请求和响应。其他考虑因素包括模型的访问控制，并确保它符合所有法律和合规要求，尤其是在处理个人或敏感数据时。

总结来说，时间序列分析项目的 ModelOps 涵盖了从开发、部署到维护模型的端到端过程，同时与 DataOps 的相关数据需求有所交集。ModelOps 确保持续改进、可复现性、协作和与业务目标的适配性。它还维持模型的有效性，并确保模型随时间持续提供价值。

我们将在第七章中详细介绍 ModelOps 的实际考虑因素，构建和测试模型。接下来的阶段是 DevOps，我们现在将详细介绍。

DevOps

紧接着 ModelOps 之后，DevOps 是一组实践和工具，旨在平滑开发（Dev）与运维（Ops）之间的交接。这适用于模型及其相关的应用程序代码。通过自动化时间序列应用的构建、测试、部署和监控，DevOps 确保它们是可靠的、可扩展的，并能持续为业务提供价值。

DevOps 实践大致可以分为持续集成/持续部署（CI/CD）、基础设施管理、监控和治理。

CI/CD

CI/CD 涉及自动化时间序列分析模型的集成和部署，以便对生产环境进行无缝更新。

这包括以下步骤：

代码和模型版本管理与仓库：代码和模型的变化需要进行跟踪，并且如果需要，可以回滚到以前的版本。这意味着代码和模型需要进行版本控制，并存储在一个仓库中，便于访问不同的版本。
测试：每当时间序列模型和相关代码发生变化时，确保没有回归是至关重要的。确保这一点的一种方法是通过自动化测试，进行单元测试和集成测试，这些测试可以在生产监控检测到性能下降时启动，或者在开发环境中模型或相关代码发生变化时启动。
部署：一旦时间序列模型和代码在开发环境中准备好，接下来的步骤就是部署到预生产和生产环境。推荐使用 CI/CD 管道自动化此部署，以最小化由于手动步骤引起的错误风险，并使该过程成为无缝、可重复和可扩展的。

总结来说，CI/CD 管道确保新功能、改进和漏洞修复能够持续集成、测试和部署，同时最大程度地减少停机时间，提高新代码发布的效率。

基础设施管理

基础设施即代码（IaC）是一种推荐的配置方法，因为它使基础设施配置可以进行版本控制、自我文档化、可重现并且可扩展。这是设置计算、存储和网络配置一致性的一种方式。在云环境等虚拟环境中，基础设施本身在某种意义上是版本控制的，因为它是软件定义的。

除了前述核心资源外，安全特定的配置还需要为访问控制、加密和网络安全防火墙提供配置。

随着应用需求的变化，相应的工作负载也会变化，可能需要更多或更少的基础设施资源。一个可扩展的基础设施管理流程确保基础设施根据需求自动进行扩展。

监控、安全性和治理

DevOps 在监控、安全性和治理方面的要求与 DataOps 和 ModelOps 类似。DevOps 的范围涵盖了部署到生产环境中的一切，包括模型、代码和配置。这通常通过诸如应用程序、安全性和合规性监控、日志记录和警报、以及事件管理等流程来实现。

总结来说，DevOps 通过自动化部署、管理和扩展，确保应用程序（包括时间序列分析）具有高可用性和可扩展性。关键在于通过促进协作和使用自动化，使得从 Dev 到 Ops 的过渡无缝化，确保时间序列分析项目能够从用例概念演变为技术实现，再到驱动显著业务影响和价值的全面运营系统。

现在我们已经理解了时间序列分析项目的端到端阶段，接下来的部分将提供实际示例和工具，以实施我们在本章中所学的内容。

实施示例和工具

定义了端到端阶段后，本节将探讨两种实施示例：基于笔记本的方法和基于协调器的方法。

注意

如果你不打算构建自己的端到端环境，可以跳过本节的实践部分，使用像 Databricks 这样的托管平台，正如我们将在后续章节中所做的那样。

让我们从设置运行示例所需的环境开始。

环境设置

我们将使用 Docker 容器，正如在 第三章 中所示，用于平台基础设施。有关安装 Docker 的说明，请参考 第三章 中的 使用容器进行部署 部分。

Docker 的替代方案

你可以使用 Podman 作为 Docker 的开源替代方案。你可以在这里找到更多信息：podman.io/。

在我们可以部署 Docker 容器之前，我们将在下一部分验证容器将使用的网络端口是否存在冲突。

网络端口

以下网络端口需要在你的本地机器或开发环境中可用：

Apache Spark: 7077、8070 和 8081
Jupyter Notebook: 4040、4041、4042 和 8888
MLflow: 5001
Airflow: 8080

你可以通过以下命令检查现有应用程序是否正在使用这些端口，从命令行或终端运行：

% netstat -an | grep LISTEN

如果你看到所需的端口已在使用的端口列表中，你必须停止使用该端口的应用程序，或者修改 docker-compose 文件以使用其他端口。

作为示例，假设前面的 netstat 命令输出显示本地机器或开发环境中端口 8080 已被占用，且你无法停止使用该端口的现有应用程序。

在这种情况下，你需要在 docker-compose.yaml 文件中将端口 8080（用于 Airflow Web 服务器）更改为另一个未使用的端口。只需搜索并将冒号（:）左边的 8080 替换为 8090，如果该端口未被占用，如下所示：

来自此：

     ports:
      - '7077:7077'
      - '8080:8080'

例如：

     ports:
      - '7077:7077'
      - '8090:8080'

记下新端口，并在需要输入对应 URL 时使用该端口。在此示例中，端口8080已更改为8090，Airflow Web 服务器的匹配 URL 更改如下：

从此开始：

http://localhost:8080/
改为：

http://localhost:8090/

注意

你需要在以下部分中的所有 URL 中更改网络端口，按照本节的说明进行修改。

环境启动

一旦 Docker 安装并运行，且网络端口配置已验证，以下指令将指导你设置和启动环境：

我们首先从本章的 Git 仓库下载部署脚本，仓库地址如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch4

我们将使用git clone友好的 URL，具体如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark.git

为此，启动终端或命令行并运行以下命令：
```
git clone https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark.git
git command:
```
xcrun: 错误：无效的活动开发者路径（/Library/Developer/CommandLineTools），缺少 xcrun，路径为：/Library/Developer/CommandLineTools/usr/bin/xcrun
```
In this case, you will need to reinstall the command-line tools with the following command:
```
xcode-select --install
现在我们可以开始构建和启动容器了。提供了一个 makefile 以简化启动和停止容器的过程。以下命令构建容器的 Docker 镜像并启动它们：
```
make up
```

Windows 环境

如果你使用的是 Windows 环境，你可以按照以下文档安装 Windows 版本的make：gnuwin32.sourceforge.net/packages/make.htm

make up命令将输出以下或等效的内容：

make prep && docker-compose up -d
sh prep-airflow.sh
[+] Running 9/9
✔ Container ts-spark-env-spark-master-1      Started
✔ Container ts-spark-env-postgres-1          Healthy ✔ Container ts-spark-env-mlflow-server-1     Started ✔ Container ts-spark-env-jupyter-1           Started ✔ Container ts-spark-env-airflow-init-1      Exited ✔ Container ts-spark-env-spark-worker-1-1    Started ✔ Container ts-spark-env-airflow-scheduler-1 Running ✔ Container ts-spark-env-airflow-triggerer-1 Running ✔ Container ts-spark-env-airflow-webserver-1 Running

当你运行上面的make up命令时，你可能会看到以下错误：

open /Users/<USER_LOGIN>/.docker/buildx/current: permission denied
make up command.

如果你的环境中使用的是bash而不是sh，且脚本无法找到sh文件，你可能会遇到错误。在这种情况下，将 makefile 中的最后一行从"sh prep-airflow.sh"更改为"bash prep-airflow.sh"，然后再次运行make up命令。

经过该过程的结束，正如第三章所述，你将拥有一个运行中的 Spark 集群和一个独立的 Jupyter Notebook 节点。此外，我们在这里已部署以下组件：

MLflow – 一个开源平台，最初由 Databricks 开发，用于管理端到端的机器学习生命周期。MLflow 具有实验和部署功能，旨在与任何机器学习库和编程语言兼容。这使得它在各种环境和用例中具有灵活性，也解释了它被广泛采用的原因。

你可以在这里找到更多信息：mlflow.org/。
Apache Airflow – 由 Airbnb 创建，Airflow 是一个开源平台，用于协调数据处理管道和计算工作流。通过能够以编程方式定义、调度和监控大规模的工作流，Airflow 被广泛采用，包括数据工程师和数据科学家在内，用于各种类型的工作流。

你可以在这里找到更多信息：airflow.apache.org/。
Postgres – 这是 Airflow 后台使用的关系型数据库。

现在让我们验证刚刚部署的环境。

访问用户界面

现在我们将访问不同组件的用户界面（UIs），作为快速验证部署的一种方式：

请按照第三章中的指示验证 Jupyter Notebook 和 Apache Spark 集群的部署。请注意，由于 Airflow Web 服务器使用端口8080，这是我们在第三章中为 Apache Spark 使用的相同端口，因此我们已将 Spark 主节点更改为以下本地网址：

http://localhost:8070/
MLflow 可以通过以下本地网址访问：

http://localhost:5001/

这将打开如图 4.4所示的网页。

图 4.4：MLflow

下一个 UI，图 4.5，是 Airflow 界面，可以通过以下本地网址访问：

http://localhost:8080/

默认的用户名和密码是airflow，强烈建议更改。

图 4.5：Airflow

我们现在已经设置好环境，接下来将使用这个环境。

Notebook 方法

我们使用了第一章中的笔记本，我们从 Databricks Community 版本开始。在第三章中，我们部署了自己的 Jupyter 笔记本环境，这是一个开源实现。正如我们到目前为止所看到的，笔记本提供了一个功能丰富的文档类型界面，在这里我们可以结合可执行代码、可视化和文本。这使得笔记本在数据科学和机器学习的互动和协作工作中非常流行。笔记本也可以构建为非交互式执行，这与它们已经在早期的数据科学和实验阶段中使用过的事实结合，使得它们能够方便地适应端到端的笔记本。

在这个第一个示例中，我们将使用基于第一章中介绍的基于 Prophet 的代码的全功能笔记本。如果你按照之前的环境设置部分中的指示操作，示例笔记本应该可以直接在 Jupyter Notebook UI 中访问，在左侧文件夹导航面板的work / notebooks位置，如图 4.6所示，网址为：http://localhost:8888/lab。

图 4.6：Notebook

笔记本也可以从以下 GitHub 位置下载：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch4/notebooks/ts-spark_ch4_data-ml-ops.ipynb

本文重点更多地放在结构上，而非代码本身，代码内容与第一章变化不大，我们将笔记本分成了以下几个部分：

配置
DataOps
从源获取数据
转换数据
ModelOps
训练并记录模型
使用模型进行预测

除了前面解释的结构外，从第一章引入的代码中还有 MLOps 部分，接下来我们将详细说明。

使用 MLflow 进行 MLOps

在此笔记本示例中，我们使用 MLflow 作为工具来实现多个 MLOps 需求。以下代码片段专注于这一部分：

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment(
    'ts-spark_ch4_data-ml-ops_time_series_prophet_notebook')
with mlflow.start_run():
    model = Prophet().fit(pdf)
…
    mlflow.prophet.log_model(
        model, artifact_path=ARTIFACT_DIR,
        signature=signature)
    mlflow.log_params(param)
    mlflow.log_metrics(cv_metrics)

上述代码中使用的 MLflow 功能如下：

set_tracking_uri – 这个函数设置跟踪服务器的 URI，MLflow 将在该服务器上存储与模型相关的信息。这样可以集中管理模型数据，促进团队成员之间的协作。跟踪服务器可以是远程服务器，也可以是本地文件路径。
set_experiment – 这个函数会创建一个新实验，或者使用现有的实验。实验是运行（单独的模型训练或试验）的逻辑分组，有助于组织和比较不同的试验。
start_run – 这会启动一个新的 MLflow 运行，可以在给定的实验中进行。作为单次训练或试验的表示，run会将相关的工件（如参数、指标和模型）归类。
prophet.log_model – 该函数将一个 Prophet 模型作为工件记录到当前的 MLflow 运行中。
log_params – 该函数记录运行过程中使用的参数的键值对。参数是模型配置。
log_metrics – 该函数记录运行过程中评估的指标的键值对。指标是关于模型性能的数值（例如均方误差、准确率）。

结果可以通过 MLflow UI 在以下网址访问：http://localhost:5001/。

这将打开一个类似于图 4.4的 UI 页面，在左侧面板中，你可以导航到名为ts-spark_ch4_data-ml-ops_time_series_prophet_notebook的实验。UI 中看到的实验名称来自于前面的代码中的标记。

实验的概览标签，如图 4.7所示，包含实验的信息，如创建者、创建日期、状态、创建实验的源代码，以及从实验中记录的模型。它还显示了代码中记录的模型参数和指标。

图 4.7：MLflow 实验概览

模型指标标签，如图 4.8所示，可以用来搜索并查看指标图。

图 4.8：MLflow 模型指标

Artifacts标签页的初始界面，见图 4.9，显示了我们在代码中作为签名记录的模型架构。它还提供了如何使用模型的代码示例。

图 4.9：MLflow 模型架构

MLmodel部分位于Artifacts标签页中，显示了模型工件及其路径，见图 4.10。

图 4.10：MLflow 模型工件

在这个示例中，我们将使用 MLflow 到此为止。我们将在下一个示例中使用编排器和 MLflow 类似的方式，并在第九章《走向生产》中深入探讨 MLflow 的更多应用。现在，我们将关注笔记本方法的其他考虑因素。

多个笔记本

这里的笔记本示例仅是一个起点，可以根据自身用例的需求以及后续章节将讨论的技术进行适配和扩展。对于更复杂的需求，建议为以下内容使用单独的笔记本：

探索性数据分析与数据科学
特征工程
模型开发、选择及最佳模型的部署
生产数据管道，可能还包括特征工程
生产模型推理
监控
模型重训练

尽管笔记本因其交互性、协作简便性、相对简单性和多功能性而备受青睐，但它们也存在限制，接下来的章节将对此进行讲解。

限制

无论笔记本有多优秀，端到端时间序列分析的笔记本都面临若干挑战，具体如下：

缺乏调度和编排功能。这使得它难以超越简单的顺序工作流，开发复杂的工作流。
可扩展性问题。笔记本代码运行在笔记本内核中，受限于所在机器的资源。请注意，可以通过将任务从笔记本提交到 Apache Spark 集群中运行来解决此问题，正如我们在示例中所做的那样。
缺乏错误处理。如果笔记本单元格中的代码失败，整个工作流执行将停止。当然，可以编写错误处理代码，但这会增加额外的编码工作量。

为了应对这些挑战，我们接下来将考虑另一种方法，使用编排器。

编排器方法

在深入了解这种方法之前，首先让我们理解什么是编排器。我们将在此使用的 Airflow，之前作为示例提到过。

编排器在管理工作流（包括数据工程和处理）中扮演着核心角色。工作流或管道是一组计算任务，这些任务按一定顺序并行或顺序执行，通常依赖于前一个任务或多个任务的结果。除了调度工作流外，编排器通常还具备在调度前创建工作流和在调度后监控其执行的功能。

编排器的好处

使用调度器相较于仅使用笔记本的方法提供了以下优点：

在工作流中调度任务，考虑任务的依赖关系以及并行或顺序执行的要求。这还包括任务执行的条件逻辑。
可扩展和分布式任务执行。
监控和记录工作流执行，包括性能和错误。这对于生产环境至关重要。
错误处理和警报，提供重试、跳过到下一个任务或失败整个管道的可能性。这也是生产环境中的关键要求。
与其他系统和工具的集成。这是构建端到端工作流的必要条件，涵盖 DataOps、ModelOps 和 DevOps，这通常意味着需要使用不同的专业工具。

既然我们已经看到了这些优点，并且环境已设置好，Airflow 作为调度器，现在让我们开始实践。

创建工作流

第一步是创建工作流或有向无环图（DAG），也叫做 DAG。

如果你遵循了之前环境设置部分中的说明，示例 DAG 已经加载并可以直接通过 Airflow UI 访问，如图 4.5所示，网址为：http://localhost:8080/。此时，你可以跳到下一部分来运行 DAG，或者继续查看 DAG 代码的详细信息。

DAG 定义位于dags文件夹中的 Python 代码文件中，也可以从以下 GitHub 位置下载：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch4/dags/ts-spark_ch4_airflow-dag.py

代码的核心与我们在之前笔记本示例中看到的非常相似。本节重点介绍与 Airflow 的集成以及定义 DAG 的任务，这些任务是 DAG 的各个步骤。

任务定义 – Python 代码

当调度器运行任务时，它调用以下相应的 Python 函数作为需要执行的底层代码。注意传入的函数参数和返回值。这些与任务的定义对齐，我们将在接下来的部分看到：

ingest_data – 对应任务t1。注意，spark.read将在 Spark 集群上运行：

def ingest_data():
    sdf = spark.read.csv(
        DATASOURCE, header=True, inferSchema=True)
    pdf = sdf.select("date", "daily_min_temperature").toPandas()
    return pdf

transform_data – 对应任务t2：

def transform_data(pdf, **kwargs):
    pdf.columns = ["ds", "y"]
    pdf["y"] = pd.to_numeric(pdf["y"], errors="coerce")
    pdf.drop(index=pdf.index[-2:], inplace=True)
    pdf.dropna()
    return pdf

train_and_log_model – 对应任务t3。注意，MLflow 函数，如mlflow.set_experiment和mlflow.prophet.log_model，会调用 MLflow 服务器。这里展示了代码的部分摘录：

def train_and_log_model(pdf, **kwargs):
    mlflow.set_experiment(
        'ts-spark_ch4_data-ml-ops_time_series_prophet')
    …
        mlflow.prophet.log_model(
            model, artifact_path=ARTIFACT_DIR,
            signature=signature)
    …
        return model_uri

forecast – 对应任务t4。注意，mlflow.prophet.load_model从 MLflow 服务器加载模型。这里只是为了展示如何从 MLflow 服务器检索模型。实际上并非严格需要，因为我们本可以在本地保留对模型的引用：

def forecast(model_uri, **kwargs):
    _model = mlflow.prophet.load_model(model_uri)
    forecast = _model.predict(
        _model.make_future_dataframe(30))
    forecast[
        ['ds', 'yhat', 'yhat_lower','yhat_upper']
    ].to_csv('/data/ts-spark_ch4_prophet-forecast.csv')
    return '/data/ts-spark_ch4_prophet-forecast.csv'

这些任务由 DAG 引用，我们将在接下来定义它们。

DAG 定义

在前面的任务定义之上，我们有一个高层次的 Airflow DAG，它按照以下定义：

dag = DAG(
    'ts-spark_ch4_data-ml-ops_time_series_prophet',
    default_args=default_args,
    description='ts-spark_ch4 - Data/MLOps pipeline example - Time series forecasting with Prophet',
    schedule_interval=None
)

这指向default_args，其中包含以下 DAG 参数。

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),
}

这些的更多信息可以在以下 Airflow 文档中找到：

airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/baseoperator/index.html#airflow.models.baseoperator.BaseOperator

我们没有设置schedule_interval，因为我们希望通过 Airflow UI 手动触发 DAG。

DAG 任务

DAG 任务按以下方式定义。请注意引用了dag和先前定义的底层 Python 函数。使用PythonOperator意味着任务将调用 Python 函数：

t1：

t1 = PythonOperator(
    task_id='ingest_data',
    python_callable=ingest_data,
    dag=dag,
)

t2：

t2 = PythonOperator(
    task_id='transform_data',
    python_callable=transform_data,
    op_kwargs={'pdf': t1.output},
    provide_context=True,
    dag=dag,
)

任务t2的一个显著特点是，任务t1的输出t1.output作为输入pdf传递给任务t2。

t3：

t3 = PythonOperator(
    task_id='train_and_log_model',
    python_callable=train_and_log_model,
    op_kwargs={'pdf': t2.output},
    provide_context=True,
    dag=dag,
)

任务t2的输出t2.output作为输入pdf传递给任务t3。

t4：

t4 = PythonOperator(
    task_id='forecast',
    python_callable=forecast,
    op_kwargs={'model_uri': t3.output},
    provide_context=True,
    dag=dag,
)

任务t3的输出t3.output作为输入model_uri传递给任务t4。

然后，这些任务通过以下代码配置，以便由 Airflow 按顺序进行协调：

# Task dependencies
t1 >> t2 >> t3 >> t4

这就结束了在 Airflow 中定义工作流作为 DAG 的过程。这里的示例只是一个起点，展示了一个简单的顺序工作流，可以根据您的具体需求以及接下来章节中讨论的时间序列分析任务进行调整和扩展。

协调笔记本

请注意，还可以通过使用PapermillOperator操作符从 Airflow 任务调用笔记本，将协调器和笔记本方法结合起来。有关此操作符的更多信息，请访问：airflow.apache.org/docs/apache-airflow-providers-papermill/stable/operators.html。

一旦 DAG 编写完成并放置在 Airflow 的dags文件夹中，它将被 Airflow 自动识别，检查 Python 定义文件中的语法错误，然后列出可以运行的 DAG 列表，接下来我们将介绍这一部分内容。

运行工作流

可以通过点击 DAG 右侧的运行按钮(>)来启动工作流，正如图 4.5中所示的访问 UI部分。通过点击左侧面板中的 DAG 名称，可以在 Airflow UI 中查看 DAG 的详细信息和图形，如图 4.11所示。

图 4.11：Airflow DAG

要查看 DAG 中特定运行和任务的信息，请在左侧选择运行，然后从图形中选择任务。这将提供查看任务执行日志的选项。

另一个有趣的信息是不同任务的执行时间，可以通过同一界面中的甘特图标签查看。

我们这里只是在探索 Airflow 的表面，它是一个功能丰富的工具，超出了本书的范围。请参阅 Airflow 文档以获取更多信息。

如前所述，部分代码在 Apache Spark 集群上运行。根据第三章中的图 3.6，可以从 Spark 主节点可视化这一过程。URL 为：http://localhost:8070/。如果应用仍在运行，Spark UI 将显示一个运行中的应用程序。这个应用程序是从 Airflow 任务中启动的 Spark 代码。

对于 MLflow，你可以在以下 URL 中通过 MLflow UI 查看结果：http://localhost:5001/。

在 MLflow UI 页面上，类似于图 4.4，你可以在左侧面板中导航到名为ts-spark_ch4_data-ml-ops_time_series_prophet的实验。UI 中看到的实验名称来自代码，在之前的train_and_log_model代码中有所高亮显示。

`这就结束了本章讨论的第二种方法。我们将在接下来的章节中，使用我们学到的概念，基于这个编排器示例进行构建。

环境关闭

现在我们可以停止容器环境。提供的 makefile 简化了这个过程，使用以下命令：

make down

这将产生如下或等效的输出：

docker-compose down
[+] Running 10/10
 ✔ Container ts-spark-env-spark-worker-1-1  Removed
 ✔ Container ts-spark-env-mlflow-server-1   Removed
 ✔ Container ts-spark-env-airflow-scheduler-1Removed ✔Container ts-spark-env-airflow-webserver-1Removed ✔ ontainer ts-spark-env-jupyter-1          Removed ✔ Container ts-spark-env-airflow-triggerer-1Removed ✔ Container ts-spark-env-airflow-init-1     Removed ✔ Container ts-spark-env-postgres-1         Removed ✔ Container ts-spark-env-spark-master-1     Removed ✔ Network ts-spark-env_default              Removed

如果你不打算进一步使用它，可以按照此处的说明，删除通过Delete操作创建的 Docker 容器：docs.docker.com/desktop/use-desktop/container/#container-actions。

总结

本章详细介绍了时间序列分析项目的关键阶段，从选择与业务需求相对应的用例开始。接着，将该用例映射到技术解决方案，包括 DataOps、ModelOps 和 DevOps 组件。最后，我们探讨了两种实现方式，包括带有一体化笔记本和带有编排器的基准实现示例，这将在本书接下来的部分进一步扩展。

在下一章中，我们将专注于此，重点是通过数据准备进行 DataOps。

加入我们社区的 Discord

加入我们社区的 Discord 空间，与作者和其他读者进行讨论：

packt.link/ds

第五章：数据准备

到目前为止，我们已经介绍了时间序列和 Apache Spark 的基础知识，以及时间序列分析项目的完整生命周期。在本章中，我们将深入探讨组织、清洗和转换时间序列数据的关键步骤，以便进行有效分析。内容包括处理缺失值、应对离群值和将数据结构化以适应 Spark 的分布式计算模型。这些信息非常宝贵，它们将帮助你确保数据质量并与 Spark 兼容，为准确高效的时间序列分析奠定坚实基础。适当的数据准备增强了后续分析过程的可靠性，使本章成为利用 Spark 从时间相关数据集中提取有意义见解的必备前提。

本章我们将涵盖以下主要内容：

数据摄取与持久化
数据质量检查与清洗
转换

技术要求

本章主要是动手编码，涵盖了时间序列分析项目中常见的数据准备步骤。本章的代码可以在本书 GitHub 仓库中的ch5文件夹找到，网址如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch5

注意

本章的代码将与 Databricks Community Edition 一起使用，正如第一章和本章中所解释的方法一样。

数据摄取与持久化

在本节中，我们将介绍从数据源获取时间序列数据并将数据集持久化到存储的方式。

摄取

摄取是从源系统检索数据进行进一步处理和分析的过程。这个过程可以批量执行，用来一次性摄取大量数据，或按计划定期自动运行，比如每晚一次。或者，如果数据是源系统持续提供且需要实时获取的，则可以使用结构化流处理作为另一种摄取方法。

注意

从技术上讲，我们可以将数据摄取过程编码为结构化流处理，并将其配置为在触发的时间间隔运行。这为根据数据的新鲜度变化调整业务需求提供了灵活性，而无需重新开发摄取过程。

本章将重点讨论批量摄取，这是目前最常见的方法。我们还将简要讨论结构化流处理，它正在迅速获得应用，在一些组织中甚至超越了批量摄取。

批量摄取

批量摄取通常是通过文件存储或数据库完成的。

从文件存储

正如我们在前几章的动手实践部分所看到的，读取文件是一个常用的批量摄取方法。使用 Apache Spark 时，可以通过spark.read()来实现：

df = spark.read.csv("file_path", header=True, sep=";", inferSchema=True)

在这个例子中，我们从file_path存储位置读取一个 CSV 格式的文件。该文件的第一行包含了标题。不同的列由；字符分隔。我们希望 Spark 根据inferSchema自动推断文件中存在的数据列及其类型。

该示例基于ts-spark_ch5_1.dbc中的代码，我们可以从 GitHub 位置导入该文件，参考技术要求部分提到的第五章，并按照第一章中解释的方法，将其导入 Databricks 社区版。

代码的 URL 是github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch5/ts-spark_ch5_1.dbc。

摄取的数据可以根据本章提供的代码示例进一步处理和分析，如图 5.1所示。

图 5.1：查看已摄取的数据

在读取文件时，也可以通过提供文件夹位置而非特定文件位置，从存储文件夹中读取多个文件。这是文件摄取的常见模式。另一个常用的功能是提供一个筛选器（pathGlobFilter），仅包括与模式匹配的文件名。

spark.read命令有许多其他选项，具体取决于正在读取的数据源。以下是关于数据源的 Apache Spark 文档，详细说明了这些选项：

spark.apache.org/docs/latest/sql-data-sources.html

从数据库

另一个常用的数据源类型是关系型数据库。以下是从 PostgreSQL 读取数据的示例：

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

详细内容请参见以下文档：spark.apache.org/docs/latest/sql-data-sources-jdbc.html

来自专门时间序列数据库（如 QuestDB）的数据，可以通过类似的方式摄取，如下所示：

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:8812/questdb") \
    .option("driver", "org.postgresql.Driver") \
    .option("user", "admin") \
    .option("password", "quest") \
    .option("dbtable", "timeseries_table") \
    .load()

详细内容请参见以下文档：

questdb.io/blog/integrate-apache-spark-questdb-time-series-analytics/

注意

您需要将特定数据库的 JDBC 驱动程序包含在 Spark 的类路径中。前面引用的文档对此进行了说明。

结构化流处理

对于基于事件驱动或近实时处理的 Apache Spark，时间序列数据可以从流数据源（如 Apache Kafka、Amazon Kinesis、Google Cloud Pub/Sub 和 Azure Event Hubs）中摄取。这通常涉及到使用对应的数据源连接器设置 Spark 结构化流处理。

以下示例展示了如何使用 Spark 从 Apache Kafka 摄取数据：

df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
    .option("subscribe", "topic1") \
    .load()

Apache Spark 文档提供了关于从流数据源读取的更多详细信息：

spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

一旦数据被摄取，下一步是将其持久化存储以进行进一步处理，如我们接下来所见。

持久化

数据通常会持久化到磁盘上的文件或数据库中。对于文件，Apache Spark 提供了一个成熟的解决方案——Delta Lake，一个开源存储协议。

注意

Apache Iceberg 是另一种常见的开源存储协议。

Delta 为 Apache Spark 和大数据工作负载提供了 ACID 事务，有效地将文件存储和数据库存储的优势结合在一起，这种结合被称为湖仓（数据湖和数据仓库的合并）。Delta 基于 Parquet 文件格式，提供如模式强制、数据版本控制和时间旅行等功能。

下面是一个示例，展示如何使用 Python 在 Delta 存储格式中持久化时间序列数据：

df.delta_table_path storage location. The overwrite mode means that existing data at this location will be overwritten. With Delta format, the data is written as a table that is given the name specified in table_name.
This example is based on the code in `ts-spark_ch5_1.dbc`, which we imported in the earlier section on batch ingestion.
There are many other options for the `spark.write` command, depending on the destination being written to. The following Apache Spark documentation on saving details these options:
[`spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables`](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables)
When the data is persisted in Delta format, in addition to the data, metadata is also stored together to disk. This can be retrieved with the following code:

将 Delta 表加载为 DeltaTable 对象

delta_table = DeltaTable.forPath(spark, delta_table_path)

Delta 表的详细信息

print("Delta 表详情：")

delta_table.detail().display()


 Note
In the code example, we did not have to install Delta as it is already installed when using the Databricks Community Edition. You will need to install the Delta packages if you are using another Apache Spark environment where Delta is not pre-installed. You can find the instructions here: [`docs.delta.io/latest/quick-start.html`](https://docs.delta.io/latest/quick-start.html).
*Figure 5**.2* shows some of the metadata such as location and creation date.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_02.jpg)

Figure 5.2: Metadata for the Delta table
Once data has been persisted, it can be read from storage as needed at a later stage for querying and analysis. The `spark.read` command can be used here as well, as per the following example:

spark.read.load(delta_table_path).display()


 The Delta table storage location, `delta_table_path`, is passed to the `load` command, which retrieves the stored table from the disk storage.
As mentioned earlier, Spark can also write to a database, among other destinations. The following example shows how to write to a PostgreSQL database.

jdbcDF.write \

.format("jdbc") \

.option("url", "jdbc:postgresql:dbserver") \

.option("dbtable", "schema.tablename") \

.option("user", "username") \

.option("password", "password") \

.save()


 This is further detailed in the following documentation: [`spark.apache.org/docs/latest/sql-data-sources-jdbc.html`](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)
Note
You will need to include the JDBC driver for the particular database on the Spark classpath. The previously referenced documentation explains this.
As seen in this section, persistence allows longer-term storage and retrieval. Delta also stores different versions of the data whenever it changes, which we will investigate next.
Versioning
Data versioning is one of the key features provided by Delta Lake, allowing you to keep track of changes made to your data over time. This storage of different versions is done in an optimal way to minimize storage footprint.
With a record of different versions, Delta enables a functionality called **time travel**. With this, you can query data at specific versions or timestamps, revert to previous versions, and perform time travel queries. This is also useful from a reproducibility point of view, whereby we can go back to the specific version of data used previously, even if it has since changed, to audit, review, and redo an analysis.
The code provided in this chapter has an example of using versioning and time travel. The following extract shows how to read a specific version of the Delta table. `version_as_of` is an integer representing the version number:

df_ = spark.read.timestamp_as_of 表示感兴趣版本的时间戳：

df_ = spark.read.history command, as follows:

print(f"Delta 表历史记录 - 修改后:")

delta_table.history().display()


 An example of output from the history is shown in *Figure 5**.3*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_03.jpg)

Figure 5.3: Delta table versions
Finally, it is possible to restore the Delta table back to a previous version with the `restoreToVersion` command, overwriting the latest version, as per the following:

delta_table.restoreToVersion(latest_version)


 You can also find more information on time travel here:
[`delta.io/blog/2023-02-01-delta-lake-time-travel/`](https://delta.io/blog/2023-02-01-delta-lake-time-travel/)
This concludes the section on ingestion and persistence. We will now move on to verify and clean the data.
Data quality checks, cleaning, and transformation
Once the data has been ingested from source systems to a storage location from which we can access it, we will need to ensure that it is of usable quality and, if not, do the necessary cleaning and transformation.
Data quality checks
The outcome of any analysis done with the data can be only as good as the data, making data quality checks an important next step.
Consistency, accuracy, and completeness
Data quality checks for consistency, accuracy, and completeness are essential to ensure the reliability of your data. With its powerful tools for data processing and analysis, Apache Spark is suitable for implementing these checks. The following are examples of how you can perform data quality checks for consistency, accuracy, and completeness using Apache Spark in Python.
Consistency check
In the following consistency test example, we are counting the number of records for each date:

示例一致性检查：检查某列的值是否一致

consistency_check_result = df.groupBy("Date").count().orderBy("count")

print(f"数据一致性结果：")

consistency_check_result.display()


 As per *Figure 5**.4*, this simple check shows that some dates do not consistently have the same number of records, which can indicate missing values for some dates.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_04.jpg)

Figure 5.4: Consistency check results
Accuracy check
In the accuracy test example, we want to verify the accuracy of `Global_active_power`, as follows:

示例准确性检查：

检查某列中的值是否符合特定条件

accuracy_check_expression = "Global_active_power < 0 OR Global_active_power > 10"

检查

accuracy_check_result = df.filter(accuracy_check_expression)

accuracy_check_result_count = accuracy_check_result.count()

如果 accuracy_check_result_count == 0:

print(f"数据通过准确性检查 - !({accuracy_check_expression}).")

else:

print(f"数据未通过准确性检查 - {accuracy_check_expression} - 计数 {accuracy_check_result_count}😊

accuracy_check_result.display()


 As per *Figure 5**.5*, this check shows that in two cases, `Global_active_power` is outside of the accuracy criteria that we have defined for this check. This indicates that either these values are wrong or that they are correct but are now going beyond the previously known ranges that we have used to define the criteria. We must update the criteria in this latter case.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_05.jpg)

Figure 5.5: Accuracy check results
Completeness check
In the completeness test example, we want to verify whether `Global_active_power` has null values:

示例完整性检查：检查某列是否包含空值

completeness_check_expression = "Global_active_power is NULL"

检查

completeness_check_result = df.filter(

completeness_check_expression)

completeness_check_result_count = completeness_check_result.count()

如果 completeness_check_result_count == 0:

print(f"数据通过完整性检查 - !({completeness_check_expression})")

else:

print(f"数据未通过完整性检查 - {completeness_check_expression} - 计数 {completeness_check_result_count}😊

completeness_check_result.display()


 Note
The consistency check example presented earlier can also be used for completeness.
These examples show basic data quality checks for consistency, accuracy, and completeness using Apache Spark. These checks can be extended and integrated into your data pipelines for more comprehensive data quality assurance.
Data quality framework
To better manage the suite of tests, it is recommended that a framework such as *Great Expectations* be used for data quality checks. You can find more information here: [`github.com/great-expectations/great_expectations`](https://github.com/great-expectations/great_expectations)
We will cover another framework approach with the integration of data quality in the Delta Live Tables pipeline, and monitoring and alerting in *Chapter 10*.
Once the data quality has been tested, the next step is to clean the data.
Data cleaning
The previous step of data quality checks indicates the issues with the data that need to be corrected, which we will now address.
Missing values
Apache Spark offers various methods to handle missing values in time series data. The following examples show how you can clean time series data for missing values using Apache Spark in Python.
Forward filling
The forward filling method to handle missing values replaces the missing values with the previous known value, with the values sorted in chronological order based on their timestamp. In the following code example, missing values for `Global_active_power` are replaced in this way. The `Window.rowsBetween` function in the following case goes from the first record to the current one. The `last` function then finds the last non-null value within that window. As the window slides over all the records, all the missing values are replaced with the last known value:

from pyspark.sql import functions as F

从 pyspark.sql 导入 Window

示例：通过向前填充处理缺失值

"timestamp" 列按时间顺序排列

df = spark.sql(

f"从 {table_name} 表中选择时间戳和全局有功功率，并按时间戳排序"

)

window = Window.rowsBetween(float('-inf'), 0)

filled_df = df.withColumn(

"filled_Global_active_power",

F.last(df['Global_active_power'],

ignorenulls=True).over(window))

显示更新后的值

filled_df.filter(

"timestamp BETWEEN '2008-11-10 17:58:00' AND '2008-11-10 18:17:00'"

).display()


 The result of forward filling can be seen in *Figure 5**.6*, where the filled values are shown in the `filled_Global_active_power` column.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_06.jpg)

Figure 5.6: Forward filling
Forward filling works well when the last known value is a good indication of the next value, such as for a slow-changing value. It is not a good method when the value can change suddenly or when there is seasonality.
Backward filling
The backward filling method to handle missing values replaces the missing values with the next known value, with the values sorted in chronological order based on their timestamp. In the following code example, missing values for `Global_active_power` are replaced in this way. The `Window.rowsBetween` function in the following case goes from the current one to the last record. The `first` function then finds the next non-null value within that window. As the window slides over all the records, all the missing values are replaced with the next known value:

从 pyspark.sql 导入 functions 作为 F

从 pyspark.sql 导入 Window

示例：通过向后填充处理缺失值

"timestamp" 列按时间顺序排列

df = spark.sql(

f"从 {table_name} 表中选择时间戳和全局有功功率，并按时间戳排序"

)

window = Window.rowsBetween(0, float('inf'))

filled_df = df.withColumn(

"filled_Global_active_power",

F.first(df['Global_active_power'],

ignorenulls=True).over(window))

显示更新后的值

filled_df.filter(

"timestamp BETWEEN '2008-11-10 17:58:00' AND '2008-11-10 18:17:00'"

).display()


 The result of backward filling can be seen in *Figure 5**.7*, where the filled values are shown in the `filled_Global_active_power` column.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_07.jpg)

Figure 5.7: Backward filling
Backward filling works well when the next known value can reasonably indicate the previous value, such as with slow-changing data or when collecting data retrospectively with gaps in the past. However, it is not suitable for analyzing causality or leading indicators.
Interpolation
The interpolation method to handle missing values replaces the missing values with a combination, such as the average, of the previous and next non-missing values, with the values sorted in chronological order based on their timestamp.
Note
There are several different interpolation calculation methods, including linear, polynomial, and spline interpolation. The average method used here is a simple form of linear interpolation.
In the following code example, missing values for `Global_active_power` are replaced in this way. The `Window.rowsBetween` function, used twice, in the following case, goes from the first record to the current one for `windowF`, and from the current one to the last record for `windowB`. The `last` function then finds the previous non-null value within `windowF`, while the `first` function finds the next non-null value within `windowB`. These two non-null values are averaged. As the window slides over all the records, all the missing values are replaced by the averaged value:

从 pyspark.sql 导入 Window

示例：通过向后填充处理缺失值

"timestamp" 列按时间顺序排列

df = spark.sql(

f"从 {table_name} 表中选择时间戳和全局有功功率，并按时间戳排序"

)

windowF = Window.rowsBetween(float('-inf'), 0)

windowB = Window.rowsBetween(0, float('inf'))

filled_df = df.withColumn(

"filled_Global_active_power", (F.last(

df['Global_active_power'], ignorenulls=True

).over(windowF) + F.first(

df['Global_active_power'], ignorenulls=True

).over(windowB))/2)

显示更新后的值

filled_df.filter(

"timestamp BETWEEN '2008-11-10 17:58:00' AND '2008-11-10 18:17:00'"

).display()


 The result of interpolation can be seen in *Figure 5**.8*, where the filled values are shown in the `filled_Global_active_power` column.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_08.jpg)

Figure 5.8: Interpolation
Interpolation works well for slow-changing values, when there is a predictable cyclical pattern, or when there is a small gap in data. It is not a good method when the value can change suddenly, is discrete, or when there is a large gap in data.
Of the three methods shown for handling missing values, the appropriate method to use is based on the characteristics of your time series data and the requirements of your analysis.
Data leakage
Note that the backward filling and interpolation methods can leak future data across the boundaries of training, validation, and test data splits. Use these methods within the splits, and not across, or use forward filling if this is going to be an issue.
Duplicates
The presence of duplicate values in time series data can skew analysis and lead to incorrect results. Apache Spark has functions to efficiently remove duplicate values. In the following example, we clean time series data for duplicate values using Apache Spark in Python.
The `dropDuplicates` function removes duplicates by comparing all columns by default and only considers a row to be a duplicate if all the columns match those of one or more other rows. This will not work if we have multiple rows with, say, the same `timestamp` column value but different values in one or more other columns. In this case, we can pass a subset of one or more columns as a parameter to be used to identify the duplicates, as opposed to using all the columns.
In the most common cases, we want to have one and only one row of values for each timestamp and consider the other rows with the same timestamp to be duplicates. Passing the timestamp as the subset parameter to `dropDuplicates` will remove all the other rows having the same timestamp value, as we will see in the following code example:

示例：基于所有列移除重复行

print(f"有重复行 - 计数: {df.count()}")

cleaned_df = df.dropDuplicates()

print(f"无重复行 - 计数: {cleaned_df.count()}")

示例：基于选定的列移除重复行

假设"timestamp"是识别重复项的列

cleaned_df = df.dropDuplicates(["timestamp"])

print(f"无重复时间戳 - 计数: {cleaned_df.count()}")


 Depending on your dataset and use case, you can choose the appropriate method based on the columns that uniquely identify duplicates in your time series data.
Outliers
The detection and handling of outliers in time series data is crucial to ensure the accuracy of analysis and modeling. Apache Spark provides various functions to detect and handle outliers efficiently. The following example shows how to clean time series data for outliers using Apache Spark in Python.
The z-score method used is based on how far the data point is from the `mean` relative to the standard deviation, `stddev`. The parametrizable threshold value, `z_score_threshold`, then specifies beyond which z-score value the data point is considered an outlier. A high threshold will allow more data points in, while a low threshold will flag more outliers:

从 pyspark.sql 导入 functions 作为 F

示例：使用 z-score 检测离群值

计算"值"列中每个值的 z-score

mean_value = df.select(F.mean(

"Global_active_power")).collect()[0][0]

stddev_value = df.select(F.stddev(

"Global_active_power")).collect()[0][0]

z_score_threshold = 5 # 根据需要调整阈值

df_with_z_score = df.withColumn("z_score", (F.col(

"Global_active_power") - mean_value) / stddev_value)

过滤掉 z-score 超出阈值的行

离群值 = df_with_z_score.filter(~F.col("z_score").between(

-z_score_threshold, z_score_threshold))

cleaned_df = df_with_z_score.filter(F.col("z_score").between(

-z_score_threshold, z_score_threshold))

标记为离群值

df_with_outlier = df_with_z_score.withColumn(

"_ 离群值",

F.when(

(F.col("z_score") < -z_score_threshold) |

(F.col("z_score") > z_score_threshold), 1

).otherwise(0))

print(f"包含异常值 - 计数: {df.count()}")

print(f"Global_active_power - 平均值: {mean_value}, 标准差: {stddev_value}, z 分数阈值: {z_score_threshold}")

print(f"去除异常值后 - 计数: {cleaned_df.count()}")

print(f"异常值 - 计数: {outliers.count()}")

print("异常值:")

outliers.display()


 *Figure 5**.9* shows the outcome of the outlier detection based on the z-score chosen.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_09.jpg)

Figure 5.9: Outlier detection
Beyond this example, the choice of z-score threshold and outlier detection techniques is based on the data characteristics and requirements of the use case.
Note
Outliers can be indicative of one or more anomalies in the source system that generated the measurement, or data processing or transmission issues post source system. The identification of outliers flags the requirement to further investigate the source system and the data transmission chain to find the root cause.
After cleaning the data based on the issues identified with the data quality checks, other transformations, which we will look at next, are required to get the data into the right shape for the analytics algorithm to work.
Transformations
In this section, we will look at examples of normalizing and standardizing, and touch briefly on stationary transformation.
Normalizing
Normalizing time series data ensures that features are on a similar scale, which can improve the performance of machine learning algorithms while facilitating analysis. Apache Spark provides various functions for normalization. The following example shows how to normalize time series data using Apache Spark in Python.
The min-max normalization technique is used to scale the data points relative to the min-max range. The `min` and `max` values are calculated first. This brings the value to the range of `0` for the minimum value and `1` for the maximum value:

from pyspark.sql import functions as F

定义要归一化的列（例如，“value”列）

columns_to_normalize = ["Global_active_power"]

计算每列的最小值和最大值进行归一化

min_max_values = df.select(

[F.min(F.col(column)).alias(f"min_{column}")

for column in columns_to_normalize] +

[F.max(F.col(column)).alias(f"max_{column}")

for column in columns_to_normalize]

).collect()[0]

使用最小-最大归一化对数据进行归一化

for column in columns_to_normalize:

min_value = min_max_values[f"min_{column}"]

max_value = min_max_values[f"max_{column}"]

df = df.withColumn(

f"normalized_{column}",

(F.col(column) - min_value) / (max_value - min_value))

print(f"归一化后的 - {columns_to_normalize}😊

df.display()


 *Figure 5**.10* shows the outcome of normalizing the example time series data.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_10.jpg)

Figure 5.10: Normalizing
Depending on the specific requirements and data characteristics, the normalization method can be adjusted with the use of other techniques such as z-score normalization and decimal scaling, in addition to the min-max technique used in the example.
Standardizing
Standardizing time series data ensures that features are on a similar scale, which can improve the performance of machine learning algorithms while facilitating analysis. This method transforms the data such that it has a mean of `0` and a standard deviation of `1`. Apache Spark provides various functions for standardization. The following example shows how to standardize time series data using Apache Spark in Python.
This example uses `log` values to account for the skewness of the data. First, `mean` and  `stddev` are calculated. These values are then used in the formula to standardize:

from pyspark.sql import functions as F

定义要标准化的列（例如，“value”列）

columns_to_standardize = ["Global_active_power"]

计算每列的均值和标准差以

标准化

mean_stddev_values = df.select(

[F.mean(F.log(F.col(column))).alias(f"mean_{column}")

for column in columns_to_standardize] +

[F.stddev(F.log(F.col(column))).alias(f"stddev_{column}")

for column in columns_to_standardize]

).collect()[0]

使用 z-score 标准化对数据进行标准化

for column in columns_to_standardize:

mean_value = mean_stddev_values[f"mean_{column}"]

stddev_value = mean_stddev_values[f"stddev_{column}"]

df = df.withColumn(

f"standardized_{column}",

(F.log(F.col(column)) - mean_value) / stddev_value

)

print(f"标准化后的 - {columns_to_standardize}😊

df.display()


 *Figure 5**.11* shows the outcome of standardizing the example time series data.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_05_11.jpg)

Figure 5.11: Standardizing
The standardization method can be adjusted depending on the specific requirements and data characteristics.
Stationary
In *Chapter 1*, we discussed the requirement of stationary time series for some analysis methods. Making time series data stationery involves removing trends and seasonality, which we will cover in the following chapter.
This concludes the section on testing for data quality and then cleaning and transforming time series data. We will cover the scalability considerations in data preparation when we discuss feature engineering in *Chapter 8*.
Summary
In conclusion, this chapter focused on the critical steps of organizing, cleaning, and transforming time series data for effective analysis. We have covered data preparation techniques using Apache Spark for ingestion, persistence, data quality checks, cleaning, and transformations. We looked at code examples for, among others, handling missing values and duplicates, addressing outliers, and normalizing data. This has set the stage for an accurate and efficient analytical process using Apache Spark. Proper data preparation significantly enhances the reliability of subsequent analytical processes, which is what we will progress toward in the next chapter.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1).jpg)

第六章：探索性数据分析

在加载和准备数据（在前一章已经介绍过）之后，我们将进行探索性数据分析，以揭示时间序列数据中的模式和见解。我们将使用统计分析技术，包括那些特定于时间模式的技术。这些步骤的结果对于识别趋势和季节性非常关键，并且为后续建模决策提供信息。使用 Apache Spark 进行强大的探索性数据分析确保全面掌握数据集的特征，增强后续时间序列模型和分析的准确性和相关性。

在本章中，我们将涵盖以下主要内容：

统计分析
重采样、分解和稳定性
相关性分析

技术要求

本章主要涵盖了时间序列分析项目中常用的数据探索技术，以实际操作为主。本章的代码可以在书籍的 GitHub 仓库的 ch6 文件夹中找到，网址为：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch6。

注意

我们将在代码示例中使用 Spark DataFrames，并将其转换为支持 pandas 的库的 DataFrames。这显示了如何可以互换使用两者。在使用 pandas 时将会提及。

统计分析

本节从时间序列数据的统计分析开始，涵盖数据概要分析以收集这些统计信息，分布分析和可视化。

本章的示例基于 ts-spark_ch6_1.dbc 中的代码，我们可以从 GitHub 上导入到 Databricks Community Edition，如第一章所述的方法，在技术要求部分提到。

代码网址如下：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch6/ts-spark_ch6_1.dbc

我们将从家庭能源消耗数据集开始实际操作示例，这个数据集也在第二章和第五章中使用过。在用 spark.read加载数据集后，如下代码片段所示，我们通过df.cache()将 DataFrame 缓存到内存中，以加速后续的处理。由于懒加载，缓存操作将在下一个动作时进行，而不是立即进行。为了强制进行缓存，我们添加了一个df.count()操作。然后，我们创建了一个timestamp列，将Date和Time列合并在一起。由于数值列已作为字符串加载，因此我们必须将它们转换为数值型的double数据类型才能进行计算。请注意，为了提高可读性，我们将对df DataFrame 的操作分成了多行代码，当然，也可以将这些操作链式调用写在一行代码中：

…
# Code in cell 5
df = spark.read.csv(
    "file:///" + SparkFiles.get(DATASET_FILE),
    header=True, sep=";", inferSchema=True)
df.cache()
df.count()
…
# Code in cell 7
df = df.withColumn('Time', F.date_format('Time', 'HH:mm:ss'))
# Create timestamp column
df = df.withColumn('timestamp', F.concat(df.Date, F.lit(" "), df.Time))
df = df.withColumn(
    'timestamp',
    F.to_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))
# Fix data types
df = df \
    .withColumn('Global_active_power',
    df.Global_active_power.cast('double')) \
…
print("Schema:")
df.spark.read option inferSchema. The data types before conversion, displayed with printSchema(), are shown in *Figure 6**.1*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_01.jpg)

Figure 6.1: Inferred schema with data types
The updated schema is as per *Figure 6**.2*, showing the converted data types.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_02.jpg)

Figure 6.2: Updated schema with converted data types
We are now ready to profile the data.
Data profiling
Data profiling involves analyzing the dataset’s structure, quality, and statistical properties. This helps to identify anomalies, missing values, and outliers, ensuring data integrity. This process can also be comprehensive, including the analysis of trends, seasonal patterns, and correlations, guiding more accurate forecasting and modeling.
Note
Data profiling can also guide preprocessing steps such as normalization and transformation, covered in *Chapter 5*.
Apache Spark provides the convenient `summary()` function, as per the following code, for summary statistics:

汇总统计

第 10 号单元格的代码

df.summary().display()


 This generates the following outcome:
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_03.jpg)

Figure 6.3: Summary statistics
While these summary statistics are useful, they are usually not sufficient. A data profiling tool such as YData Profiling, which we will look at next, provides more extensive analysis and reporting.
The following code extract shows how to launch a Profile Report with YData. Notable here is the use of a Pandas DataFrame, `pdf`, and of the time series mode (`tsmode` parameter), with the `sortby` parameter to sort by timestamp. We also want correlations to be included in the report. After the report is generated, it is converted to HTML for display with the `to_html()` function.

第 12 号单元格的代码

…

profile = ProfileReport(

pdf,

title='时间序列数据分析',

tsmode=True,

sortby='timestamp',

infer_dtypes=False,

interactions=None,

missing_diagrams=None,

correlations={

"auto": {"calculate": False},

"pearson": {"calculate": True},

"spearman": {"calculate": True}})

将分析报告保存为 HTML 文件

profile.to_file("time_series_data_profiling_report.html")

在笔记本中展示分析报告

report_html = profile.to_html()

displayHTML(report_html)


 The generated report contains an **Overview** section, as per *Figure 6**.4*, with an indication, among other things, of the number of variables (columns), observations (rows), and missing values and duplicate counts.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_04.jpg)

Figure 6.4: Data profile report – Overview
Scrolling down from **Overview**, we can see column-specific statistics, as shown in *Figure 6**.5*, such as the minimum, maximum, mean, number of zeros, and number of distinct values.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_05.jpg)

Figure 6.5: Data profile report – Details
This section has further sub-sections, such as **Histogram**, showing the distribution of values, and **Gap analysis**, as per *Figure 6**.6*, with indications of data gaps for the column.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_06.jpg)

Figure 6.6: Data profile report – Gap analysis
With the time series mode specified earlier, we also get a basic **Time Series** part of the report, shown in *Figure 6**.7*
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_07.jpg)

Figure 6.7: Data profile report – Time Series
Other sections of the report cover **Alerts**, shown in *Figure 6**.8*, with outcomes of tests run on the dataset, including time-series-specific ones, and a **Reproduction** section with details on the profiling run.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_08.jpg)

Figure 6.8: Data profile report – Time Series
This section provided an example of how to perform data profiling on time series data using YData Profiling and Apache Spark. Further information on YData Profiling can be found here: [`github.com/ydataai/ydata-profiling`](https://github.com/ydataai/ydata-profiling).
We will now drill down further in our understanding of the data, by analyzing the gaps in the dataset.
Gap analysis
In the previous section, we mentioned gap analysis for gaps in value for a specific column. Another consideration for time series data is gaps in the timeline itself, as in the following example with the household energy consumption dataset, where we are expecting values every minute.
In this case, we first calculate the time difference between consecutive timestamps using `diff()`, as in the following code, with a pandas DataFrame, `pdf`. If this is greater than `1 minute`, we can flag the timestamp as having a prior gap:

测试间隙

第 15 号单元格的代码

测试间隙

pdf['gap_val'] = pdf['timestamp'].sort_values().diff()

pdf['gap'] = pdf['gap_val'] > ps.to_timedelta('1 minute')

pdf[pdf.gap]


 As *Figure 6**.9* shows, we found 3 gaps of 2 minutes each in this example.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_09.jpg)

Figure 6.9: Gap analysis
Depending on the size of the gap and the nature of the dataset, we can adopt one of the following approaches:

*   Ignore the gap
*   Aggregate, for example, use the mean value at a higher interval
*   Use one of the missing-value handling techniques we saw in *Chapter 5*, such as forward filling

Regular or irregular time series
The gap analysis presented here assumes a regular time series. The approach is slightly different in detecting gaps in the timeline of irregular time series. The previous example of checking for the absence of values at every minute interval is not applicable for an irregular time series. We will have to look at the distribution of the count of values over the timeline of the irregular time series and make reasonable assumptions about how regularly we expect values in the irregular time series. For instance, if we are considering the energy consumption of a household, the time series may be irregular at minute intervals, but based on historical data, we expect energy use every hour or daily. In this case, not having a data point on a given hour or day can be indicative of a gap. Once we have identified a gap, we can use the same approaches as discussed for regular time series, that is, forward filling or similar imputation, aggregation at higher intervals, or just ignoring the gap.
We discussed here the specific problem of gaps in the time series. We mentioned that, to identify gaps, we can look at the distribution of the data, which will be covered next.
Distribution analysis
Distribution analysis of time series provides an understanding of the underlying patterns and characteristics of the data, such as skewness, kurtosis, and outliers. This helps detect deviations from normal distribution, trends, and seasonal patterns, and visualize the variability of the time series. This understanding then feeds into choosing the appropriate statistical models and forecasting methods. This is required as models are built on assumptions of the distribution of the time series. Done correctly, distribution analysis ensures that model assumptions are met. This also improves the accuracy and reliability of the predictions.
In this section, we will examine a few examples of distribution analysis, starting with the profiling output of *Figure 6**.5*, which shows a kurtosis of 2.98 and a skewness of 1.46\. Let’s explain what this means by first defining these terms.
**Kurtosis** indicates how peaked or flat a distribution is compared to a normal distribution. A value greater than 2, as in our example in *Figure 6**.5*, indicates the distribution is too peaked. Less than -2 means too flat.
**Skewness** indicates how centered and symmetric the distribution is compared to a normal distribution. A value between -1 and 1 is considered near normal, between -2 and 2, as in the example in *Figure 6**.5*, is acceptable, and below -2 or above 2 is not normal.
When both kurtosis and skewness are zero, we have a perfectly normal distribution, which is quite unlikely to be seen with real data.
Let’s now do some further distribution analysis with the following code extract. We want to understand the frequency distribution of `Global_active_power`, the distribution by day of the week, `dayOfWeek`, and the hour of the day. We will use the Seaborn (`sns`) visualization library for the plots, with the pandas DataFrame, `pdf`, passed as a parameter:

分布分析

第 17 号单元格的代码

…

提取日期和小时

df = df.withColumn("dayOfWeek", F.dayofweek(F.col("timestamp")))

df = df.withColumn("hour", F.hour(F.col("timestamp")))

…

使用 Seaborn 和 Matplotlib 进行分布分析

…

sns.histplot(pdf['Global_active_power'], kde=True, bins=30)

plt.title(

'时间序列数据中 Global_active_power 的分布'

)

…

用箱线图可视化按星期几分布

…

sns.boxplot(x='dayOfWeek', y='Global_active_power', data=pdf)

plt.title(

'时间序列数据中 Global_active_power 的日分布'

)

…

用箱线图可视化按小时分布

…

sns.boxplot(x='hour', y='Global_active_power', data=pdf)

plt.title(

'时间序列数据中 Global_active_power 的小时分布'

)

…


 We can see the frequency of occurrence of the different values of `Global_active_power` in *Figure 6**.10*, with the skewness to the left.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_10.jpg)

Figure 6.10: Distribution by frequency
If we look at the distribution by day of the week, as in *Figure 6**.11*, power consumption during the weekends is higher, as can be expected for a household, with 1 on the *x* axis representing Sundays and 7 Saturdays. The distribution is also over a broader range of values these days.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_11.jpg)

Figure 6.11: Distribution by day of the week
The distribution by hour of the day, as in *Figure 6**.12*, shows higher power consumption during the morning and evening, again as can be expected for a household.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_12.jpg)

Figure 6.12: Distribution by hour of the day
You will also notice in the distribution plots the values that are flagged as outliers, lying beyond the whiskers. These are at a 1.5 **inter-quartile range** (**IQR**) above the third quartile. You can use other thresholds for outliers, as in *Chapter 5*, where we used a cutoff on the z-score value.
Visualizations
As we have seen so far in this book and, more specifically, this chapter, visualizations play an important role in time series analysis. By providing us with an intuitive and immediate understanding of the data’s underlying patterns, they help to identify seasonal variations, trends, and anomalies that might not otherwise be seen from raw data alone. Furthermore, visualizations facilitate the detection of correlations, cycles, and structural changes over time, contributing to better forecasting and decision-making.
Fundamentally, (and this is not only true for time series analysis) visualizations aid in communicating complex insights to stakeholders and, in doing so, improve their ability to understand and act accordingly.
Building on the techniques for statistical analysis seen in this chapter, we will now move on to other important techniques to consider while analyzing  time series—resampling, decomposition, and stationarity.
Resampling, decomposition, and stationarity
This section details additional techniques used in time series analysis, introduced in *Chapter 1*. We will see code examples of how to implement these techniques.
Resampling and aggregation
Resampling and aggregation are used in time series analysis to transform and analyze data at different time scales. **Resampling** is changing the frequency of the time series, such as converting hourly data to daily data, which can reveal trends and patterns at different time frequencies. **Aggregation**, on the other hand, is the summarizing of data over specified intervals and is used in conjunction with resampling to calculate the resampled value. This can reduce noise, handle missing values, and convert an irregular time series to a regular series.
The following code extract shows the resampling at different intervals, together with the aggregation. The original dataset has data every minute. With `resample('h').mean()` applied to the pandas DataFrame, `pdf`, we resample this value to the mean over the hour:

重采样与聚合

第 22 号单元格的代码

…

将数据重采样为小时、天和周的频率，并按#均值聚合

hourly_resampled = pdf.resample('h').mean()

hourly_resampled_s = pdf.resample('h').std()

daily_resampled = pdf.resample('d').mean()

daily_resampled_s = pdf.resample('d').std()

weekly_resampled = pdf.resample('w').mean()

weekly_resampled_s = pdf.resample('w').std()

…


 *Figure 6**.13* shows the outcome of the hourly resampling.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_13.jpg)

Figure 6.13: Resampled hourly
*Figure 6**.14* shows the outcome of the daily resampling.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_14.jpg)

Figure 6.14: Resampled daily
*Figure 6**.15* shows the outcome of the weekly resampling.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_15.jpg)

Figure 6.15: Resampled weekly
With these examples, we have resampled and aggregated time series data using Apache Spark. We will next expand on the time series decomposition of the resampled time series.
Decomposition
As introduced in *Chapter 1*, decomposition breaks down the time series into its fundamental components: trend, seasonality, and residuals. This separation helps uncover underlying patterns within the data more clearly. The trend shows long-term movement, while seasonal components show repeating patterns. Residuals highlight any deviation from the trend and seasonal components. This decomposition allows for each component to be analyzed and addressed individually.
The following code extract shows the decomposition of time series using `seasonal_decompose` from the `statsmodels` library. In *Chapter 1*, we used a different library, `Prophet`.

第 30 号单元格的代码

…

from statsmodels.tsa.seasonal import seasonal_decompose

执行季节性分解

hourly_result = seasonal_decompose(

hourly_resampled['Global_active_power'])

daily_result = seasonal_decompose(

daily_resampled['Global_active_power'])

…


 *Figure 6**.16* shows the components of the hourly resampled time series. The seasonal component shows a pattern, with each repeating pattern corresponding to a day, and the ups in power consumption every morning and evening are visible.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_16.jpg)

Figure 6.16: Decomposition of hourly data
*Figure 6**.17* shows the components of the daily resampled time series. The seasonal component shows a pattern, with each repeating pattern corresponding to a week, and the ups in power consumption every weekend are visible.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_17.jpg)

Figure 6.17: Decomposition of daily data
Now that we have performed time series decomposition using Apache Spark and `statsmodels` for time series at different resampling intervals, let's discuss the next technique. 
Stationarity
Another key concept related to time series data, introduced in *Chapter 1*, stationarity concerns the statistical properties of the series, such as mean, variance, and autocorrelation remaining constant over time. This is an assumption on which time series models, such as **AutoRegressive Integrated Moving Average** (**ARIMA**) are built. A series must be identified and converted to stationary before using such models. In general, stationary time series facilitate analysis and improve model accuracy.
The first step in handling non-stationarity is to check the time series, which we will look at next.
Check
The **Augmented Dickey-Fuller** (**ADF**) test and the **Kwiatkowski-Phillips-Schmidt-Shin** (**KPSS**) test are commonly used statistical tests to check for stationarity. Without going into the details of these tests, we can say they calculate a value, which is called the p-value. A value of p < 0.05 for ADF means that the series is stationary. Additionally, we can check for stationarity by visual inspection of the time series plot and **autocorrelation function** (**ACF**) plots, and by comparing summary statistics over different time periods. Mean, variance, and autocorrelation remaining constant across time suggest stationarity. Significant changes indicate non-stationarity.
The following example code checks for stationarity using the ADF test, `adfuller`, from the `statsmodels` library. We will use the hourly resampled data in this example.

平稳性

代码位于第 33 行

…

from statsmodels.tsa.stattools import adfuller

执行扩展的 Dickey-Fuller 检验

result = adfuller(hourly_resampled)

if Test statistic < Critical Value and p-value < 0.05

拒绝原假设，时间序列没有单位根

序列是平稳的

…


 In this case, the p-value, as shown in *Figure 6**.18*, is less than 0.05, and we can conclude the time series is stationary from the ADF test.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_18.jpg)

Figure 6.18: ADF test results – Power consumption dataset
Running the ADF test on the dataset for the annual mean temperature of Mauritius, used in *Chapter 1*, gives a p-value greater than 0.05, as shown in *Figure 6**.19*. In this case, we can conclude that the time series is non-stationary.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_19.jpg)

Figure 6.19: ADF test results – Annual mean temperature dataset
As we now have a non-stationary series, we will next consider converting it to a stationary series using differencing.
Differencing
The following code extract shows the conversion of a non-stationary time series to a stationary one. We’ll use differencing, a common method to remove trends and seasonality, which can make the time series stationary. By using a combination of the `Window` function and `lag` of 1, we can find the difference between an annual mean and the previous year’s value.

差分

代码位于第 41 行

…

from pyspark.sql.window import Window

计算差分（差分处理）

window = Window.orderBy("year")

df2_ = df2.withColumn(

"annual_mean_diff",

F.col("annual_mean") - F.lag(

F.col("annual_mean"), 1

).over(window))

…


 We can see the original time series compared to the differenced time series in *Figure 6**.20*. The removal of the trend is visible.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_20.jpg)

Figure 6.20: Differencing – Annual mean temperature dataset
Running the ADF test after differencing, gives a p-value less than 0.05, as shown in *Figure 6**.21*. We can conclude that the difference in time series is stationary.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_21.jpg)

Figure 6.21: ADF test results – Differenced annual mean temperature dataset
Building on our understanding of techniques for exploratory analysis learned in this section, we will now move on to the last section of this chapter, which is about correlation of  time series data.
Correlation analysis
Correlation measures the relationship between two variables. This relationship can be causal, whether one is the result of the other. This section will explore the different types of correlation applicable to time series.
Autocorrelation
The **AutoCorrelation Function** (**ACF**) measures the relationship between a time series and its past values. High autocorrelation indicates that past values have a strong influence on future values. This information can then be used to build predictive models, for instance, in selecting the right parameters for models such as ARIMA, thereby enhancing the robustness of the analysis. Understanding autocorrelation also helps in identifying seasonal effects and cycles.
The **Partial AutoCorrelation Function** (**PACF**) similarly measures the relationship between a variable and its past values, but contrary to the ACF, with the PACF we discount the effect of values of the time series at all shorter lags.
Check
The following code shows how you can check for autocorrelation and partial autocorrelation using Apache Spark and `plot_acf` and `plt_pacf` from the `statsmodels` library.

自相关

代码位于第 45 行

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

绘制自相关函数 (ACF)

plt.figure(figsize=(12, 6))

plot_acf(hourly_resampled['Global_active_power'], lags=3*24)

plt.title('自相关函数 (ACF)')

plt.show()

绘制偏自相关函数 (PACF)

plt.figure(figsize=(12, 6))

plot_pacf(hourly_resampled['Global_active_power'], lags=3*24)

plt.title('偏自相关函数 (PACF)')

plt.show()

…


 The resulting ACF and PACF plots are shown in *Figure 6**.22*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_22.jpg)

Figure 6.22: ACF and PACF plots
The outcomes of ACF and PACF indicate the nature of the time series and guide the selection of the appropriate models and parameters for forecasting. Let’s now make sense of these plots and how we can use their outcome.
Interpretation of ACF
We will consider the peaks and the decay from the ACF plot to interpret the outcome, using the upper graph in *Figure 6**.22* as an example.
Peaks in the autocorrelation plot outside the confidence interval indicate notable autocorrelations. Regular intervals point to seasonality. From the example, we can see autocorrelation at lags 1, 2, and 3 and seasonality at lags 12 and 24, which correspond to a 12- and 24-hour interval.
A slow decay in the autocorrelation plot suggests that the series is non-stationary with a trend. In this case, we can convert the series to stationary by differencing it, as discussed in the previous section on *Differencing*. This, however, is not the case in our example in *Figure 6**.22*, as there is no slow decay.
The outcome of the ACF can be used to define the `q` of an ARIMA model. Major peaks at lags 1, 2 and 3 in our example, means q=1, q=2, and q=3.
Interpretation of PACF
We will consider the peaks and the cut-off from the PACF plot to interpret the outcome, using the lower graph in *Figure 6**.22* as an example.
Peaks in the partial autocorrelation plot outside the confidence interval indicate notable partial autocorrelations. In the example, this is seen at lags 1, 12, and 24.
An immediate cut-off after some lags indicates an **autoregressive** (**AR**) component. In the example, this is after lag 1.
The outcome of the PACF can be used to define the AR parameter `p` of an ARIMA model. Major peaks at lag 1 in our example, means p=1.
Model parameters
Based on the interpretation of the ACF and PACF plots in *Figure 6**.22*, we can consider the following candidate ARIMA(p, d, q) models, where p is the PACF cut-off point, d is the order of differencing, and q is the ACF autocorrelation lag:

*   ARIMA(1, 0, 1)
*   ARIMA(1, 0, 2)
*   ARIMA(1, 0, 3)

We will discuss model selection and parameters in detail in the next chapter. The depth of our discussion here is just enough to conclude the discussion on ACF and PACF. Let’s move on to other lag analysis methods.
Lag analysis
In addition to ACF and PACF plots seen previously, we will explore another lag analysis method in this section.
We’ll start by calculating the different lag values of interest, as per the following code extract, using the `Window` and `lag` functions we have seen previously.

滞后分析

代码位于第 49 行

…

window = Window.orderBy("timestamp")

创建滞后特征

hourly_df = hourly_df.withColumn(

"lag1", F.lag(F.col("Global_active_power"), 1).over(window))

hourly_df = hourly_df.withColumn(

"lag2", F.lag(F.col("Global_active_power"), 2).over(window))

hourly_df = hourly_df.withColumn(

"lag12", F.lag(F.col("Global_active_power"), 12).over(window))

hourly_df = hourly_df.withColumn(

"lag24", F.lag(F.col("Global_active_power"), 24).over(window))

…


 This creates the lag columns, as shown in *Figure 6**.23*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_23.jpg)

Figure 6.23: Lag values
We then calculate the correlation of the current values with their lag values, as in the following code, using the `stat.corr()` function.

代码位于第 50 行

…

计算滞后 1 的自相关

df_lag1 = hourly_df.dropna(subset=["lag1"])

autocorr_lag1 = df_lag1.stat.corr("Global_active_power", "lag1")

…

计算滞后 24 的自相关

df_lag24 = hourly_df.dropna(subset=["lag24"])

autocorr_lag24 = df_lag24.stat.corr("Global_active_power", "lag24")

…


 *Figure 6**.24* shows the autocorrelation values, significant at lag 1, 2, and 24, as we saw on the ACF plot previously.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_24.jpg)

Figure 6.24: Autocorrelation at different lag values
Finally, by plotting the current and lag values together, we can see in *Figure 6**.25* how they compare to each other. We can visually confirm here the greater correlation at lag 1, 2, and 24.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_25.jpg)

Figure 6.25: Comparison of current and lag values
This concludes the section on autocorrelation, where we looked at ACF and PACF, and how to calculate lagged features and their correlation using Apache Spark. While the lag analysis methods in this section have been used for autocorrelation, they can also be used for cross-correlation, which we will cover next, as another type of correlation, this time between different time series.
Cross-correlation
Cross-correlation measures the relationship between two different time series. One series may influence or predict the other over different time lags, in what is called a **lead-lag relationship**. Cross-correlation is used for multivariate time series modeling and causality analysis.
Going back to the profiling report we saw earlier, we can see a graph of the correlation of the different columns of the example dataset included in the report, as in *Figure 6**.26*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_26.jpg)

Figure 6.26: Cross-correlation heatmap
We can calculate the cross-correlation directly with the following code.

互相关

代码位于第 53 行

…

计算 value1 和 value2 之间的互相关

cross_corr = hourly_df.stat.corr("Global_active_power", "Voltage")

…


 The cross-correlation calculation yields the value in *Figure 6**.26*. As this correlation is at the same lag, it does not have predictive value, in the sense that we are not using the past to predict the future. However, this pair of attributes is still worth further analysis at different lags, due to the significant cross-correlation.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_27.jpg)

Figure 6.27: Cross-correlation value
Note
We know that P=IV, where P is electrical power, I is current, and V is voltage, indicates how power and voltage are related. Hence, these two time series are not independent of each other. Even if there is no further insight into the P and V relationship, we will continue this analysis as an example of cross-correlation analysis.
As cross-correlation at the same lag does not help much for prediction, we will now look at using different lag values with the following code. This uses the cross-correlation `ccf()` function, which calculates the cross-correlation at different lag values.

代码位于第 54 行

…

from statsmodels.tsa.stattools import ccf

hourly_ = hourly_resampled.iloc[:36]

计算互相关函数

ccf_values = ccf(hourly_['Global_active_power'], hourly_['Voltage'])

绘制互相关函数

plt.figure(figsize=(12, 6))

plt.stem(range(len(ccf_values)),

ccf_values, use_line_collection=True, markerfmt="-")

plt.title('互相关函数 (CCF)')

…


 This generates the plot in *Figure 6**.27*, which shows the correlation of the two attributes at different lags.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_06_28.jpg)

Figure 6.28: Cross-correlation function
To conclude, this section showed how to perform cross-correlation analysis by creating lagged features, and calculating and visualizing cross-correlation.
Summary
In this chapter, we used exploratory data analysis to uncover patterns and insights in time series data. Starting with statistical analysis techniques, where we profiled the data and analyzed its distribution, we then resampled and decomposed the series into its components. To understand the nature of the time series, we also checked for stationarity, autocorrelation, and cross-correlation. By this point, we have gathered enough information on time series to guide us into the next step of building predictive models for time series.
In the next chapter, we will dive into the core topic of this book, which is developing and testing models for time series analysis.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1).jpg)

第七章：构建与测试模型

在覆盖了时间序列分析的数据准备和探索性数据分析阶段后，我们现在将重点转向为时间序列数据构建预测模型。我们将涵盖多种类型的模型以及如何决定选择哪个模型。我们还将学习如何训练、调整和评估模型。

本章所涉及的概念将作为模型开发的实用指南，为有效的时间序列模型提供基本构建块，并促进准确的预测和深入的分析。我们将考虑在实际项目中常见的执行约束，并最终对不同模型解决预测问题的结果进行比较。

我们将涵盖以下主要主题：

模型选择
开发与测试
模型比较

技术要求

本章的代码将在开发与测试部分中讲解，可以在本书的 GitHub 仓库的ch7文件夹中找到，网址如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch7。

模型选择

在开发时间序列分析模型之前的第一步是选择使用哪个模型。正如在第一章中讨论的那样，时间序列分析的一个关键挑战是选择合适的模型。这个选择会影响分析的准确性、可靠性、效率和可扩展性等多个方面。反过来，这确保了分析能够得出更有根据的决策和更有效的结果，同时具有科学的严谨性和实际的实用性。

不同类型的模型各有其特点。

模型类型

时间序列分析模型可以分为统计模型、经典机器学习（ML）模型和深度学习（DL）模型：

统计模型用于时间序列分析，它们基于统计理论并假设时间序列的特征，如线性和平稳性。经典模型的例子包括自回归滑动平均模型（ARIMA）、季节性自回归积分滑动平均外生模型（SARIMAX）、指数平滑法（ETS）、广义自回归条件异方差模型（GARCH）和状态空间模型。
经典机器学习模型用于时间序列分析，采用无需显式编程的算法从数据中学习。这些模型能够处理非线性关系。然而，它们通常需要比经典模型更多的数据来进行训练。机器学习模型的例子包括线性回归、支持向量机（SVMs）、k 近邻（kNN）、随机森林和梯度提升机。
深度学习模型使用具有多个层次的神经网络来学习时间序列数据中的复杂模式。这些模型能够处理非线性关系和长期依赖性。然而，它们需要大量的训练数据和显著的计算资源。深度学习模型的例子包括长短期记忆网络（LSTM）、卷积神经网络（CNN）、时间卷积网络（TCN）、变换器（transformers）和自编码器（autoencoders）。

机器学习与深度学习

深度学习是机器学习的一个子集，它使用深度神经网络。按照常规做法，我们在这里使用“经典机器学习”这一术语，指的是那些不是基于神经网络的方法和模型。深度学习一词则用于描述使用神经网络的方法和模型。

之前提到的每个类别和模型都有独特的特点和方法，这决定了它们的适用性，我们接下来将进一步探讨这些内容。

选择标准

选择使用哪种模型是基于几个标准的。在第一章中关于选择正确模型的部分以及在第四章中初步讨论模型选择时，我们简要提到了这一点。模型在解决时间序列分析问题中的适用性取决于分析目标、数据特征、计算能力和可用时间等因素。

我们现在将深入探讨模型选择中的这些及其他重要因素。

使用场景类型

时间序列分析大体上可分为预测、分类和异常检测等使用场景，如在第二章中讨论的那样。我们将在这里简要回顾这些使用场景，并突出常用的模型。接下来的章节将深入讨论这些内容。

预测的目标是基于模型从过去的值中学习到的模式来预测未来的值。如在第二章中所述，预测可以是单步或多步，基于单一（单变量）或多个（多变量）时间序列。常用的模型如 ARIMA、SARIMA 和指数平滑法（ETS）因其简单性而被选择，并在预测任务中表现出强劲的性能。LSTM 和在本书前面章节中介绍的 Prophet，适用于更复杂的预测需求，并在这些场景下能更有效地工作。
模式识别和分类用于识别和理解模式，并相应地对时间序列进行分类。常用的模型基于分解方法，例如使用 LOESS 的季节-趋势分解（STL）和多重 STL（MSTL），以及傅里叶分析。我们在第一章和第六章中花了一些时间讨论分解方法。我们在第二章中简要讨论了傅里叶分析，此外还讨论了基于距离的方法、形状分析、集成方法和深度学习。
异常检测旨在识别时间序列中的异常值或异常。正如在第二章中所展示的，这种检测可以基于单变量或多变量序列，以及点、集体或上下文分析。最初标记为异常的点可能最终被认为是新颖的，即一种非问题的全新模式。常用的模型基于残差分析的能力，例如 ARIMA。机器学习模型也经常被使用，例如孤立森林，或者在异常值比例较高时，使用专门的方法，如季节性混合极端学生化偏差（SH-ESD）。我们在第二章中看到过使用孤立森林进行异常检测的代码示例，并讨论了监督、无监督、半监督和混合方法。

另一个模型选择标准是我们接下来将讨论的时间序列的统计性质。

时间序列的性质

时间序列的性质，即其统计特性，影响模型的选择。模型的研究和开发通常是基于对时间序列特性的特定假设，从而决定它们的适用性。本节将专注于适用性，并跳过定义，假设到目前为止，你已经熟悉我们将在本节中使用的术语，这些术语基于第一章的介绍和第六章中的代码示例：

平稳时间序列可以使用 ARIMA 建模，ARIMA 假设序列是平稳的。平稳时间序列的一个例子是某股票在三年期间的日收益百分比。假设市场没有发生显著的结构性变化，股票收益通常围绕稳定的均值波动，并具有一致的方差。

非平稳时间序列可以通过差分转化为平稳序列，正如在第六章中看到的那样。差分后的序列可以与这些模型一起使用。或者，使用 Prophet 或机器学习模型来处理非平稳序列。非平稳时间序列的一个例子是月度失业率，它可能具有趋势，并与经济状况相关的周期性模式。
季节性时间序列需要能够处理季节性的模型，如 SARIMA、ETS、Prophet 或机器学习模型。我们在第二章中的编码示例中已经看到，使用 Prophet 预测温度的应用。
时间序列中的趋势可能会影响某些模型的表现，比如 ARIMA。在这种情况下，类似于平稳性，我们可以通过差分去除趋势成分，如第六章中的代码示例所示。然后可以使用 ARIMA 模型。或者，使用能够处理趋势的模型，如趋势模型、ETS、Prophet 或机器学习。
波动性时间序列可以通过模型处理，如广义自回归条件异方差性（GARCH）、随机波动性 GARCH（SV-GARCH）或机器学习。这些模型常用于高度波动的金融市场以及其他领域的预测和风险管理。
数据关系的线性意味着线性模型（如 ARIMA）适用。一个线性时间序列的例子是每日气温，其中今天的气温可以通过前两天气温的线性组合加上一些随机误差来预测。

在非线性模式的情况下，具有神经网络的机器学习模型更为合适。一个非线性时间序列的例子是，当股票价格低于某个阈值（比如 100）时遵循一种关系，而当其高于该阈值时则遵循另一种关系。

数据分析的量和频率（接下来会讨论）是影响模型选择的时间序列的另一个特性。

数据的量和频率

数据的量和频率会影响所需的计算能力和分析所需的时间。它们的组合决定了选择何种模型进行分析。我们将在这里讨论数据的量和频率，而另外两个因素将在下一个部分进行讨论：

小数据集可以使用统计模型进行分析，例如 ARIMA 和 ETS。这些是适用于较小数据集的简单模型。一个小数据集的例子是过去几年某商店的每日销售数据。
大数据集非常适合机器学习模型，如梯度提升和 LSTM。这两者之间是相辅相成的：一方面是机器学习模型在处理大数据集时的计算能力和可扩展性，另一方面是需要大量数据进行模型训练以避免过拟合。机器学习模型可以学习大数据集中的复杂模式，但需要更多的计算资源。大数据集的例子包括分钟级股票价格或过去五年的传感器数据。

正如我们在第八章中将看到的，我们可以通过利用 Apache Spark 的分布式计算能力，将模型扩展到大数据集：

低频率时间序列，如日度、周度、月度、季度或年度，通常规模较小。如前所述关于小数据集，ARIMA 和 ETS 通常是这类数据集的好选择。
高频率时间序列可能具有快速变化、噪声、波动性和异方差性，可以使用诸如 GARCH 这样的模型来处理，这通常用于金融时间序列。

如果需要在数据到达速率以下的较低频率上进行分析，则可以通过重新采样和聚合将高频序列转换为低频序列，如第六章中所讨论的。重新采样会减少数据集的大小，同时平滑噪声和波动性。这打开了使用适合低频时间序列模型的可能性，正如之前讨论的那样。

高频率数据的减值

我们在这里讨论频率，指的是时间序列中连续数据点之间的时间间隔，也称为粒度。高频数据的另一个考虑因素是分析也需要高频率进行。这是由于高频数据随时间的快速减值。考虑到实时股票 tick 变化在发生时的关键性，但几小时后变得不那么相关。在这种情况下，模型必须能够进行极快的计算，潜在地实时进行。

更高的数据量和频率需要更多的计算资源，这将在接下来进行讨论。

计算约束

像任何其他项目一样，时间序列分析也是在预算内进行的。这意味着可用的资源量，包括执行分析过程的计算能力，是受限制的。同时，我们知道更高的数据量和频率需要更多的计算资源。我们还必须考虑分析需要多快才能完成，以使结果有用。在考虑这些约束条件的同时，让我们来探讨模型的选择：

有限的计算资源意味着我们可能需要考虑通过重新采样来减少数据集的大小，并使用 ARIMA 或 ETS 等简单模型的组合。机器学习模型虽然能更好地检测复杂模式和更大的数据集，但通常需要更多的计算资源。
快速分析需要使用更快的模型进行训练和预测。像 ARIMA 或 ETS 这样的模型，对于较小的数据集再次是很好的选择。

如果需要对大型数据集进行快速分析，则可以选择以下选项：
- 在下一章中，我们将介绍使用 Apache Spark 集群的分布式处理来扩展大型数据集。
- 重新采样以将数据集大小转换为更小，并使用 ARIMA 或 ETS 等简单模型。
- 使用机器学习模型时的注意事项：对于较大的数据集，训练和调优阶段会变慢。通过使用更多的计算资源可以提高预测速度，但这当然会带来更高的成本。值得注意的是，训练、调优和预测速度也可以通过使用 Apache Spark 的分布式处理来提高，正如我们将在下一章看到的那样。
计算资源成本是另一个可能限制使用计算密集型模型的重要因素。虽然较简单的统计模型可以在较便宜的标准资源上运行，但深度学习模型可能需要在高性能硬件上使用更昂贵的 GPU。

在考虑了计算需求如何影响模型选择后，我们将进一步考虑模型准确性、复杂性和可解释性如何决定使用哪个模型。

模型准确性、复杂性和可解释性

在模型选择时需要考虑的其他因素包括模型准确性、复杂性和可解释性：

模型准确性在许多情况下被错误地视为模型选择的决定性因素。准确性被故意列在选择标准的末尾，目的是强调同时考虑其他因素的重要性。最好的模型不一定是最准确的模型，而是为特定应用场景带来最大投资回报的模型。

当需要高准确性时，特别是在预测中，可能需要更复杂的模型，如 SARIMAX 或深度学习。超参数调优作为开发过程的一部分，用于进一步提高准确性，但这会增加计算开销。
复杂性和可解释性通常是相互冲突的。对更高准确性的需求会导致使用更复杂的模型，而这些模型通常更难以解释，常被称为黑箱模型。

如果可解释性至关重要，可以选择较简单的模型，如 ARIMA 或 ETS，这些模型还具有计算资源需求较低的额外优势。基于树的模型，如 GBM 或时间序列树形管道（TSPi），在准确性和计算需求之间提供了良好的平衡，而较简单的树形模型则提供了可解释性。

如果数据表现出复杂的模式且高准确性至关重要，可能没有太多选择，我们可能需要使用复杂的模型，这会在计算资源和可解释性上做出权衡。

模型选择概述

关于模型选择，有几个要点值得注意：

统计模型如 ARIMA 基于对时间序列的假设，需进行统计检验，并可能需要额外的预处理，以便在使用模型之前转换序列。
Prophet 和机器学习模型更广泛适用，但具有额外的复杂性和计算要求。
本节中提到的模型作为示例，适用于讨论的标准。其他模型，来自不断增长的公开可用模型和方法列表，可以并且应该被测试。找到最佳模型是一个实验和迭代的过程，取决于具体的应用场景。

正如我们在选择标准部分所看到的，多个因素会影响模型的选择，并决定在哪些方面投入更多的精力。哪些因素最为重要取决于项目的背景和使用场景。最佳的模型选择是能带来最高投资回报率的模型，这需要在这里讨论的不同因素之间进行权衡。

在此时，选定了模型后，我们已准备好进入下一开发步骤，即在我们的时间序列数据上训练模型。

开发与测试

在本节中，我们将比较不同类别模型的预测性能：统计模型、经典机器学习模型和深度学习模型。我们将使用六种不同的模型：SARIMA、LightGBM、LSTM、NBEATS、NHITS 和 NeuralProphet。这些模型因其广泛且经过验证的应用以及易于访问和使用而被选中。

我们将继续执行以下限制条件：

尽可能使用默认模型超参数进行比较，并将调整限制在少数几个案例中，具体内容将在后文说明。
完整的执行过程，从数据加载到模型训练、测试和预测，将限制在 15 分钟以内。
所使用的计算资源将受限于 Databricks 社区版计算资源，如图 7.1所示，具有 15.3 GB 的内存和 2 个 CPU 核心。

图 7.1：Databricks 社区版计算资源

在我们的实际项目中，我们常常面临时间和资源的限制。本节还旨在为您提供在这些限制条件下工作的工具。

单线程、多线程和集群

在本章的代码示例中，我们将使用Pandas和NumPy。Pandas在使用 CPU 核心时是单线程的，NumPy默认是多线程的，因此它会并行使用多个 CPU 核心。两者都绑定到单一机器上，无法利用多机器的 Spark 集群能力。我们将在第八章中讨论如何解决这一限制，该章节涉及扩展。在很多现有的代码示例中，你会发现使用了Pandas和NumPy，因此从这些库开始作为基础非常重要。然后，在第八章中，我们将讨论如何将单机代码转换为利用 Spark 集群能力的代码。

本节使用的时间序列数据是第二章中用于家庭能量消耗的扩展版本。我们将在本章余下的所有模型中使用相同的时间序列。数据集位于ch7文件夹中的ts-spark_ch7_ds1_25mb.csv。由于这是一个新数据集，我们将在下一节中通过探索数据的步骤进行介绍。

数据探索

在本节中，我们要检查数据集中的平稳性、季节性和自相关。这是理解时间序列特性的重要步骤。

本节的代码位于ts-spark_ch7_1e_sarima_comm.dbc。我们按照第一章中“实践操作：加载和可视化时间序列”部分的说明，将代码导入 Databricks 社区版。

代码的 URL 如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch7/ts-spark_ch7_1e_sarima_comm.dbc

代码的第一部分加载并准备数据。我们在这里不详细讲解这部分内容，因为我们已经在第五章中涵盖了数据准备的内容，你可以参考笔记本中的代码。然而，数据探索部分与本章相关，因此让我们接下来进一步探索，从平稳性检查开始。

平稳性

我们可以通过运行增强型迪基-富勒（ADF）测试，使用以下代码来检查能量消耗时间序列是否平稳：

from statsmodels.tsa.stattools import adfuller
# Perform Augmented Dickey-Fuller test
result = adfuller(data_hr[-300:]['Global_active_power'])
# if Test statistic < Critical Value and p-value < 0.05
#   reject the Null hypothesis, time series does not have a unit root
#   series is stationary
# Extract and print the ADF test results
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:')
for key, value in result[4].items():
    print(f'   {key}: {value}')

这给出了以下 ADF 统计量：

ADF Statistic: -6.615237252003429
p-value: 6.231223531550648e-09
Critical Values:
 1%: -3.4524113009049935
 5%: -2.8712554127251764
 10%: -2.571946570731871

由于 ADF 统计量小于临界值，且 p 值小于 0.05，我们可以得出结论，时间序列是平稳的。

季节性

我们可以通过以下代码检查季节性：

from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series data into seasonal, trend, and residual 
# components
results = seasonal_decompose(data_hr)
# Plot the last 300 data points of the seasonal component
results.seasonal[-300:].plot(figsize = (12,8));

这给出了图 7**.2中的季节性分解。

图 7.2：季节性分解

由于模式每 24 小时重复一次，我们可以得出结论，时间序列具有日常季节性。

自相关

我们可以通过以下代码检查自相关和偏自相关：

import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot ACF to identify autocorrelation in 'data_hr' DataFrame
plot_acf(data_hr['Global_active_power'])
# Plot PACF to identify partial autocorrelation in 'data_hr' DataFrame
plot_pacf(data_hr['Global_active_power'])
# Display the ACF and PACF plots
plt.show()

这给出了图 7**.3中的自相关图。

图 7.3：不同滞后（x 轴）下的自相关（y 轴）

我们可以看到在较低的滞后值（包括滞后 1）和滞后 12 时有较高的自相关性，以及在滞后 24 时季节性的影响。考虑到家庭中典型的能量消耗模式，这一点是合理的：

比如做饭、洗衣或看电视等活跃能量使用的时段，很可能会超过一个小时（滞后 1）
早晨和晚上（滞后 12）通常是活动的高峰期
日常例行活动意味着我们每 24 小时会有相似的活动周期（滞后 24）

图 7.4：偏自相关

PACF 图显示在滞后 1 时有较高的偏自相关，并在滞后 10 和滞后 23 附近有明显的偏自相关。这与我们提到的家庭能源消费的典型模式一致。

统计模型 – SARIMA

我们将讨论的第一个模型是 SARIMA，它通过加入季节性组件扩展了 ARIMA 模型。虽然 ARIMA 模型解决了自相关、差分平稳性和移动平均等问题，SARIMA 在此基础上还考虑了数据中的季节性模式。

本节的代码位于 ts-spark_ch7_1e_sarima_comm.dbc 文件中。我们按照 第一章 中 动手实践：加载和可视化时间序列 部分的说明，将代码导入 Databricks 社区版。

代码 URL 如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch7/ts-spark_ch7_1e_sarima_comm.dbc

开发与调优

在模型开发过程中，我们使用以下代码将数据集的最后 48 小时与训练数据分开。这将用于后续的测试，其他部分将用于训练：

# Split the data into training and testing sets
# The last 48 observations are used for testing,
# the rest for training
train = data_hr[:-48]
test = data_hr[-48:]

我们将讨论两种结合训练和调优的方法，用于训练模型并找到最佳参数：auto_arima 和 ParameterGrid。

Auto ARIMA

使用 auto ARIMA 方法时，我们希望自动找到最小化 pmdarima 库的模型参数，以演示 auto ARIMA 方法。由于这是一个计算密集型操作，我们希望保持之前解释的时间（15 分钟）和资源（Databricks 社区版）限制，因此我们将数据集限制为最后 300 个数据点。

使用 pmdarima 的代码如下：

import pmdarima as pm
# Create auto_arima model to automatically select the best ARIMA parameters
model = pm.auto_arima(
    # Use the last 300 observations of the series for modeling:
    train[-300:]["Global_active_power"],
    # Enable seasonal differencing:
    seasonal=True,
    # Set the seasonal period to 24
    # (e.g., 24 hours for daily data):
    m=24,
    # Set the degree of non-seasonal differencing to 0
    # (assumes data is already stationary):
    d=0,
    # Set the degree of seasonal differencing to 1:
    D=1,
    # Set the maximum value of AR (p) terms to consider:
    max_p=3,
    # Set the maximum value of MA (q) terms to consider:
    max_q=3,
    # Set the maximum value of seasonal AR (P) terms to consider:
    max_P=3,
    # Set the maximum value of seasonal MA (Q) terms to consider:
    max_Q=3,
    # Use AIC (Akaike Information Criterion) to select the best model:
    information_criterion='aic',
    # Print fit information to see the progression of
    # the model fitting:
    trace=True,
    # Ignore models that fail to converge:
    error_action='ignore',
    # Use stepwise algorithm for efficient search of the model space:
    stepwise=True,
    # Suppress convergence warnings:
    suppress_warnings=True
)
# Print the summary of the fitted model
print(model.summary())

以下代码输出展示了逐步搜索最小化 AIC 的参数集。这将是用于 ARIMA 模型的最佳参数集，用于预测家庭的能源消费：

Performing stepwise search to minimize aic
…
ARIMA(1,0,1)(2,1,0)[24] intercept : AIC=688.757, Time=9.37 sec
…
ARIMA(2,0,2)(2,1,0)[24] : AIC=681.750, Time=6.83 sec
…
ARIMA(1,0,1)(2,1,0)[24] : AIC=686.763, Time=6.02 sec
Best model: ARIMA(2,0,2)(2,1,0)[24]

请注意，虽然这是一组最佳的模型参数，但考虑到时间和资源的限制，我们可能会发现，通过更长时间运行算法，我们能够找到更好的模型。

ParameterGrid

使用 ParameterGrid 方法时，我们将逐一遍历参数组合列表，以找到最小化 AIC 的模型参数。

使用 ParameterGrid 的代码如下：

# Define parameter grid for SARIMAX model configuration
param_grid = {
    'order': [(0, 0, 0), (1, 0, 1), (2, 0, 0)],
    # Non-seasonal ARIMA orders
    'seasonal_order': [
        (0, 0, 0, 24),
        (2, 0, 1, 24),
        (2, 1, 1, 24)
    ],  # Seasonal ARIMA orders with period of 24
}
# Initialize variables to store the best AIC and
# corresponding parameters
best_aic = float("inf")
best_params = ["",""]
# Iterate over all combinations of parameters in the grid
for params in ParameterGrid(param_grid):
    print(
        f"order: {params['order']}, seasonal_order: {params['seasonal_order']}"
    )
    try:
        # Initialize and fit SARIMAX model with current parameters
        model = SARIMAX(
            train['Global_active_power'],
            order=params['order'],
            seasonal_order=params['seasonal_order'])
        model_fit = model.fit(disp=False)
        print(f"aic: {model_fit.aic}")
        # Update best parameters if current model has lower AIC
        if model_fit.aic < best_aic:
            best_aic = model_fit.aic
            best_params = params
    except Exception as error:
        print("An error occurred:", error)
        continue

尽管 auto ARIMA 和 ParamaeterGrid 在最小化 AIC 方面相似，但 auto ARIMA 使用起来要简单得多，仅需一行代码。

在 SARIMA 模型训练完成后，我们将接下来进行模型预测测试。

测试与预测

我们使用模型通过 predict 函数预测测试数据集，每次预测一个周期，每次预测后更新模型的实际值。这种迭代方法将 forecast_step 中的单步预测转化为多步预测：

def forecast_step():
    # Predicts the next period with confidence intervals
    forecast, conf_int = model.predict(
        n_periods=1, return_conf_int=True)
…
# Iterate over each observation in the test dataset
for obs in test['Global_active_power']:
    forecast, conf_int = forecast_step()  # Forecast next step
    forecasts.append(forecast)  # Append forecast to list
…
    # Update the model with the new observation
    model.update(obs)

然后，我们可以在图 7.5和图 7.6中绘制预测值与实际值的对比图。

图 7.5：SARIMA 预测与实际值（训练与测试）

我们在图 7.6中放大了测试期，以便直观比较预测值与实际值。

图 7.6：SARIMA 预测与实际值（缩放至测试数据）

虽然可视化图表可以帮助我们了解模型的预测能力，但我们仍然需要定量的指标来评估模型的好坏。这些指标还将帮助我们与其他模型进行预测准确度的比较。

时间序列预测有多种可用的评估指标。本章将展示以下三种指标的使用，突出它们如何服务于不同的目标：

均方误差（MSE）度量了预测值（F）与实际值（A）之间差值的平方平均值。当我们希望惩罚较大误差时，它效果很好。然而，由于平方误差会赋予较大差异更大的权重，因此它对异常值敏感。

对称平均绝对百分比误差（SMAPE）是预测值（F）与实际值（A）之间绝对差值的平均值。它以百分比的形式表示，基于实际值和预测值的绝对值之和的一半。SMAPE 可调整数据的尺度，使其适用于不同数据集之间的比较。由于其对称缩放，它对极端值的敏感度较低。

加权绝对百分比误差（WAPE）是一个归一化的误差度量，通过实际值加权绝对误差。当处理具有不同大小的数据时，它表现良好，但对大值误差敏感。

$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" display="block">mml:miW</mml:mi>mml:miA</mml:mi>mml:miP</mml:mi>mml:miE</mml:mi>mml:mo=</mml:mo>mml:mfracmml:mrowmml:mrowmml:msubsup<mml:mo stretchy="false">∑</mml:mo>mml:mrowmml:mit</mml:mi>mml:mo=</mml:mo>mml:mn1</mml:mn></mml:mrow>mml:mrowmml:min</mml:mi></mml:mrow></mml:msubsup>mml:mrowmml:mo|</mml:mo>mml:msubmml:mrowmml:miF</mml:mi></mml:mrow>mml:mrowmml:mit</mml:mi></mml:mrow></mml:msub>mml:mo-</mml:mo>mml:msubmml:mrowmml:miA</mml:mi></mml:mrow>mml:mrowmml:mit</mml:mi></mml:mrow></mml:msub>mml:mo|</mml:mo></mml:mrow></mml:mrow></mml:mrow>mml:mrowmml:mrowmml:msubsup<mml:mo stretchy="false">∑</mml:mo>mml:mrowmml:mit</mml:mi>mml:mo=</mml:mo>mml:mn1</mml:mn></mml:mrow>mml:mrowmml:min</mml:mi></mml:mrow></mml:msubsup>mml:mrowmml:mo|</mml:mo>mml:msubmml:mrowmml:miA</mml:mi></mml:mrow>mml:mrowmml:mit</mml:mi></mml:mrow></mml:msub>mml:mo|</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mfrac>mml:mo×</mml:mo>mml:mn100</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math>$

我们将看到两种不同的度量计算方法：模型库中包含的度量计算函数，以及一个独立的专门度量计算库。

模型库中的度量函数

在这种方法中，我们希望使用模型库中已经包含的度量计算函数。我们将使用sklearn和pmdarima库进行度量计算，并在以下代码中演示：

from sklearn.metrics import mean_squared_error
from pmdarima.metrics import smape
# Calculate and print the mean squared error of the forecasts
print(f"Mean squared error: {mean_squared_error(test['Global_active_power'], forecasts)}")
# Calculate and print the Symmetric Mean Absolute Percentage Error 
# (SMAPE)
print(f"SMAPE: {smape(test['Global_active_power'], forecasts)}")

这给出了以下结果：

Mean squared error: 0.6131968222566936
SMAPE: 43.775868579535334

单独的度量库

在这种第二种度量计算方法中，我们使用SeqMetrics库，如以下代码所示：

from SeqMetrics import RegressionMetrics, plot_metrics
# Initialize the RegressionMetrics object with actual and
# predicted values
er = RegressionMetrics(
    test['Global_active_power'], forecasts)
# Calculate all available regression metrics
metrics = er.calculate_all()
# Plot the calculated metrics using a color scheme
plot_metrics(metrics, color="Blues")
# Display the Symmetric Mean Absolute Percentage Error (SMAPE)
print(f"Test SMAPE: {metrics['smape']}")
# Display the Weighted Absolute Percentage Error (WAPE)
print(f"Test WAPE: {metrics['wape']}")

这给出了以下结果：

Test SMAPE: 43.775868579535334
Test WAPE: 0.4202224470299464

该库还提供了所有计算的度量的可视化，如图 7.7和7.8所示。

图 7.7：WAPE 的 SeqMetrics 显示

图 7.8：SMAPE 的 SeqMetrics 显示

在训练和测试完我们的第一个模型后，我们可以进入下一个模型，这是一个经典的机器学习模型。

经典机器学习模型 – LightGBM

我们将介绍的第二个模型是Light Gradient Boosting Machine（LightGBM），这是一个免费的开源梯度提升模型。它基于树学习算法，旨在高效且分布式。

本节的代码在ts-spark_ch7_1e_lgbm_comm.dbc中。我们将代码导入到 Databricks 社区版中，按照第一章中解释的方法进行操作。

代码的 URL 如下：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch7/ts-spark_ch7_1e_lgbm_comm.dbc

开发与调优

对于模型开发，我们使用以下代码将数据集的最后 48 小时从训练数据中分离出来，用于后续测试。其余部分用于训练：

# Split the data into training and testing sets
# The last 48 observations are used for testing, the rest for training
train = data_hr[:-48]
test = data_hr[-48:]

我们将使用 GridSearchCV 方法为 LGBMRegressor 模型寻找最佳参数。TimeSeriesSplit 用于根据时间序列特性将训练数据集进行交叉验证划分：

# Define the parameter grid for LightGBM
param_grid = {
    'num_leaves': [30, 50, 100],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [50, 100, 200]
}
# Initialize LightGBM regressor
lgbm = lgb.LGBMRegressor()
# Setup TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=10)
# Configure and run GridSearchCV
gsearch = GridSearchCV(
    estimator=lgbm,
    param_grid=param_grid,
    cv=tscv
)
gsearch.fit(X_train, y_train)
# Output the best parameters from Grid Search
print(f"Best Parameters: {gsearch.best_params_}")

我们找到了以下最佳参数：

Best Parameters: {'learning_rate': 0.1, 'n_estimators': 50, 'num_leaves': 30}

基于训练数据集，这将是与 LightGBM 模型预测该家庭能耗时使用的最佳参数集。然后，我们可以用这些参数训练最终模型：

final_model = lgb.LGBMRegressor(**best_params)
final_model.fit(X_train, y_train)

在训练好 LightGBM 模型后，我们将进行模型预测测试。

测试与预测

我们使用模型通过 predict 函数对测试数据集进行预测。请注意，在此情况下，我们并没有使用迭代式多步预测代码，而是使用了滞后值作为模型的输入特征：

# Predict on the test set
y_pred = final_model.predict(X_test)

然后，我们可以在 图 7.8 和 图 7.9 中将预测值与实际值进行对比。

图 7.9：LightGBM 预测与实际值对比（训练与测试）

我们在 图 7.9 中放大测试期，以便直观比较预测值与实际值。

图 7.10：LightGBM 预测与实际值对比（测试数据放大）

根据预测值与实际值，我们可以计算 SMAPE 和 WAPE，得到以下值：

Test SMAPE: 41.457989848314384
Test WAPE: 0.38978585281926825

现在我们已经训练并测试了统计学和经典机器学习模型，可以进入第三种模型类型——深度学习模型。

深度学习模型 - NeuralProphet

我们将介绍的第三个模型是 NeuralProphet，它是一个免费的开源深度学习模型，灵感来自于我们在前几章使用过的 Prophet 和 AR-Net。NeuralProphet 基于 PyTorch 构建。

本节代码位于 ts-spark_ch7_1e_nprophet_comm.dbc 文件中。我们按照 第一章 中的方式将代码导入 Databricks Community Edition。

代码链接如下：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch7/ts-spark_ch7_1e_nprophet_comm.dbc

注意

请注意，此示例的笔记本需要 Databricks 计算 DBR 13.3 LTS ML。

开发

我们实例化了一个 NeuralProphet 模型，并通过 n_lag 参数指定我们希望使用过去 24 小时的数据进行预测。然后，我们在训练数据集上训练（fit 方法）该模型：

# Initialize and fit the Prophet model
# model = NeuralProphet()
model = NeuralProphet(n_lags=24, quantiles=[0.05, 0.95])
metrics = model.fit(train_df)

只需这两行代码即可训练模型，接下来我们将进行模型预测测试。

测试与预测

在使用模型对测试数据集进行预测之前，我们需要为 NeuralProphet 准备数据，类似于之前为 Prophet 所做的准备。所需的格式是有一个ds列用于日期/时间，另一个y列用于预测目标。然后，我们可以使用predict方法。请注意，在此情况下，我们没有使用迭代的多步预测代码。在前一段代码中指定了滞后 24 作为参数，NeuralProphet 使用过去 24 个值的滑动窗口来预测下一个值：

# Convert the DataFrame index to datetime,
# removing timezone information
test_df['ds'] = test_df.index.to_pydatetime()
test_df['ds'] = test_df['ds'].apply(
    lambda x: x.replace(tzinfo=None))
# Rename the target variable for Prophet compatibility
test_df = test_df.rename(columns={'Global_active_power': 'y'})
# Use the trained model to make predictions on the test set
predictions_48h = model.predict(test_df)

我们在图 7.12和图 7.13中将预测值与实际值进行对比。

图 7.11：NeuralProphet 预测与实际值对比（训练与测试）

我们在图 7.13中放大测试期，以便进行预测与实际值的视觉对比。

图 7.12：NeuralProphet 预测与实际值对比（放大测试数据）

基于预测和实际值，我们可以计算 SMAPE 和 WAPE，并获得以下值来衡量模型的准确性：

Test SMAPE: 41.193985580947896
Test WAPE: 0.35355667972102317

我们将在后续的模型比较部分使用这些指标来比较本章中使用的不同模型。

到目前为止，我们已经训练并测试了每种类型的模型：统计模型、经典机器学习模型和深度学习模型。书中 GitHub 仓库中提供了其他一些常用的时间序列模型示例：

Prophet: ts-spark_ch7_1e_prophet_comm.dbc
LSTM: ts-spark_ch7_1e_lstm_comm1-cpu.dbc
NBEATS 和 NHITS: ts-spark_ch7_1e_nbeats-nhits_comm.dbc

我们鼓励你进一步探索这些内容。

拥有一个有效的模型很重要，但还不够。我们还需要能够解释我们使用的模型。接下来我们将介绍这一部分内容。

可解释性

可解释性在许多情况下都是一个关键要求，例如金融和受监管行业。我们将通过一种广泛使用的方法——Shapley 加法解释（SHAP）来解释数据集的不同特征如何影响预测结果。

我们将使用shap库中的TreeExplainer函数，应用于经典机器学习模型 – LightGBM部分的最终模型，计算 SHAP 值，从而了解每个特征对模型输出的影响。

import shap
# Initialize a SHAP TreeExplainer with the trained model
explainer = shap.TreeExplainer(final_model)
# Select features for SHAP analysis
X = data_hr[[
    'Global_active_power_lag1', 'Global_active_power_lag2',
    'Global_active_power_lag3', 'Global_active_power_lag4',
    'Global_active_power_lag5', 'Global_active_power_lag12',
    'Global_active_power_lag24', 'Global_active_power_lag24x7'
]]
# Compute SHAP values for the selected features
shap_values = explainer(X)
# Generate and display a summary plot of the SHAP values
shap.summary_plot(shap_values, X)

然后，我们可以在图 7.10中绘制特征重要性。正如我们在前一部分的数据探索中所预期的那样，滞后 1 和滞后 24 是对预测贡献最大的特征。

图 7.13：SHAP – 特征重要性

我们可以通过以下代码进一步分析，聚焦于某个特定的预测值，在此我们要解释第一个预测值的情况：

# Plot a SHAP waterfall plot for the first observation's SHAP values # to visualize the contribution of each feature
shap.plots.waterfall(shap_values[0])

我们可以在图 7.11中看到特征的相对贡献，再次呈现滞后 1 和滞后 24 的主导地位，滞后 12 的贡献相对较小。这与我们在数据探索部分中的分析一致，在该部分中我们确认了这些滞后项在预测家庭能源消耗中的重要性。

图 7.14：SHAP—特征重要性（首次观测）

模型比较

在结束本章之前，我们将根据我们所测量的指标和代码执行时间对所有测试过的模型进行比较。结果显示在表 7.1中。

模型	类型	SMAPE	WAPE	训练	调优	测试	总计（包括数据预处理）
NeuralProphet	深度学习/混合	41.19	0.35	60 秒	-	1 秒	90 秒
LightGBM	经典机器学习	41.46	0.39	60 秒	包含	包含	137 秒
SARIMA	统计模型	43.78	0.42	包含	420 秒	180 秒	662 秒
Prophet	统计/混合	47.60	0.41	2 秒	-	1 秒	70 秒
NHITS	深度学习	54.43	0.47	35 秒	-	包含	433 秒
NBEATS	深度学习	54.91	0.48	35 秒	-	包含	433 秒
LSTM	深度学习	55.08	0.48	722 秒	-	4 秒	794 秒

表 7.1：模型结果比较

以下是一些关于模型准确性的观察：

NeuralProphet 和 LightGBM 在 SMAPE 和 WAPE 指标下提供了最佳的预测准确性。SARIMA 的表现也不算差。
深度学习模型 NBEATS、NHITS 和 LSTM 作为单输入模型时预测准确性较差。我们建议进一步探索如何通过多输入来提升它们的表现。

以下内容涉及执行时间：

在所有情况下，我们都保持在 900 秒（15 分钟）的总执行时间限制内，使用 2 个 CPU 核心在单节点的 Databricks 社区版集群上运行。这对于 25MB 的数据集来说是可行的。我们将在第八章中看到如何为更大的数据集进行扩展。
Prophet、NBEATS 和 NHITS 的执行时间最佳，NeuralProphet 和 LightGBM 紧随其后，训练、调优和测试时间仍在 1 分钟以内。
即使我们将数据集限制为最后 300 个观测值，SARIMA 的执行时间仍然相对较高。这是由于 Auto ARIMA 算法在搜索最佳超参数时以及多步迭代预测代码的执行。
LSTM 的执行时间最长，这可以通过使用 CPU 而非 GPU 来解释，GPU 对于深度学习来说要快得多。

从这次模型比较的整体结论来看，NeuralProphet 和 LightGBM 是我们使用的数据集的最佳选择，几乎不需要调优，并且符合我们设定的计算和执行时间限制。

总结

在本章中，我们重点讨论了本书的核心主题，即时间序列分析模型的开发，特别是预测模型。从回顾不同类型的模型开始，然后介绍了选择合适模型的关键标准。在本章的第二部分，我们实践了多个模型的开发和测试，并根据准确性和执行时间进行了比较。

在下一章中，我们将扩展一个 Apache Spark 的优势领域：将时间序列分析扩展到大数据。

加入我们社区的 Discord

加入我们社区的 Discord 空间，与作者和其他读者进行讨论：

packt.link/ds

第三部分：扩展到生产环境及更远发展

在最后一部分，我们将讨论将第二部分中涉及的解决方案扩展和投入生产时需要考虑的因素和实际案例。随后，我们将以使用 Databricks 和生成式 AI 作为解决方案的一部分，来结束本书，并介绍如何进一步推进 Apache Spark 和时间序列分析的应用。

本部分包含以下章节：

第八章，大规模处理
第九章，投入生产
第十章，进一步使用 Apache Spark
第十一章，时间序列分析的最新发展

第八章：扩展计算能力

在上一章构建并测试模型后，我们将讨论在大型分布式计算环境中扩展时间序列分析的需求和注意事项。我们将讨论 Apache Spark 如何扩展第七章中的示例，内容从特征工程开始，接着是超参数调优，以及单模型和多模型训练。这些信息对于我们在时间紧迫的情况下分析大量时间序列数据至关重要。

在本章中，我们将涵盖以下主要话题：

为什么我们需要扩展时间序列分析？
扩展特征工程
扩展模型训练

技术要求

在进入主要话题之前，我们将先介绍本章的技术要求，具体如下：

书籍 GitHub 仓库中的ch8文件夹，网址为：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch8
合成数据：我们将使用合成数据库工具（Synthetic Data Vault），这是一个用于生成合成表格数据的 Python 库。你可以在这里找到更多关于合成数据库的信息：docs.sdv.dev/sdv。
Databricks 平台：虽然 Databricks 社区版是免费的，但其资源有限。类似地，在个人计算机或笔记本电脑上使用时，资源也可能受到限制。鉴于本章需要演示计算能力的扩展，我们将使用 Databricks 的非社区版。如第一章中所讨论的，你可以注册 Databricks 的 14 天免费试用版，但前提是你需要先拥有一个云服务提供商的账户。一些云服务提供商在开始时提供免费积分，这将为你提供比社区版更多的资源，且仅限时使用。请注意，试用期结束后，费用将转到你注册时提供的信用卡。

使用的 Databricks 计算配置如图 8.1所示。这里展示的工作节点和驱动节点类型基于 AWS，与 Azure 和 GCP 上的配置不同。请注意，UI 界面可能会发生变化，在这种情况下，请参考最新的 Databricks 文档：

docs.databricks.com/en/compute/configure.html

图 8.1：Databricks 计算配置

为什么我们需要扩展时间序列分析？

扩展时间序列分析的需求通常源于需要更快地进行分析或处理更大的数据集。在本章中，我们将探讨如何在将数据集大小扩大五倍的同时，减少在第七章中实现的处理时间。这将得益于 Apache Spark 所提供的处理能力。

扩展后的数据集

为了测试 Spark 的可扩展性，我们需要一个比以往更广泛的数据集。虽然你可能已经拥有这样的数据集，但为了本章的目的，我们将扩展在第七章和之前章节中使用的家庭能源消耗数据集。扩展后的数据集将使用 Synthetic Data Vault 工具生成，如技术要求部分所述。

本节的代码位于ts-spark_ch8_1.dbc中。我们将代码导入到 Databricks 中，方法与第一章中逐步操作：加载和可视化时间序列部分对于社区版的讲解类似。

在这段代码中，我们希望使用一个家庭的数据生成四个其他家庭的能源消耗数据，以将数据规模扩大五倍。

我们首先捕获pdf_main的元数据，这是较小的参考数据集。元数据作为输入，用于创建一个名为synthesizer的GaussianCopulaSynthesizer对象，它表示数据的统计模型。然后，该合成器通过fit方法在参考数据集（pdf_main）上进行训练。最终，这个模型将用于通过sample方法生成合成数据。

较小的参考数据集（pdf_main）与客户标识符（cust_id）1相关联，合成数据集则与标识符2、3、4和5相关联：

# Initialize metadata object for the dataset
metadata = SingleTableMetadata()
# Automatically detect and set the metadata from the Pandas DataFrame
metadata.detect_from_dataframe(pdf_main)
# Initialize the Gaussian Copula Synthesizer with the dataset metadata
synthesizer = GaussianCopulaSynthesizer(metadata)
# Fit the synthesizer model to the Pandas DataFrame
synthesizer.fit(pdf_main)
…
# Define the number of customer datasets to generate:
num_customers = 5
# Count the number of rows in the original dataset:
sample_size = df_main.count()
i = 1
df_all = df_main.withColumn(
    'cust_id', F.lit(i)
) # Add a 'cust_id' column to the original dataset with a constant 
# value of 1
…
    synthetic_data = spark.createDataFrame(
        synthesizer.sample(num_rows=sample_size)
) # Generate synthetic data matching the original dataset's size
…

在这个新的、更大的数据集上运行第七章中的代码将无法达到高效的性能。我们可以对更大的数据集进行纵向扩展或横向扩展时间序列分析。接下来，我们将解释这两种扩展方式。

扩展

扩展是一种更简单的扩展方式，不需要我们修改在第七章中编写的代码。通过增加更多的内存（RAM）和使用更强大的 CPU 或甚至 GPU，我们可以提高性能。这样做在一定程度上是有效的，但会在达到扩展极限、成本过高或回报递减时遇到瓶颈。实际上，由于系统瓶颈和开销，扩展并不会导致性能线性提升。

要进一步扩展，我们需要进行横向扩展。

横向扩展

扩展不仅仅是让我们的单一机器变得更强大，它还涉及增加更多的机器并行处理。这需要一种机制，让代码能够被分发并并行执行，而 Apache Spark 正是提供了这种机制。

在接下来的章节中，我们将介绍 Apache Spark 可用于扩展时间序列分析的几种不同方法：

特征工程
模型训练

特征工程

Apache Spark 可以利用其分布式计算框架扩展特征工程的处理。这使得特征工程任务能够进行并行处理，我们将在本节中展示如何操作。

我们将在第五章继续讨论数据准备，并改进第七章中的特征工程。我们将在本节中以第七章中开发与测试部分的基于 pandas 的代码示例为基础进行讨论。接下来的示例将展示如何将非 Spark 代码重写为 Spark 兼容代码，从而利用其可扩展性的优势。

虽然 Spark 可以用于多种特征工程方式，但我们将重点讨论以下三种与改进第七章代码相关的方式：

列转换
重采样
滞后值计算

接下来，我们将开始进行列转换的讨论。

列转换

在第一个代码示例中，我们将重写ts-spark_ch7_1e_lgbm_comm.dbc中已有的列转换代码，该代码用于第七章中的开发与测试部分。我们将通过使用pyspark.sql.functions库将代码修改为支持 Spark 的版本。为此，我们需要执行以下操作：

使用concat_ws函数，将现有的Date和Time列合并，替换Date列。
将Date列转换为时间戳格式（to_timestamp函数）。
有选择性地（使用when和otherwise条件）将Global_active_power中的错误值?替换为None。
使用regexp_replace函数将Global_active_power中的,替换为.，以确保符合float值的正确格式。

以下代码示例演示了前述步骤：

from pyspark.sql import functions as F
# Combine 'Date' and 'Time' into a single 'Date' column of timestamp 
# type
df_all = df_all.withColumn(
    'Date',
    F.to_timestamp(
        F.concat_ws(' ', F.col('Date'), F.col('Time')),
        'd/M/yyyy HH:mm:ss')
)...
# Select only the 'cust_id', 'Date' and 'Global_active_power' columns
df_all = df_all.select(
    'cust_id', 'Date', 'Global_active_power'
)
# Replace '?' with None and convert 'Global_active_power' to float
df_all = df_all.withColumn(
    'Global_active_power',
    F.when(F.col('Global_active_power') == '?', None)
    .otherwise(F.regexp_replace(
        'Global_active_power', ',', '.').cast('float')
    )
)
# Sort the DataFrame based on 'cust_id' and 'Date'
df_all = df_all.orderBy('cust_id', 'Date')

在使用 Spark 并行化列转换后，我们接下来要讲解的代码优化是对时间序列数据进行重采样。

重采样

在第二个代码转换示例中，我们将重写ts-spark_ch7_1e_lgbm_comm.dbc中每小时重采样的代码，该代码用于第七章中的开发与测试部分。我们希望计算每个客户的Global_active_power的每小时均值。为此，我们需要执行以下操作：

使用date_format函数将Date列转换为日期和小时组件。
对每个客户（groupBy函数），将Global_active_power的重采样平均值转换为每小时的均值（使用agg和mean函数）。

以下代码展示了前述步骤：

from pyspark.sql import functions as F
# Convert the 'Date' column to a string representing the
# start of the hour for each timestamp
data_hr = df_all.withColumn(
    'Date',
    F.date_format('Date', 'yyyy-MM-dd HH:00:00'))
# Group the data by 'cust_id' and the hourly 'Date',
# then calculate the mean 'Global_active_power' for each group
data_hr = data_hr.groupBy(
    'cust_id', 'Date').agg(
    F.mean('Global_active_power').alias('Global_active_power')
)

现在我们已经使用 Spark 对重采样过程进行了并行化，接下来我们要介绍的代码优化是计算时间序列数据的滞后值。

计算滞后值

在第三个例子中，使用 Apache Spark 进行特征工程的扩展时，我们将重写存在于 ts-spark_ch7_1e_lgbm_comm.dbc 中的滞后计算代码，该代码用于第七章中的开发和测试部分。我们希望为每个客户计算不同的滞后值。为此，我们需要执行以下操作：

定义一个滑动日期窗口来计算每个客户的滞后值（partitionBy 函数）。我们已经为每个客户按日期排序（orderBy 函数）。
计算滑动窗口上的不同滞后值（lag 和 over 函数）。
请注意，由于滞后计算基于先前的值，数据集开头的某些滞后值可能没有足够的先前值进行计算，因此会为空。我们使用 dropna 函数删除这些空滞后值的行。

以下代码演示了上述步骤：

from pyspark.sql.window import Window
from pyspark.sql import functions as F
# Define a window specification partitioned by -
# 'cust_id' and ordered by the 'Date' column
windowSpec = Window.partitionBy("cust_id").orderBy("Date")
# Add lagged features to the DataFrame to incorporate
#  past values as features for forecasting
# Apply the lag function to create the lagged column,
#  separately for each 'cust_id'
# Lag by 1, 2, 3, 4, 5, 12, 24, 168 hours (24 hours * 7 days)
lags = [1, 2, 3, 4, 5, 12, 24, 24*7]
for l in lags:
    data_hr = data_hr.withColumn(
        'Global_active_power_lag' + str(l),
        F.lag(F.col('Global_active_power'), l).over(windowSpec))
# Remove rows with NaN values that were introduced by
#  shifting (lagging) operations
data_hr = data_hr.dropna()

通过使用 Spark 函数而不是 pandas，我们将使 Spark 能够并行化处理大数据集的滞后计算。

现在我们已经介绍了不同的方法来利用 Apache Spark 提升第七章中特征工程部分的代码，接下来我们将深入探讨模型训练的扩展。

模型训练

在本节中，我们将涵盖以下几种不同的方式，即 Apache Spark 如何用于规模化模型训练：

超参数调优
单模型并行训练
多模型并行训练

这些方法使得在拥有大数据集或需要训练多个模型时能够高效地进行模型训练。

当使用不同的超参数重复训练同一模型时，超参数调优可能是昂贵的计算。我们希望能够利用 Spark 高效地找到最佳超参数。

同样地，对于大数据集，训练单一模型可能需要很长时间。在其他情况下，我们可能需要训练许多模型以适应不同的时间序列数据集。我们希望通过在 Spark 集群上并行化训练来加速这些过程。

我们将在下一节详细介绍这些方法，从超参数调优开始。

超参数调优

如第四章中所述，机器学习中的超参数调优是为机器学习算法找到最佳配置集的过程。这种寻找最佳超参数的过程可以使用 GridSearchCV、Hyperopt 和 Optuna 等库来并行化，这些库与 Apache Spark 结合提供后端处理并行性的框架。

我们在第三章中讨论了 Spark 的处理并行性。这里我们将更专注于使用 Optuna 与 Apache Spark 结合进行超参数调整。

如果你还记得，在第七章中，我们在单个节点上使用 GridSearchCV 调整了 LightGBM 模型的超参数。在这一节的代码示例中，我们将通过并行化过程来改进这一点。我们将使用 Optuna 与 Spark 一起，找到我们在第七章中探索的 LightGBM 模型的最佳超参数。

Optuna 是一个开源的超参数优化框架，用于自动化超参数搜索。您可以在这里找到有关 Optuna 的更多信息：optuna.org/。

我们将通过定义一个 objective 函数开始调优过程（稍后我们将使用 Optuna 优化该函数）。此 objective 函数执行以下操作：

在params中定义超参数值的搜索空间。
初始化 LightGBM LGBMRegressor模型，使用特定于试验的参数。
在训练数据集上训练（fit）模型。
使用模型对验证数据集进行预测。
计算模型的评估指标（mean_absolute_percentage_error）。
返回评估指标。

以下代码展示了前面的步骤：

import lightgbm as lgb
from sklearn.metrics import mean_absolute_percentage_error
import optuna
def objective(trial):
    # Define the hyperparameter configuration space
    params = {
        # Specify the learning task and
        #  the corresponding learning objective:
        "objective": "regression",
        # Evaluation metric for the model performance:
        "metric": "rmse",
        # Number of boosted trees to fit:
        "n_estimators": trial.suggest_int("n_estimators", 50, 200),
        # Learning rate for gradient descent:
        "learning_rate": trial.suggest_float(
            "learning_rate", 0.001, 0.1, log=True),
        # Maximum tree leaves for base learners:
        "num_leaves": trial.suggest_int("num_leaves", 30, 100),
    }
    # Initialize the LightGBM model with the trial's parameters:
    model = lgb.LGBMRegressor(**params)
    # Train the model with the training dataset:
    model.fit(X_train, y_train)
    # Generate predictions for the validation dataset:
    y_pred = model.predict(X_test)
    # Calculate the Mean Absolute Percentage Error (MAPE)
    #  for model evaluation:
    mape = mean_absolute_percentage_error(y_test, y_pred)
    # Return the MAPE as the objective to minimize
    return mape

一旦定义了目标函数，接下来的步骤如下：

注册 Spark（register_spark 函数）作为后端。
创建一个研究（create_study函数），它是一个包含试验的集合，用于最小化评估指标。
在 Spark parallel_backend 上运行研究，以优化 objective 函数在 n_trials 上的表现。

以下代码展示了前面的步骤：

from joblibspark import register_spark
# This line registers Apache Spark as the backend for
# parallel computing with Joblib, enabling distributed
# computing capabilities for Joblib-based parallel tasks.
register_spark()
…
# Create a new study object with the goal of minimizing the objective # function
study2 = optuna.create_study(direction='minimize')
# Set Apache Spark as the backend for parallel execution of –
# trials with unlimited jobs
with joblib.parallel_backend("spark", n_jobs=-1):
    # Optimize the study by evaluating the –
    #  objective function over 10 trials:
    study2.trial.value) and parameters (trial.params) for best_trial:

从优化研究中获取最佳试验

trial = study2.best_trial

打印最佳试验的目标函数值，

通常是准确性或损失

print(f"最佳试验准确率：{trial.value}")

print("最佳试验参数：")

遍历最佳试验的超参数并打印它们

for key, value in trial.params.items():

print(f" {key}: {value}")


 The outcome of the hyperparameter tuning, shown in *Figure 8**.2*, is the best hyperparameters found within the search space specified, as well as the related model accuracy.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_2.jpg)

Figure 8.2: Hyperparameter tuning – best trials
In addition to the scaling of the hyperparameter tuning stage, which we have seen in this section, Spark clusters can also be used to parallelize the next step, that is, fitting the model to the training data. We will cover this next.
Single model in parallel
Ensemble methods such as Random Forest and gradient boosting machines can benefit from task parallelism during the model training stage. Each tree in a Random Forest can be trained independently, making it possible to parallelize across multiple processors. Similarly in the case of Gradient Boosting models such as LightGBM and XGBoost, the tree’s construction can be parallelized, even though the boosting itself is sequential,
In *Chapter 7*’s example in the *Classical machine learning model* section, we used LightGBM. This model was not Spark enabled. Here, as we want to demonstrate training parallelism with a Spark-enabled Gradient Boosting model, we will use `SparkXGBRegressor` instead.
As a first step, we will build a vector of the features using `VectorAssember`, as shown in the following code:

from pyspark.ml.feature import VectorAssembler

定义一个列表来保存滞后特征列的名称

inputCols = []

遍历滞后间隔列表以创建特征列

名称

for l in lags:

inputCols.append('Global_active_power_lag' + str(l))

初始化 VectorAssembler 并使用

创建特征列名称并指定输出列名称

assembler = VectorAssembler(

inputCols=inputCols, outputCol="features")


 We then create the `SparkXGBRegressor` model object, setting `num_workers` to all available workers, and specifying the target column with `label_col`:

from xgboost.spark import SparkXGBRegressor

初始化 SparkXGBRegressor 用于回归任务。

`num_workers` 设置为默认的并行度级别 -

Spark 上下文，用于利用所有可用核心。

`label_col` 指定目标变量列名

预测。

`missing` 设置为 0.0，以处理数据集中的缺失值。

xgb_model = SparkXGBRegressor(

num_workers=sc.defaultParallelism,

label_col="Global_active_power", missing=0.0

)


 As we have seen so far, hyperparameter tuning is an important step in finding the best model. In the following code example, we will use `ParamGridBuilder` to specify the range of parameters that are specific to the model and that we want to evaluate.
We then pass the parameters to `CrossValidator` together with `RegressionEvaluator`. We will use the root mean square error (`rmse`) as the evaluation metric. This is the default metric for `RegressionEvaluator`, making it suitable for our example here:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import RegressionEvaluator

初始化超参数调整的参数网格

- max_depth: 指定模型中树的最大深度

- n_estimators: 定义模型中树的数量

paramGrid = ParamGridBuilder()\

.addGrid(xgb_model.max_depth, [5, 10])\

.addGrid(xgb_model.n_estimators, [30, 100])\

.build()

初始化回归评估器用于模型评估

- metricName: 指定用于评估的指标

这里是 RMSE（均方根误差）

- labelCol: 标签列的名称

- predictionCol: 预测列的名称

evaluator = RegressionEvaluator(

metricName="rmse",

LabelCol = xgb_model.getLabelCol(),

PredictionCol = xgb_model.getPredictionCol()

)

初始化 CrossValidator 进行超参数调整

- estimator: 需要调优的模型

- evaluator: 用于模型评估的评估器

- estimatorParamMaps: 用于调优的参数网格

cv = CrossValidator(

estimator = xgb_model, evaluator = evaluator,

estimatorParamMaps = paramGrid)


 At this point, we are ready to build a pipeline (`Pipeline`) to train (`fit`) the model.  We will do this by combining in sequence the `VectorAssembler` (`assembler`) and `CrossValidator` (`cv`) stages:

from pyspark.ml import Pipeline

初始化一个包含两个阶段的管道对象：

一个特征组装器和一个交叉验证器用于模型调优

pipeline = filter function) 训练数据筛选为 cust_id 1。接着，我们使用所有记录（head function）进行训练，除去最后 48 小时的数据，因为这些数据将用于测试。最终得到的 train_hr DataFrame 包含了每小时的训练数据：

# Filter the dataset for customer with cust_id equal to 1
train_hr = data_hr.filter('cust_id == 1')
# Create a Spark DataFrame excluding the last 48 records for training
train_hr = spark.createDataFrame(
    train_hr.head(train_hr.count() - 48)
)
# Fit the pipeline model to the training data
pipelineModel = pipeline.fit(train_hr)

同样，对于测试，我们将筛选出 cust_id 1，并在这种情况下使用最后的 48 小时数据。然后，我们可以将模型（pipelineModel）应用于测试数据（test_hr），以获取这 48 小时的能耗预测：

# Filter the dataset for customer with cust_id equal to 1 for testing
test_hr = data_hr.filter('cust_id == 1')
# Create a Spark DataFrame including the last 48 records for testing
test_hr = spark.createDataFrame(train_hr.tail(48))
…
# Apply the trained pipeline model to the test data to generate 
# predictions
predictions = RegressionEvaluator (the evaluator object) to calculate (the evaluate function) the RMSE:

使用评估器评估模型的性能

均方根误差（RMSE）指标

rmse = evaluator.evaluate(predictions)


 For comparison, we also calculate the **Symmetric Mean Absolute Percentage Error** (**SMAPE**) and **Weighted Average Percentage Error** (**WAPE**) similarly to how we have done in the *Classical machine learning model* section of *Chapter 7*. The results are shown in *Figure 8**.3*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_3.jpg)

Figure 8.3: XGBoost evaluation metrics
We plot the forecast against the actual values in *Figures 8.4* and *8.5*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_4.jpg)

Figure 8.4: XGBoost forecast versus actuals (training and testing)
We zoom in on the testing period in *Figure 8**.5* for a visual comparison of the forecast and actuals.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_5.jpg)

Figure 8.5: XGBoost forecast versus actuals (zoom on test data)
In this section, we have seen parallelism in single-model training. This requires the use of a library, such as XGBoost used here, which supports a multi-node processing backend such as Apache Spark. In addition to ensemble methods, other models, such as deep learning, can benefit from training parallelism.
Multiple models can also be trained in parallel, which we will explore next.
Multiple models in parallel
Earlier in this chapter, we scaled the dataset to represent the household energy consumption of multiple customers. In this section, we will train a different machine learning model for each customer in parallel. This is required if we want to predict the energy consumption of individual customers based on their own historical consumption. There are several other use cases where such multi-model training is required, for example, in the retail industry when doing sales forecasting for individual products or stores.
Coming back to our energy consumption example, the `train_model` function does the following for each customer:

1.  Get the customer ID (`cust_id`) from the pandas DataFrame passed as input.
2.  Choose the features (`X`) and target (`y`) variables.
3.  Split (`train_test_split`) the dataset into training and testing, specifying `shuffle` as `False` to preserve the time order. As discussed in *Chapter 1*, this is an important consideration for time-series datasets.
4.  Perform hyperparameter tuning with `GridSearchCV` using `LGBMRegressor` as the model and `TimeSeriesSplit` for the dataset splits.
5.  Train (`fit`) the final model with the best hyperparameters (`best_params`) on the full training dataset.
6.  Test the final model on the test dataset and calculate the evaluation metrics (`rmse` and `mape`).
7.  Return the result of `train_model` in a DataFrame with `cust_id`, `best_params`, `rmse`, and `mape`.

The following code shows the function definition with the preceding steps:

def train_model(df_pandas: pd.DataFrame) -> pd.DataFrame:

提取用于训练模型的客户 ID

cust_id = df_pandas["cust_id"].iloc[0]

从 DataFrame 中选择特征和目标变量

X = df_pandas[[

'Global_active_power_lag1', 'Global_active_power_lag2',

'Global_active_power_lag3', 'Global_active_power_lag4',

'Global_active_power_lag5', 'Global_active_power_lag12',

'Global_active_power_lag24', 'Global_active_power_lag168'

]]

y = df_pandas['Global_active_power']

将数据集分为训练集和测试集，并保持

时间顺序

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, shuffle=False, random_state=12

)

定义 LightGBM 模型调优的超参数空间

param_grid = {

'num_leaves': [30, 50, 100],

'learning_rate': [0.1, 0.01, 0.001],

'n_estimators': [50, 100, 200]

}

初始化 LightGBM 回归模型

lgbm = lgb.LGBMRegressor()

初始化 TimeSeriesSplit 进行交叉验证

尊重时间序列数据结构

tscv = TimeSeriesSplit(n_splits=10)

使用交叉验证进行网格搜索

gsearch = GridSearchCV(

estimator=lgbm, param_grid=param_grid, cv=tscv)

gsearch.fit(X_train, y_train)

提取最佳超参数

best_params = gsearch.best_params_

使用最佳参数训练最终模型

final_model = lgb.LGBMRegressor(**best_params)

final_model.fit(X_train, y_train)

对测试集进行预测

y_pred = final_model.predict(X_test)

计算 RMSE 和 MAPE 指标

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

mape = mean_absolute_percentage_error(y_test, y_pred)

准备返回的结果 DataFrame

return_df = pd.DataFrame(

[[cust_id, str(best_params), rmse, mape]],

columns=["cust_id", "best_params", "rmse", "mape"]

)

return return_df


 Now that the model training function is defined, we can launch it in parallel for each customer (the `groupBy` function), passing a pandas DataFrame of all the rows for this specific customer to the `applyInPandas` function.
pandas UDFs, mapInPandas, and applyInPandas
Using Spark-enabled libraries, as we did in the previous section with single-model parallel training, is usually faster for large datasets than single-machine libraries. There are, however, cases when we have to use a library that isn’t implemented natively for Spark’s parallel processing. In these situations, we can use pandas `mapInPandas`, or `applyInPandas`. These methods allow you to call pandas operations in a distributed way from Spark. The common use cases are as follows:
- **pandas UDF**: One input row for one output row
- **mapInPandas**: One input row for multiple output rows
- **applyInPandas**: Multiple input rows for one output row
Note that these are general guidance and that there is great flexibility in how these methods can be used.
In the example in this section, we use `applyInPandas` as we want to execute a pandas-enabled function for all the rows in the dataset corresponding to a specific customer for model training. We want the function to output one row with the result of model training for the specific customer.
Note how, in the following code extract, we specified the `train_model_result_schema` schema of the function’s return value. This is a requirement for serializing the result that is added to the `train``_model_result_df` pandas DataFrame:

from pyspark.sql.functions import lit

按客户 ID 对数据进行分组，并应用

将 train_model 函数应用于每个组，使用 Pandas UDF

结果 DataFrame 的模式由以下定义

train_model_result_schema

缓存结果 DataFrame 以优化性能

后续操作

train_model_result_df = (

data_hr

.groupby("cust_id")

.applyInPandas(train_model, schema=train_model_result_schema)

.cache()

)


 *Figure 8**.6* shows the outcome of the multi-model training. It shows the best hyperparameters (the `best_params` column) and evaluation metrics (the `rmse` and `mape` columns) for each customer.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_08_6.jpg)

Figure 8.6: Multi-model training – best hyperparameters and evaluation metrics
With this example, we have trained five different models representing different customers. We have found the best hyperparameters to use for each model, which we are then able to use to do individual energy consumption forecasting.
With this, we conclude the different ways in which we can leverage Apache Spark to scale time-series analysis. Next, we will discuss some of the ways that the training process can be optimized.
Training optimization
When training machine learning models at a large scale, several inefficiencies and overheads can impact resource utilization and performance. These include the following:

*   Idle time waiting for resources such as GPU, network, and storage accesses, which can delay the training process.
*   Frequent checkpointing, which saves the model during training to avoid restarting in case of failure. This results in additional storage and time during model training.
*   Hardware or software failures during the training result in restarts, which waste resources and delay the training.

The following mitigation techniques can be used, depending on the model being trained and the library in use:

*   Eliminate the cause of idle wait times by provisioning sufficient compute, network, and storage resources
*   Avoid too frequent checkpointing
*   Rearrange features based on correlation with the target variable or their importance to facilitate convergence during model training
*   Reduce the dimensionality of the dataset, choosing the most informative features

While the implementation details of these techniques are beyond our scope here, we recommend researching and addressing these points when operating at a large scale due to the potentially high impact on cost, efficiency, and scalability.
Summary
In this chapter, we saw the need to scale the processing capacity for bigger datasets. We examined different ways of using Apache Spark to this end. Building on and extending the code examples from *Chapter 7*, we focused on scaling the feature engineering and model training stages. We looked at leveraging Spark to scale transformations, aggregations, lag values calculation, hyperparameter tuning, and single- and multi-model training in parallel.
In the next chapter, we will cover the considerations for going to production with time-series analysis, using and extending what we have learned so far.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1).jpg)

第九章：进入生产环境

在上一章中，我们已构建并测试了我们的时间序列分析模型，并展示了其可扩展性，接下来我们将探讨将时间序列模型部署到生产环境中的实际考量和步骤，使用 Spark 框架。此信息对于指导你从开发过渡到实际应用至关重要，确保时间序列模型在操作环境中的可靠性和有效性。许多机器学习项目停滞在开发和概念验证阶段，掌握部署到生产环境的细节将增强你将时间序列分析无缝集成到决策过程中。

在第四章中，我们介绍了时间序列分析项目的整体端到端视角，而在本章中，我们将专注于以下几个主要主题，帮助项目进入生产环境：

工作流
监控与报告
额外的考虑事项

技术要求

在本章中，我们将通过代码示例探讨如何在基于容器的环境中部署可扩展的时间序列分析端到端工作流。构建一个生产就绪的环境需要大量工作，远超我们在本章中可以合理涵盖的范围。我们将专注于提供一个示例作为起点。我们将看到迄今为止关于时间序列分析的知识如何结合起来，形成一个完整的端到端工作流。

本章的代码可以在以下网址找到：github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch9

我们先从为示例设置环境开始。

环境设置

我们将使用 Docker 容器，如第三章和第四章中所述，用于平台基础设施。请按照第三章中使用容器进行部署部分和第四章中环境设置部分的说明设置容器环境。

一旦环境设置好，从本章的 Git 仓库下载部署脚本，链接地址为：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch9

然后你可以按照第四章中环境启动部分的说明，启动容器环境。根据同一章中的访问 UI部分，快速进行组件的可视化验证。

在深入代码细节之前，让我们回顾一下工作流的概述，看看我们将在本节中构建的整体框架。

工作流

本章中的代码示例包含两个工作流。它们作为 有向无环图（DAGs）在 Airflow 中实现，类似于 第四章。可视化工作流的最佳方式是通过 Airflow 中的 DAG 视图，如 图 9.1 和 9.2 所示。

这两个工作流如下：

ts-spark_ch9_data-ml-ops：这是端到端流程的示例，见 图 9.1，包括以下任务：
- get_config
- ingest_train_data
- transform_train_data
- train_and_log_model
- forecast
- ingest_eval_data
- transform_eval_data
- eval_forecast

图 9.1：端到端工作流的 Airflow DAG

ts-spark_ch9_data-ml-ops_runall：这是第二个工作流，如 图 9.2 所示，它多次调用前一个工作流，并使用不同的日期范围。它模拟了现实中的情况，其中前一个端到端工作流会在定期间隔（如每日或每周）内启动，并使用新数据。

图 9.2：Airflow DAG，多个调用端到端工作流

这些 Airflow DAG 的代码位于 dags 文件夹中。它们是用 Python 编写的（.py 文件），可以通过文本编辑器或代码编辑器进行可视化查看。

模块化

值得注意的是，在这里的示例中，我们本可以将这些独立任务的所有代码合并成一个大任务。然而，我们将工作流拆分为多个任务，以说明模块化的最佳实践。在实际情况下，这样做有利于独立的代码修改、扩展以及任务的重新执行。不同的团队可能拥有不同任务的所有权。

工作流分离

我们在此示例中演示的工作流，在您自己的实现中可以进一步拆分。例如，通常会将模型训练相关任务、预测和模型评估拆分为各自独立的工作流，并在不同的时间间隔内启动。

我们将在接下来的章节中详细解释每个 DAG 和相关任务，从 ts-spark_ch9_data-ml-ops_runall 开始。

模拟与运行

如我们在 图 9.2 中看到的，ts-spark_ch9_data-ml-ops_runall 有五个任务，我们将在此进一步解释这些任务。

_runall 工作流的目的是模拟在定期间隔内，训练、预测和评估周期的真实执行过程。在我们的示例中，_runall 工作流的每个任务对应一次训练、预测和评估的循环。我们将每个任务称为一次运行，所有任务总共有五次运行，对应 _runall 的五个任务。这些任务将在定期的间隔内调度，如每日、每周、每月等。在这里的示例中，我们只是顺序执行它们，一个接一个。

每个任务都使用不同的参数调用 ts-spark_ch9_data-ml-ops 工作流。它们如下：

runid：一个整数，用于标识运行
START_DATE：用于训练的时间序列数据集的起始日期
TRAIN_END_DATE：训练数据集中的时间序列结束日期
EVAL_END_DATE：评估数据集中的时间序列结束日期

不同运行的配置方式是使用一个滑动窗口，其中训练数据为 5 年，评估数据为 1 年。在实际场景中，评估日期范围可能更短，对应于更短的预测期。

运行配置如下：

conf_run1 = {
    'runid':          1,
    'START_DATE':     '1981-01-01',
    'TRAIN_END_DATE': '1985-12-31',
    'EVAL_END_DATE':  '1986-12-31',
}
conf_run2 = {
    'runid':          2,
    'START_DATE':     '1982-01-01',
    'TRAIN_END_DATE': '1986-12-31',
    'EVAL_END_DATE':  '1987-12-31',
}
…

任务按以下方式定义，以触发ts-spark_ch9_data-ml-ops工作流并将运行配置作为参数传递：

# Define tasks
t1 = TriggerDagRunOperator(
    task_id="ts-spark_ch9_data-ml-ops_1",
    trigger_dag_id="ts-spark_ch9_data-ml-ops",
    conf=conf_run1,
    wait_for_completion=True,
    dag=dag,
)
t2 = TriggerDagRunOperator(
    task_id="ts-spark_ch9_data-ml-ops_2",
    trigger_dag_id="ts-spark_ch9_data-ml-ops",
    conf=conf_run2,
    wait_for_completion=True,
    dag=dag,
)
…

任务随后按以下顺序依次启动：

t1 >> t2 >> t3 >> t4 >> t5

您可以从 Airflow DAG 视图中启动此ts-spark_ch9_data-ml-ops_runall Airflow DAG，如图 9.3所示，通过点击绿色突出显示的运行按钮（>）。

图 9.3：运行 Airflow DAG

本 DAG 的结果可以在图 9.2中看到，显示了各个任务的状态。

现在我们将讨论这些任务的细节，正如我们所见，它们使用不同的参数调用ts-spark_ch9_data-ml-ops工作流。我们从第一个步骤get_config开始，它负责处理这些参数。

配置

ts-spark_ch9_data-ml-ops工作流中的第一个任务是t0，它调用get_config函数来获取运行工作流所需的配置。这些配置作为参数传递给工作流。正如前面提到的，它们是运行标识符和时间序列数据的日期范围，工作流将基于这些范围运行。我们将看到它们在随后的任务中是如何使用的。

定义任务t0的代码如下：

t0 = PythonOperator(
    task_id='get_config',
    python_callable=get_config,
    op_kwargs={'_vars': {
        'runid': "{{ dag_run.conf['runid'] }}",
        'START_DATE': "{{ dag_run.conf['START_DATE'] }}",
        'TRAIN_END_DATE': "{{ dag_run.conf['TRAIN_END_DATE'] }}",
        'EVAL_END_DATE': "{{ dag_run.conf['EVAL_END_DATE'] }}",
        },
    },
    provide_context=True,
    dag=dag,
)

get_config函数由任务t0调用，其代码如下：

def get_config(_vars, **kwargs):
    print(f"dag_config: {_vars}")
    return _vars variable for use by subsequent tasks.
We can see the status of the task in the DAG view in Airflow as per *Figure 9**.1*.
Data ingestion and storage
At this step, after completion of the `t0` task, the `t1` task is launched by Airflow. It calls the `ingest_train_data` function to ingest the training data from the input CSV file as specified by the `DATASOURCE` variable. In this example, as it is a relatively small file, we ingest the full file every time. You will likely ingest only new data points incrementally at this stage.
The code for this step is as follows:

def ingest_train_data(_vars, **kwargs):

sdf = spark.read.csv(

DATASOURCE, header=True, inferSchema=True

)

sdf = sdf.filter(

(F.col('date') >= F.lit(_vars['START_DATE'])) &

(F.col('date') <= F.lit(_vars['TRAIN_END_DATE']))

)

data_ingest_count = sdf.count()

sdf.write.format("delta").mode("overwrite").save(

f"/data/delta/ts-spark_ch9_bronze_train_{_vars['runid']}"

)

_vars['train_ingest_count'] = data_ingest_count

return _vars


 The data is ingested using Spark, with the `spark.read.csv` function, into a Spark DataFrame. We then filter the data for the range of dates that fall within the training dataset as per the `START_DATE` and `TRAIN_END_DATE` parameters.
We want to be able to later report on how much data we ingest every time. To enable this, we count the number of rows in the DataFrame.
Finally, in this task, we persist the ingested data with the `write` function to disk storage in `delta` format for use by the next steps of the workflow. As we will parallelize the workflow tasks in the future, and to avoid multiple parallel writes to the same disk location, we store the data for this specific run in its own table appended with `runid`. Note as well how we used the term `bronze` in the name. This corresponds to the **medallion** approach, which we discussed in the *Data processing and storage* section of *Chapter 4*. Persisting the data to storage at this stage can come in handy when we are ingesting a lot of data. This makes it possible in the future to change and rerun the rest of the pipeline without having to re-ingest the data.
The status of the task is visible in the DAG view in Airflow as per *Figure 9**.1*.
With the data ingested from the source and persisted, we can move on to the data transformation stage.
Data transformations
This stage corresponds to Airflow task `t2`, which calls the `transform_train_data` function. As its name suggests, this function transforms the training data into the right format for the upcoming training stage.
The code for this step is as follows:

def transform_train_data(_vars, **kwargs):

sdf = spark.read.format("delta").load(

f"/data/delta/ts-spark_ch9_bronze_train_{_vars['runid']}"

)

sdf = sdf.selectExpr(

"date as ds",

"cast(daily_min_temperature as double) as y"

)

sdf = sdf.dropna()

data_transform_count = sdf.count()

sdf.write.format("delta").mode("overwrite").save(

f"/data/delta/ts-spark_ch9_silver_train_{_vars['runid']}"

)

_vars['train_transform_count'] = data_transform_count

return _vars


 We first read the data from `bronze`, where it was stored by the previous task, `t1`. This stored data can then be used as input to run the current task.
In this example, we do the following simple transformations:

*   Column level: Rename the `date` column as `ds`
*   Column level: Change `daily_min_temperature` to the double data type (the `cast` function) and rename it as `y`
*   DataFrame level: Remove all rows with missing values using the `dropna` function

As in the previous stage, we want to collect metrics specific to this stage so that we can later report on the transformations. To do this, we count the number of rows in the DataFrame after the transformations.
Note
This stage is likely to include several data checks and transformations, as discussed in the *Data quality checks, cleaning, and transformations* section of *Chapter 5*.
Finally, in this task, we persist the ingested data with the `write` function to disk storage in `delta` format for use by the next steps of the workflow. We call this data stage `silver`, as per the medallion approach explained previously.
Similarly to the previous tasks, we can see the task’s status in the DAG view in Airflow, as per *Figure 9**.1*.
With the data curated and persisted, we can move on to the model training stage.
Model training and validation
This stage is the longest in our example and corresponds to Airflow task `t3`, which calls the `train_and_log_model` function. This function trains and validates a Prophet forecasting model using the training data from the previous stage. As we saw in *Chapter 7*, choosing the right model involves a whole process, which we have simplified here to a minimum.
The code extract for this step is as follows:

def train_and_log_model(_vars, **kwargs):

sdf = spark.read.format("delta").load(

f"/data/delta/ts-spark_ch9_silver_train_{_vars['runid']}"

)

pdf = sdf.toPandas()

mlflow.set_experiment(

'ts-spark_ch9_data-ml-ops_time_series_prophet_train'

)

mlflow.start_run()

mlflow.log_param("DAG_NAME", DAG_NAME)

mlflow.log_param("TRAIN_START_DATE", _vars['START_DATE'])

…

mlflow.log_metric(

'train_ingest_count', _vars['train_ingest_count'])

…

model = Prophet().fit(pdf)

…

cv_metrics_name = [

"mse", "rmse", "mae", "mdape", "smape", "coverage"]

cv_params = cross_validation(

…

)

_cv_metrics = performance_metrics(cv_params)

cv_metrics = {

n: _cv_metrics[n].mean() for n in cv_metrics_name}

…

signature = infer_signature(train, predictions)

mlflow.prophet.log_model(

model, artifact_path=ARTIFACT_DIR,

signature=signature, registered_model_name=model_name,)

mlflow.log_params(param)

mlflow.log_metrics(cv_metrics)

…

mlflow.end_run()

return _vars


 In this code example, we do the following:

1.  We first read the data from `silver`, where it was stored by the previous task, `t2`. Then, we can run the current task using the stored data as input.
2.  MLflow Tracking Server is used to save all the parameters and metrics for each run. We group them under an experiment called `ts-spark_ch9_data-ml-ops_time_series_prophet_train` and use `log_param` and `log_metric` functions to capture the parameters and metrics gathered so far in the run.
3.  We then train the Prophet model with the training data using the `fit` function.
4.  As a model validation step, we use the `cross_validation` function and retrieve the corresponding metrics with the `performance_metrics` function.
5.  The final step is to log the model to the MLflow Model Registry, using the `log_model` function, and all the related training and validation metrics with MLflow. Note that we log the model signature as a best practice to document the model in the MLflow Model Registry.

We can see the task’s status in the DAG view in Airflow, as per *Figure 9**.1*. The logged parameters and metrics are visible in MLflow Tracking server, as shown in *Figure 9**.4*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_04.jpg)

Figure 9.4: MLflow experiment tracking (training)
The model saved in the MLflow Model Registry is shown in *Figure 9**.5*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_05.jpg)

Figure 9.5: MLflow Model Registry
After the conclusion of the model training stage, we can progress to the next stage, where we will use the trained model to do forecasting.
Forecasting
This stage corresponds to Airflow task `t4`, which calls the `forecast` function. As its name suggests, this function infers future values of the time series. While we have this task in the same workflow as the prior training tasks, it is common for the forecasting task to be in a separate inferencing pipeline. This separation allows for scheduling the training and inferencing at different times.
The code for this step is as follows:

def forecast(_vars, **kwargs):

Load the model from the Model Registry

model_uri = f"models:/{model_name}/{model_version}"

_model = mlflow.prophet.load_model(model_uri)

forecast = _model.predict(

_model.make_future_dataframe(

periods=365, include_history = False))

sdf = spark.createDataFrame(forecast[

['ds', 'yhat', 'yhat_lower', 'yhat_upper']])

sdf.write.format("delta").mode("overwrite").save(

f"/data/delta/ts-spark_ch9_gold_forecast_{_vars['runid']}")

print(f"forecast:\n${forecast.tail(30)}")

mlflow.end_run()

return _vars


 We first load the model from the model registry, where it was stored by the previous task, `t3`. This model can then be used for forecasting in the current task.
In this example, we want to generate a forecast for 365 days in advance by calling the following:

*   The `make_future_dataframe` function to generate the future period
*   The `predict` function to forecast for these future times

Another approach to generating the future period is to get this as input from the user or another application calling the model. As for the 365-day forecasting horizon, this is a relatively long time to forecast. We discussed in the *Forecasting* section of *Chapter 2* how a shorter forecasting horizon is likely to yield better forecasting accuracy. We have used a long period in this example for practical reasons to showcase forecasting over a 5-year period, with 5 runs of 365 days each. These runs were explained in the earlier *Simulation and runs* section. Moving beyond the requirement of the example here, keep the forecasting horizon shorter relative to the span of the training dataset and the level of granularity.
Finally, in this task, we persist the forecasted data with the `write` function to disk storage in `delta` format for use by the next steps of the workflow. We call this data stage `gold` as per the medallion approach explained previously. This delta table, where the forecasting outcome is stored, is also known as the inference table.
With the forecasts persisted, we can move on to the model evaluation stage.
Model evaluation
This stage corresponds to Airflow tasks `t5`, `t6`, and `t7`, which call the `ingest_eval_data`, `transform_eval_data`, and `eval_forecast` functions respectively.
Note
In a production environment, we want to monitor the accuracy of our model’s forecast against real data so that we can detect when the model is not accurate enough and needs retraining. In the example here, we have these tasks in the same workflow as the prior forecasting task to keep the example simple enough to fit within this chapter. These tasks will be a separately scheduled workflow, which will be executed a posteriori of the event being forecasted. In the example, we are simulating the post-event evaluation by using the data points following the training data.
The `ingest_eval_data` and `transform_eval_data` functions are very similar to the `ingest_train_data` and `transform_train_data` functions, which we have seen in the previous sections. The main difference, as the name suggests, is that they operate on the evaluation and training data respectively.
We will focus on the `eval_forecast` function in this section, with the code extract as follows:

def eval_forecast(_vars, **kwargs):

sdf = spark.read.format("delta").load(

f"/data/delta/ts-spark_ch9_silver_eval_{_vars['runid']}")

sdf_forecast = spark.read.format("delta").load(

f"/data/delta/ts-spark_ch9_gold_forecast_{_vars['runid']}")

sdf_eval = sdf.join(sdf_forecast, 'ds', "inner")

…

evaluator = RegressionEvaluator(

labelCol='y', predictionCol='yhat', metricName='rmse')

eval_rmse = evaluator.evaluate(sdf_eval)

…

mlflow.set_experiment('ts-spark_ch9_data-ml-ops_time_series_prophet_eval')

mlflow.start_run()

mlflow.log_param("DAG_NAME", DAG_NAME)

mlflow.log_param("EVAL_START_DATE", _vars['START_DATE'])

…

mlflow.log_metric('eval_rmse', _vars['eval_rmse'])

mlflow.end_run()

return _vars


 In this code example, we do the following:

1.  We first read the evaluation data from `silver`, where it was stored by the previous task, `t6`. We also read the forecasted data from `gold`, where it was stored earlier by the forecasting task, `t4`. We join both datasets with the `join` function so that we can compare the forecasts to the actuals.
2.  In this example, we use `RegressionEvaluator` from the `pyspark.ml.evaluation` library is used to do the calculation.
3.  As a final step, MLflow Tracking Server is used to save all the parameters and metrics for each run. We group them under an experiment called `ts-spark_ch9_data-ml-ops_time_series_prophet_eval` and use the `log_param` and `log_metric` functions to capture the parameters and metrics gathered so far in the run.

We can see the task’s status in the DAG view in Airflow, as per *Figure 9**.1*. The logged parameters and metrics are visible in MLflow Tracking Server, as shown in *Figure 9**.6*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_06.jpg)

Figure 9.6: MLflow experiment tracking (evaluation)
As with the training experiment tracking shown in *Figure 9**.4*, we can see in the evaluation experiment in *Figure 9**.6* the ingest, transform, and forecast data counts, as well as the evaluation RMSE.
With this concluding the model evaluation stage, we have seen the end-to-end workflow example. In the next section, we will cover the monitoring and reporting part of the example.
Monitoring and reporting
The workflows we covered in the previous section are the backend processes in our end-to-end time series analysis example. In this section, we will cover the operational monitoring of the runs and the end user reporting of the forecasting outcome.
Monitoring
The work to collect the metrics has been done as part of the code executed by the workflows we have seen in this chapter. Our focus in this section is on the visualizations to monitor the workflows and the metrics.
Workflow
Starting with the workflow, as we have seen in *Figures 9.1* and *9.2*, the Airflow DAG shows the status of the runs. In case a task fails, as shown in *Figure 9**.7*, we can select the failed task in Airflow and inspect the event log and logs to troubleshoot.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_07.jpg)

Figure 9.7: Airflow DAG with failed task
Training
We can visualize the training metrics for a specific run in MLflow Tracking Server, as shown in *Figure 9**.4*. We can also monitor the metrics across multiple runs and compare them in a table, as per *Figure 9**.8*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_08.jpg)

Figure 9.8: MLflow experiments (training) – select and compare
By selecting the five runs of our `_runall` workflow and clicking on the **Compare** button as in *Figure 9**.8*, we can create a scatter plot as per *Figure 9**.9*. This allows us to see the details for a specific run as well by hovering the mouse pointer over a data point in the graph.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_09.jpg)

Figure 9.9: Training – plot by runs with details
An interesting metric to monitor is the count of training data transformed and ready for training, as per *Figure 9**.10*. We can see here that the first four runs had fewer data points for training than the last run.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_10.jpg)

Figure 9.10: Training – transform count by runs
We can similarly monitor the RMSE for each training run, as per *Figure 9**.11*. We can see here that the model accuracy has improved (lower RMSE) in the last two runs. If the accuracy had dropped instead, then the question from an operational point of view would have been whether this drop is acceptable or there is a need to develop another model. In this situation, this decision is dependent on your specific requirement and what was agreed as an acceptable drop in accuracy.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_11.jpg)

Figure 9.11: Training – RMSE by runs
After the model has been trained and is used for forecasting, we can evaluate the model’s forecasted values against actuals. We will cover the monitoring of the evaluation metrics next.
Evaluation
We can visualize the evaluation metrics for a specific run in MLflow Tracking Server, as shown in *Figure 9**.6*. We can also monitor the metrics across multiple runs and compare them in a table, as per *Figure 9**.12*.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_12.jpg)

Figure 9.12: MLflow experiments (evaluation) – select and compare
By selecting the five runs of our `_runall` workflow and clicking on the **Compare** button as in *Figure 9**.12*, we can create a scatter plot as per *Figure 9**.13*. This allows us to also see the details for a specific run by hovering the mouse pointer over a data point in the graph.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_13.jpg)

Figure 9.13: Evaluation – RMSE by runs with details
An interesting metric to monitor is the count of forecasted data points, as per *Figure 9**.14*. We can see here that all the runs had the expected number of data points, except the fourth run, having one less. This can be explained by the fact that the evaluation dataset missed one data point during this time period.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_14.jpg)

Figure 9.14: Evaluation – forecast count by runs
We can similarly monitor the RMSE for each evaluation run, as per *Figure 9**.13*. We can see here that the model accuracy dropped gradually (higher RMSE) until the fourth run and then improved in the last run. If the drop had persisted instead, the question from an operational point of view would have been whether this drop is acceptable or there is a need to develop another model. This decision is dependent on your specific requirement and what has been agreed as an acceptable drop in accuracy.
This concludes the section on monitoring. While using MLflow was sufficient for the example here, most organizations have dedicated monitoring solutions into which MLflow metrics can be integrated. These solutions also include alerting capabilities, which we have not covered here.
We have explored the process to reach an outcome so far, but have not seen the outcome yet. In the next section, we will report on the forecasting outcome.
Reporting
We will use a Jupyter notebook in this example to create a set of graphs to represent the forecasting outcome. The `ts-spark_ch9_data-ml-ops_results.ipynb` notebook can be accessed from the local web location, as shown in *Figure 9**.15*. This Jupyter environment was deployed as part of the *Environment* *setup* section.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_15.jpg)

Figure 9.15: Reporting – notebook to create a graph
After running the notebook, we can see at the end of the notebook the graph, as per *Figure 9**.16*, of the forecasts (gray lines) and actuals (scatter plot) for the different runs. The forecast captures the seasonality well, and most of the actuals fall within the uncertainty intervals, which are set at 80% by default on Prophet.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_16.jpg)

Figure 9.16: Reporting – actuals (scatter plot) compared to forecasts (gray lines)
We can zoom into specific runs as per *Figures 9.17* and *9.18*. These match with the RMSE values we saw in the earlier *Monitoring* section, as we will detail next.
As we can see in *Figure 9**.13*, the first run had the lowest RMSE. This is reflected in *Figure 9**.17*, with most actuals falling within the forecasting interval.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_17.jpg)

Figure 9.17: Reporting – actuals compared to forecasts (run with lowest RMSE)
In *Figure 9**.13*, the fourth run had the highest RMSE. This is reflected in *Figure 9**.18*, with many more actuals than in the first run falling outside the forecasting interval.
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/B18568_09_18.jpg)

Figure 9.18: Reporting – actuals compared to forecasts (run with highest RMSE)
At this point, the output of the Jupyter notebook can be exported as a report in several formats, such as HTML or PDF. While using Jupyter was sufficient for the example here, most organizations have reporting solutions into which the forecasting outcome can be integrated.
Additional considerations
We will discuss here some of the additional considerations that apply when going to production, in addition to what we already covered in the example in this chapter.
Scaling
We covered scaling extensively in *Chapter 8*. The environment and workflows in this chapter can be scaled as well. At a high level, this can be achieved in the following ways:

*   Airflow server: scale up by adding more CPU and memory resources
*   Airflow DAG: run the tasks in parallel
*   Spark cluster: scale up by adding more CPU and memory resources
*   Spark cluster: scale out by adding more workers
*   Model: use Spark-enabled models or parallelize the use of pandas, as discussed in the previous chapter

You can find more information about Airflow DAGs, including parallel tasks, here: [`airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html`](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html)
To relate this to our Airflow DAG example in this chapter, we defined the tasks as sequential in the following way in the code for `ts-spark_ch9_data-ml-ops_runall`:

t1 >> t2 >> t3 >> t4 >> t5


 The code to run tasks `t3` and `t4` in parallel is as follows:

t1 >> t2 >> [t3, t4] >> t5


 With regard to the considerations for scaling the Spark cluster, refer to *Chapter 3* and, more specifically, the *Driver and worker nodes* section, for a detailed discussion.
Model retraining
We already included retraining in our example workflow at every run using a sliding window of the most recent data. In practice, and to optimize resource utilization, the retraining can be scheduled at a less frequent interval in its own separate workflow. We discussed tracking the model’s accuracy metrics across runs of the workflow in the *Monitoring* section. The trigger of this retraining workflow can be based on the accuracy dropping below a predefined threshold. The appropriate value for the threshold depends on your specific requirements.
Governance and security
In *Chapter 4*, in the *From DataOps to ModelOps to DevOps* section, we discussed the considerations for governance and security at various points. Securing your environment and production rollout is beyond the scope of this book. As these are key requirements, and we will not be going into further details here, we highly recommend referring to the following resources to secure the components used in our example:

  **Apache Spark**
 |
  [`spark.apache.org/docs/latest/security.html`](https://spark.apache.org/docs/latest/security.html)
 |

  **MLflow**
 |
  [`mlflow.org/docs/latest/auth/index.html`](https://mlflow.org/docs/latest/auth/index.html)
[`github.com/mlflow/mlflow/security`](https://github.com/mlflow/mlflow/security)
 |

  **Airflow**
 |
  [`airflow.apache.org/docs/apache-airflow/stable/security/index.html`](https://airflow.apache.org/docs/apache-airflow/stable/security/index.html)
 |

  **Jupyter**
 |
  [`jupyter.org/security`](https://jupyter.org/security)
 |

  **Docker**
 |
  [`www.docker.com/blog/container-security-and-why-it-matters/`](https://www.docker.com/blog/container-security-and-why-it-matters/)
 |

  **Unity Catalog**
 |
  [`www.unitycatalog.io/`](https://www.unitycatalog.io/)
 |

Table 9.1: Resources on security and governance for components in use
This concludes the section on the additional considerations before going to production.
Summary
In this chapter, we focused on the crucial phase of moving projects into production, especially given the challenges many projects face in achieving this transition and delivering measurable business results. We saw an example of an end-to-end workflow, covering the stages of data ingestion, storage, data transformations, model training and validation, forecasting, model evaluation, and monitoring. With this example, we brought together what we have learned in this book in view of planning for a production rollout.
In the next chapter, we will explore how to go further with Apache Spark for time series analysis by leveraging the advanced capabilities of a managed cloud platform for data and AI.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
[`packt.link/ds`](https://packt.link/ds)
![](https://github.com/OpenDocCN/freelearn-ds-pt3-zh/raw/master/docs/ts-anal-spk/img/ds_(1).jpg)

第十章：进一步了解 Apache Spark

在上一章中，我们利用开源组件将时间序列分析投入生产。这需要大量的努力来设置和管理平台。在本章中，我们将通过使用 Databricks 作为基于云的托管平台即服务（PaaS）解决方案，进一步利用 Apache Spark。我们将使用一个基于 Databricks 构建的端到端时间序列分析示例，采用先进的功能，如 Delta Live Tables 结合流处理管道、AutoML、Unity Catalog 和 AI/BI 仪表盘。

本章将介绍以下主要内容：

Databricks 组件和设置
工作流
监控、安全性和治理
用户界面

技术要求

在本章中，我们将通过代码示例，探索如何在 Databricks 上部署一个可扩展的端到端时间序列分析解决方案，从下一节的环境设置开始。

本章的代码位于以下网址：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch10

Databricks 组件和设置

我们将使用 Databricks 环境，如在第八章中所述，作为平台基础设施。请按照第八章中的技术要求部分的说明来设置 Databricks 环境。

工作区、文件夹和笔记本

环境设置完成后，请按照此处提供的链接中的说明导入笔记本：

浏览 Databricks 工作区：docs.databricks.com/en/workspace/index.html
创建一个名为ts_spark的文件夹，以及一个名为ch10的子文件夹：docs.databricks.com/en/workspace/workspace-objects.html#folders
将本示例的笔记本导入到ch10文件夹中：docs.databricks.com/en/notebooks/notebook-export-import.html#import-a-notebook

总共有八个笔记本，它们可以从以下网址导入：

导入笔记本后，我们可以接下来设置集群。

集群

我们可以为集群使用 Databricks 机器学习运行时（MLR）或无服务器计算。

MLR 集群预加载了用于机器学习（ML）的常见库。它在你的云提供商账户中实例化虚拟机。云提供商将向你收取虚拟机的费用。在创建集群时，选择一个小型实例，配备最小的 CPU 和内存，以最小化此费用。对于本章的示例，这样就足够了。请参阅第八章中关于设置 Databricks 集群的技术要求部分。

MLR 集群具有 AutoML 示例所需的库，我们将在后续部分中讲解。如果你不想为相关虚拟机产生 MLR 相关的云提供商费用，可以跳过该示例的代码执行。我们将提供一个不使用 AutoML 的替代工作流。

注意

在撰写本文时，这些虚拟机的云提供商费用已超出免费云提供商试用账户提供的免费额度。这意味着你需要升级到付费云提供商账户，费用将通过你在创建账户时指定的信用卡进行扣费。虚拟机和云基础设施的费用并不包含在免费的 Databricks 试用账户内。

无服务器集群已包含在您的 Databricks 费用中，因为底层虚拟机由 Databricks 完全管理。这意味着它不会产生额外的云服务商费用。然而，无服务器集群在写作时需要安装 ML 库，正如您在代码示例中所看到的。未来，Databricks 可能会提供预加载 ML 库的无服务器集群。

注意

Databricks 已开始在写作时将无服务器功能包含在免费试用账户中。这意味着如果在 Databricks 免费试用账户的时间和费用限制内，您使用无服务器集群执行本章代码将是免费的。未来此政策可能会有所变化。

您可以在以下资源中找到更多关于 MLR 和无服务器集群的信息：

在涵盖集群之后，我们将接下来使用 Delta Live Tables 配置数据管道。

使用 Delta Live Tables 进行流处理

Databricks Delta Live Tables（DLT）是一个低代码声明式的数据管道构建解决方案。在我们的示例中，我们将使用 DLT 构建特征工程管道，从源文件获取数据，检查数据质量，并将其转换为可以用于训练时间序列模型的特征。您可以在以下链接中找到更多关于 DLT 的信息：

www.databricks.com/discover/pages/getting-started-with-delta-live-tables

我们将在实施 工作流部分深入探讨 DLT 配置的细节。

工作流

Databricks 工作流相当于我们在第四章和第九章中使用的 Airflow DAGs。您可以在以下链接中找到更多关于工作流，也称为任务的信息：

docs.databricks.com/en/jobs/index.html

我们接下来将深入探讨任务配置的细节。

实施工作流

本章的代码示例包括四个工作流。这些工作流在 Databricks 中作为任务实现。查看任务的最佳方式是通过 Databricks 中的工作流 > 任务 > 任务视图，参考图 10.1、10.2、10.3和10.4。

任务如下：

ts-spark_ch10_1a_ingest_and_train – 该任务用于数据摄取、特征工程和模型训练，显示在图 10.1中。它包括以下任务：
- reset
- dlt_features
- model_training

图 10.1：数据摄取、特征工程和模型训练任务

ts-spark_ch10_1b_ingest_and_train_automl – 第二个作业，如图 10**.2所示，是第一个作业的另一个版本，区别在于使用了 AutoML，具体内容将在使用 AutoML 训练部分进行解释。

图 10.2：数据摄取、特征工程和模型训练（AutoML）作业

ts-spark_ch10_2b_ingest_and_forecast – 该作业用于摄取新数据、重新训练模型，并生成和评估预测，如图 10**.3所示。它包括以下任务。
- dlt_features
- update_model
- generate_forecast
- update_data
- evaluate_forecast

图 10.3：摄取新数据、重新训练模型并生成预测作业

ts-spark_ch10_2a_update_iteration – 如图 10**.4所示，该作业多次调用前一个作业以摄取新数据。它模拟了现实世界中的情境，即以固定时间间隔（例如每日或每周）启动前一个端到端工作流，并处理新数据。

图 10.4：多次调用摄取和处理新数据作业

模块化和任务分离

与第九章一样，我们将作业拆分为多个任务，以展示模块化的最佳实践。这有助于独立修改代码、扩展和任务重跑。任务的所有权可以由不同的团队负责。根据您的需求，您可以进一步拆分这些作业，以便分别启动任务。

我们将在接下来的章节中详细解释每个作业和相关任务，从摄取和训练作业开始。

要设置本章所需的作业，请按照以下链接中的说明创建作业并配置任务：

请参考接下来的章节中的表格，查看创建作业和相关任务时的配置，并将<USER_LOGIN>替换为您自己的 Databricks 用户登录。

摄取和训练

ts-spark_ch10_1a_ingest_and_train作业，如图 10**.1所示，将在本节中详细介绍。

表 10.1 显示了ts_spark_ch10_1a_ingest_and_train作业的配置，您可以在之前提供的 URL 中按照说明使用。请注意，为简便起见，我们给每个任务起了与其运行的代码笔记本或管道相同的名称。

作业	`ts_spark_ch10_1a_ingest_and_train`
任务 1	任务名称
	类型
	来源
	路径（笔记本）
	计算
任务 2	任务名称
	类型
	管道
	触发管道的完整刷新
	依赖
任务 3	任务名称
	类型
	来源
	路径（笔记本）
	计算
	依赖

表 10.1：作业配置 – ts_spark_ch10_1a_ingest_and_train

reset

reset 任务执行以下操作：

重置 Databricks 目录 ts_spark，该目录用于本示例
从 GitHub 下载本章节的数据文件到 ts_spark 目录下创建的卷

此任务的代码位于 ts_spark_ch10_reset 笔记本中。

目录和卷

Databricks 的 Unity Catalog 提供数据治理和管理功能。它将数据组织成三级层次结构：目录、模式（相当于数据库）以及表、视图或卷。表格数据存储在表和视图中，而文件存储在卷中。在我们的代码示例中，我们使用一个单独的目录 ts_spark 和卷来存储数据文件。

dlt_features

此任务用于数据摄取和特征工程。它实现为 ts_spark_ch10_dlt_features DLT 管道，如图 10.5所示。

图 10.5：特征工程管道

您可以在这里查看并放大图 10.5的数字版本：packt.link/D9OXb

要设置本章节所需的 DLT 管道，请按照以下链接中的指示创建管道：

docs.databricks.com/en/delta-live-tables/configure-pipeline.html#configure-a-new-delta-live-tables-pipeline

请注意，您需要先创建 ts_spark 目录，才能设置 DLT 管道。请参阅以下说明，通过目录浏览器创建 ts_spark 目录：docs.databricks.com/aws/en/catalogs/create-catalog?language=Catalog%C2%A0Explorer

创建管道时，请参考表 10.2中的配置，并将 <USER_LOGIN> 替换为您自己的 Databricks 用户登录。

管道	`ts_spark_ch10_dlt_features`
通用	管道名称
	无服务器
	管道模式
源代码	路径（笔记本）
目标	存储选项
	默认目录 / 默认模式

表 10.2：DLT 配置 – ts_spark_ch10_dlt_features

此管道任务的代码位于ts_spark_ch10_dlt_features笔记本中，包含以下步骤：

使用 Auto Loader 从vol01_hist卷中的文件读取历史数据，检查数据，并将数据存储在raw_hist_power_consumption流表中。

Auto Loader

Databricks Auto Loader，也称为代码中的cloudfiles，可以高效地增量导入到达云存储位置的新数据文件。你可以通过以下链接了解更多关于 Auto Loader 的信息：docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html。

数据质量检查

Databricks DLT 可以包括数据质量检查，以确保数据管道中的数据完整性，基于质量约束进行校验。你可以通过以下链接了解更多关于 DLT 中数据质量检查的信息：docs.databricks.com/en/delta-live-tables/expectations.html。

使用 Auto Loader 从vol01_upd卷中的文件读取更新数据，检查数据，并将数据存储在raw_upd_power_consumption流表中。
从raw_hist_power_consumption流表中读取原始历史数据，转换数据，并将结果存储在curated_hist_power_consumption流表中。
从raw_upd_power_consumption流表中读取原始更新数据，转换数据，并将结果存储在curated_upd_power_consumption流表中。
从curated_hist_power_consumption和curated_upd_power_consumption流表中追加数据，并将合并结果存储在curated_all_power_consumption流表中。
从curated_all_power_consumption流表中读取整理后的数据，使用 Tempo 计算features_aggr_power_consumption物化视图。

Tempo

Databricks Tempo 是一个开源项目，简化了在 Apache Spark 中处理时间序列数据。你可以通过以下链接了解更多关于 Tempo 的信息：databrickslabs.github.io/tempo/。

从features_aggr_power_consumption物化视图中读取汇总数据，使用 Tempo 与curated_all_power_consumption流表进行AsOf连接。然后，将结果存储在features_gnlr_power_consumption物化视图中。

这些步骤对应于“金银铜”方法中的数据转换阶段，这部分在第四章的数据处理和存储章节中进行了讨论。

model_training

此任务用于训练一个 Prophet 模型，使用在之前dlt_features任务中计算的特征。model_training的代码位于ts_spark_ch10_model_training笔记本中。步骤如下：

从features_aggr_power_consumption读取特征。
将Date列重命名为ds，并将hourly_Global_active_power重命名为y。这些列名是 Prophet 所要求的。
启动一个 MLflow 运行以追踪训练过程。
将 Prophet 模型拟合到数据集。
将模型注册到 Unity Catalog，并将别名设置为Champion。

请注意，这个笔记本展示的是简化的模型训练，足以说明本章示例中的训练步骤。它没有包括完整的模型实验过程和超参数调整，这部分内容在第七章中有所讨论。

使用 AutoML 进行训练

另一种模型训练方法是使用 Databricks AutoML 来为给定的数据集找到最佳模型。

AutoML 是 Databricks 中的一项功能，自动化了机器学习模型开发的过程。它包含数据分析、特征工程、模型选择和超参数调整等任务。此功能使用户能够快速生成回归、分类和预测问题的基线模型。通过其“玻璃盒”方法，AutoML 提供每个模型的底层代码，这与不显示代码细节的“黑盒”方法不同。AutoML 可以通过 UI 使用，如图 10.6所示，也可以通过编程方式使用，如本章提供的示例所示。

图 10.6：Databricks AutoML

你可以在此处找到有关 AutoML 的更多信息：

www.databricks.com/product/automl。

ts-spark_ch10_1b_ingest_and_train_automl作业是如何在训练任务中以编程方式包括 AutoML 的示例。此任务的代码位于ts_spark_ch10_model_training_automl笔记本中。步骤如下：

从features_aggr_power_consumption读取特征。
调用databricks.automl.forecast函数，它负责重命名列、启动一个 MLflow 运行以追踪训练过程，并根据指定的primary_metric（示例中使用的是mdape）找到最佳的预测模型。
将模型注册到 Unity Catalog，并将别名设置为Champion。

ts_spark_ch10_1b_ingest_and_train_automl作业的配置见表 10.3。

任务（可选）	`ts_spark_ch10_1b_ingest_and_train_automl`
任务 1	任务名称
	类型
	来源
	路径（笔记本）
	计算
任务 2	任务名称
	类型
	管道
	触发管道的完整刷新
	依赖于
任务 3	任务名称
	类型
	来源
	路径（笔记本）
	计算
	依赖于

表 10.3：作业配置 - ts_spark_ch10_1b_ingest_and_train_automl

请注意，除了简化与之前不使用 AutoML 的训练方法步骤之外，我们还可以找到最佳模型。

导入和预测

ts-spark_ch10_2b_ingest_and_forecast 作业，如 图 10.3 所示，将在本节中详细介绍。

ts_spark_ch10_2b_ingest_and_forecast 作业的配置如 表 10.4 所示。

作业	`ts_spark_ch10_2b_ingest_and_forecast`
	作业参数
任务 1	任务名称
	类型
	管道
	触发管道的完全刷新
任务 2	任务名称
	类型
	来源
	路径（笔记本）
	计算
	依赖于
任务 3	任务名称
	类型
	来源
	路径（笔记本）
	计算
	依赖于
任务 4	任务名称
	类型
	来源
	路径（笔记本）
	计算
	依赖于
任务 5	任务名称
	类型
	来源
	路径（笔记本）
	计算
	依赖于

表 10.4：作业配置 - ts_spark_ch10_2b_ingest_and_forecast

dlt_features

该任务与前面 导入和训练 部分中使用的 ts_spark_ch10_dlt_features DLT 管道相同，如 图 10.5 所示，只不过这次我们将调用该管道来处理来自 vol01_upd 卷的新数据文件。

update_model

该任务用于使用先前 dlt_features 任务中计算的特征来训练 Prophet 模型。update_model 的代码位于 ts_spark_ch10_update_model 笔记本中。该任务与 模型训练 部分中讨论的任务类似，唯一的区别是我们现在有了新数据来包含在训练中。步骤如下：

从 features_aggr_power_consumption 中读取特征。
将 Date 列重命名为 ds，并将 hourly_Global_active_power 列重命名为 y。这些列名是 Prophet 所要求的。
将 Prophet 模型拟合到数据集。
将模型注册到 Unity Catalog，并设置别名为Champion。

更新完最新模型后，我们可以使用它进行下一步的预测。

generate_forecast

该任务使用之前训练好的模型来生成和存储预测结果。generate_forecast的代码位于ts_spark_ch10_generate_forecast笔记本中。步骤如下：

从 Unity Catalog 加载Champion模型。
为接下来的 24 小时生成预测。
将预测结果与模型的名称和版本一起存储在forecast表中。

在生成预测后，我们可以将预测的时间段与实际数据进行比较，实际数据将在接下来获取。

update_data

该任务只是将新的时间段的数据文件从vol01_upd_src卷复制到vol01_upd。update_data的代码位于ts_spark_ch10_update_data笔记本中。

evaluate_forecast

该任务计算并存储预测准确度指标。evaluate_forecast的代码位于ts_spark_ch10_evaluate_forecast笔记本中。步骤如下：

将features_aggr_power_consumption实际数据表与之前创建的forecast表连接。
计算mdape指标。
将计算得到的指标与模型的名称和版本一起存储在forecast_metrics表中。
将数据质量检查结果存储在dq_results表中。

在评估完预测后，我们可以报告结果和指标。我们将在用户界面部分介绍这一部分内容。在进入这部分之前，先详细说明如何协调多个新数据到达并进行相应处理的迭代过程。

更新迭代

ts-spark_ch10_2a_update_iteration作业，如图 10.4所示，模拟了现实中我们在定期时间间隔（如每天或每周）处理新数据的情况。它调用ts-spark_ch10_2b_ingest_and_forecast作业七次，对应一周的每日新数据。每次调用都会触发一个新的数据文件的端到端处理，如前面的获取和预测部分所描述。

ts_spark_ch10_2a_update_iterations作业的配置见表 10.5。

作业	`ts_spark_ch10_2a_update_iterations`
任务 1	任务名称
	类型
	输入
任务 2（添加一个任务以循环遍历）	任务名称
	类型
	作业
	作业参数

表 10.5：作业配置 – ts_spark_ch10_2a_update_iterations

启动作业

配置并解释完作业后，我们将启动这些作业，这些作业将执行本章的代码。有关运行作业的更多信息，请参见：

docs.databricks.com/en/jobs/run-now.html

按照以下顺序进行操作：

点击ts-spark_ch10_1a_ingest_and_train。等待任务完成。
点击ts-spark_ch10_2a_update_iteration。

在启动并执行作业后，我们可以查看它们的状态，下一节将对此进行详细解释。

监控、安全和治理

正如我们在第四章的从 DataOps 到 ModelOps 再到 DevOps部分和第九章的治理与安全部分中讨论的那样，生产环境和涉及敏感数据的工作负载的关键要求是必须具备适当的监控、安全和治理。这通过利用像 Databricks 与 Unity Catalog 这样的托管平台的内置功能得到极大的促进。如果我们开发和测试自己定制的平台，替代方法将需要相当多的时间和精力，才能稳健地满足这些要求。

监控

可以通过ts-spark_ch10_2b_ingest_and_forecast任务来进行作业监控。我们可以查看不同的运行、它们的参数、持续时间、状态等信息，这些对于监控非常有用。

图 10.7：Databricks Workflows – 作业 – 运行

ts_spark_ch10_dlt_features DLT 管道的监控可以通过Workflows > Pipelines页面完成，如图 10.8所示。我们可以看到不同的阶段、数据检查、持续时间和状态等信息，这些对于监控非常有用。

图 10.8：Databricks DLT 管道

你可以在此处找到更多关于可观察性、监控和警报的信息：

安全

如图 10.9所示，使用 Unity Catalog 设置表格及其他对象的访问权限只需要几次点击。

图 10.9：Databricks Unity Catalog – 设置权限

还可以根据以下资源在表格内定义更加细粒度的访问控制，细化到行或列级别：

www.databricks.com/resources/demos/videos/governance/access-controls-with-unity-catalog

你可以在此处找到更多关于安全的信息：

docs.databricks.com/en/security/index.html

治理

治理的重要考虑因素之一是能够追踪数据资产的血缘关系，如图 10.10所示。我们可以看到数据的来源、多个中间阶段，以及数据存储的最终表格。Unity Catalog 会在 Databricks 中自动跟踪这一过程，让我们能够实时监控数据流。

图 10.10：Databricks Unity Catalog – 血缘视图

你可以在这里找到并放大图 10.10的数字版本：

https://packt.link/D6DyC

我们仅简要提及了使用 Databricks Unity Catalog 进行治理和安全性。你可以在这里找到更多信息：

www.databricks.com/product/unity-catalog

了解如何利用 Databricks 这样的平台进行监控、安全性和治理后，我们将继续揭示如何展示时间序列分析的结果。

Databricks 用户界面 — AI/BI 仪表板

在展示我们迄今为止进行的时间序列分析结果时，Databricks 提供了多种用户界面的选项，包括 AI/BI 仪表板、Genie 空间、基于 AI 的聊天机器人和 Lakehouse 应用。我们将在本节中介绍 AI/BI 仪表板，其它选项将在下一章中讨论。

在本书中，我们广泛使用了各种图表来表示数据和分析结果。这要求我们在笔记本中执行代码来创建图表。当我们能够编写代码并拥有执行环境时，这种方式非常有效。然而，在无法编写代码的情况下，常见的展示数据和分析结果的方式是使用报告仪表板。Databricks AI/BI 仪表板便提供了这种功能，如图 10.11所示。

Databricks AI/BI 仪表板是一个集成到 Databricks 平台中的解决方案，用于创建报告和仪表板。它具备 AI 驱动的功能，帮助生成查询和数据可视化。仪表板可以发布并共享，供他人使用。

图 10.11：Databricks AI/BI 仪表板

要在自己的环境中安装此仪表板，首先，下载它并从以下位置获取：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/blob/main/ch10/ts_spark_ch10.lvdash.json

然后，您可以按照这里的说明将仪表板文件导入到自己的环境中：

docs.databricks.com/en/dashboards/index.html#import-a-dashboard-file

注意

你需要一个 SQL 仓库来运行仪表板。请参考以下说明来创建 SQL 仓库：

docs.databricks.com/aws/en/compute/sql-warehouse/create

在这个仪表板中，我们将以下内容整合在一个视图中：

实际值与预测值的图表
通过数据质量检查的记录数（失败和通过）
不同模型版本的指标

您可以通过以下链接找到更多关于 AI/BI 仪表板的信息：

总结

通过在托管的 Spark 平台上进行时间序列分析的端到端示例，本章展示了如何利用 Databricks 的开箱即用功能进一步推动 Apache Spark 的应用。我们从通过流处理管道进行数据摄取开始，到特征工程和模型训练，再到推理和报告，同时确保监控、安全性和治理得到了落实。通过将 Databricks 上预构建的功能与我们自己的自定义代码相结合，我们实现了一个可以扩展到更多使用场景的解决方案。

这将引导我们进入最后一章，在本章中，我们将扩展一些近期在时间序列分析中的发展。

加入我们的 Discord 社区

加入我们社区的 Discord 空间，与作者和其他读者进行讨论：

packt.link/ds

第十一章：时间序列分析的最新发展

当我们走到本书的最后一章时，让我们简要回顾一下我们走过的历程。从在第一章中介绍时间序列及其组成开始，我们在第二章中查看了时间序列分析的不同应用场景。接着，我们在第三章中介绍了 Apache Spark 及其架构，以及它是如何工作的。在深入探讨 Apache Spark 如何用于时间序列分析之前，我们在第四章中回顾了一个端到端的时间序列项目的整体框架。随后，我们将焦点转向项目的主要阶段，从第五章到第九章，涵盖了数据准备、探索性数据分析、模型开发、测试、扩展和生产部署。在第十章中，我们讨论了通过使用如 Databricks 这样的托管数据和 AI 平台，如何进一步利用 Apache Spark。

在本章的结尾，我们将探讨时间序列分析领域的最新发展，涵盖新兴的方法论、工具和趋势。我们将介绍一种来自生成式 AI 领域的时间序列预测方法。拥有一个预测机制固然很棒，但还不够。另一个有趣的发展方向是如何通过 API 向数据分析师和应用程序提供并按需提供预测结果。最终用户也可以通过新的方法受益，使时间序列分析的结果以非技术性的方式对他们可访问。

在本章中，我们将讨论以下主要内容：

时间序列的生成式 AI
通过 API 提供预测
民主化时间序列分析的访问

技术要求

我们将使用 Databricks 环境作为平台基础设施。要设置环境，请按照第十章中环境设置部分的说明进行操作。

本章的代码可以在此 URL 找到：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch11

时间序列分析中的生成式 AI

尽管传统的时间序列模型有效，但在大规模数据或复杂模式下，它们在性能和准确性上存在局限性。

生成式人工智能，特别是时间序列变换器（TSTs），为这些挑战提供了解决方案。类似于自然语言处理（NLP）中的 transformer 模型，TSTs 擅长捕捉长序列上的复杂、非线性依赖关系。这种能力使它们适用于包含缺失值、季节性和不规则模式的真实世界数据。TSTs 使用自注意机制分析时间序列数据并识别季节性模式。这些模型在庞大数据集上进行预训练以创建基础模型，然后可以针对特定时间序列应用进行微调。

最近，已发布了几个预构建的 TST，使我们能够利用它们的功能，而无需努力工程化这些解决方案。示例包括 Chronos、Moira、TimesFM 和 TimeGPT 等。

在下一节中，我们将研究如何使用其中之一与 TimesFM。

TimesFM 简介

TimesFM，简称时间序列基础模型，是由谷歌研究开发的开源预测模型，专门设计用于时间序列数据。TimesFM 建立在基于 transformer 的架构上，具有多功能性，可以处理从短期到长期预测的各种任务。与 Chronos 等将时间序列类似于自然语言处理的模型不同，TimesFM 包括针对时间序列数据的专门机制，如季节性处理、支持缺失值和捕捉多变量依赖关系。

在超过 1000 亿真实世界时间序列点上进行预训练，TimesFM 有效地推广到新数据集，通常在没有额外训练的情况下提供准确的零-shot 预测。这种广泛的预训练使 TimesFM 能够识别时间序列数据中的短期和长期依赖关系，使其非常适用于需要了解季节模式和趋势的应用程序。

要了解 TimesFM 架构的概述和详细解释，我们建议查阅原始研究论文，一种仅解码器的时间序列基础模型 预测，请点击这里：

research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/

我们将在下一节中通过一个预测示例看到 TimesFM 的实际应用。

预测

在本节的时间序列预测示例中，我们将使用在技术要求部分设置的 Databricks 环境。本节的代码可以从以下 URL 上传到 Databricks 工作区：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch11/ts_spark_ch11_timesFM.dbc

你可以使用 Databricks 无服务器计算来执行代码，正如我们在第十章中所做的那样。或者，你可以使用 Databricks Runtime for ML。由于 TimesFM 在撰写时支持的 Python 版本要求，必须使用 14.3 版本。

我们将在这里通过代码示例逐步讲解如何安装和使用 TimesFM。完整代码请参见笔记本：

安装以下必要的库：timesfm[torch]、torch和sktime。
指定超参数（hparams），并从 Hugging Face 的检查点加载 TimesFM 模型（huggingface_repos_id）。请注意，500m指的是模型支持的 5 亿个参数，由于与 Databricks 的兼容性，我们将使用pytorch版本：
```
# Initialize the TimesFm model with specified hyperparameters 
# and load from checkpoint
model = timesfm.TimesFm(
    hparams = timesfm.TimesFmHparams(
        backend="gpu",
        per_core_batch_size=32,
        horizon_len=128,
        num_layers=50,
        use_positional_embedding=False,
        context_len=2048,
    ),
    checkpoint = timesfm.TimesFmCheckpoint(
        huggingface_repo_id="google/timesfm-2.0-500m-pytorch"
    )
)
```
虽然我们使用了超参数的默认值，但你需要进行实验，根据你的预测需求找到最佳的超参数。

加载 TimesFM 模型后，我们可以引入用于预测的数据集。我们将重新使用来自第十章的能源消耗数据集。在进行下一步之前，你必须执行代码示例，包括上一章中的特征工程管道。

我们将从features_aggr_power_consumption表中读取数据，并将 Spark DataFrame 转换为 pandas DataFrame，这是 TimesFM 所需要的。Date列重命名为date，并转换为模型所期望的datetime格式：

# Define catalog, schema, and table names
CATALOG_NAME = "ts_spark"
SCHEMA_NAME = "ch10"
ACTUALS_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.features_aggr_power_consumption"
FORECAST_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.forecast"
# Load data from the actuals table into a Spark DataFrame
sdf = spark.sql(f"""
    SELECT * FROM {ACTUALS_TABLE_NAME}
""")
# Convert Spark DataFrame to Pandas DataFrame
df = sdf.toPandas()
# Convert 'Date' column to datetime format
df['get_batched_data_fn function. The important part, shown in the following code extract, is the mapping of inputs and outputs. The code is based on an adaptation of an example from TimesFM([`github.com/google-research/timesfm/blob/master/notebooks/covariates.ipynb`](https://github.com/google-research/timesfm/blob/master/notebooks/covariates.ipynb)):

创建批量时间序列数据管道的函数

数据

def get_batched_data_fn(

…

examples["inputs"].append(

sub_df["hourly_Global_active_power"][

start:(context_end := start + context_len)

].tolist())

…

examples["outputs"].append(

sub_df["hourly_Global_active_power"][

context_end:(context_end + horizon_len)

].tolist())

…

然后，我们可以遍历输入数据的批次，使用forecast函数生成预测，代码示例如下：

# Iterate over the batches of data
for i, example in enumerate(input_data()):
    # Generate raw forecast using the model
    raw_forecast, _ = model.forecast(
        inputs=example["inputs"],
        freq=[0] * len(example["inputs"])
    )

我们使用mdape指标来评估预测效果，正如我们在前几章所做的那样。这与第十章类似：

# Calculate and store the evaluation metric for the forecast
metrics["eval_mdape_timesfm"].extend([
    mdape(
        pd.DavtaFrame(raw_forecast[:, :horizon_len]),
        pd.DataFrame(example["outputs"])
    )
])

这将得到以下结果：

eval_mdape_timesfm: 0.36983413500008916

正如我们在本节中所看到的，使用基于预训练的 Transformer 模型（如 TimesFM），并且使用默认的超参数，能够提供与我们在前几章中使用的不同方法相当的准确性。通过超参数调优和协变量的使用（下文将讨论），我们可以进一步提高准确性。

协变量支持

TimesFM 的一个重要特性是它支持外部协变量，因为时间序列很少是孤立出现的。经济指标或天气条件等多种因素可能与时间序列相关联，将这些因素纳入分析可以提高预测的准确性。简单来说，协变量是一个独立的变量，可以帮助我们预测时间序列。

TimesFM 支持单变量和多变量预测，包含协变量，使其能够捕捉目标序列与这些外部变量之间的相关性。通过将协变量作为并行序列输入，模型可以学习它们与未来值之间的关系，从而增强其在外部因素对结果产生显著影响的实际场景中的适应性。例如，我们可以通过估算道路交通来预测污染水平。这一支持协变量的能力使得 TimesFM 在预测上相较于传统时间序列模型和其他不包含这些变量的基础模型具有优势。

您可以在这里找到有关协变量支持的更多信息和示例：

community.databricks.com/t5/technical-blog/genai-for-time-series-analysis-with-timesfm/ba-p/95507

其他生成式 AI 模型和多模型预测

您可以测试其他生成式模型，以找到最适合您用例的模型。一种方法是使用 Databricks 的 多模型预测（MMF）解决方案加速器。该加速器为需要在多个时间序列上创建预测的组织提供了解决方案，例如销售、需求或库存预测。该仓库提供了一个可扩展的方法，使用 Databricks 同时部署和管理多个预测模型。它包括笔记本、模型模板和数据管道等资源，简化了在大规模上训练、评估和部署时间序列模型的过程。

随着生成式 AI 和 MMF 成为我们时间序列分析工具包的一部分，让我们探索如何增强预测结果对应用和数据分析师的可用性。

通过 API 提供预测

本书的主要部分集中在准备和分析时间序列数据集上。我们还涵盖了如何在笔记本和报告仪表板中以表格和图形的形式呈现分析结果。然而，在许多情况下，预测必须按需提供给数据分析师和应用程序。我们现在将探讨如何实现这一目标。

通过 ai_forecast 简化预测

在这种情况下，数据分析师可以访问时间序列数据，并希望将其作为输入来获取预测，而无需首先开发一个模型。通过将预测功能抽象为 Databricks 平台上的 ai_forecast 函数，可以大大简化没有预测模型和算法知识的用户进行预测的过程。

您可以在以下 URL 查看一个简单示例：

github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch11/ts_spark_ch11_aiforecast.dbc

这段代码基于文档中的示例，链接在本节末尾提供：

SELECT *
FROM AI_FORECAST(
    TABLE(aggregated),
    horizon => '2016-03-31',
    time_col => 'ds',
    value_col => 'revenue'
)

运行此示例的输出如图 11.1所示。

图 11.1：ai_forecast 示例输出

您可以在此处找到并放大图 11.1的数字版本：

https://packt.link/vg87q

请注意，在撰写本文时，此功能仍处于公开预览阶段，因此您可能需要向 Databricks 请求访问权限才能试用它。

虽然像ai_function这样简化且预定义的函数是快速生成预测的好方法，但我们可能希望使我们自己定制开发的预测模型能够轻松供其他应用访问，接下来我们将介绍如何做到这一点。

模型服务

在某些情况下，我们需要从另一个应用程序中以编程方式获取模型的预测。对于这种应用间集成，使用 REST API 是一种常见做法。提供 REST API 接口的一种方式是使用 Databricks 的模型服务。

Databricks 的模型服务提供了部署、管理和查询 ML 和 AI 模型的功能，支持实时推理和批量推理。已部署的模型可以通过 REST API 访问，从而集成到 Web 或客户端应用程序中。支持多种模型类型，包括以 MLflow 格式打包的自定义 Python 模型和提供的开放基础模型。该服务旨在高可用性和低延迟，并能自动扩展以应对需求变化。

这是提供模型服务的步骤概览。请注意，这不是一个实际的示例。此处展示的截图仅用于说明步骤：

按照图 11.2中的步骤访问 Unity Catalog 中的模型，并点击右上角的服务此模型按钮。

图 11.2：Unity Catalog 中的模型

按照图 11.3中的步骤创建服务端点。此时将显示访问 REST API 以调用模型的 URL。

图 11.3：创建服务端点

创建服务端点时，我们可以启用推理表，如图 11.4所示，以存储与模型 REST API 交互的所有输入和输出。

图 11.4：推理表

创建后，服务端点将显示为准备就绪状态，如图 11.5所示，并可以使用。

图 11.5：服务端点已准备就绪

当使用服务端点时，可以根据图 11.6来监控其指标。

图 11.6：服务端点指标

你可以在此找到有关模型服务的更多信息：

docs.databricks.com/en/machine-learning/serve-models.html

正如我们在本节中看到的，通过 REST API 暴露我们的时间序列分析模型，使得将分析与其他应用程序集成变得更加容易。继续讨论时间序列分析的可访问性，接下来我们将探讨如何为最终用户简化这一过程。

普及时间序列分析的访问

在本节中，我们将探索访问时间序列结果的创新方法如何使非技术用户受益。这使得我们能够将时间序列分析普及化，惠及更广泛的受众。

Genie 空间

在第一种方法中，我们将使用 Databricks 上类似自然语言聊天机器人的界面，称为 Genie 空间。

Databricks Genie 空间是一个对话式 UI，使业务用户能够用自然语言提问并获得分析见解，而无需技术专长。这通过配置 Genie 空间与相关数据集、示例查询和说明来实现。然后，用户可以用自然语言与系统互动，提出关于数据的问题和可视化需求。Genie 使用带注释的表格和列元数据将用户查询转化为 SQL 语句。这些语句用于查询数据，以便 Genie 可以向用户提供响应。

为了实践这一点，我们将使用在 第十章 中创建的仪表板，如 图 11.7 所示。这是访问 Genie 的一种方式——在仪表板上，我们可以点击左上角的 询问 Genie 按钮。这将打开右下角的聊天机器人界面，我们可以开始用自然语言输入问题。或者，我们可以选择将 Genie 空间打开为全屏模式。

图 11.7：从仪表板访问 Genie 空间

在 图 11.8 中，我们可以看到完整的 Genie 空间，包括查询、结果和用于获取结果的生成 SQL。

图 11.8：Genie 空间查询和结果

该示例中的查询是显示 预测与实际，这也可以作为可视化请求，如 图 11.9 所示。

图 11.9：Genie 空间可视化

你可以在此找到有关 Databricks Genie 空间的更多信息：docs.databricks.com/en/genie/index.html

应用

在需要更多应用式交互性的情况下，仪表盘或聊天机器人界面不足以满足用户需求。Databricks 应用提供了一个平台，可以在 Databricks 环境中直接构建和部署应用。目前处于公开预览阶段，Databricks 应用支持如 Dash、Shiny、Gradio、Streamlit 和 Flask 等开发框架，用于创建数据可视化、AI 应用、自助分析和其他数据应用。

你可以在这里找到更多关于 Databricks 应用的信息：www.databricks.com/blog/introducing-databricks-apps

总结

在本章的最后，我们深入探讨了时间序列分析的最新进展，重点关注新兴的方法论、工具和趋势。我们尝试了将生成式 AI 这一创新方法应用于时间序列预测的前沿领域。为了响应对通过 API 进行预测的需求增长，我们探索了如何为数据分析师和应用提供按需预测服务。最后，我们使用了 AI 聊天机器人和 Databricks 应用，旨在使非技术用户也能便捷地进行时间序列分析。

当我们到达本书的尾声，回顾我们的旅程和所获得的技能时，我们已经在使用 Apache Spark 和其他组件进行时间序列分析项目的多个阶段上打下了坚实的基础。凭借第二章中讨论的多个应用场景、本书中获得的实践技能，以及本章中的最新进展，我们已经具备了成功实施可生产、可扩展并具备未来适应性的时间序列分析项目所需的所有要素。

我们以 Pericles 关于时间重要性的智慧建议开始了这本书——现在，在本书的结尾，我们具备了揭示时间序列中隐藏的宝贵洞察并将其运用到实际中的能力。愿这些知识使你能够以新的思路和信心应对挑战。祝你在时间序列分析和 Apache Spark 的学习旅程中取得成功！

加入我们在 Discord 的社区

加入我们社区的 Discord 空间，与作者和其他读者讨论：

packt.link/ds

posted @ 2025-07-16 12:31 绝不原创的飞龙阅读(13) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

Spark-时间序列分析-全-

Spark 时间序列分析（全）

前言

本书适用对象

本书涵盖的内容

为了最大限度地发挥本书的作用

下载示例代码文件

使用的约定

与我们联系

分享你的想法

下载本书的免费 PDF 副本

第一部分：时间序列和 Apache Spark 简介

第一章：什么是时间序列？

技术要求

时间序列简介

时间顺序

定期和不定期

平稳与非平稳

动手实践：加载和可视化时间序列

开发环境

PaaS

自定义

代码

数据集

步骤：加载和可视化时间序列

将时间序列分解为其组成部分

系统性和非系统性成分

水平

趋势

季节性和周期性

残差或剩余项

加法型或乘法型

实践操作：分解时间序列

多重重叠季节性

时间序列分析的额外考虑因素

面对数据挑战

使用正确的模型

维持空间和时间层次结构

解决可扩展性问题

接近实时

生产管理

监控和解决漂移

总结

进一步阅读

加入我们的 Discord 社区

第二章：为什么需要时间序列分析？

技术要求

理解时间序列分析的需求

预测

单步预测

多步预测

挑战

解决方案

单变量预测

多变量预测

模式检测与分类

基于距离

基于区间

基于频率

基于字典

Shapelets

集成

深度学习

异常检测

无监督异常检测

有监督异常检测

半监督异常检测

高级深度学习方法

行业特定的应用案例

金融服务

零售

医疗保健

制造业和公用事业

使用选定的用例进行动手实践

预测

模式分类

异常检测

总结