DevOps-2-2-工具集-全-

DevOps 2.2 工具集（全）

原文：annas-archive.org/md5/d241b9e4933a3f913b0efcc97ebc54da

译者：飞龙

协议：CC BY-NC-SA 4.0

前言

看起来每一本新书的范围都变得越来越模糊，不再那么精确。当我开始写作《测试驱动的 Java 开发》时，整本书的范围是提前就确定好的。我有一个团队与我合作。我们定义了目录和每一章的简短描述。从那时起，我们按照计划进行工作，就像大多数技术作者一样。然后我开始写第二本书。书的范围变得更加模糊。我想写的是 DevOps 实践和流程，但我当时对最终结果只有一个非常宽泛的想法。我知道 Docker 必须包含在其中。我知道配置管理是必须的。微服务、集中日志记录，以及我在项目中使用的其他一些实践和工具，都是最初的范围之一。对于那本书，我没有任何人支持。没有团队，只有我、许多披萨、未知数量的红牛罐子和无数个不眠之夜。最终的成果是《DevOps 2.0 工具包：使用容器化微服务自动化持续部署管道》。到了第三本书，最初的范围变得更加模糊。我开始写作时没有任何计划。最初的主题是集群管理。几个月后，我参加了在西雅图举行的 DockerCon 大会，在会上展示了新的 Docker Swarm 模式。我当时的反应是：把我写的东西全部扔掉，重新开始。我不知道这本书到底会讲什么，只知道它必须和 Docker Swarm 有关。我对新的设计印象深刻。关于 Swarm 的内容最终成为了《DevOps 2.1 工具包：Docker Swarm：在 Docker Swarm 集群中构建、测试、部署和监控服务》。在写作过程中，我决定做一个DevOps 工具包系列。我认为记录我从不同实验中获得的经验，或者与各种公司和开源项目合作的经历，应该会很有意思。所以，自然而然地，我开始思考并规划这个系列的第三本书；DevOps 工具包 2.2。唯一的问题是，这次，我完全不知道这本书会讲什么。有一个想法是深入比较不同的调度器（例如 Docker Swarm、Kubernetes 和 Mesos/Maraton）。另一个想法是探索无服务器架构。尽管这个名字很糟糕（其实有服务器，只是我们不去管理它们），但它是一个很好的主题。各种想法不断涌现，但没有一个明显的最终方向。所以，我决定不定义具体的范围，而是设定一些总体目标。

我设定的目标是构建一个基于 Docker 的自适应和自愈系统。唯一的问题是，我还不知道怎么做。我使用过不同的实践和工具，但隧道尽头并没有明确可见的光亮。与其定义这本书会是什么，我更定义了我想要实现的目标。你可以把这本书看作是我记录旅程的日志。我将需要探索很多，可能还需要采用一些新的工具，自己写些代码。我现在还不知道，也许它最终会变成完全不同的东西，根本不会有自适应和自愈系统。我们拭目以待。把这本书当作是“Viktor 在尝试做事时的日记”。

所以，目前的目标是超越简单的集群设置、服务、持续部署以及你可能已经知道的其他内容。如果你不知道，可以读读我的旧书。我目前还不知道具体的范围，也不知道结果会是什么。通常，当你写一本书时，你会从大纲和索引开始，一章一章地写，最后才写前言。这让我们（作者）看起来聪明而且掌控全局。但情况并非如此。我并没有在过程结束时写前言（正如编辑所建议的那样）。我想对你们保持诚实。我没有计划。

你们已经被警告了！我不知道这本书会走向何方，也不知道我是否能完成我自定的目标。我会尽力以我探索自适应和自愈系统的方式，概述前进的步骤。

第一章：概述

本书不会教你 DevOps 实践。它不会展示 Docker 是如何工作的，也不会探讨如何构建镜像、部署服务、操作 Swarm 集群，或者如何进行持续部署。我们不会开发微服务，也不会讨论允许我们创建和管理基础设施的实践和工具。本书假设你已经了解这些内容。如果你不了解，请阅读The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices以获取 DevOps 工具和实践的概述，并阅读The DevOps 2.1 Toolkit: Docker Swarm: Building, testing, deploying, and monitoring services inside Docker Swarm clusters以深入了解 Docker Swarm 集群的工作原理。

既然你知道本书不涉及哪些内容，你可能会想知道它到底讲了什么。嗯……我现在还不知道。我决定跳过规划，直接开始编写代码和写下超越简单集群管理和服务部署的解决方案。目标是创建一个自适应和自愈的系统。就这些，我现在知道的就是这些。我不确定我将如何做到，也不确定是否能够成功。我所知道的是，我会记录下这段旅程的每一步。

尽管本书中有大量理论内容，但它是一本实践书籍。你无法通过在地铁上阅读它来完成。你必须在电脑前动手操作，才能阅读这本书。最终，你可能会卡住，需要帮助。或者，你可能想写一篇评论或对书中的内容发表评论。请加入DevOps20 Slack 频道，发布你的想法，提问，或者简单地参与讨论。如果你更喜欢一对一的交流，可以使用 Slack 给我发私信，或发送邮件至 viktor@farcic.com。所有我写的书对我来说都非常重要，我希望你在阅读它们时有一个愉快的体验。体验的一部分是你可以联系我。不要害羞。

请注意，这本书和之前的书一样，是自出版的。我相信，作者和读者之间没有中介是最好的方式。这让我能够更快地写作、更频繁地更新书籍，并与您进行更直接的沟通。你的反馈是过程的一部分。无论你是在书籍只有几章或是全部章节完成时购买的，理念是这本书永远不会真正完成。随着时间的推移，它将需要更新，以便与技术或流程的变化保持一致。当可能时，我会尽力保持其最新，并在合适的时候发布更新。最终，事情可能会发生变化，更新不再是一个好的选择，那时就意味着需要一本全新的书了。只要你继续支持我，我将继续写作。

第二章：观众

不确定。可能是你。

第三章：关于作者

Viktor Farcic 是CloudBees的高级顾问，Docker Captains小组成员，同时也是书籍作者。

他使用过多种编程语言，从 Pascal（是的，他很老）开始，接着是 Basic（在它有了 Visual 前缀之前），ASP（在它有了 .Net 后缀之前），C，C++，Perl，Python，ASP.Net，Visual Basic，C#，JavaScript，Java，Scala 等等。他从未接触过 Fortran。他现在最喜欢的是 Go。

他的主要兴趣是微服务、持续部署和测试驱动开发（TDD）。

他经常在社区聚会和会议上发言。

他编写了The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices，The DevOps 2.1 Toolkit: Docker Swarm: Building, testing, deploying, and monitoring services inside Docker Swarm clusters和Test-Driven Java Development。

他的随机思考和教程可以在他的博客TechnologyConversations.com找到。

第四章：献词

我注意到，作者有时会感谢那些他们从未得到过帮助的名人。有时，他们还会发布那些由声名显赫的作者撰写的介绍，而这些作者可能根本没有读过他们的书。背后一定有某种原因，也许有名的名字有助于推广。也许列出这些名人的名字能召唤一些未知的力量。可能存在一种“巫毒”魔法，使得读者与书籍产生某种联系。尽管我怀疑感谢他人会有帮助，但我还是决定这么做。最好不要冒这个风险。

我要感谢以下几位：Larry David，感谢他比我还怪；Uncle Bob，他的书总是在亚马逊上排在我的书前；Donald Trump，感谢他写出了娱乐性的推文；Josip Broz Tito，感谢他没活过 100 岁；Netflix，感谢它在我工作时娱乐我的女儿；David Heinemeier Hansson，感谢他解释了 Ruby On Rails 是无法测试的；Douglas Adams，感谢他允许我在本书的结尾使用他的引用。还有许多人我也想感谢，但我的谦虚让我停下了脚步，列出一些（尚未）成名的人。

这本书献给我的女儿 Sara，是她给予我每天早晨醒来并坚持无数工作小时的力量。她放学回家的笑容是我所需要的所有鼓励。献给我的妻子 Eva，没有她永不懈怠的支持，这本书绝对无法完成。

我爱你们，女孩们，胜过世界上任何事。这本书献给你们。

献给咖啡因和糖分，是熬夜写作的必需成分。献给送披萨的小哥，没有他我早就饿死了。

第五章：自适应和自愈系统简介

微服务，微服务，微服务。我们都在将单体架构重写成微服务的过程中。有些人已经做到了。我们将它们放入容器中，并通过调度程序进行部署。我们正朝着一个辉煌的未来迈进。现在没有什么可以阻止我们了。除了……我们作为一个行业，还没有为微服务做好准备。一方面，我们可以设计服务，使其无状态、容错、可扩展，等等。另一方面，我们还需要将这些服务作为一个整体整合进系统中。

除非你刚开始了一个新项目，否则你很可能还没有达到“微服务的涅槃”，并且仍然有许多遗留服务存在。但为了简洁起见，并且为了直接切入正题，我假设你所控制的所有服务确实是微服务。这是否意味着整个系统已经达到了那个涅槃状态？服务的部署（无论是谁编写的）是否完全独立于系统的其余部分？很可能不是。

你们在实践持续部署（CD）吗？我假设你们是。现在，假设你刚刚完成了新服务的第一次发布。这个第一次发布就是你代码仓库中的第一次提交。你选择的 CD 工具检测到代码仓库中的变更，并启动了 CD 管道。管道的最终目标是将服务部署到生产环境中。我能看到你脸上的笑容。那是只有在孩子出生或服务首次部署到生产环境时，才能看到的幸福表情。但这个笑容不应该持续太久，因为部署服务只是个开始。它还需要与整个系统进行集成。代理需要重新配置。日志解析器需要更新以适应新服务生成的格式。监控系统需要识别新服务。还需要创建警报，以便在服务状态达到某些阈值时发送警告和错误通知。整个系统必须适应新服务，并融入我们刚刚提交的更新所引入的新变量。

我们如何调整系统，使其能够考虑到新的服务？我们如何将该服务融入系统的整体架构中？

除非你自己编写所有内容（在这种情况下你一定是 Google），否则你的系统由你自己开发的服务和其他人编写和维护的服务组成。你可能使用了第三方代理（希望是Docker Flow Proxy）。你可能选择了 ELK 堆栈或 Splunk 来进行集中式日志记录。监控呢？也许是 Nagios，或者可能是 Prometheus。不管你做了什么选择，你都无法掌控整个系统的架构。实际上，你甚至可能无法掌控你自己编写的所有服务。

大多数第三方服务并不是为高度动态的集群而设计的。当你部署该服务的第一个版本时，你可能需要手动配置代理。你可能需要为 LogStash 配置添加一些解析规则。你的 Prometheus 目标也必须更新。新的报警规则需要添加，等等。即使所有这些任务已经自动化，持续部署管道也会变得过于庞大，且流程会变得非常脆弱。

我将尝试保持乐观，并假设你已经成功克服了配置所有第三方工具以使它们与新服务无缝协作的难题。接下来将没有时间休息，因为同样的服务（或其他服务）很快就会被更新。有人可能会做出修改，导致更高的内存阈值。这意味着例如监控工具需要重新配置。你可能会说这没问题，因为偶尔会发生这种情况，但那也不是真的。如果我们采用了微服务和持续部署，"偶尔"可能意味着“在任何频繁的提交中”。记住，团队很小，而且它们是独立的。影响整个系统的更改可能随时发生，我们需要为此做好准备。

图 1-1: 传统部署，其中服务的真实来源分散在许多不同的地方

第三方服务的一个主要限制是它们依赖于静态配置。以 Prometheus 为例，它可能负责监控你所有的服务、硬件、网络等。它观察的每个目标可能有不同的指标集和不同的条件来触发告警。每次我们想要添加一个新目标时，都需要修改 Prometheus 的配置并重新加载。这意味着为了容错，我们必须将配置文件存储在网络驱动上，并使用某种模板机制，每次新增服务或更新现有服务时更新该配置。因此，我们将部署我们那新奇的服务，更新生成 Prometheus 配置的模板，创建新配置，覆盖存储在网络驱动上的配置，并重新加载 Prometheus。即使如此，这还不够，因为驱动这些模板的数据需要存储在某个地方，这意味着我们需要在服务注册中心注册每个服务（或者使用 Docker 中内置的服务注册），并确保模板解决方案从中读取。

如果 Prometheus 可以通过其 API 进行配置，部分混乱是可以避免的。然而，配置 API 虽然可以移除模板的需求，但不会消除网络驱动的需求。它的配置就是它的状态，这个状态必须被保存。

这种思维方式具有历史性。我们习惯于基于单体系统的信息被分散在各处。我们正在慢慢朝着一种不同的模型发展。系统被拆分成多个小服务，每个服务都是它解决的领域问题的完整真相来源。如果你需要有关某个服务的信息，向它请求，或者有一个机制会将这些信息推送给你。一个服务既不需要知道，也不应该关心谁使用它以及如何使用它。

服务本身应该包含描述它的所有数据。如果它需要重新配置代理，这些信息应该是服务的一部分。它应该包含用于输出日志的模式。它应该有监控工具应从中抓取的目标地址。它应该包含用于触发告警的信息。换句话说，服务所需的所有内容都应该在该服务中定义，而不是其他地方。我们需要将系统适配到新服务所需的数据来源不应该分散在多个位置，而应当位于我们正在部署的服务内部。由于我们都在使用容器（不是吗？），定义这些信息的最佳位置是服务标签。

如果你的服务应该通过路径/v1/my-fancy-service进行访问，可以使用参数--label servicePath=/v1/my-fancy-service来定义标签。如果 Prometheus 应该在端口8080抓取指标，则定义标签--label scrapePort=8080。依此类推。

为什么这一切很重要？其中一个原因是，当我们在服务内部定义所有所需数据时，我们有一个包含服务完整信息的单一位置。这使得配置变得更加简单，使负责服务的团队更加自给自足，使得部署更加可管理并且减少错误，等等。

图 1-2：一个服务是唯一的事实来源，通常通过中介，向系统的其他部分宣布它的存在

在我们开发的服务中定义所有信息并不成问题。问题在于，大多数我们使用的第三方服务并没有设计成能够利用这些信息。请记住，关于服务的数据需要在集群中分布，它需要到达所有与我们开发和部署的服务协同工作的其他服务。我们不希望在多个位置定义这些信息，因为这会增加维护成本，并且可能引入由人为错误造成的问题。相反，我们希望将所有内容定义在我们部署的服务中，并将这些信息传播到整个集群中。

我们不希望在多个位置定义和维护相同的信息，我们确实希望将这些信息保留在源头，但第三方服务无法从源头获取这些数据。如果我们排除修改第三方服务的选项，那么唯一的选择就是扩展这些服务，使其能够拉取或接收所需的数据。

我们真正需要的是能够从我们部署的服务中发现信息的第三方服务。该发现可以是拉取（一个服务从另一个服务中拉取信息）或推送（一个服务充当中介，将数据从一个服务推送到另一个服务）。无论发现是依赖推送还是拉取，接收数据的服务都需要能够重新配置自己。所有这些都需要与一个能够检测服务已部署或已更新并通知所有相关方的系统结合起来。

最终目标是设计一个能够适应我们投入的任何服务以及集群变化的系统。最终的目标是拥有一个自适应和自愈的系统，即使在我们度假时，它也能继续高效运行。

什么是自适应系统？

自适应系统是能够适应变化条件的系统。这一点显而易见，不是吗？从实际操作的角度来看，当操作一个集群并部署服务时，这意味着当部署一个新服务或更新一个现有服务时，系统应该能够适应。当集群中的条件发生变化时，整个系统应该通过适应这些条件而发生变化。如果部署了一个新服务，监控解决方案应该获取关于该服务的信息并更改其配置。日志系统应该开始处理该服务的日志并正确解析。集群中的节点数量应进行调整，等等。系统自适应的最重要要求是将其构建成不需要人工干预的方式。否则，我们就不妨把“自适应”改为“John 适应它”系统。

什么是自愈系统？

一个自愈系统需要具有适应性。如果没有适应环境变化的能力，我们就无法实现自愈。虽然适应性更为持久或长期，而自愈则是一个暂时性的行动。举个例子，假设请求数量的增加是永久性的，可能是因为我们现在有更多的用户，或者是因为新设计的 UI 非常好，用户花更多时间使用我们的前端。由于这种增加，我们的系统需要适应并永久性（或至少更持久地）增加服务的副本数量。这个增加应该匹配最低预期负载。也许我们运行了五个购物车副本，这在大多数情况下足够了，但由于用户数量增加，购物车的实例数量需要增加，比如说增加到十个副本。这个数量不必是固定的。例如，它可以在七个（最低预期负载）到十二个（最高预期负载）之间变化。

自愈是一种对突发事件的反应，具有临时性。以我们（人类）为例。当病毒攻击我们时，我们的身体会作出反应并进行抵抗。一旦病毒被消灭，紧急状态就会结束，我们恢复正常状态。这个过程始于病毒的入侵，结束于病毒的清除。一个副作用是，在这个过程中我们可能会适应，并永久性地创造出更强的免疫系统。我们可以将同样的逻辑应用到我们的集群中。我们可以创建一些流程，来应对外部威胁并采取相应的措施。有些措施会在威胁消失后立即被移除，而其他措施可能会对我们的系统产生永久性的变化。

自愈并不总是有效。我们（人类）和软件系统有时也需要外部帮助。如果一切都失败了，我们无法自愈并在内部解决问题时，我们可能会去看医生。类似地，如果集群无法自行修复，它应该向操作员发送通知，操作员希望能够修复问题、写出事后分析，并改善系统，以便下次同样的问题出现时，它能够自愈。

对外部帮助的需求概述了一种有效构建自愈系统的方法。我们无法预测系统中可能发生的所有组合。然而，我们能做的是确保当意外发生时，它不会持续太久。一位优秀的工程师会尽力使自己变得不再必要。他会尝试只执行一次相同的操作，而做到这一点的唯一方法就是通过不断增加的自动化过程。所有预期的事项都应该被脚本化，并且纳入由系统执行的自适应和自愈过程。我们应该仅在意外发生时作出反应。

现在怎么办？

让我们开始构建一个自适应和自愈的系统。我们首先需要的是度量标准。没有度量标准，无论是系统还是我们都无法做出决策。我们将从选择合适的工具开始。

第六章：选择度量存储和查询解决方案

每个集群都需要收集度量指标。这些指标是我们可能想要使用的任何告警系统的基础。如果没有集群当前和过去状态的信息，我们将无法在问题发生时做出反应，也无法从一开始就防止问题的发生。事实上，这并不完全准确。我们本可以做所有这些事，但无法以高效且可扩展的方式进行。

一个好的类比是失明。失明并不意味着我们不能通过触觉在环境中移动。同样，如果没有收集和查询度量指标的方法，我们也并非无能为力。我们可以通过 SSH 进入每个节点，手动检查系统。我们可以从使用 top、mem、df 等命令开始。我们可以通过 docker stats 命令检查容器的状态。我们可以在一个容器与另一个容器之间切换，查看它们的日志。我们可以做所有这些事，但这种方式无法扩展。我们无法让操作员的数量与服务器的数量同步增长。我们无法将自己转变为人类机器。即使我们能够做到，也会非常糟糕。这就是为什么我们需要工具来帮助我们。如果这些工具不能满足我们的需求，我们可以在它们之上构建自己的解决方案。

我们可以选择的工具有很多。要将它们一一比较几乎是不可能的，所以我们将限制范围，只讨论少数几个。

我们将只关注开源项目。我们将讨论的一些工具有付费的企业版，提供附加功能。我们将从比较中排除这些工具。排除的原因是我认为我们应该始终从开源软件开始，先对其进行熟悉，只有当它证明其价值后，才评估是否值得切换到企业版。

此外，我们还将引入一个额外的限制条件。我们将仅探讨那些可以自行托管的解决方案。这排除了像 Scout 或 DataDog 这样的托管服务。做出这一决定的原因有两个方面。一方面，许多组织不愿意将数据“交给”第三方托管服务。即使没有这样的限制，托管服务也需要能够将警报发送回我们的系统，而这将是一个巨大的安全漏洞。如果这些问题对你来说不重要，那么它们的灵活性也不足以满足需求。我所知道的没有任何服务能提供足够的灵活性来构建一个自适应和自愈的系统。而且，本书的目的是为你提供免费的解决方案，因此坚持使用可以自己托管的开源解决方案。

这并不意味着付费软件不值这个价格，或者我们不应该使用并支付托管服务。恰恰相反。然而，我觉得从我们可以自己构建的工具开始，并探索其极限会更好。从那里开始，你将更好地理解自己需要什么，以及支付费用是否值得。

无量纲与有量纲的度量

在我们探讨选择的工具之前，我们应该讨论不同的度量存储和收集方法。

我们可以根据维度来划分工具。一些工具可以存储带有维度的数据，而其他工具则不能。无量纲度量工具的代表是 Graphite 和 Nagios。说实话，Graphite 中确实有某种维度的形式，但由于它们本质上非常有限，我们会将其视为无量纲工具。支持维度的解决方案有，例如 InfluxDB 和 Prometheus。前者以键/值对的形式支持维度，后者则使用标签。

无量纲（或无维度）度量存储属于旧世界，那时服务器相对静态，被监控的目标数量较少。这可以从这些工具创建的时间看出来。Nagios 和 Graphite 都比 InfluxDB 和 Prometheus 更早。

为什么维度相关？查询语言需要维度才能有效。如果没有维度，语言的能力必然受到限制。这并不意味着我们总是需要维度。对于简单的监控，维度可能是多余的。然而，运行一个可扩展的集群，其中服务不断部署、扩展、更新和移动，可远非简单。我们需要能够表示集群所有维度和其上运行服务的度量。一个动态系统需要动态分析，这需要包含维度的度量来实现。

一个无量纲度量的例子是container_memory_usage。与此相比，container_memory_usage{service_name="my-service", task_name="my-service.2.###", memory_limit="20000000", ...}则提供了更多的自由度。我们可以像无量纲度量那样计算平均内存使用量，但我们还可以推断出内存限制、服务名称、该任务是哪个副本（任务），等等。

维度（或缺乏维度）是区分存储和分析度量工具的唯一因素吗？除此之外，这些度量最终如何进入数据库也是一个可能产生显著差异的因素。一些工具期望数据被推送，而其他工具则会拉取（或抓取）数据。

如果我们坚持之前提到的工具，推送方法的代表是 Graphite 和 InfluxDB，而 Nagios 和 Prometheus 则属于拉取组。

那些属于推送类别的系统期望数据主动送到它们那里。它们是被动的（至少在指标收集方面是这样的）。每个收集数据的服务应该将数据推送到一个中心位置。collectD 和 statsD 就是流行的例子。而拉取系统则是主动的。它会从所有指定的目标中抓取数据。数据收集器并不知道数据库的存在。它们的唯一目的是收集数据并通过一种系统可以接受的协议将数据暴露出来。

关于每个系统的优缺点的讨论已经激烈进行了一段时间。关于为何一个系统优于另一个系统的论据有很多，我们可以花费大量时间来讨论它们。相反，我们将讨论发现机制，这个论点在我看来是最相关的。

在推送系统中，发现是容易的。数据收集器只需要知道指标存储的地址并推送数据。只要该地址保持可用，配置就非常简单。对于拉取系统，系统需要知道所有数据收集器（或导出器）的地址。当只有少数几个时，这很容易配置。如果数量增加到数十、数百甚至数千个目标时，配置可能会变得非常繁琐。这种情况显然更有利于推送模式。但技术发生了变化。我们现在有了可靠的系统来提供服务发现。例如，Docker Swarm 就将其作为 Docker 引擎的一部分内置。找到目标非常容易，并且假设我们信任服务发现，我们总是能获取到所有数据收集器的最新信息。

如果有一个适当的服务发现机制，拉取与推送的争论就变得或多或少无关紧要。这引出了一个让拉取更加有吸引力的论点。拉取数据时，发现一个失败的实例或缺失的服务要容易得多。当一个系统期望数据收集器推送数据时，它并不会意识到有什么东西缺失。我们可以用“我不知道我不知道什么”来总结这个问题。另一方面，拉取系统知道应该期待什么。它们知道它们的目标是什么，当一个抓取目标没有响应时，很容易推测其原因是它停止了工作。

图 2-1：根据维度和数据收集方法放置的监控工具

无论是推送还是拉取的论点都不是绝对的，我们不应该仅仅基于这些标准做出选择。相反，我们将稍微深入探讨我们之前讨论过的工具。

排行榜上的第一个是 Graphite。

Graphite

Graphite 是一个被动的指标存储工具。我们之所以称其为被动，是因为它不能收集指标。指标需要通过单独的进程进行收集并推送。

它是一个时序数据库，拥有自己的查询语言和生成图表的能力。查询 API 非常强大。或者，更准确地说，在它出现时被认为非常强大。今天，与其他一些工具相比，它的查询语言显得有些局限，主要是由于它用于存储度量标准的无维度格式。

Graphite 将数值数据存储为时序格式。它的度量名称由点分隔的元素组成。

数据存储在本地磁盘上。

InfluxDB

就像 Graphite 一样，InfluxDB 也是一个时序数据库。与 Graphite 不同，InfluxDB 的数据模型是基于标签形式的键值对。

InfluxDB（更准确地说是开源版本）依赖本地存储来存储数据，并进行抓取、规则处理和告警。

Nagios 和 Sensu

Nagios 是一个起源于 90 年代的监控系统，最初名为 NetSaint。它主要通过脚本的退出码进行告警。

与其他解决方案不同，它存储的数据量和数据类型受到限制，仅用于检查状态，因此只适合非常基础的监控。

Sensu 可以视为 Nagios 的更现代版本。主要的区别在于，Sensu 客户端会自行注册，并可以从中央或本地配置中决定要运行的检查项。它还有一个客户端套接字，允许将任意的检查结果推送到 Sensu 中。

Sensu 使用与 Nagios 几乎相同的数据模型，并且共享其在存储度量标准时使用的格式的限制。

Prometheus

Prometheus 是一个完整的监控和趋势分析系统，包含内置的主动抓取、存储、查询、绘图和基于时序数据的告警功能。它了解世界应该是什么样子（哪些端点应该存在，什么时序模式表示问题，等等），并主动寻找故障。

Prometheus 拥有丰富的数据模型，并且可能是时序数据库中最强大的查询语言。它将维度显式地编码为附加在度量名称上的键值对（标签）。这使得通过查询语言可以轻松地按这些标签进行过滤、分组和匹配。

我们应该选择哪种工具？

我们列出的所有工具都各有优点。它们在许多方面有所不同，但在某些方面又是相似的。

Nagios 和 Sensu 过去为我们提供了很好的服务。它们设计于不同的时代，基于今天被认为已经过时的原则。它们在静态集群和运行在预定义位置的单体应用程序与服务中表现良好。它们存储的度量标准（或缺乏度量标准）并不适合进行更复杂的决策。在我们想要运营像 Docker Swarm 这样的调度器，运行在一个自动扩展的集群中时，它们会让我们遇到很大的困难。在我们探索的解决方案中，它们是最先应该被舍弃的。一个被淘汰，剩下三个可以选择。

Graphite 使用的点分隔度量格式存在局限性。使用星号（*）排除度量的元素往往不足以进行适当的过滤、分组和其他操作。与 InfluxDB 和 Prometheus 相比，它的查询语言是我们放弃它的主要原因。

我们剩下的选择是 InfluxDB 和 Prometheus，二者的差异仅在于一些细微之处。

InfluxDB 和 Prometheus 在许多方面相似，因此选择并不容易。说实话，我们不可能做出错误的决策。不论我们选择哪个，最终的决定都会基于一些细小的差异。

如果我们不局限于开源解决方案作为唯一候选项，那么 InfluxDB 企业版可能会因为其可扩展性而成为赢家。然而，我们将放弃它，选择 Prometheus。它提供了更为完整的解决方案。更重要的是，Prometheus 正在慢慢成为事实上的标准，至少在与调度器一起使用时是如此。它在 Kubernetes 中是首选解决方案。Docker（因此 Swarm）很快会以 Prometheus 格式公开其度量数据。这一点本身就是一个转折点，应该让我们更倾向于选择 Prometheus。

决定已定。我们将使用 Prometheus 来存储度量数据，查询它们，并触发警报。

接下来怎么办？

现在我们已经决定了用于存储度量数据的工具基础，接下来应当进行设置。由于我们将使用 Docker Swarm 服务，以最基本的形式部署 Prometheus 将会非常轻松。

第七章：部署和配置 Prometheus

初看之下，部署 Prometheus 很简单。创建一个 Compose 文件并执行docker stack deploy命令。复杂性出现在我们开始将服务与 Prometheus 集成时。很快，你将亲身体验到集成问题。

像任何一个好故事一样，本章将从一个愉快的开始。对工程师来说，愉快意味着简单且有效。让我们看看实践中简单是怎样的。

部署 Prometheus 堆栈

我们将从克隆vfarcic/docker-flow-monitor仓库开始。它包含了本章中我们将使用的所有脚本和 Docker 堆栈。

`1` git clone `\`
`2 `    https://github.com/vfarcic/docker-flow-monitor.git
`3` 
`4` `cd` docker-flow-monitor

Before we create a Prometheus service, we need to have a cluster. It will consist of three nodes created with Docker Machine. ``` `1` chmod +x scripts/dm-swarm.sh `2` `3` ./scripts/dm-swarm.sh `4` `5` `eval` `$(`docker-machine env swarm-1`)` ``` ````````````````` The `dm-swarm.sh` script created the nodes and joined them into a Swarm cluster. Now we can create the first Prometheus service. We’ll start small and slowly move toward a more robust solution. We’ll deploy the stack defined in `stacks/prometheus.yml`. It is as follows. ``` `1` version: "3" `2` `3` services: `4` `5 ` prometheus: `6 ` image: prom/prometheus `7 ` ports: `8 ` - 9090:9090 ``` ```````````````` As you can see, it is as simple as it can get. It specifies the image and the port that should be opened. Let’s deploy the stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/prometheus.yml `\` `3 ` monitor ``` ``````````````` Please wait a few moments until the image is pulled and deployed. You can monitor the status by executing the `docker stack ps monitor` command. Let’s confirm that Prometheus service is indeed up-and-running. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``:9090"` ``` `````````````` You should see the Prometheus Graph screen. Let’s take a look at the configuration. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``:9090/config"` ``` ````````````` You should see the default config that does not define much more than intervals and internal scraping. In its current state, Prometheus is not very useful, so we’ll have to spice it up a bit. ![Figure 3-1: Prometheus with the default configuration](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00008.jpeg) Figure 3-1: Prometheus with the default configuration We should start fine tuning Prometheus. There are quite a few ways we can do that. We can create a new Docker image that would extend the one we used and add our own configuration file. That solution has a distinct advantage of being immutable and, hence, very reliable. Since Docker image cannot be changed, we can guarantee that the configuration is exactly as we want it to be no matter where we deploy it. If the service fails, Swarm will reschedule it and, since the configuration is baked into the image, it’ll be preserved. The problem with that approach is that it is not suitable for microservices architecture. If Prometheus has to be reconfigured with every new service (or at least those that expose metrics), we would need to build it quite often and tie that build to CD processes executed for the services we’re developing. This approach is suitable only for a relatively static cluster and monolithic applications. Discarded! ![Figure 3-2: Creating a new image every time Prometheus config change](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00009.jpeg) Figure 3-2: Creating a new image every time Prometheus config change What would be the alternative approach? We can enter a running Prometheus container, modify its configuration, and reload it. While this allows a higher level of dynamism, it is not fault-tolerant. If Prometheus fails, Swarm will reschedule it, and all the changes we made will be lost. Besides fault tolerance, modifying a config in a running container poses additional problems when running it as a service inside a cluster. We need to find out the node it is running in, SSH into it, figure out the ID of the container, and, only then, we can `exec` into it, modify the config, and send a reload request. While those steps are not overly complicated and can be scripted, they will pose an unnecessary operational complexity. Discarded! ![Figure 3-3: Updating Prometheus configuration inside a container](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00010.jpeg) Figure 3-3: Updating Prometheus configuration inside a container Among other reasons, we discarded the previous solution because it is not fault-tolerant. We could mount a network volume to the service. That would solve persistence, but would still leave the problem created by a dynamic nature of a cluster. We still, potentially, need to change the configuration and reload Prometheus every time a new service is deployed or updated. From the operational perspective, this solution is simpler than the previous solution we discussed. We do not need to find out the node it is running in, SSH into it, figure out the ID of the container, `exec` into it, and modify the config. Instead, we can alter the file on the network drive and send a reload request to Prometheus. While network drive simplifies the process, it does not make it as dynamic and independent from the services as it should be. We would need to make sure that the deployment pipeline of each of the services has the required steps that will reconfigure Prometheus. By doing that we would break one of our objectives. That is, our services would not contain all the information about themselves. Instead, we’d need to create a different pipeline for each and specify the targets, alerts, and other information we might need before reconfiguring Prometheus. We’ll discard this solution as well. ![Figure 3-4: Updating Prometheus configuration stored on a network drive](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00011.jpeg) Figure 3-4: Updating Prometheus configuration stored on a network drive What other options do we have? If we’re looking for an out-of-the-box solution that uses the official Prometheus image, all our options are exhausted. But we are engineers. We are used to extending other people solutions and adapting them to suit our needs. Let’s not limit our options and try to design a solution that would suit us well. ### Designing A More Dynamic Monitoring Solution How can we improve Prometheus design to suit our purposes better? How can we make it more dynamic and more scheduler friendly? One improvement we can make is the usage of environment variables. That would save us from having to create a new image every time we need to change its configuration. At the same time, environment variables would remove the need to use a network drive (at least for configuration). We can make a generic solution that will transform any environment variable into a Prometheus configuration entry or an initialization argument. To enable Prometheus configuration through environment variables, we need to distinguish those that should be used as command line arguments from those that will serve to create the configuration file. We’ll define a naming convention stating that every environment argument with a name that starts with `ARG_` is a startup argument. The code can be as follows. ``` `1` `func` `Run``()` `error` `{` `2` `cmdString` `:=` `"prometheus"` `3` `for` `_``,` `e` `:=` `range` `os``.``Environ``()` `{` `4` `if` `key``,` `value` `:=` `getArgFromEnv``(``e``,` `"ARG"``);` `len``(``key``)` `>` `0` `{` `5` `cmdString` `=` `fmt``.``Sprintf``(``"%s -%s=%s"``,` `cmdString``,` `key``,` `value``)` `6` `}` `7` `}` `8` `cmd` `:=` `exec``.``Command``(``"/bin/sh"``,` `"-c"``,` `cmdString``)` `9` `return` `cmdRun``(``cmd``)` `10` `}` ``` ```````````` It is a very simple function. It iterates through all the environment variables. If their names start with `ARG`, they will be added as arguments of the executable `prometheus`. Once the iteration is done, binary is launched with arguments. We made Prometheus more *Docker-friendly* with only a few lines of code that sits on top of it. The full source code can be found in the [run.go](https://github.com/vfarcic/docker-flow-monitor/blob/master/prometheus/run.go) file. We should do something similar with the configuration file. Specifically, we can make the global section of the configuration use environment variables prefixed with `GLOBAL_`. The logic of the code is similar to the `Run` function we explored. Please go through [config.go](https://github.com/vfarcic/docker-flow-monitor/blob/master/prometheus/config.go) for more details. The `GetGlobalConfig` function returns `global` section of the config while the `WriteConfig` function writes the configuration to the file. Please consult [Prometheus Configuration](https://prometheus.io/docs/operating/configuration/) for more information about the available options. By using environment variables, we managed to get rid of the network drive. As far as configuration is concerned, it will be fault tolerant. If the service fails and gets rescheduled with Swarm, it will not lose its configuration since it is part of the service definition. There is a downside though. Every time we want to change the configuration, we’ll need to execute `docker service update` command or modify the stack file, and re-execute `docker stack deploy`. As a result, Docker will stop the currently running replica and start a new one thus producing a short downtime. However, since we are, at the moment, only dealing with global configuration and startup arguments, changes will be very uncommon. We’ll deal with more dynamic parts of the configuration later. ![Figure 3-5: Prometheus configuration defined through environment variables](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00012.jpeg) Figure 3-5: Prometheus configuration defined through environment variables I have the code compiled and available as [vfarcic/docker-flow-monitor/](https://hub.docker.com/r/vfarcic/docker-flow-monitor/). Let’s give it a spin. ### Deploying Docker Flow Monitor Deploying *Docker Flow Monitor* is easy (as almost all Docker services are). We’ll start by creating a network called `monitor`. We could let Docker stack create it for us, but it is useful to have it defined externally so that we can easily attach it to services from other stacks. ``` `1` docker network create -d overlay monitor ``` ``````````` The stack is as follows. ``` `1` `version``:` `"3"` `2` `services``:` `3` `monitor``:` `4` `image``:` `vfarcic``/``docker``-``flow``-``monitor``:``$``{``TAG``:-``latest``}` `5` `environment``:` `6` `-` `GLOBAL_SCRAPE_INTERVAL``=``10``s` `7` `networks``:` `8` `-` `monitor` `9` `ports``:` `10 ` `-` `9090``:``9090` `11` `networks``:` `12 ` `monitor``:` `13 ` `external``:` `true` ``` `````````` The environment variable `GLOBAL_SCRAPE_INTERVAL` shows the first improvement over the “original” Prometheus service. It allows us to define entries of its configuration as environment variables. That, in itself, is not a significant improvement but is a good start. More powerful additions will be explored later on. Now we’re ready to deploy the stack. ``` `1` docker stack rm monitor `2` `3` docker stack deploy `\` `4 ` -c stacks/docker-flow-monitor.yml `\` `5 ` monitor ``` ````````` Please wait a few moments until Swarm pulls the image and starts the service. You can monitor the status by executing `docker stack ps monitor` command. Once the service is running, we can confirm that the environment variable indeed generated the configuration. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``:9090/config"` ``` ```````` ![Figure 3-6: Prometheus configuration defined through environment variables](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00013.jpeg) Figure 3-6: Prometheus configuration defined through environment variables we are going to expose services url in pretty format, therefore we must get rid of port number (9090) in above url. ### Integrating Docker Flow Monitor With Docker Flow Proxy Having a port opened (other than `80` and `443`) is, often, not a good idea. If for no other reason, at least it’s not user-friendly to remember a different port for each service. In general service might need to be accessible on its own subdomain, it might need SSL certificate, it might require some URL rewriting, it might need a basic authentication, and so on and so forth. I won’t go into details since you probably already know all that and you are probably already using some proxy in your organization. We’ll integrate [Docker Flow Monitor](http://monitor.dockerflow.com/) with [Docker Flow Proxy (DFP)](http://proxy.dockerflow.com/). If you haven’t used DFP before, please visit the [official documentation](http://proxy.dockerflow.com/) for tutorials, setup, and configuration. Before we apply the knowledge about new ways to configure Prometheus, we need to run the proxy. ``` `1` docker network create -d overlay proxy `2` `3` docker stack deploy `\` `4 ` -c stacks/docker-flow-proxy.yml `\` `5 ` proxy ``` ``````` We created the `proxy` network and deployed the `docker-flow-proxy.yml` stack. We won’t go into details how *Docker Flow Proxy* works. The essence is that it will configure itself with each service that has specific labels. For any deeper explanation, please visit [Docker Flow Proxy Stack Tutorial](http://proxy.dockerflow.com/swarm-mode-stack/) or any other tutorial available. With the proxy up and running, we should redeploy our monitor. We’ll replace the current monitor stack with a new one with in order to achieve this. The major difference is that this time we’ll define startup arguments as well as the labels that will allow the proxy to reconfigure itself to enable access to monitor. You’ll also notice that we will not expose port *9090*. It’ll be accessible through the proxy on port *80*, so there’s no reason to open any other port. The stack is as follows. ``` `1` `monitor``:` `2` `image``:` `vfarcic``/``docker``-``flow``-``monitor``:``$``{``TAG``:-``latest``}` `3` `environment``:` `4` `-` `GLOBAL_SCRAPE_INTERVAL``=``10``s` `5` `-` `ARG_WEB_ROUTE``-``PREFIX``=/monitor` `6` `-` `ARG_WEB_EXTERNAL``-``URL``=``http``:``//``$``{``DOMAIN``:-``localhost``}``/``monitor` `7` `networks``:` `8` `-` `proxy` `9` `-` `monitor` `10 ` `deploy``:` `11 ` `labels``:` `12 ` `-` `com``.``df``.``notify``=``true` `13 ` `-` `com``.``df``.``distribute``=``true` `14 ` `-` `com``.``df``.``servicePath=/monitor` `15 ` `-` `com``.``df``.``serviceDomain=``$``{``DOMAIN``:-``localhost``}` `16 ` `-` `com``.``df``.``port``=``9090` `17` `18` `19 ` `swarm``-``listener``:` `20 ` `image``:` `vfarcic``/``docker``-``flow``-``swarm``-``listener` `21 ` `networks``:` `22 ` `-` `monitor` `23 ` `volumes``:` `24 ` `-` `/``var``/``run``/``docker``.``sock``:``/``var``/``run``/``docker``.``sock` `25 ` `environment``:` `26 ` `-` `DF_NOTIFY_CREATE_SERVICE_URL``=``http``:``//``monitor``:``8080``/``v1``/``docker``-``flow``-``monitor/\` `27` `reconfigure` `28 ` `-` `DF_NOTIFY_REMOVE_SERVICE_URL``=``http``:``//``monitor``:``8080``/``v1``/``docker``-``flow``-``monitor/\` `29` `remove` `30 ` `deploy``:` `31 ` `placement``:` `32 ` `constraints``:` `[``node``.``role` `==` `manager``]` `33` `34` `networks``:` `35 ` `monitor``:` `36 ` `external``:` `true` `37 ` `proxy``:` `38 ` `external``:` `true` ``` `````` This time we added a few additional environment variables. They will be used instead Prometheus’ default startup arguments. We are specifying the route prefix (`ARG_WEB_ROUTE-PREFIX`) as well as the full external URL (`ARG_WEB_EXTERNAL-URL`). > Please visit [ARG Variables](http://monitor.dockerflow.com/config/#arg-variables) section of the documentation for more information about environment variables that can be used as startup arguments. We also used the `com.df.*` service labels that will tell the proxy how to reconfigure itself so that Prometheus is available through the path `/monitor`. The second service is [Docker Flow Swarm Listener](http://swarmlistener.dockerflow.com/) that will listen to Swarm events and send reconfigure and remove requests to the monitor. You’ll see its usage later on. For now, just remember that we deployed it alongside the `monitor` service. Let us deploy the new version of the monitor stack. ``` `1` docker stack rm monitor `2` `3` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `4 ` docker stack deploy `\` `5 ` -c stacks/docker-flow-monitor-proxy.yml `\` `6 ` monitor ``` ````` Please execute, `docker stack ps monitor` to check the status of the stack. Once it’s up-and-running, we can confirm that the monitor is indeed integrated with the proxy. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/flags"` ``` ```` By opening the *flags* screen, not only that we confirmed that the integration with *Docker Flow Proxy* worked but also that the arguments we specified as environment variables are properly propagated. You can observe that through the values of the `web.external-url` and `web.route-prefix` flags. ![Figure 3-7: Prometheus flags screen with values passed through environment variables](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00014.jpeg) Figure 3-7: Prometheus flags screen with values passed through environment variables Please note that we did not specify the port of the `monitor` service. As soon as the service was created, `swarm-listener` detected it and sent a request to the proxy to reconfigure itself. The information the proxy needs was obtained through the labels (e.g. `com.df.servicePath`). There was a hidden reason behind the integration of the two. Apart from the need to have a proxy, I wanted to show you an existing implementation of the logic we are exploring. There was no need for a manual configuration of the proxy, nor we had to define the data proxy needs anywhere but inside the service definition itself. The `monitor` service contains all the information, and any other part of the system can fetch it. Everything related to the service is in a single location. By everything, I mean everything that we need for now. Later on, we’ll extend the definition of this and many other services. ### What Now? Soon we’ll start exploring *exporters* and their integration with *Prometheus* and *Docker Flow Monitor*. We’ll take a break and remove the machines we created. Every chapter will start from scratch. Don’t be scared. It’ll take only a couple of minutes to get back to the previous state. ``` `1` docker-machine rm -f `\` `2 ` swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` `````````````````

第八章：拉取指标

Prometheus 是一个基于拉取的系统。它需要从目标获取指标。这些指标可以从你的服务内部暴露，或者通过充当 Prometheus 和其他服务或系统之间中介的通用出口程序来暴露。

服务可以通过使用其中一个客户端库来提供指标。许多编程语言都被支持。如果你的编程语言没有可用的库，或者你不想再增加一个依赖项，始终可以选择实现其中一种暴露格式。

我们服务的替代性监控方法是使用出口程序。出口程序与集成页面列出了许多官方和社区维护的解决方案。

在我们面前有这两种选择时，我们应该做出决定，选择哪种类型。我们是应该对服务进行监控，还是使用出口程序？这个决定不一定是二选一的。我们可以同时使用两者。

在某些情况下，我们没有选择。例如，对于第三方软件，如 HAProxy，出口程序可能是唯一的选择，因为它没有原生提供 Prometheus 格式的指标。另一方面，如果有一组非常特定的指标需要从我们的服务中抓取，监控是最佳选择，除非该服务已经以不同格式暴露了指标。

大多数时候，我们确实有选择，既可以使用出口程序，也可以对服务进行监控。在这种情况下，我更倾向于使用出口程序。即使有时监控不可避免，它也会导致不必要的耦合。我们的服务应该只做它们被设计来做的事。例如，如果我们有一个充当购物车的服务，添加监控并且可能依赖于 Prometheus 库，就会引入紧耦合。如果我们这样做，购物车就不再专注于解决单一的业务领域，而是还承担了其他附带功能。你可能会认为，添加监控不是一项很大的工作。然而，保持服务专注于其业务领域有诸多好处，我们应该避免通过增加额外的职责来扩展它们的范围。也就是说，前提是我们能够避免这样做。

我的建议是，始终从出口程序开始，只有当你需要的指标没有通过现有的出口程序提供时，才对服务进行监控。这样，你的服务将有明确的职责，并专注于某个业务领域，而所有与基础设施相关的任务将委托给像出口程序这样的垂直服务。

在本章中，我们将仅使用导出器作为提供目标的手段，这些目标将被 Prometheus 用于抓取指标。如果你意识到确实需要为你的服务添加监控功能，请查阅Prometheus 文档以获取更多信息。

在本书的后续章节中，如果我们确实意识到它提供了实质性的优势，我们可能会为我们的演示服务添加监控功能。

现在我们已经简要概述了不同的指标暴露方式，我们可以继续进行这一主题的实践探索。

创建集群并部署服务

我们将从重新创建集群并部署上一章中使用的堆栈开始。

`1` chmod +x scripts/dm-swarm-04.sh
`2` 
`3` ./scripts/dm-swarm-04.sh
`4` 
`5` `eval` `$(`docker-machine env swarm-1`)`

We executed the `dm-swarm-04.sh` script which, in turn, created a Swarm cluster composed of Docker Machines, created the networks and deployed the stacks. Now we should wait a few moments until all the services in the `monitor` stack are up and running. Please use `docker stack ps monitor` command to confirm that the status of all the services in the stack is *Running*. Finally, we’ll confirm that everything is deployed correctly by opening Prometheus in a browser. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor"` ``` ````````````````````````` Now the state of our cluster is the same as it was at the end of the previous chapter and we can proceed towards deploying exporters. ### Deploying Exporters Exporters provide data Prometheus can scrape and put into its database. The stack we’ll deploy is as follows. ``` `1` `version``:` `"3"` `2` `3` `services``:` `4` `5` `ha``-``proxy``:` `6` `image``:` `quay``.``io``/``prometheus``/``haproxy``-``exporter``:``$``{``HA_PROXY_TAG``:-``latest``}` `7` `networks``:` `8` `-` `proxy` `9` `-` `monitor` `10 ` `deploy``:` `11 ` `labels``:` `12 ` `-` `com``.``df``.``notify``=``true` `13 ` `-` `com``.``df``.``scrapePort``=``9101` `14 ` `command``:` `-``haproxy``.``scrape``-``uri=``"http://admin:admin@proxy/admin?stats;csv"` `15` `16 ` `cadvisor``:` `17 ` `image``:` `google``/``cadvisor``:``$``{``CADVISOR_TAG``:-``latest``}` `18 ` `networks``:` `19 ` `-` `monitor` `20 ` `volumes``:` `21 ` `-` `/:/rootfs` `22 ` `-` `/``var``/``run``:``/``var``/``run` `23 ` `-` `/``sys``:``/``sys` `24 ` `-` `/``var``/``lib``/``docker``:``/``var``/``lib``/``docker` `25 ` `deploy``:` `26 ` `mode``:` `global` `27 ` `labels``:` `28 ` `-` `com``.``df``.``notify``=``true` `29 ` `-` `com``.``df``.``scrapePort``=``8080` `30` `31 ` `node``-``exporter``:` `32 ` `image``:` `basi``/``node``-``exporter``:``$``{``NODE_EXPORTER_TAG``:-``v1``.13.0``}` `33 ` `networks``:` `34 ` `-` `monitor` `35 ` `environment``:` `36 ` `-` `HOST_HOSTNAME``=/etc``/``host_hostname` `37 ` `volumes``:` `38 ` `-` `/``proc``:``/``host``/``proc` `39 ` `-` `/``sys``:``/``host``/``sys` `40 ` `-` `/:/rootfs` `41 ` `-` `/``etc``/``hostname``:``/``etc``/``host_hostname` `42 ` `deploy``:` `43 ` `mode``:` `global` `44 ` `labels``:` `45 ` `-` `com``.``df``.``notify``=``true` `46 ` `-` `com``.``df``.``scrapePort``=``9100` `47 ` `command``:` `'-collector.procfs /host/proc -collector.sysfs /host/sys -collector\` `48` `.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)" -collector.te\` `49` `xtfile.directory /etc/node-exporter/ -collectors.enabled="conntrack,diskstats,en\` `50` `tropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,\` `51` `vmstat,ipvs"'` `52` `53` `networks``:` `54 ` `monitor``:` `55 ` `external``:` `true` `56 ` `proxy``:` `57 ` `external``:` `true` ``` ```````````````````````` As you can see, the stack definition contains the `node` and `haproxy` exporters as well as `cadvisor` service. `haproxy-exporter` provides proxy metrics, `node-exporter` collects server data, while `cadvisor` outputs information about containers inside our cluster. You’ll notice that `cadvisor` and `node-exporter` are running in the `global mode`. A replica will run on each server so that we can obtain an accurate picture of all the nodes that form the cluster. The important parts of the stack definition are `com.df.notify` and `com.df.scrapePort` labels. The first one tells `swarm-listener` that it should notify the monitor when those services are created (or destroyed). The `scrapePort` labels are defining ports of the exporters. Please visit [Scrape Parameter](http://monitor.dockerflow.com/usage/#scrape-parameters) section of the documentation for more information how to define scrape parameters. Let’s deploy the stack and see it in action. ``` `1` docker stack deploy `\` `2 ` -c stacks/exporters.yml `\` `3 ` exporter ``` ``````````````````````` Please wait until all the services in the stack and running. You can monitor their status with `docker stack ps exporter` command. Once the `exporter` stack is up-and-running, we can confirm that all the services were added to the `monitor` config. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/config"` ``` `````````````````````` ![Figure 4-1: Configuration with exporters](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00015.jpeg) Figure 4-1: Configuration with exporters We can also confirm that all the targets are indeed working. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/targets"` ``` ````````````````````` There should be three targets. If they are still not registered, please wait a few moments and refresh your screen. Two of the targets (`exporter_cadvisor` and `exporter_node-exporter`) are running as global services. As a result, each has three endpoints, one on each node. The last target is `exporter_ha-proxy`. Since we did not deploy it globally nor specified multiple replicas, in has only one endpoint. ![Figure 4-2: Targets and endpoints](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00016.jpeg) Figure 4-2: Targets and endpoints If we used the “official” Prometheus image, setting up those targets would require an update of the config file and reload of the service. On top of that, we’d need to persist the configuration. Instead, we let *Swarm Listener* notify *Docker Flow Monitor* that there are new services that should, in this case, generate new scraping targets. Instead of splitting the initial information into multiple locations, we specified scraping info as service labels and let the system take care of the distribution of that data. ![Figure 4-3: Prometheus scrapes metrics from exporters](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00017.jpeg) Figure 4-3: Prometheus scrapes metrics from exporters Let’s take a closer look into the exporters running in our cluster. ### Exploring Exporter Metrics All the exporters we deployed expose metrics in Prometheus format. We can observe them by sending a simple HTTP request. Since the services do not publish any ports, the only way we can communicate with them is through the `monitor` network attached to those exporters. We’ll create a new utility service and attach it to the `monitor` network. ``` `1` docker service create `\` `2 ` --name util `\` `3 ` --network monitor `\` `4 ` --mode global `\` `5 ` alpine sleep `100000000` ``` ```````````````````` We created a service based on the `alpine` image, named it `util`, and attached it to the `monitor` network so that it can communicate with exporters we deployed. We made the service `global` so that it runs on every node. That guaranteed that a replica runs on the node we’re in. Since `alpine` does not have a long running process, without `sleep`, it would stop as soon as it started, Swarm would reschedule it, only to detect that it stopped again, and so on. Without `sleep` it would enter a never ending loop of failures and rescheduling. ``` `1` `ID``=``$(`docker container ls -q `\` `2 ` -f `"label=com.docker.swarm.service.name=util"``)` `3` `4` docker container `exec` -it `$ID` `\` `5 ` apk add --update curl ``` ``````````````````` Next, we found the `ID` of the container, entered it, and installed `curl`. Now we’re ready to send requests to the exporters. ``` `1` docker container `exec` -it `$ID` `\` `2 ` curl node-exporter:9100/metrics ``` `````````````````` Partial output of the request to the `node-exporter` is as follows. ``` `1` ... `2` # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. `3` # TYPE process_cpu_seconds_total counter `4` process_cpu_seconds_total 3.05 `5` # HELP process_max_fds Maximum number of open file descriptors. `6` # TYPE process_max_fds gauge `7` process_max_fds 1.048576e+06 `8` # HELP process_open_fds Number of open file descriptors. `9` # TYPE process_open_fds gauge `10` process_open_fds 7 `11` # HELP process_resident_memory_bytes Resident memory size in bytes. `12` # TYPE process_resident_memory_bytes gauge `13` process_resident_memory_bytes 1.6228352e+07 `14` # HELP process_start_time_seconds Start time of the process since unix epoch in \ `15` seconds. `16` # TYPE process_start_time_seconds gauge `17` process_start_time_seconds 1.49505618366e+09 `18` # HELP process_virtual_memory_bytes Virtual memory size in bytes. `19` # TYPE process_virtual_memory_bytes gauge `20` process_virtual_memory_bytes 2.07872e+07 `21` ... ``` ````````````````` As you can see, each metric contains a help entry that describes it, states the type, and displays metric name followed with a value. We won’t go into details of all the metrics provided by `node-exporter`. The list is quite big, and it would require a whole chapter (maybe even a book) to go through all of them. The important thing, at this moment, is to know that almost anything hardware and OS related is exposed as a metric. Please note that Overlay network load-balanced our request and forwarded it to one of the replicas of the exporter. We don’t know what the origin of those metrics is. It could be a replica running on any of the nodes of the cluster. That should not be a problem since, at this moment, we’re interested only in observing how metrics look like. If you go back to the configuration screen, you’ll notice that targets are configured to use `tasks.[SERVICE_NAME]` format for addresses. When a service name is prefixed with `tasks.`, Swarm returns the list of all replicas (or tasks) of a service. Let’s move to `cadvisor` metrics. ``` `1` docker container `exec` -it `$ID` `\` `2 ` curl cadvisor:8080/metrics ``` ```````````````` Partial output of the request to `cadvisor` metrics is as follows. ``` `1` ... `2` # HELP container_network_receive_bytes_total Cumulative count of bytes received `3` # TYPE container_network_receive_bytes_total counter `4` container_network_receive_bytes_total{id="/",interface="dummy0"} 0 `5` container_network_receive_bytes_total{id="/",interface="eth0"} 6.6461026e+07 `6` container_network_receive_bytes_total{id="/",interface="eth1"} 1.3054141e+07 `7` ... `8` container_network_receive_bytes_total{container_label_com_docker_stack_namespace\ `9` ="proxy",container_label_com_docker_swarm_node_id="zvn1kazstoa12pu3rfre9j4sw",co\ `10` ntainer_label_com_docker_swarm_service_id="gfoias8w9bf1cve5dujzzlpfh",container_\ `11` label_com_docker_swarm_service_name="proxy_swarm-listener",container_label_com_d\ `12` ocker_swarm_task="",container_label_com_docker_swarm_task_id="39hgd75s8vt051smew\ `13` 3ke4imw",container_label_com_docker_swarm_task_name="proxy_swarm-listener.1.39hg\ `14` d75s8vt051smew3ke4imw",id="/docker/f2232d2ddf801b1ff41120bb1b95213be15767fe0e6d4\ `15` 5266b3b8bba149b3634",image="vfarcic/docker-flow-swarm-listener:latest@sha256:d67\ `16` 494f08aa3efba86d5231adba8ee7281c29fd401a5f67377ee026cc436552b",interface="eth0",\ `17` name="proxy_swarm-listener.1.39hgd75s8vt051smew3ke4imw"} 112764 `18` ... ``` ``````````````` The major difference, when compared to `node-exporter`, is that `cadvisor` provides a lot of labels. They help a lot when querying metrics, and we’ll use them soon. Just like with `node-exporter`, we won’t go into details of each metric exposed through `cadvisor`. Instead, as we’re progressing towards creating a *self-healing* system, we’ll gradually increase the number of metrics we’re using and comment on them as they come. Now that we have the metrics and that Prometheus is scraping and storing them in its database, we can turn our attention to queries we can execute. ### Querying Metrics Targets are up and running, and Prometheus is scraping their data. We should generate some traffic that would let us see Prometheus query language in action. We’ll deploy *go-demo* stack. It contains a service with an API and a corresponding database. We’ll use it as a demo service that will allow us to explore better some of the metrics we can use. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo.yml `\` `3 ` go-demo ``` `````````````` We should wait a few moments for the services from the `go-demo` stack to become operational. Please execute `docker stack ps go-demo` to confirm that all the replicas are running. Now that the demo service is running, we can explore some of the metrics we have at our disposal. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/graph"` ``` ````````````` Please type `haproxy_backend_connections_total` in the *Expression* field, and press the *Execute* button. The result should be zero connections on the backend `go-demo_main-be8080`. Let’s spice it up by creating a bit of traffic. ``` `1` `for` `((``n``=``0``;`n<`200``;`n++`))``;` `do` `2 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello"` `3` `done` ``` ```````````` We sent 200 requests to the `go-demo` service. If we go back to the Prometheus UI and repeat the execution of the `haproxy_backend_connections_total` expression, the result should be different. In my case, there are *200* backend connections from *go-demo_main-be8080*. ![Figure 4-4: HA Proxy metrics](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00018.jpeg) Figure 4-4: HA Proxy metrics We could display the data as a graph by clicking the *Graph* tab. How about memory usage? We have the data through `cadvisor` so we might just as well use it. Please type `container_memory_usage_bytes{container_label_com_docker_swarm_service_name="go-demo_main"}` in the expression field and click the *Execute* button. The result is memory usage limited to the Docker service `go-demo_main`. Depending on the view, you should see three values in *Console* or three lines in the *Graph* tab. They represent memory usage of the three replicas of the `go-demo_main` service. ![Figure 4-5: cAdvisor metrics](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00019.jpeg) Figure 4-5: cAdvisor metrics Finally, let’s explore one of the `node-exporter` metrics. We can, for example, display the amount of available memory from each of the nodes. Please type `sum by (instance) (node_memory_MemFree)` in the expression field and click the *Execute* button. The result is a representation of free memory for each of the nodes of the cluster. ![Figure 4-6: Graph with available memory](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00020.jpeg) Figure 4-6: Graph with available memory Now that we had a very brief overview of the ways we can query metrics, we should start using them. ### Updating Service Constraints The services we created so far are scheduled without any constraints. The exceptions are those that tie some of the services to one of the Swarm managers. Without constraints, Swarm will distribute service replicas evenly. It will place them on a node that has fewest containers. Such a strategy can be disastrous. For example, we might end up with Prometheus, ElasticSearch, and MongoDB on the same node. Since all three of them require a fair amount of memory, their performance can deteriorate quickly. At the same time, the rest of the nodes might be running very undemanding services like `go-demo`. As a result, we can end up with a very uneven distribution of replicas from the resource perspective. We cannot blame Swarm for a poor distribution of service replicas. We did not give it any information to work with. As a minimum, we should have defined how much memory it should reserve for each service as well as memory limits. Memory reservation gives Swarm a hint how much it should reserve for a service. If, for example, we specify that a replica of a service should reserve 1GB of memory, Swarm will make sure to run it on a node that has that amount available. Bear in mind that it does not compare reservation with the actual memory usage but, instead, it compares it with the reservations made for other services and the total amount of memory allocated to each node. Memory limit, on the other hand, should be set to the maximum amount we expect a service to use. If the actual usage surpasses it, the container will be shut down and, consequently, Swarm will reschedule it. Memory limit is, among other things, a useful protection against memory leaks and a way of preventing a single service abducting all the resources. Let us revisit the services we are currently running and try to set their memory reservations and limits. What should be the constraint values? How do we know how much memory should be reserved and what should be the limit? As it happens, there are quite a few different approaches we can take. We could visit a fortune teller and consult a crystal ball, or we can make a lot of very inaccurate assumptions. Either of those is a bad way of defining constraints. You might be inclined to say that databases need more memory than backend services. We can assume that those written in Java require more resources than those written in Go. There is no limit to the number of guesses we could make. However, more often than not, they will be false and inaccurate. If those two would be the only options, I would strongly recommend visiting a fortune teller instead guessing. Since the result will be, more or less, the same, a fortune teller can, at least, provide a fun diversion from day to day monotony and lead to very popular photos uploaded to Instagram. The correct approach is to let the services run for a while and consult metrics. Then let them run a while longer and revisit the metrics. Then wait some more and consult again. The point is that the constraints should be reviewed and, if needed, updated periodically. They should be redefined and adapted as a result of new data. It’s a task that should be repeated every once in a while. Fortunately, we can create alerts that will tell us when to revisit constraints. However, you’ll have to wait a while longer until we get there. For now, we are only concerned with the initial set of constraints. While we should let the services run for at least a couple of hours before consulting metrics, my patience is reaching the limit. Instead, we’ll imagine that enough metrics were collected and consult Prometheus. The first step is to get a list of the stacks we are currently running. ``` `1` docker stack ls ``` ``````````` The output is as follows. ``` `1` NAME SERVICES `2` exporter 3 `3` go-demo 2 `4` monitor 2 `5` proxy 2 ``` `````````` Let us consult the current memory usage of those services. Please open Prometheus’ graph screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/graph"` ``` ````````` Type `container_memory_usage_bytes{container_label_com_docker_stack_namespace="exporter"}` in the *Expression* field, click the *Execute* button, and switch to the *Graph* view. If you hover over the lines in the graph, you’ll see that one of the labels is `container_label_com_docker_swarm_service_name`. It contains the name of a service allowing you to identify how much memory it is consuming. While the exact numbers will differ from one case to another, `exporter_cadvisor` should be somewhere between 20MB and 30MB, while `exporter_node-exporter` and `exporter_ha-proxy` should have lower usage that is around 10MB. With those numbers in mind, our `exporter` stack can be as follows (limited to relevant parts). ``` `1` ... `2` `3` ha-proxy: `4` ... `5` deploy: `6` ... `7` resources: `8` reservations: `9` memory: 20M `10 ` limits: `11 ` memory: 50M `12 ` ... `13` `14 ` cadvisor: `15 ` ... `16 ` deploy: `17 ` ... `18 ` resources: `19 ` reservations: `20 ` memory: 30M `21 ` limits: `22 ` memory: 50M `23` `24 ` node-exporter: `25 ` ... `26 ` deploy: `27 ` ... `28 ` resources: `29 ` reservations: `30 ` memory: 20M `31 ` limits: `32 ` memory: 50M `33 ` ... ``` ```````` We set memory reservations similar to the upper bounds of the current usage. That will help Swarm schedule the containers better, unless they are global and have to run everywhere. More importantly, it allows Swarm to calculate future schedules by excluding these reservations from the total available memory. Memory limits, on the other hand, will provide limitations on how much memory containers created from those services can use. Without memory limits, a container might “go wild” and abduct all the memory on a node for itself. Good example are in-memory databases like Prometheus. If we would deploy it without any limitation, it could easily take over all the resources leaving the rest of the services running on the same node struggling. Let’s deploy the updated version of the `exporter` stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/exporters-mem.yml `\` `3 ` exporter ``` ``````` Since most of the stack are global services, we will not see much difference in the way Swarm schedules them. No matter the reservations, a replica will run on each node when the mode is global. Later on, we’ll see more benefits behind memory reservations. For now, the important thing to note is that Swarm has a better picture about the reserved memory on each node and will be able to do future scheduling with more precision. We’ll continue with the rest of the stacks. The next in line is `go-demo`. Please go back to Prometheus’ Graph screen, type `container_memory_usage_bytes{container_label_com_docker_stack_namespace="go-demo"}` in the *Expression* field, and click the *Execute* button. The current usage of `go-demo_db` should be between 30MB and 40MB while `go-demo_main` is probably below 5MB. We’ll update the stack accordingly. The new `go-demo` stack is as follows (limited to relevant parts). ``` `1` ... `2` main: `3` ... `4` deploy: `5` ... `6` resources: `7` reservations: `8` memory: 5M `9` limits: `10 ` memory: 10M `11` `12 ` db: `13 ` ... `14 ` deploy: `15 ` resources: `16 ` reservations: `17 ` memory: 40M `18 ` limits: `19 ` memory: 80M `20` ... ``` `````` Now we can deploy the updated version of the `go-demo` stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-mem.yml `\` `3 ` go-demo ``` ````` Two stacks are done, and two are still left to be updated. The `monitor` and `proxy` stacks should follow the same process. I’m sure that by now you can query Prometheus by yourself. You’ll notice that `monitor_monitor` service (Prometheus) is the one that uses the most memory (over 100MB). Since we can expect Prometheus memory usage to rise with time, we should be generous with its reservations and set it to 500MB. Similarly, a reasonable limit could be 800MB. The rest of the services are very moderate with their memory consumption. Once you’re done exploring the rest of the stacks through Prometheus, the only thing left is to deploy of the updated versions. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-mem.yml `\` `4 ` monitor `5` `6` docker stack deploy `\` `7 ` -c stacks/docker-flow-proxy-mem.yml `\` `8 ` proxy ``` ```` Now that our stacks are better-defined thanks to metrics, we can proceed and try to improve our queries through memory reservations and limits. ### Using Memory Reservations and Limits in Prometheus Metrics obtained through *cAdvisor* are not restricted to actual usage. We have, among others, metrics based on container specs. We can, for example, retrieve memory limits with the metric `container_spec_memory_limit_bytes`. Please type `container_spec_memory_limit_bytes{container_label_com_docker_stack_namespace!=""}` in the *Expression* field and click the *Execute* button. The result should be straight lines that represent memory limits we defined in our stacks. The usage of the `container_label_com_docker_stack_namespace` label is important. We used it to filter the metrics so that only those that come from the stacks are included. That way, we excluded root metrics from `cAdvisor` that provide summarized totals. In Prometheus, memory limits are not very useful in themselves. However, if we combine them with the actual memory usage, we can get percentages that can provide indications of the health of our system. Please type `container_memory_usage_bytes{container_label_com_docker_stack_namespace!=""} / container_spec_memory_limit_bytes{container_label_com_docker_stack_namespace!=""} * 100` in the *Expression* field and click the *Execute* button. ![Figure 4-7: Graph percentages based on memory limits and the actual usage](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00021.jpeg) Figure 4-7: Graph percentages based on memory limits and the actual usage The result consists of percentages based on memory limits and the actual usage. The should all be below 60%. We will leverage this information later when we start working on alerts. ### What Now? We did not go deep into metrics and queries. There are too many of them. Listing each metric would be the repetition of the `HELP` entries that already explain them (even though often not in much detail). More importantly, I believe that the best way to learn something is through a practical usage. We’ll use those metrics soon when we start creating alerts, and you will have plenty of opportunities to get a better understanding how they work. The same holds true for queries. They will be indispensable for creating alerts and will be explained in more details in the next chapter. Still, even though we’ll go through quite a few metrics and queries, the book will not provide a detailed documentation of every combination you can apply. [Querying Prometheus](https://prometheus.io/docs/querying/basics/) is a much better place to learn how queries work. Instead, we’ll focus on practical hands-on experience. Now it’s time for another break. Remove the VMs, grab a coffee, do something fun, and come back fresh. Alerts are coming next. ``` `1` docker-machine rm -f `\` `2 ` swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` `````````````````````````

第九章：定义集群级别警报

一个常见的错误是过于依赖仪表盘作为发现问题的主要手段。仪表盘在整体方案中有其作用，是任何监控解决方案中不可或缺的一部分。然而，它们并不像我们想象的那么重要。

监控系统的目的不是替代 Netflix。它们不应该被不断观察。相反，它们应该收集数据，并在满足特定条件时创建警报。这些警报应该尝试与系统进行交互，并触发一系列操作来自动修复问题。只有在系统无法修复问题时，才应该向人类发送通知。换句话说，我们应该努力创建一个自愈系统，只有在它无法自我修复时，才会向“医生”（我们，人类）求助。

当我们知道系统出现问题时，仪表盘非常有用。如果系统运行正常，查看仪表盘就是浪费时间，应该把时间花在系统改进上。

想象一下，一个 Slack 通知会显示“集群内没有可用内存，系统未能创建额外的虚拟机”。请注意这句话的第二部分。系统检测到问题并未能修复它。出现了故障，系统无法扩展，未能创建新的虚拟机。这是一个很好的例子，说明这种类型的通知应该发送给人工操作员。如果系统能够自我修复，就无需发送这样的 Slack 通知了。

我们应该在收到系统无法自我修复的消息后，才查看仪表盘。在那之前，一切正常，我们可以继续进行系统的下一次重大改进。收到消息后，我们应访问一个或两个仪表盘，试图获取系统的高层次视图。有时，仪表盘提供的信息就足够了。但更常见的是，我们需要更多的信息。我们需要访问 Prometheus，开始查询更多信息。最后，一旦找出罪魁祸首，我们可以创建修复方案，进行测试，并将其应用到生产环境中，改进自愈系统，使得下次发生同样的问题时能够自动修复，并撰写“事后分析”报告。

如你所见，一切都从一个单一的警报开始，这将是本章的重点。目前，我们不会区分那些能自我修正的系统和那些需要发送通知给人工操作员的警报。这个问题会在后面讨论。现在，我们将专注于创建警报，而不定义它们应该触发的事件。

创建集群并部署服务

我们将从重新创建集群并部署上一章中使用的堆栈开始。

`1` chmod +x scripts/dm-swarm-05.sh
`2` 
`3` ./scripts/dm-swarm-05.sh
`4` 
`5` `eval` `$(`docker-machine env swarm-1`)`

We executed the `dm-swarm-05.sh` script which, in turn, created a Swarm cluster composed of Docker Machines, created the networks, and deployed the stacks. Now we should wait a few moments until all the services in the `monitor` stack are up and running. Please use `docker stack ps monitor` command to confirm that the status of all the services in the stack is *Running*. Finally, we’ll confirm that everything is deployed correctly by opening Prometheus in a browser. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor"` ``` ```````````````````````````````` Now the state of our cluster is the same as it was at the end of the previous chapter and we can proceed towards deploying exporters. ### Creating Alerts Based On Metrics Let us create the first alert. We’ll update our `go-demo_main` service by adding a few labels. ``` `1` docker service update `\` `2 ` --label-add com.df.alertName`=`mem `\` `3 ` --label-add com.df.alertIf`=``'container_memory_usage_bytes{container_label_com\` `4` `_docker_swarm_service_name="go-demo_main"} > 20000000'` `\` `5 ` go-demo_main ``` ``````````````````````````````` The label `com.df.alertName` is the name of the alert. It will be prefixed with the name of the service stripped from underscores and dashes (`godemomem`). That way, a unique alert name is guaranteed. The second label (`com.df.alertIf`) is more important. It defines the expression. Translated to plain words, it takes the memory usage limited to the `go-demo_main` service and checks whether it is bigger than 20MB (20000000 bytes). An alert will be launched if the expression is true. Let’s take a look at the Prometheus configuration. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/config"` ``` `````````````````````````````` As you can see, `alert.rules` file was added to the `rule_files` section. ![Figure 5-1: Prometheus configuration with alert rules](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00022.jpeg) Figure 5-1: Prometheus configuration with alert rules Let us explore the rules we created so far. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/rules"` ``` ````````````````````````````` As you can see, the expression we specified with the `com.df.alertIf` label reached *Docker Flow Monitor*. ![Figure 5-2: Prometheus rule with go-demo memory usage](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00023.jpeg) Figure 5-2: Prometheus rule with go-demo memory usage Finally, let’s take a look at the alerts. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````````````````````````` The *godemomainmem* alert is green meaning that none of the `go-demo_main` containers are using over 20MB of memory. Please click the *godemomainmem* link to expand the alert definition. ![Figure 5-3: Prometheus alerts with go-demo memory usage](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00024.jpeg) Figure 5-3: Prometheus alerts with go-demo memory usage The alert is green meaning that the service uses less than 20MB of memory. If we’d like to see how much memory it uses, we need to go back to the graph screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/graph"` ``` ``````````````````````````` Once inside the graph screen, please type the expression that follows, and press the *Execute* button. ``` `1` container_memory_usage_bytes{container_label_com_docker_swarm_service_name="go-d\ `2` emo_main"} ``` `````````````````````````` The exact value will vary from one case to another. No matter which one you got, it should be below 20MB. Let’s change the alert so that it is triggered when `go-demo_main` service uses more than 1MB. ``` `1` docker service update `\` `2 ` --label-add com.df.alertName`=`mem `\` `3 ` --label-add com.df.alertIf`=``'container_memory_usage_bytes{container_label_com\` `4` `_docker_swarm_service_name="go-demo_main"} > 1000000'` `\` `5 ` go-demo_main ``` ````````````````````````` Since we are updating the same service and using the same `alertName`, the previous alert definition was overwritten with the new one. Let’s go back to the alerts screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````````````````````` This time, the alert is red, meaning that the condition is fulfilled. If it is still green, please wait for a few moments and refresh your screen. Our service is using more than 1MB of memory and, therefore, the `ALERT IF` statement is fulfilled, and the alert is firing. Please click the *godemomainmem* link to expand the alert and see more details. ![Figure 5-4: Prometheus alerts screen with go-demo memory usage in firing state](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00025.jpeg) Figure 5-4: Prometheus alerts screen with go-demo memory usage in firing state The flow of the events can be described through the figure 5-5. ![Figure 5-5: The flow of the events that result in a service alert being fired](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00026.jpeg) Figure 5-5: The flow of the events that result in a service alert being fired Let’s take a look at the graph screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/graph"` ``` ``````````````````````` Let us quickly review `go-demo_main` service memory reservations and limits. They might be useful in defining alerts. Please type the expression that follows, and press the *Execute* button. ``` `1` container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name=\ `2` "go-demo_main"} ``` `````````````````````` As you can see, memory metric is set to 10MB. Soon, we’ll use those metrics to our benefit. Next, we’ll check the metrics of the “real” memory usage of the service. Please type the expression that follows, and press the *Execute* button. ``` `1` container_memory_usage_bytes{container_label_com_docker_swarm_service_name="go-d\ `2` emo_main"} ``` ````````````````````` Memory consumption will vary from one case to another. In my case it ranges from 1MB to 3.5MB. ![Figure 5-6: go-demo memory usage](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00027.jpeg) Figure 5-6: go-demo memory usage If we go back to the `alertIf` label we specified, there is an apparent duplication of data. Both the `alertIf` label and the service reservations are defining the thresholds of the service. As you probably already know, duplication is not a good idea because it increases the chances of an error and complicates future updates that would need to be performed in multiple places. A better definition of the `alertIf` statement is as follows. ``` `1` docker service update `\` `2 ` --label-add com.df.alertName`=`mem_limit `\` `3 ` --label-add com.df.alertIf`=``'container_memory_usage_bytes{container_label_com\` `4` `_docker_swarm_service_name="go-demo"}/container_spec_memory_limit_bytes{containe\` `5` `r_label_com_docker_swarm_service_name="go-demo"} > 0.8'` `\` `6 ` go-demo_main ``` ```````````````````` This time we defined that the `mem_limit` alert should be triggered if memory usage is higher than 80% of the memory limit. We avoided duplicating the value that is already defined as service’ memory limit. That way, if, at some later stage, we change the value of the `--limit-memory` argument, the alert will continue working properly. Let’s confirm that *Docker Flow Swarm Listener* sent the notification and that *Docker Flow Monitor* was reconfigured accordingly. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ``````````````````` Please click the *godemo_main_mem_limit* link to see the new definition of the alert. ![Figure 5-7: go-demo alert based on memory limit and usage](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00028.jpeg) Figure 5-7: go-demo alert based on memory limit and usage ### Defining Multiple Alerts For A Service In many cases, one alert per service is not enough. We need to be able to define multiple specifications. *Docker Flow Monitor* allows us that by adding an index to labels. We can, for example, define labels `com.df.alertName.1`, `com.df.alertName.2`, and `com.df.alertName.3`. As a result, *Docker Flow Monitor* would create three alerts. Let’s see it in action. We’ll update the `node-exporter` service in the `exporter` stack so that it registers two alerts. ``` `1` docker service update `\` `2 ` --label-add com.df.alertName.1`=`mem_load `\` `3 ` --label-add com.df.alertIf.1`=``'(sum by (instance) (node_memory_MemTotal) - su\` `4` `m by (instance) (node_memory_MemFree + node_memory_Buffers + node_memory_Cached)\` `5` `) / sum by (instance) (node_memory_MemTotal) > 0.8'` `\` `6 ` --label-add com.df.alertName.2`=`diskload `\` `7 ` --label-add com.df.alertIf.2`=``'(node_filesystem_size{fstype="aufs"} - node_fi\` `8` `lesystem_free{fstype="aufs"}) / node_filesystem_size{fstype="aufs"} > 0.8'` `\` `9 ` exporter_node-exporter ``` `````````````````` This time, `alertName` and `alertIf` labels got an index suffix (e.g. `.1` and `.2`). The first one (`mem_load`) will create an alert if memory usage is over 80% of the total available memory. The second alert will fire if disk usage is over 80%. Let’s explore the *alerts* screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ````````````````` As you can see, two new alerts were registered. ![Figure 5-8: Node exporter alerts](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00029.jpeg) Figure 5-8: Node exporter alerts The flow of the events can be described through the figure 5-9. ![Figure 5-9: The flow of the events that result in an exporter alert being fired](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00030.jpeg) Figure 5-9: The flow of the events that result in an exporter alert being fired ### Postponing Alerts Firing Firing an alert as soon as the condition is met is often not the best idea. The conditions of the system might change temporarily and go back to “normal” shortly afterward. A spike in memory is not bad in itself. We should not worry if memory utilization jumps to 95% only to go back to 70% a few moments later. On the other hand, if it continues being over 80% for, let’s say, five minutes, some actions should be taken. We’ll modify the `go-demo_main` service so that it fires an alert only if memory threshold is reached and the condition continues for at least one minute. The relevant parts of the `go-demo` stack file are as follows. ``` `1` `services``:` `2` `3` `main``:` `4` `...` `5` `deploy``:` `6` `...` `7` `labels``:` `8` `...` `9` `-` `com``.``df``.``alertName``=``mem_limit` `10 ` `-` `com``.``df``.``alertIf``=``container_memory_usage_bytes``{``container_label_com_docker``\` `11` `_swarm_service_name``=``"go-demo"``}/``container_spec_memory_limit_bytes``{``container_label``\` `12` `_com_docker_swarm_service_name``=``"go-demo"``}` `>` `0.8` `13 ` `-` `com``.``df``.``alertFor``=``30``s` `14` `...` ``` ```````````````` We set the `com.df.alertName` and `com.df.alertIf` labels to the same values as those we used to update the service. The new addition is the `com.df.alertFor` label that specifies the period Prometheus should wait before firing an alert. In this case, the condition would need to persist for thirty seconds before the alert is fired. Until then, the alert will be in the pending state. Let’s deploy the new stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-alert-long.yml `\` `3 ` go-demo ``` ``````````````` After a few moments, the `go-demo_main` service will be rescheduled, and the `alert` labels will be propagated to the Prometheus instance. Let’s take a look at the alerts screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` `````````````` The `go-demo` memory limit alert with the `FOR` statement set to thirty seconds is registered. ![Figure 5-10: go-demo memory limit aert with the `FOR` statement](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00031.jpeg) Figure 5-10: go-demo memory limit aert with the `FOR` statement We should test whether the alert indeed works. We’ll temporarily decrease the threshold to five percent. That should certainly trigger the alert. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf`=``'container_memory_usage_bytes{container_label_com\` `3` `_docker_swarm_service_name="go-demo_main"}/container_spec_memory_limit_bytes{con\` `4` `tainer_label_com_docker_swarm_service_name="go-demo_main"} > 0.05'` `\` `5 ` go-demo_main ``` ````````````` Let us take another look at the alerts screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````````` If you opened the screen within thirty seconds since the update, you should see that there are three alerts in the *PENDING* state. Once thirty seconds expire, the status will change to *FIRING*. Unfortunately, there is no destination Prometheus can fire those alerts. We’ll fix that in the next chapter. For now, we’ll have to be content by simply observing the alerts from Prometheus. ![Figure 5-11: go-demo alerts in the PENDING state](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00032.jpeg) Figure 5-11: go-demo alerts in the PENDING state ### Defining Additional Alert Information Through Labels And Annotations We might want to specify supplementary information to our alerts. We can accomplish that through the usage of alert labels and annotations. Alert labels clause allows specifying a set of additional labels to be attached to the alert. The annotations clause specifies another set of labels that are not identifying for an alert instance. They are used to store longer additional information such as alert descriptions or runbook links. We can, for example, update our `go-demo` stack by adding service labels that follow. ``` `1` ... `2` services: `3` `4` main: `5` ... `6` deploy: `7` ... `8` labels: `9` ... `10 ` - com.df.alertLabels=severity=high,receiver=system `11 ` - com.df.alertAnnotations=summary=Service memory is high,description=Do \ `12` something or start panicking `13 ` ... ``` ``````````` Let’s deploy the updated stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-alert-info.yml `\` `3 ` go-demo ``` `````````` A few moments later, the alert definition reached Prometheus, and we can explore it from a browser. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ````````` Please expand the `godemo_main_mem_limit` alert, and you’ll see that it contains the labels and annotations we specified through service labels. Besides serving as additional information, alert labels and annotations can be used with *Alertmanager* which we’ll explore in the next chapter. For now, just remember that they are available. ![Figure 5-12: go-demo alerts with labels and annotations](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00033.jpeg) Figure 5-12: go-demo alerts with labels and annotations ### Using Shortcuts To Define Alerts Setting alerts as service labels is great but, as you probably noticed, a bit cumbersome. Alert conditions can get pretty long and repetitive. I, for one, got tired of writing the same statement over and over again. So, I created shortcuts that accomplish the same functionality. Let’s see them in action. The modified version of the `go-demo` stack definition is as follows (restricted to relevant parts). ``` `1` version: '3' `2` `3` services: `4` `5` main: `6` ... `7` deploy: `8` ... `9` labels: `10 ` - com.df.alertIf=@service_mem_limit:0.8 `11 ` ... ``` ```````` We simplified the definition by replacing the expression that follows with `com.df.alertIf=@service_mem_limit:0.8`. ``` `1` com.df.alertIf=container_memory_usage_bytes{container_label_com_docker_swarm_ser\ `2` vice_name="go-demo_main"}/container_spec_memory_limit_bytes{container_label_com_\ `3` docker_swarm_service_name="go-demo_main"} > 0.8 ``` ``````` Similarly, the modified version of the `exporter` stack definition is as follows (limited to relevant parts). ``` `1` version: "3" `2` `3` services: `4` `5` ... `6` node-exporter: `7` ... `8` deploy: `9` ... `10 ` labels: `11 ` ... `12 ` - com.df.alertIf.1=@node_mem_limit:0.8 `13 ` ... `14 ` - com.df.alertIf.2=@node_fs_limit:0.8 `15 ` ... ``` `````` Just as with the `go-demo`, we simplified the stack definition by replacing `alertIf` labels with shortcut values. Now we can deploy the modified stacks. ``` `1` docker stack deploy `\` `2 ` -c stacks/exporters-alert.yml `\` `3 ` exporter `4` `5` docker stack deploy `\` `6 ` -c stacks/go-demo-alert.yml `\` `7 ` go-demo ``` ````` Let’s check the outcome in the Prometheus *Alerts* screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```` If you check the details of the alerts, you’ll notice that they are the same as they were before. The shortcuts were sent to Prometheus and expanded into their full syntax. ### What Now? We are moving forwards. Alerts are an important step towards a self-healing system. However, at this moment, we can only see them. They are not firing any events. Prometheus is aware of the conditions that should create an alert but is unaware what to do with them. We’ll fix that in the next chapter. Take another break. Remove the machines we created, do something fun, and come back fresh. A brain needs a rest every once in a while. ``` `1` docker-machine rm -f `\` `2 ` swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ````````````````````````````````

第十章：向人类发送警报

尽管 Prometheus 警报本身很有用，但除非你打算把所有时间都花在警报屏幕前，否则它们并不太有用。还有更好的事物可以盯着看。例如，你可以选择观看 Netflix。它比看 Prometheus 屏幕要有趣得多。然而，在你开始在工作时间观看 Netflix 之前，我们需要找到一种方法，确保在警报触发时，你能够收到通知。

在继续之前，我必须强调，向人类（操作员和系统管理员）发送警报是最后的手段。我们只有在系统无法自行修复问题时，才应该接收到警报。然而，在开始时，我们并没有一个自愈系统。我们将采取的方法是将每个警报发送给人类。这是一个快速修复。从那里开始，我们将努力构建一个能够接收这些警报的系统，而不是由我们来接收。这将根据具体用例进行。我们会创建一个发送所有警报的系统，然后开始探索每种情况。如果我们能让系统接受这个警报并进行自我修复，我们将停止将其发送给人类。另一方面，如果我们无法将该场景添加到系统中，它将继续向我们发出警报。换句话说，所有的警报都会发送给人类，除非它们被集成到我们将要构建的自愈系统中。

我们应该将警报消息发送到哪里？Slack 可能是一个不错的起点。即使你不使用 Slack，我们将探索的原则在无论你使用的是电子邮件、HangOuts、Messenger、HipChat、SMS，还是信鸽快递等终端时，都是相同的。只要该终端有 API，我们应该能够利用它。对于一些人来说，这可能比其他人更容易。信鸽快递可能还没有 API。

创建集群并部署服务

我们将从重新创建集群并部署在上一章中使用的堆栈开始。

`1` chmod +x scripts/dm-swarm-06.sh
`2` 
`3` ./scripts/dm-swarm-06.sh
`4` 
`5` `eval` `$(`docker-machine env swarm-1`)`

We executed the `dm-swarm-06.sh` script which, in turn, created a Swarm cluster composed of Docker Machines, created the networks and deployed the stacks. Now we should wait a few moments until all the services in the `monitor` stack are up and running. Please use `docker stack ps monitor` command to confirm that the status of all the services in the stack is *Running*. Finally, we’ll confirm that everything is deployed correctly by opening Prometheus in a browser. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor"` ``` ```````````````````````` Now the state of our cluster is the same as it was at the end of the previous chapter and we can proceed towards deploying exporters. ### Setting Up Alertmanager Since we are already using Prometheus, it makes sense to deploy Prometheus’ companion *Alertmanager*. It will receive alerts, filter and forward them to the end-points we’ll define. Slack will be the first. Alertmanager Docker image expect us to define a configuration file that defines routes, receivers, and a few other things. One possible configuration can be as follows. ``` `1` `route``:` `2` `receiver``:` `"slack"` `3` `repeat_interval``:` `1``h` `4` `5` `receivers``:` `6` `-` `name``:` `"slack"` `7` `slack_configs``:` `8` `-` `send_resolved``:` `true` `9` `text``:` `"Something horrible happened! Run for your lives!"` `10 ` `api_url``:` `"https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0Kvv\` `11` `yStVnIt3ZWpIaLnqLCu"` ``` ``````````````````````` The configuration defines the `route` with `slack` as the receiver of the alerts. In the `receivers` section, we specified that we want resolved notifications (besides alerts), creative text, and the Slack API URL. As a result, alerts will be posted to the *df-monitor-tests* channel in DevOps20 team slack. Please sign up through the [DevOps20 Registration page](http://slack.devops20toolkit.com/) and make sure you join the *df-monitor-tests* channel. This configuration should be more than enough for demo purposes. Please consult the [alerting documentation](https://prometheus.io/docs/alerting/configuration/) for more information about *Alertmanager* configuration options. Next, we’ll take a quick look at the [alert-manager-slack.yml](https://github.com/vfarcic/docker-flow-monitor/blob/master/stacks/alert-manager-slack.yml) stack. ``` `1` version: "3" `2` `3` services: `4` `5` alert-manager: `6` image: prom/alertmanager `7` ports: `8` - 9093:9093 `9` networks: `10 ` - monitor `11 ` secrets: `12 ` - alert_manager_config `13 ` command: -config.file=/run/secrets/alert_manager_config -storage.path=/alert\ `14` manager `15` `16` networks: `17 ` monitor: `18 ` external: true `19` `20` secrets: `21 ` alert_manager_config: `22 ` external: true ``` `````````````````````` The stack is very straightforward. The only thing worth noting is that we are exposing port `9093` only for demo purposes. Later on, when we integrate it with *Docker Flow Monitor*, they will communicate through the `monitor` network without the need to expose any ports. We need the port `9093` to demonstrate manual triggering of alerts through *Alertmanager*. We’ll get rid of it later on. If you take a look at the `command`, you’ll notice that it specifies the configuration file that resides in the `/run/secrets/` directory. It is an in-memory file system where Docker stores secrets. We defined `alert_manager_config` as the external secret. Please visit [Alerting Rules](https://prometheus.io/docs/alerting/rules/) for more information. Let’s create the secret. ``` `1` `echo` `'route:` `2`` receiver: "slack"` `3`` repeat_interval: 1h` `4` `5` `receivers:` `6`` - name: "slack"` `7`` slack_configs:` `8`` - send_resolved: true` `9`` text: "Something horrible happened! Run for your lives!"` `10 `` api_url: "https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0KvvyStV\` `11` `nIt3ZWpIaLnqLCu"` `12` `'` `|` docker secret create alert_manager_config - ``` ````````````````````` Now that the secret with the *Alertmanager* configuration is created, we can deploy the `alert-manager-slack.yml` stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/alert-manager-slack.yml `\` `3 ` alert-manager ``` ```````````````````` Please wait a few moments until the service is deployed. You can monitor the status through the `docker stack ps alert-manager` command. Now we can send a manual request to the *Alertmanager*. ``` `1` curl -H `"Content-Type: application/json"` `\` `2 ` -d `'[{"labels":{"alertname":"My Fancy Alert"}}]'` `\` `3 ` `$(`docker-machine ip swarm-1`)`:9093/api/v1/alerts ``` ``````````````````` Before you execute the request, please change the *My Fancy Alert* name to something else. That way you’ll be able to recognize your alert from those submitted by other readers. The output should be as follows. ``` `1` { `2 ` "status": "success" `3` } ``` `````````````````` Please open *df-monitor-tests* channel in *DevOps20* Slack team and observe that a new notification was posted. Now that we confirmed that `alert-manager` works when triggered manually, we’ll remove the stack and deploy the version integrated with *Docker Flow Monitor*. ``` `1` docker stack rm alert-manager ``` ````````````````` We’ll deploy the `docker-flow-monitor-slack.yml` stack. It contains `monitor` and `swarm-listener` services we’re already familiar with and adds `alert-manager`. The only change to the `monitor` service is the addition of the environment variable `ARG_ALERTMANAGER_URL=http://alert-manager:9093`. It defines the address and the port of the `alert-manager`. The definition of the `alert-manager` service is as follows. ``` `1` monitor: `2` image: vfarcic/docker-flow-monitor `3` environment: `4` ... `5` - ARG_ALERTMANAGER_URL=http://alert-manager:9093 `6` ... `7` `8` alert-manager: `9` image: prom/alertmanager `10 ` networks: `11 ` - monitor `12 ` secrets: `13 ` - alert_manager_config `14 ` command: -config.file=/run/secrets/alert_manager_config -storage.path=/alert\ `15` manager `16` ... ``` ```````````````` We added the environment variable `ARG_ALERTMANAGER_URL` to the `monitor` service. Prometheus will use it as the address to which to send alerts. Since both services are connected through the same `monitor` network, all we had to specify is the name of the service and the internal port. The `alert-manager` service is the same as the one we deployed earlier except that the ports are removed. There’s no need to publish them when services communicate through Overlay network. Let’s deploy the new stack. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-slack.yml `\` `4 ` monitor ``` ``````````````` We should confirm that the `alert-manager` is correctly configured through the environment variable `ARG_ALERTMANAGER_URL`. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/flags"` ``` `````````````` As you can see from the *flags* screen, the *alertmanager.url* is now part of the Prometheus configuration. Since both are connected through the same network (`monitor`), the address is the name of the service. ![Figure 6-1: Prometheus flags screen with values passed through environment variables](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00034.jpeg) Figure 6-1: Prometheus flags screen with values passed through environment variables Let us generate an alert. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf`=`@service_mem_limit:0.1 `\` `3 ` go-demo_main ``` ````````````` We updated the `main` service from the `go-demo` stack by adding the `alertIf` label. It defines `mem_limit` alert that will be triggered if the service exceeds 10% of the memory limit. In other words, it will almost certainly fire the alert. Let’s open the alerts screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````````` As you can see, the alert is red (if it isn’t, wait a few moments and refresh your screen). Since we configured *Alertmanager*, the alert was already sent to it and, from there, forwarded to Slack. Please open the *df-monitor-tests* channel in the *DevOps20* Slack team and observe that a new notification was posted. ![Figure 6-2: Slack message generated by Alertmanager](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00035.jpeg) Figure 6-2: Slack message generated by Alertmanager As you can see, the message is not very well defined. The title is anything but understandable, the text of the message is the same no matter which alert was fired, and the link does not lead back to Prometheus but to the internal address. We’ll fix all those problems soon. For now, the important thing is that we managed to send alert to Slack. The flow of the events is described through figure 6-3. ![Figure 6-3: The flow of the events that results in a Slack message being created](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00036.jpeg) Figure 6-3: The flow of the events that results in a Slack message being created We’ll restore the `go-demo` alert to its original state (used memory over 80%). ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf`=`@service_mem_limit:0.8 `\` `3 ` go-demo_main ``` ``````````` A few moments later, we can observe that the alert is green again. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` `````````` If the alert is still not green, please wait for a while and refresh the screen. Since we specified `send_resolved: true` in the `alert-manager` config, we got another notification. This time, the message states that the issue is resolved. The only thing left is to create your own *Alertmanager* configuration. You’ll need Webhook URL if you choose to send alerts to your team’s Slack. The instructions for obtaining it are as follows. Please login to your Team Slack channel, open the settings menu by clicking the team name, and select *Apps & integrations*. ![Figure 6-4: Team setting Slack menu](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00037.jpeg) Figure 6-4: Team setting Slack menu You will be presented with the *App Directory* screen. Click the *Manage* link located in the top-right corner of the screen followed by the *Custom Integrations* item in the left-hand menu. Select *Incoming WebHooks* and click the *Add Configuration* button. Choose the channel where alerts will be posted and click the *Add Incoming WebHooks integration* button. Copy the *Webhook URL*. You’ll need it when you customize the solution to your needs. Now that you know how to get the *Webhook URL*, feel free to replace the one from the examples that follow. That does not mean that you cannot run them as they are. You’re free to use *DevOps20* team Slack if that suits you better. ### Using Templates In Alertmanager Configuration Defining *Alertmanager* configuration using static text is not very useful if we’re running more than one service. Instead, we should employ templates that will help us customize messages. While we’re at it, we can also fix the broken link from the message and customize the title. Before we proceed, let us remove the `monitor_alert-manager` service and the `alert_manager_config` secret. That will allow us to deploy it again with better-defined messages. ``` `1` docker service rm monitor_alert-manager `2` `3` docker secret rm alert_manager_config ``` ````````` We’ll create a new secret with the complete *Alertmanager* configuration. ``` `1` `echo` `"route:` `2`` group_by: [service]` `3`` receiver: 'slack'` `4`` repeat_interval: 1h` `5` `6` `receivers:` `7`` - name: 'slack'` `8`` slack_configs:` `9`` - send_resolved: true` `10 `` title: '[{{ .Status | toUpper }}] {{ .GroupLabels.service }} service is \` `11` `in danger!'` `12 `` title_link: 'http://``$(`docker-machine ip swarm-1`)``/monitor/alerts'` `13 `` text: '{{ .CommonAnnotations.summary}}'` `14 `` api_url: 'https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0KvvyStV\` `15` `nIt3ZWpIaLnqLCu'` `16` `"` `|` docker secret create alert_manager_config - ``` ```````` Previously, we specified only `text` and `api_url` and let *Alertmanager* fill in the blanks. This time, we added `title` and `title_link` to the mix. We used `{{ .GroupLabels.service }}` to specify the name of the service inside the `title`. Group labels are defined in the `route` section. Even though we could use “normal” labels, group labels are easier since they are unique for all the alerts coming from, in this case, the same service. The title is prefixed with the alert `status` in the upper case. That should give us clear indication whether the alert is fired or resolved. The previous configuration produced a link that did not work. That was to be expected since the communication goes through internal networking. This time, we made sure that the `title_link` is correct and points to one of the servers in the cluster. Finally, the `text` is the same as the alert summary defined as one of the alert `ANNOTATIONS`. Please visit [Notification Template Reference](https://prometheus.io/docs/alerting/notifications/) for more info about the templates that can be used when configuring Alertmanager. If everything works as expected, the new Alertmanager config will result in clearer messages customized for each service. Let us deploy the stack and, with it, `alert-manager` with the new configuration. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-slack.yml `\` `4 ` monitor ``` ``````` We’ll test the alert in the same way as before by decreasing the threshold. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf`=`@service_mem_limit:0.1 `\` `3 ` go-demo_main ``` `````` A few moments later, Prometheus will change the state of the alert to pending and, a while later to firing. We can observe those changes by opening the *Alerts* screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ````` If you open Slack channel *#df-monitor-tests* you’ll notice that the message is much better this time. The only thing left is to confirm that we’re receiving the correct message when an issue is resolved. We’ll change the alert threshold back to 80%. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf`=`@service_mem_limit:0.8 `\` `3 ` go-demo_main ``` ```` After a while, Prometheus will change the alert status to resolved, send a notification to Alertmanager which, in turn, will communicate the news to Slack. The result will be the *RESOLVED* message. ![Figure 6-5: Customized Slack alert messages](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00038.jpeg) Figure 6-5: Customized Slack alert messages ### What Now? We’re done. We have a system that will alert us whenever there’s something wrong inside the cluster. The next step is to start alerting the system so that it can self-heal and leave our Slack channel only for emergencies that cannot be auto-fixed. The next chapter will explore the options for alerting the system. For now, the time has come for both us and our laptops to take a rest. ``` `1` docker-machine rm -f `\` `2 ` swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ````````````````````````

第十一章：系统告警

我们的告警系统已经设置好了。Alertmanager 已配置为将通知发送到 Slack。虽然这已经是一个不错的进步，但它距离作为自适应和自愈系统基础的告警系统仍然差得很远。到目前为止，我们所做的可以被视为一个后备策略。如果系统无法检测到条件变化，并在需要时自我适应或自我修复，通过 Slack 通知人类是一个不错的解决方案。在某些情况下，Slack 通知将是暂时的，并会被请求系统自动修正替代。在其他情况下，系统无法自我修复，因此必须将通知发送给医生（我们，人类，工程师）。

我们已经构建了一个初步的告警系统解决方案。Alertmanager 可以满足我们的一些需求。它并不是唯一的，还有另一个我们在全书中使用过的系统，尽管我们从未在此上下文中提到过它。我相信你能猜出它是哪一个。如果你猜不出来，我会再让你悬念一段时间。

在我们继续并开始构建接收告警的系统之前，我们应该讨论系统可能需要执行的操作类型。

动态和自给自足系统的四个象限

任何打算完全自动化和自给自足的系统必须能够自愈和自适应。最低要求是，它需要能够监控自己，并在服务和基础设施层面执行某些操作。

两个轴可以表示系统可能执行的操作集。一组操作通过基础设施与服务之间的差异来表示。另一个轴可以通过活动类型来解释，一端是自愈，另一端是自适应。

在基础设施中应用最常见的自愈类型是重新创建一个失败或有故障的节点。当基础设施需要适应变化的条件时，节点会被扩展。自愈大多是关于重新调度失败的服务。当系统条件发生变化时，它应该通过扩展某些服务来自我适应。

图 7-1: 系统操作类型

系统如何区分自适应与自愈？它应该什么时候选择执行一个操作而非另一个？

每个系统都有设计。这个设计可能会长时间保持不变，也可能每几分钟就被重新设计一次。系统设计变化的频率将静态系统与动态系统区分开来。在软件行业的短暂历史中，我们通常偏爱长期有效的设计。我们会花大量时间规划，甚至花费更长时间设计一个系统，然后再实现它。难怪我们不愿意在做所有这些工作后，直到上线后一周就开始更改系统设计。我们曾采用瀑布模型工作，所有工作都提前规划，并分阶段执行。大多数情况下，最终结果是失败，但我们此刻不讨论这个问题。如果你在这个行业工作了一段时间，你大概知道什么是瀑布模型。希望你的公司发生了变化，或者你改变了公司。瀑布模型已经死去，长期稳定的静态设计也随之消失。

动态系统的特点是设计中非常频繁，甚至可以说是持续不断的变化。我们可能会设计一个运行五个副本的服务，但一周后又将其更改为七个副本。我们可能会设计一个由二十七个节点组成的基础设施，但过了一段时间后，这个数字可能变成三十个。每当我们做出有意识的决定来改变系统中的某些内容时，我们实际上是在改变设计。这些变化的每一个，都可能是初期计算错误的结果，或者是外部条件的变化影响了系统。流量的稳定增长要求设计发生变化。这要求我们扩大一个或多个服务的副本数量。其他条件相同的情况下，副本数量的增加意味着基础设施资源的增加。我们需要添加更多的节点来承载这些增加的副本。如果情况并非如此，我们就过度配置了系统。也就是说，我们有闲置资源，在扩展服务时可以利用它们。

自适应是一种自动化方式，用于改变系统的设计。当我们（人类）改变设计时，我们通过评估指标来做出决定。至少，我们应该这么做。否则，我们就像是在咨询水晶球、雇佣算命师，或仅仅是在猜测。如果我们能够基于指标做出决策，那么系统也能这样做。不管是谁在改变系统，每次更改都是对设计的修改。如果我们将这一过程自动化，就得到了自适应。

自愈功能并不会影响设计。相反，它是遵循设计的。如果设计中规定有五个副本，而系统中只有四个副本在运行，那么系统应该尽力增加一个副本。而且这并不总是关于增加数量。如果副本数量超过了设计要求，则应删除一些副本。同样的逻辑适用于节点或系统中的任何其他可量化部分。简而言之，自愈功能就是确保设计始终被遵循。

一个系统可以适应或修复服务或基础设施。修复背后的最常见操作是重建失败的部分，而适应通常是关于扩展。它们是四个象限，代表了一个动态且自给自足的系统。

我们即将开始构建一个（几乎）自动化的系统，利用这四个象限。到目前为止，我们所做的一切都是构建这样一个系统的前提条件。我们在 Prometheus 中有度量指标，并且可以通过 Alertmanager 管理警报。现在是时候充分利用这些工具，并扩展它们的应用了。

我们将要探索的第一个象限集中在自我修复和服务上。

第十二章：自我修复应用于服务

自我修复服务的系统的工作就是确保服务（几乎）始终按照设计运行。这样的系统需要监控集群的状态，并持续确保所有服务都在运行指定数量的副本。如果其中一个停止，系统应启动一个新的副本。如果整个节点宕机，该节点上运行的所有副本应该被调度到健康节点上运行。只要集群的容量能够容纳所有副本，系统应该能够保持已定义的规格。

拥有一个自我修复服务的系统并不意味着它提供高可用性。如果一个副本停止工作，系统将把它恢复到运行状态。然而，系统恢复到所需状态之间会有一个（非常）短的时间间隔。如果我们只运行一个副本的服务，那么在这段时间内将会有停机时间。解决这个问题的最佳方法是运行每个服务的至少两个副本。这样，当其中一个副本宕机时，其他副本将处理请求，直到故障副本恢复到所需状态。

假设集群的条件保持不变（节点不会宕机）且集群负载保持恒定，具备自我修复服务的系统应该能够提供接近 100% 的正常运行时间。不幸的是，节点确实会宕机，并且集群的负载（几乎）从不恒定。我们稍后将探讨如何解决这些问题。现在，我们将重点关注如何构建系统的部分，以确保服务能够自动修复。

创建集群并部署服务

我们将从设置一个 Swarm 集群开始，并部署在本章中使用的堆栈。

`1` chmod +x scripts/dm-swarm-08.sh
`2` 
`3` ./scripts/dm-swarm-08.sh
`4` 
`5` `eval` `$(`docker-machine env swarm-1`)`
`6` 
`7` docker stack ls

We executed the `dm-swarm-08.sh` script which, in turn, created a Swarm cluster composed of Docker Machines, created the networks and deployed the stacks. The last command listed all the stacks in the cluster. We are running only `go-demo` and `proxy` stacks. Where are `prometheus` and `exporter` stacks we deployed in the previous chapters? Why are we missing them? The reason is quite simple. We don’t need them to demonstrate self-healing applied to services. We have everything we need. Before we proceed, please confirm that all the replicas that compose the `go-demo` stack are running. You can check their statuses by executing `docker stack ps go-demo` command. You might see a few failed replicas of the `go-demo_main` service. The reason is in its design. It fails if it cannot connect to the database running inside the `go-demo_db` service. Since the database is a bigger image, it takes more time to pull it. Ignore the failed replicas and confirm that there are three instances of `go-demo_main` running. ![Figure 8-1: Replicas spread across the cluster](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00040.jpeg) Figure 8-1: Replicas spread across the cluster ### Using Docker Swarm For Self-Healing Services Docker Swarm already provides almost everything we need from a system that self-heals services. What follows is a short demonstration of some of the scenarios the system might encounter when facing failed service replicas. I already warned you that at least basic knowledge of operating Swarm is the pre-requirement for this book so I chose to skip a lengthy discussion about the features behind the scheduler. I won’t go into details but only prove that Swarm guarantees that the services will (almost) always be healthy. Let’s see what happens when one of the three replicas of the `go-demo_main` service fails. We’ll simulate it by stopping the primary process inside one of the replicas. The first thing we need to do is find out the node where one of the replicas are running. ``` `1` `NODE``=``$(`docker service ps `\` `2 ` -f desired-state`=`Running `\` `3 ` go-demo_main `|` tail -n `1` `\` `4 ` `|` awk `'{print $4}'``)` `5` `6` `eval` `$(`docker-machine env `$NODE``)` ``` ``````````````` We listed all the processes of the `go-demo_main` service and used a filter to limit the output only to those that are running. The output was sent to `tail` so that only one result is returned. Further on, we used `awk` to print only the fourth column which contains the name of the node. The result was assigned to the environment variable `NODE`. The second command changed our local Docker client to point to the node with one of the replicas. Next, we need to find the ID of one of the replicas running on the node we selected. ``` `1` `CONTAINER_ID``=``$(`docker container ls -q `\` `2 ` -f `"label=com.docker.swarm.service.name=go-demo_main"` `\` `3 ` `|` tail -n `1``)` ``` `````````````` We listed all the containers in quiet mode so that only IDs are returned. We used filtering so that only containers labeled as the service `go-demo_main` are retrieved. Since we need only one container (there might be more on that node), we sent the output to `tail` that returned only the last row. Now we can stop the main process inside the container and observe what happens. ``` `1` docker container `exec` -it `\` `2 ` `$CONTAINER_ID` pkill go-demo ``` ````````````` We killed the `go-demo` process inside the container. That was the main and the only process inside that container. As soon as it stopped, container stopped as well. Let’s list the processes of the `go-demo` stack. ``` `1` docker stack ps go-demo ``` ```````````` The output, limited to the replica we killed, is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_main.3 vfarcic/go-demo:latest swarm-2 Running Running 1 second\ `4 ` ago `5 ` \_ go-demo_main.3 vfarcic/go-demo:latest swarm-2 Shutdown Failed 11 second\ `6` s ago "task: non-zero exit (2)" ``` ``````````` As you can see, Swarm detected that one of the replicas failed, and scheduled a new one. It made sure that the specification (the design) is followed. When we deployed the `go-demo` stack, we told Swarm that we want to have three replicas of the `go-demo_main` service and Swarm is continuously monitoring the cluster making sure that our desire is always fulfilled. There were a few seconds between the failure and until the new replica was running. If we’d run only one replica, that would mean a short downtime. However, since we are running three, the other two took over the requests, and there was no downtime. High availability is preserved. ![Figure 8-2: The failed replica was re-scheduled](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00041.jpeg) Figure 8-2: The failed replica was re-scheduled What happens when a whole node is destroyed? I’m sure you already know the answer, but I’ll go through a small demonstration never the less. We’ll repeat the command that we executed earlier and find a node with at least one of the `go-demo_main` replicas. ``` `1` `NODE``=``$(`docker service ps `\` `2 ` -f desired-state`=`Running `\` `3 ` go-demo_main `|` tail -n `1` `\` `4 ` `|` awk `'{print $4}'``)` ``` `````````` Let’s be destructive and delete the node. ``` `1` docker-machine rm -f `$NODE` ``` ````````` To be on the safe side, we’ll list all the machines and confirm that one was indeed removed. ``` `1` docker-machine ls ``` ```````` The output is as follows. ``` `1` NAME ACTIVE DRIVER STATE URL SWARM DO\ `2` CKER ERRORS `3` swarm-1 - virtualbox Running tcp://192.168.99.100:2376 v1\ `4` 7.03.1-ce `5` swarm-3 * virtualbox Running tcp://192.168.99.102:2376 v1\ `6` 7.03.1-ce ``` ``````` Next, we’ll have to change our environment variables to ensure that our local Docker client is not pointing to the node we just removed. ``` `1` `NODE``=``$(`docker-machine ls -q `|` tail -n `1``)` `2` `3` `eval` `$(`docker-machine env `$NODE``)` ``` `````` Now we can, finally, list the processes of the `go-demo` stack and see the result. ``` `1` docker stack ps go-demo ``` ````` The output, limited to the relevant parts, is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` ... `4` go-demo_main.3 vfarcic/go-demo:latest swarm-1 Running Running 2 minute\ `5` s ago `6 ` \_ go-demo_main.3 vfarcic/go-demo:latest swarm-2 Shutdown Running 7 minute\ `7` s ago `8` ... ``` ```` Docker Swarm detected that `swarm-2` node is not available and changed the desired state of the replicas that were running there to `Shutdown`. Unlike the case when a container fails, the current state stayed unchanged. Swarm still assumes that the replicas are running inside `swarm-2`. We know that the node is destroyed and that no replicas are running inside it. Swarm, on the other hand, is not aware of that. It just knows what the last known state of that replica is. The node, from Swarm’s point of view, might still be operational and only lost the connection with the cluster. Theoretically, the connection could be reestablished later. There could be many other explanations besides the destruction of the node, so Swarm keeps the last known state. Never the less, if the node rejoins the cluster, that replica is scheduled for shutdown and will be destroyed immediately as a way to preserve the desired state. ![Figure 8-3: Replicas from a failed node are spread across the cluster](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00042.jpeg) Figure 8-3: Replicas from a failed node are spread across the cluster ### Is It Enough To Have Self-Healing Applied To Services? Self-healing applied to services is only the beginning. It is by no means enough. The system, as it is now, is far from being autonomous. At best, it can recuperate from a few types failures. If one replica of a service goes down, Swarm will do the right thing. Even a simultaneous failure of a few replicas should not be a cause for alarm. However, self-healing applied to services by itself does not contemplate many of the common circumstances. Let us imagine that the sizing of a cluster is done in a way that around 80 percent of CPU and memory is utilized. Such a number, more or less, provides a good balance between having too many unused resources and under-provisioning our cluster. With greater resource utilization we are running a risk that even a failure of a single node would mean that there are no available resources to reschedule the replicas that were running inside it. On the other hand, if we have more than twenty percent of available resources, we are paying for more hardware than we need. Assuming that we do aim for eighty percent of resource utilization, without self-healing applied to infrastructure, a failure of more than one node could have a devastating effect. Swarm would not have enough available resources to reschedule replicas from the failed servers. While it is not common, an availability zone (to use AWS terms) can go down. Assuming that our infrastructure is spread over three availability zones, such a failure would mean that our capacity is reduced by thirty-three percent. When we do the math, that would mean that we would be missing sixteen percent of resources. It is even worse than that since Swarm cannot schedule services so that hundred percent is used. Somewhere around ninety to ninety-five percent is more likely. So, a failure of an AZ would mean that we would be missing quite a lot of resources and some replicas could not be rescheduled. At best, we would have reduced performance. Self-healing applied to infrastructure is a must, and we will explore it soon. Even if nothing failed, our system would not function autonomously for long. We should expect that the load will increase with time. After all, we want our business to expand and that, in most cases, results in increased load. We need to build a system that will adapt to those changes. We need it to expand when the load increases thus providing high availability and low response times. At the same time, we need it to contract when the load decreases and save money from paying for unused resources. We need the system not only to self-heal but also self-adapt to the changed conditions. We need it to redesign itself. There are many other things that we are missing, and we won’t discuss them just yet. Patience is a virtue, and you’ll have to wait for a while longer. ### What Now? We’re done with a brief exploration of self-healing capabilities provided with Docker Swarm. We have a system that will reschedule failed services as long as there is enough capacity inside our cluster. The next step is to apply self-adaptation to our services. Please remove the machines we created. We’ll recreate the cluster in the next chapter. ``` `1` docker-machine rm -f `\` `2 ` swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ```````````````

第十三章：应用于服务的自适应

我们已经看到了服务如何自愈。设置一个系统，确保每个服务的期望副本数（几乎）始终在运行，是相对简单的。Docker Swarm 完成了所有工作。只要有足够的硬件资源可用，我们的服务就会（几乎）始终运行指定数量的副本。我们所需要做的只是，在定义我们堆栈的 YAML 文件中指定replicas: [NUMBER_OF_REPLICAS]。

自愈的问题在于，它没有考虑到影响我们系统的变化。即使它们的内存利用率急剧上升，我们仍然会运行相同数量的副本。例如，如果网络流量增加，也同样适用。Docker Swarm 不会确保我们的系统适应变化的条件。它会盲目地遵循蓝图。虽然与过去的系统操作方式相比，这已经是一个巨大的改进，但这远远不够。我们需要系统既能够自愈，也能自适应。

在本章中，我们将扩展到目前为止获得的知识，并开始探索如何使系统实现自适应。暂时，我们将仅限于服务，忽略硬件也需要修复和自适应的情况。那部分将在后面讨论。

选择用于扩展的工具

我们已经采用了一些工具。我们将度量数据存储在 Prometheus 中，并部署了 Swarm Listener，将信息传递到 Prometheus。我们还有 Alertmanager，在达到特定阈值时接收通知。虽然这些工具帮助我们朝着目标迈进，但它们还不足够。现在，我们需要弄清楚该如何处理这些警报。仅仅在 Slack 中接收它们只是最后的手段。我们需要一个能够接收警报、处理其中的数据、应用一定的逻辑并决定如何处理的工具。

在大多数情况下，自适应主要涉及扩展。由于我们将自己限制在服务上，当系统接收到警报时，它需要能够决定是扩展、缩减，还是不做任何操作。我们需要一个可以接受远程请求的工具，能够运行代码来确定该做什么，并且能够与 Docker 进行交互。

如果你读过《DevOps 2.1 工具包：Docker Swarm：在 Docker Swarm 集群内部构建、测试、部署和监控服务》，你会知道我曾建议使用 Jenkins 来处理我们的持续部署过程。我们也可以将其作为执行自适应操作的工具。毕竟，Jenkins 的真正强大之处不仅仅是用于运行（仅仅是）持续集成/交付/部署管道，而是用于运行任何类型的任务。它的任务可以通过 Alertmanager 远程触发。它具有强大而简单的脚本语言，通过 Pipeline DSL。如果我们在 Jenkins 代理中暴露 Docker Socket，它们可以轻松地与 Docker 进行交互并执行任何可用命令。

即使你更喜欢其他工具，只要满足前面提到的要求，我们将在 Jenkins 中实现的示例也可以轻松适配到其他任何工具上。

让我们开始吧。

创建集群并部署服务

就像在（几乎）任何其他章节中一样，我们将通过设置一个 Swarm 集群并部署我们之前使用过的堆栈来开始实践部分。

`1` chmod +x scripts/dm-swarm-09.sh
`2` 
`3` ./scripts/dm-swarm-09.sh
`4` 
`5` `eval` `$(`docker-machine env swarm-1`)`
`6` 
`7` docker stack ls

We executed the `dm-swarm-09.sh` script which, in turn, created a Swarm cluster composed of Docker Machines, created the networks, and deployed the stacks. The last command listed all the stacks in the cluster. We are running `proxy`, `monitor`, `exporter`, and `go-demo` stacks. Those four comprise the whole toolkit we used by now. ### Preparing The System For Alerts We’ll deploy the stack defined in [stacks/jenkins.yml](https://github.com/vfarcic/docker-flow-monitor/blob/master/stacks/jenkins.yml). The definition is as follows. ``` `1` `version``:` `'3.1'` `2` `3` `services``:` `4` `5` `master``:` `6` `image``:` `vfarcic``/``jenkins` `7` `ports``:` `8` `-` `50000``:``50000` `9` `environment``:` `10 ` `-` `JENKINS_OPTS``=``"--prefix=/jenkins"` `11 ` `networks``:` `12 ` `-` `proxy` `13 ` `-` `default` `14 ` `deploy``:` `15 ` `labels``:` `16 ` `-` `com``.``df``.``notify``=``true` `17 ` `-` `com``.``df``.``distribute``=``true` `18 ` `-` `com``.``df``.``servicePath=/jenkins` `19 ` `-` `com``.``df``.``port``=``8080` `20 ` `extra_hosts:` `21 ` `-` `"${SLACK_HOST:-devops20.slack.com}:${SLACK_IP:-54.192.78.227}"` `22 ` `secrets``:` `23 ` `-` `jenkins``-``user` `24 ` `-` `jenkins``-``pass` `25` `26 ` `agent``:` `27 ` `image``:` `vfarcic``/``jenkins``-``swarm``-``agent` `28 ` `environment``:` `29 ` `-` `USER_NAME_SECRET``=/run``/``secrets/``$``{``JENKINS_USER_SECRET``:-``jenkins``-``user``}` `30 ` `-` `PASSWORD_SECRET``=/run``/``secrets/``$``{``JENKINS_PASS_SECRET``:-``jenkins``-``pass``}` `31 ` `-` `COMMAND_OPTIONS``=-master` `http``:``//``master``:``8080``/``jenkins` `-``labels` `'prod'` `-``execu\` `32` `tors` `4` `33 ` `networks``:` `34 ` `-` `default` `35 ` `volumes``:` `36 ` `-` `/``var``/``run``/``docker``.``sock``:``/``var``/``run``/``docker``.``sock` `37 ` `secrets``:` `38 ` `-` `jenkins``-``user` `39 ` `-` `jenkins``-``pass` `40 ` `deploy``:` `41 ` `placement``:` `42 ` `constraints``:` `[``node``.``role` `==` `manager``]` `43` `44` `networks``:` `45 ` `proxy``:` `46 ` `external``:` `true` `47 ` `default``:` `48 ` `external``:` `false` `49` `50` `secrets``:` `51 ` `jenkins``-``user``:` `52 ` `external``:` `true` `53 ` `jenkins``-``pass``:` `54 ` `external``:` `true` ``` ````````````````````````````````````````````````````````````````` The stack contains two services. The first one is Jenkins master. We are running `vfarcic/jenkins` instead the [official Jenkins image](https://hub.docker.com/_/jenkins/). The `vfarcic/jenkins` image is already built with an administrative user and has all the plugins we’ll need. With it, we’ll be able to skip Jenkins’ setup process. I won’t go into more detail about the image. If you’re curious, please read the [Automating Jenkins Docker Setup](https://technologyconversations.com/2017/06/16/automating-jenkins-docker-setup/) article. The `master` service from the stack publishes port `50000` so that other agents from this, or other clusters, can connect to it. If all the agents run inside the same cluster, there would be no need for this port. Instead, they would be attached to the same Overlay network. Since, in most cases, Jenkins agents tend to be spread across multiple clusters, having the port open is a must. Environment variable `JENKINS_OPTS` defines `/jenkins` as the prefix so that [Docker Flow Proxy](http://proxy.dockerflow.com/) can distinguish requests meant for Jenkins from those that should be forwarded to the other services inside the cluster. The service will be attached to the `proxy` and `default` networks. The first one will be used for communication with *Docker Flow Proxy* while the second is meant to connect it to the agent. Labels are there to provide sufficient information to the proxy so that it can reconfigure itself. We had to add the Slack address as the extra host. Otherwise, Jenkins would not know the address of the `devops20.slack.com` domain. Finally, we specified two secrets (`jenkins-user` and `jenkins-pass`) that will define the credentials of the administrative user. The `agent` follows a similar logic. We’re using `vfarcic/jenkins-swarm-agent` image that contains Docker, Docker Compose, and [Jenkins Swarm Plugin](https://wiki.jenkins.io/display/JENKINS/Swarm+Plugin). The latter allows us to connect to the master automatically. The alternative would be to use the “traditional” approach of adding agents manually through Jenkins’ UI. Please note that the environment variable `COMMAND_OPTIONS` has the `-labels` argument set to `prod`. Since this agent will run on the production cluster, we need to identify it as such. Even though in this chapter we won’t use Jenkins for continuous deployment processes, it is important to label agents from the start so that, later on, we can add others that will serve a different purpose. Just as the `main` service, the agent uses `jenkins-user` and `jenkins-pass` secrets to provide credentials that will be used to connect to Jenkins master. Finally, we need the agent to communicate with one of the Docker managers, so we set the `node.role == manager` constraint. Without this constraint, agents would not be able to spin new services since only managers are allowed to perform such actions. Containers that form Jenkins agents have Docker socket mounted so that Docker commands executed inside them spin containers on one of the nodes, not inside the container. The later would produce Docker-in-Docker (DinD) which is, in most cases, not a good idea. If you do not want to take my word for granted, please read Jerome’s post [Using Docker-in-Docker for your CI or testing environment? Think twice.](http://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/) ![Figure 9-1: Jenkins agents connected to a master and Docker managers](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00043.jpeg) Figure 9-1: Jenkins agents connected to a master and Docker managers Now that we have a general idea about the services inside the `jenkins` stack, we can deploy it. ``` `1` `echo` `"admin"` `|` `\` `2` docker secret create jenkins-user - `3` `4` `echo` `"admin"` `|` `\` `5` docker secret create jenkins-pass - `6` `7` `export` `SLACK_IP``=``$(`ping `\` `8` -c `1` devops20.slack.com `\` `9` `|` awk -F`'[()]'` `'/PING/{print $2}'``)` `10` `11` docker stack deploy `\` `12 ` -c stacks/jenkins.yml jenkins ``` ```````````````````````````````````````````````````````````````` We created the secrets and deployed the stack. The value of the environment variable `SLACK_IP` was obtained by pinging `devops20.slack.com` domain. All that is left, before we start using Jenkins, is a bit of patience. We need to wait until Docker pulls the images. Please execute `docker stack ps jenkins` to confirm that the services are running. Let’s open Jenkins UI in a browser. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins"` ``` ``````````````````````````````````````````````````````````````` If Jenkins does not open, please wait a few moments and refresh the screen. The fact that Docker service is running does not mean that the process inside it is initialized. Jenkins needs ten to fifteen seconds (depending on hardware) to start. Once you see the Jenkins home screen, please click the *Log in* link located in the top-right corner of the screen, and use *admin* as both username and password. Click the *log in* button to authenticate. We should confirm that the agent was added to the master by observing the *computer* screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/computer"` ``` `````````````````````````````````````````````````````````````` You should see two agents. The *master* agent is set up by default with each Jenkins instance. The second agent identified with a hash name was added through the `agent` service in the stack. ![Figure 9-2: Jenkins agent automatically added to the master](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00044.jpeg) Figure 9-2: Jenkins agent automatically added to the master ### Creating A Scaling Pipeline Now comes the exciting part. We’re about to start writing a Pipeline job that will serve as the base for the first self-adaptation script. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/newJob"` ``` ````````````````````````````````````````````````````````````` Once inside the *New Job* screen, please type *service-scale* as the item name. Select *Pipeline* as job type and click the *OK* button. Since Jenkins service we created comes with enabled authorization, we need an authentication mechanism for triggering builds. We could use the administrative *username* and *password*. A better option is to make a trigger that will be independent of any particular user. That can be accomplished with tokens. Please select the *Trigger builds remotely* checkbox from the *Build Trigger* section of the job configuration screen. Type *DevOps22* as the *Authentication Token*. We’ll use it to authenticate remote requests which will trigger a build of this job. Now we can start writing a Pipeline script. There are quite a few things that it should do so we’ll go step by step. The first thing we need is parameters. AS a minimum, we need to know which service should be scaled and how many replicas to add or remove. We’ll assume that if the number of replicas is positive, we should scale up. Similarly, if the value is negative, we should scale down. Please type the script that follows inside the *Pipeline Script* field. ``` `1` `pipeline` `{` `2` `agent` `{` `3` `label` `"prod"` `4` `}` `5` `parameters` `{` `6` `string``(` `7` `name:` `"service"``,` `8` `defaultValue:` `""``,` `9` `description:` `"The name of the service that should be scaled"` `10 ` `)` `11 ` `string``(` `12 ` `name:` `"scale"``,` `13 ` `defaultValue:` `""``,` `14 ` `description:` `"Number of replicas to add or remove."` `15 ` `)` `16 ` `}` `17 ` `stages` `{` `18 ` `stage``(``"Scale"``)` `{` `19 ` `steps` `{` `20 ` `echo` `"Scaling $service by $scale"` `21 ` `}` `22 ` `}` `23 ` `}` `24` `}` ``` ```````````````````````````````````````````````````````````` If you do not like typing, feel free to copy and paste the contents of the [service-scale-1.groovy Gist](https://gist.github.com/vfarcic/98778e9f414f1af1ab30cd07e39b015a). Don’t forget to click the *Save* button. Since we’re trying to scale services running in production, we defined the agent as such. Next, we set the parameters `service` and `scale`. Finally, we have only one stage (`Scale`) with a single step that prints a message. Each pipeline has one or more stages, and each stage is a collection of steps. A step (in this case `echo`) is a task or logic that should be executed. Please note that we are using [Declarative](https://jenkins.io/doc/book/pipeline/syntax/#declarative-pipeline) instead [Scripted](https://jenkins.io/doc/book/pipeline/syntax/#scripted-pipeline) Pipeline syntax. Both have pros and cons. Declarative is a more opinionated and structured syntax while Scripted provides more freedom. The main reason we’re using Declarative flavor is that it has better support for the new [Blue Ocean](https://jenkins.io/projects/blueocean/) UI. Moreover, I happen to know the Jenkins roadmap and Declarative Pipeline is at its center. The default Jenkins UI is not among the prettiest in town. It, kind of, hurts the eyes if you look at it for more than a couple of seconds. Since I do not want your health to deteriorate as a result of reading this book, we’ll switch to *Blue Ocean*. It is available as the alternative UI (soon to become the default) and we already have it installed as one of the plugins. Please click the *Open Blue Ocean* link located in the left-hand menu. And… Lo and behold… We just jumped through time from the 80s to the present tense (at least from the aesthetic perspective). Now we can see our simple pipeline script in action. Since we did not yet run this Pipeline, you will be presented with the *This job has not been run* message and the *Run* button. Please click it. The job will fail the first time we run it. You can consider it a bug that will, hopefully, be fixed shortly. It failed because it got confused with the parameters we specified. I’ll skip the debate about the reasons behind this bug since the workaround is straightforward. Just rerun it by pressing the *Run* button located in the top-left corner. You’ll be presented with a screen that contains the input parameters we specified in the script. Please type *go-demo_main* as *the name of the service that should be scaled* and *2* as the *number of replicas to add or remove*. Click the *Run* button. This time the Pipeline worked, and we can observe the result by clicking on the row of the last build which, in this case, should be *2*. We specified only one stage that contains a single step that prints the message. Please click the *Print Message* row to see the result. The output should be as follows. ``` `1` Scaling go-demo_main by 2 ``` ``````````````````````````````````````````````````````````` ![Figure 9-3: Jenkins Pipeline with a simple Print Message step](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00045.jpeg) Figure 9-3: Jenkins Pipeline with a simple Print Message step Even though Blue Ocean UI is very pleasing, our goal is not to use it to execute builds. Instead, we should invoke it through an HTTP request. That way, we can be confident that Alertmanager will be capable of invoking it as well. Please execute the command that follows. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=2"` ``` `````````````````````````````````````````````````````````` The request we sent is very straightforward. We invoked `buildWithParameters` endpoint of the job and passed the token and required inputs as query parameters. We received no response and can consider that no news is good news. The job was run, and we can confirm that through the UI. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` ````````````````````````````````````````````````````````` You’ll see the list of builds (there should be three). While the *admin* user executed the first two through the UI, the last one was triggered remotely. We can see that by observing the *started by the remote host* message. Please click the row of the last build and observe that the *Print Message* is the same as when we executed the job through UI. Similarly, we can change the `scale` parameter to a negative value if we’d like to scale down. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=-1"` ``` ```````````````````````````````````````````````````````` If you repeat the steps from before, the output of the *Print Message* should be *Scaling go-demo_main by -1*. The Pipeline we have does not do anything but accept parameters and print a message that confirms that parameters are passed correctly. As you probably guessed, we are missing the main ingredient. We need to tell Docker to scale the service. The problem is that Swarm does not accept relative scale values. We cannot instruct it to increase the number of replicas by two nor to decrease it by, let’s say, one. We can overcome this limitation by finding out the current number of replicas and adding or subtracting the value of the `scale` parameter. First things first. How can we find out the current number of replicas? The answer lies in the `docker service inspect` command. Let’s see the output Docker provides if we inspect the `go-demo_main` service. ``` `1` docker service inspect go-demo_main ``` ``````````````````````````````````````````````````````` The output is too long to be presented here. Instead, we’ll focus on the part that interests us. In particular, we need the `Replicas` value. The relevant part of the output is as follows. ``` `1` [ `2` { `3` ... `4` "Spec": { `5` ... `6` "Mode": { `7` "Replicated": { `8` "Replicas": 3 `9` } `10 ` }, `11 ` ... ``` `````````````````````````````````````````````````````` As you can see, we got the information that the service runs three replicas. We can execute the same command from Jenkins pipeline, capture the output, and filter it in a way that only the value of the `Replicas` key is retrieved. In the spirit of brevity, we’ll go only through the `stages` section of the Pipeline. The whole scripts are available as Gist in case you want to copy and paste them in their entirety. ``` `1` `...` `2` `stages` `{` `3` `stage``(``"Scale"``)` `{` `4` `steps` `{` `5` `script` `{` `6` `def` `inspectOut` `=` `sh` `script:` `"docker service inspect $service"``,` `7` `returnStdout:` `true` `8` `def` `inspectJson` `=` `readJSON` `text:` `inspectOut``.``trim``()` `9` `def` `currentReplicas` `=` `inspectJson``[``0``].``Spec``.``Mode``.``Replicated``.``Replicas` `10 ` `def` `newReplicas` `=` `currentReplicas` `+` `scale``.``toInteger``()` `11 ` `echo` `"We should scale from $currentReplicas to $newReplicas replicas"` `12 ` `}` `13 ` `}` `14 ` `}` `15` `}` `16` `...` ``` ````````````````````````````````````````````````````` Due to Declarative Pipeline’s decision not to allow an easy way to declare variables, we coded everything as one `script` step. The script is executing `docker service inspect` as an `sh` step. The `returnStdout` argument is mandatory if we want to be able to capture the output of a command. Later on, we’re using the `readJSON` step that converts plain text to JSON map. The current number of replicas is retrieved by filtering JSON array. We limited the output to the first element and navigated through `Spec`, `Mode`, `Replicated`, and `Replicas` items. The result is stored in the variable `currentReplicas`. From there on, it is a simple math of subtracting the current number of replicas with the `scale` parameter. Since it is a string, we had to convert it to an integer. Finally, we are outputting the result using the `echo` step. The complete code can be found in the [service-scale-2.groovy Gist](https://gist.github.com/vfarcic/77bc5baae1b19d13a7d048f27d03eaff). Let’s open the *service-scale* configure screen and modify the script. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/configure"` ``` ```````````````````````````````````````````````````` Feel free to replace the current script with the one from the [service-scale-2.groovy Gist](https://gist.github.com/vfarcic/77bc5baae1b19d13a7d048f27d03eaff). Personally, I learn better when I write code instead of copying and pasting snippets. No matter the choice, please click the *Apply* button once the Pipeline is updated. Let us repeat the build request and see the outcome. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=2"` ``` ``````````````````````````````````````````````````` We’ll go the the job activity screen and observe the result. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` `````````````````````````````````````````````````` Please click the row of the top-most (most recent) build followed with the click on the last (bottom) step with the *Print Message* label. The output should be as follows. ``` `1` We should scale from 3 to 5 replicas ``` ````````````````````````````````````````````````` ![Figure 9-4: Jenkins Pipeline with a Print Message stating that we should scale to five replicas](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00046.jpeg) Figure 9-4: Jenkins Pipeline with a Print Message stating that we should scale to five replicas Let us confirm that de-scaling calculation works as well. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=-1"` ``` ```````````````````````````````````````````````` If we open the details of the last build and expand the last step, the message should be as follows. ``` `1` We should scale from 3 to 2 replicas ``` ``````````````````````````````````````````````` We are still not performing scaling but, at this moment, we are capable of discovering the current number of replicas and performing a simple calculation that provides us with the number of replicas our system should have. Now we are ready to expand the script and truly scale the service. Equipped with the desired number of replicas stored in the variable `newReplicas`, all we have to do is execute `docker service scale` command. The updated Pipeline script, limited to the relevant parts, is as follows. ``` `1` `...` `2` `script` `{` `3` `def` `inspectOut` `=` `sh` `script:` `"docker service inspect $service"``,` `4` `returnStdout:` `true` `5` `def` `inspectJson` `=` `readJSON` `text:` `inspectOut``.``trim``()` `6` `def` `currentReplicas` `=` `inspectJson``[``0``].``Spec``.``Mode``.``Replicated``.``Replicas` `7` `def` `newReplicas` `=` `currentReplicas` `+` `scale``.``toInteger``()` `8` `sh` `"docker service scale $service=$newReplicas"` `9` `echo` `"$service was scaled from $currentReplicas to $newReplicas replicas"` `10` `}` `11` `...` ``` `````````````````````````````````````````````` The only addition is the `sh "docker service scale $service=$newReplicas"` line. It should be pretty obvious what it does so we’ll just go ahead and modify it in Jenkins. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/configure"` ``` ````````````````````````````````````````````` Please update the current script or replace it with the [service-scale-3.groovy Gist](https://gist.github.com/vfarcic/2b160b93c6cc08320be80d284eb03017). When finished, please press the *Apply* button. Let us run the build one more time and observe the result. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=2"` ``` ```````````````````````````````````````````` This time, we do not need to open Jenkins UI to see the outcome. If everything went as planned, we should see that the `go-demo_main` service is scaled from three to five replicas. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ``````````````````````````````````````````` We listed all the processes that belong to the `go-demo` stack. As a way to reduce noise from those that previously failed or were shut down, we used the filter that limited the output only to those with `Running` as the desired state. The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 2 hours ago `4` go-demo_db.1 mongo:latest swarm-1 Running Running 2 hours ago `5` go-demo_main.2 vfarcic/go-demo:latest swarm-3 Running Running 2 hours ago `6` go-demo_main.3 vfarcic/go-demo:latest swarm-2 Running Running 2 hours ago `7` go-demo_main.4 vfarcic/go-demo:latest swarm-2 Running Running 2 minutes ago `8` go-demo_main.5 vfarcic/go-demo:latest swarm-1 Running Running 2 minutes ago ``` `````````````````````````````````````````` As you can see, the number of `go-demo_main` replicas is now five. Two of them are running for only a few minutes. Since I am a paranoid person, I like testing at least a few combinations of any code or script I write. Let’s see whether it works if we choose to scale by a negative number. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=-1"` ``` ````````````````````````````````````````` After a few moments, the number of replicas should scale down from five to four. Let’s double-check it. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ```````````````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 2 hours ago `4` go-demo_db.1 mongo:latest swarm-1 Running Running 2 hours ago `5` go-demo_main.2 vfarcic/go-demo:latest swarm-3 Running Running 2 hours ago `6` go-demo_main.3 vfarcic/go-demo:latest swarm-2 Running Running 2 hours ago `7` go-demo_main.4 vfarcic/go-demo:latest swarm-2 Running Running 25 minutes a\ `8` go ``` ``````````````````````````````````````` As you can see, the replica number five disappeared, proving that the script works in both directions. We can use it to scale or de-scale services. As a side note, don’t get alarmed if some other replica disappeared. There is no guarantee that, when we scale down by one replica, it will be the last one that is removed from the system. For example, replica number two could have been removed instead of the replica five. Indexes are not of importance. What matters is that only four replicas are running inside the cluster. ![Figure 9-5: Manual scaling through Jenkins](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00047.jpeg) Figure 9-5: Manual scaling through Jenkins ### Preventing The Scaling Disaster On the first look, the script we created works correctly. Doesn’t it?. I’ve seen similar scripts in other places, and there is only one thing I have to say. **Do not run this pipeline in production!!!** It is too dangerous. It can easily crash your entire cluster or make your service disappear. Can you guess why? Let us imagine the following situation. Prometheus detects that certain threshold is reached (e.g. memory utilization, response time, and so on) and send a notification to Alertmanager. It sends a build request to Jenkins which, in turn, scales the service by increasing the number of replicas by one. So far, so good. What happens if scaling does not resolve the problem? What if the threshold reached in Prometheus persists? After a while, the process will be repeated, and the service will be scaled up one more time. That might be correct. Maybe there was a significant increase in requests. Maybe that new feature convinced a huge number of new users to start using our service. In such a situation, scaling twice is a legitimate operation. But, what if the second round of scaling did not produce results. What if the system continues scaling up until all the resources are used, and the nodes start failing one by one? The whole cluster could be destroyed. If you think that scenario is bad, let me tell you that it can get much worse. Let’s assume that there is a system in place that would create new nodes when resources are over certain threshold. In that scenario, scaling up indefinitely would result in infinite addition of new nodes. As a result, the bill from AWS could ruin your company. Fortunately, there is a limit to how many nodes an account can create. Still, the unlimited increase in the number of replicas together with the growth of nodes up to a limit would only produce a massive bill, and the cluster would still fail at the end. As you can imagine, neither of those scenarios is pretty. What happens if the system decides to de-scale? Maybe you set up a lower threshold for a memory limit or for response time. When that boundary is reached, the system should scale-down. Following the similar logic from the previous examples, scaling-down could continue until the number of replicas reaches zero. At that moment, the service is as good as if it would be removed from the system. As a result, we’d have downtime. The major difference is that we would not get a huge bill from our hosting vendor and only a part of the system would experience downtime. The rest of the services should work correctly unless they also start experiencing the same fate. What we need to do is set some limits. We should define what the minimum and the maximum number of replicas of a service is. However, the trick is not only to know what information should be defined but also where to put that information. Jenkins needs to know what are those limits and I can think of a few ways to provide that information. We could add two new input parameters to the `service-scale` Pipeline job. They could be `scaleMin` and `scaleMax`. The problem, in that case, is that Alertmanager would need to pass those parameters when sending requests to Jenkins. But, Alertmanager does not have that information. It would need to rely on Prometheus which could get it from the labels scraped from cAdvisor. However, that would assume that all alerts are generated with data that come from cAdvisor. That might not be the case. So, if neither Alertmanager nor Prometheus are the right places to define (or discover) the scaling limits of a service, the only option left is for Jenkins job to discover it directly from the service. Since Pipeline code has, through its agents, access to Docker Manager, it could, simply, request that information. That should be the optimum solution since it would follow the pattern we used before. We would continue specifying all the information related to a service inside the service itself. To be more precise, we could add a few additional labels and let Jenkins “discover them”. The [stacks/go-demo-scale.yml](https://github.com/vfarcic/docker-flow-monitor/blob/master/stacks/go-demo-scale.yml) is a slightly modified version of the one we used by now. It defines two new labels. The relevant parts of the stack are as follows. ``` `1` version: '3' `2` `3` services: `4` `5` main: `6` image: vfarcic/go-demo `7` ... `8` deploy: `9` ... `10 ` labels: `11 ` ... `12 ` - com.df.scaleMin=2 `13 ` - com.df.scaleMax=4 `14 ` ... ``` `````````````````````````````````````` We used the `com.df.scaleMin` and `com.df.scaleMax` labels to define that the minimum number of replicas is two and the maximum four. Let’s update the stack with the new definition. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-scale.yml `\` `3 ` go-demo ``` ````````````````````````````````````` Please note that the `go-demo-scale.yml` stack has the number of replicas set to three, so the deployment of the stack will remove any extra replicas we created previously. Let us update the Pipeline script. The new version is as follows. ``` `1` `...` `2` `script` `{` `3` `def` `inspectOut` `=` `sh` `script:` `"docker service inspect $service"``,` `4` `returnStdout:` `true` `5` `def` `inspectJson` `=` `readJSON` `text:` `inspectOut``.``trim``()` `6` `def` `currentReplicas` `=` `inspectJson``[``0``].``Spec``.``Mode``.``Replicated``.``Replicas` `7` `def` `newReplicas` `=` `currentReplicas` `+` `scale``.``toInteger``()` `8` `def` `minReplicas` `=` `inspectJson``[``0``].``Spec``.``Labels``[``"com.df.scaleMin"``].``toInteger``()` `9` `def` `maxReplicas` `=` `inspectJson``[``0``].``Spec``.``Labels``[``"com.df.scaleMax"``].``toInteger``()` `10 ` `if` `(``newReplicas` `>` `maxReplicas``)` `{` `11 ` `error` `"$service is already scaled to the maximum number of $maxReplicas repl\` `12` `icas"` `13 ` `}` `else` `if` `(``newReplicas` `<` `minReplicas``)` `{` `14 ` `error` `"$service is already descaled to the minimum number of $minReplicas re\` `15` `plicas"` `16 ` `}` `else` `{` `17 ` `sh` `"docker service scale $service=$newReplicas"` `18 ` `echo` `"$service was scaled from $currentReplicas to $newReplicas replicas"` `19 ` `}` `20` `}` `21` `...` ``` ```````````````````````````````````` Let us go through the new additions to the script. We are extending the usage of JSON obtained through `docker service inspect` command. In addition to the number of replicas, we are retrieving the values of the labels `com.df.scaleMin` and `com.df.scaleMax`. Further on, we have a simple conditional. If the new number of replicas is more than the maximum allowed, throw an error. Similarly, if the number of replicas is less than the minimum allowed, throw an error as well. We are scaling the service only if neither of those conditions is met. The script is still relatively simple and straight forward. Let’s go back to the job configuration screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/configure"` ``` ``````````````````````````````````` Please replace the current pipeline with the contents of the [service-scale-4.groovy Gist](https://gist.github.com/vfarcic/fd15bcae2278d3a5ca223d67fe2f2e64) or edit it manually and test your ability to type while reading a book. Either way, press the *Apply* button when finished. Now we can test whether our scaling process can destroy the cluster. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=1"` ``` `````````````````````````````````` Let us open the job activity screen and check the result of the last build. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` ````````````````````````````````` As before, please navigate to the details of the last build and expand the last step. The output should be as follows. ``` `1` go-demo_main was scaled from 3 to 4 replicas ``` ```````````````````````````````` ![Figure 9-6: Jenkins job scaled the service](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00048.jpeg) Figure 9-6: Jenkins job scaled the service We’ll confirm the same result by listing the running processes of the `go-demo` stack. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ``````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_db.1 mongo:latest swarm-1 Running Running about an hou\ `4` r ago `5` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 6 hours ago `6` go-demo_main.2 vfarcic/go-demo:latest swarm-3 Running Running 6 hours ago `7` go-demo_main.3 vfarcic/go-demo:latest swarm-2 Running Running 6 hours ago `8` go-demo_main.4 vfarcic/go-demo:latest swarm-2 Running Running 16 seconds a\ `9` go ``` `````````````````````````````` As expected, the service scaled from three to four replicas. And now comes the moment of truth. Will our service continue scaling indefinitely or the limits will be respected? I know you know the answer, but I like being melodramatic every once in a while. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=1"` ``` ````````````````````````````` If everything worked as planned, the last build threw an error. Feel free to check it yourself. If there is a purpose in UIs, that’s to announce in red color that something failed. More importantly than the error message in Jenkins, we should confirm that the number of replicas is still four. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ```````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_db.1 mongo:latest swarm-1 Running Running about an hou\ `4` r ago `5` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 6 hours ago `6` go-demo_main.2 vfarcic/go-demo:latest swarm-3 Running Running 6 hours ago `7` go-demo_main.3 vfarcic/go-demo:latest swarm-2 Running Running 6 hours ago `8` go-demo_main.4 vfarcic/go-demo:latest swarm-2 Running Running 17 minutes a\ `9` go ``` ``````````````````````````` I’ll skip the instructions for scaling down and observing that the lower limit is maintained. Feel free to play with it yourself, or just take my word for granted and trust me blindly. Either way, there is one more important thing missing. ### Notifying Humans That Scaling Failed We made significant progress by creating upper and lower limit for scaling. From now on, the script will not exceed them. However, the fact that we will stay within those limit does not mean that the problem that initiated the procedure is gone. Whichever process decided that a service should be scaled probably did that based on some metrics. If, for example, the average response time was slow and the system failed to scale up, the problem will persist unless there is some dark magic involved. We can categorize this situation as “the body tried to self-adapt, it failed, it’s time to consult a doctor.” Since we live in the 21st century, we won’t call him but send him a Slack message. Before we proceed and modify the script one more time, we need to configure Slack in Jenkins. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/configure"` ``` `````````````````````````` Once inside the *Configure System* screen, please scroll down to the *Global Slack Notifier Settings* section. Please enter *devops20* in the *Team Subdomain* field and *2Tg33eiyB0PfzxII2srTeMbd* in the *Integration Token* field. Now there is another bug or an undocumented feature. I guess it all depends on who you ask. We cannot test the connection before clicking the *Apply* button. There is an explanation for that, but we won’t go through it now. Once you applied the configuration, please click the *Test Connection* button. If everything worked as expected, you should see the *Success* message. At the same time, the *#df-monitor-tests* channel inside [DevOps20 team](https://devops20.slack.com) should have received a message similar to *Slack/Jenkins plugin: you’re all set on http://192.168.99.100/jenkins/*. Feel free to change the subdomain and the token to match your own Slack channel. You’ll find the token in *Slack* > *App & Integrations* > *Manage* > *Jenkins CI* screen. All that’s left is to *Save* the changes to the config and update the Pipeline script. We’ll add `post` section. ``` `1` `...` `2` `post` `{` `3` `failure` `{` `4` `slackSend``(` `5` `color:` `"danger"``,` `6` `message:` `"""$service could not be scaled.` `7` `Please check Jenkins logs for the job ${env.JOB_NAME} #${env.BUILD_NUMBER}` `8` `${env.RUN_DISPLAY_URL}"""` `9` `)` `10 ` `}` `11` `}` ``` ````````````````````````` Post sections in Declarative Pipeline are always executed no matter the outcome of the build steps. We can fine tune it by adding conditions. In our case, we specified that it should be executed only on `failure`. Inside it, we used the `slackSend` step from the [Slack Notification Plugin](https://jenkins.io/doc/pipeline/steps/slack/). There are quite a few arguments we could have specified but, in this case, we constrained ourselves to only two. We set the `color` to `danger` and the mandatory `message`. Please consult the plugin for more information if you’d like to fine-tune the behavior to your needs. Now we can open the job configuration page and apply the changes. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/configure"` ``` ```````````````````````` Please modify the script yourself or replace it with the [service-scale-5.groovy Gist](https://gist.github.com/vfarcic/aeb332b2ab889a81377833f904148d10). When finished, please press the *Apply* button. We can quickly confirm whether notifications to Slack work by sending a request that would scale way below the limit. ``` `1` curl -X POST `"http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `2` `dWithParameters?token=DevOps22&service=go-demo_main&scale=-123"` ``` ``````````````````````` Please open the *#df-monitor-tests* channel in [`devops20.slack.com/`](https://devops20.slack.com/) and confirm that the message was sent. ![Figure 9-7: Jenkins notification in Slack](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00049.jpeg) Figure 9-7: Jenkins notification in Slack Now that we have a Jenkins job that is in charge of scaling our services, we should make sure that the system can execute it when certain thresholds are met. ### Integrating Alertmanager With Jenkins At the moment, we are running Alertmanager configured in the previous chapter. It creates a Slack notification on all alerts. Let’s try to change it so that alerts trigger a remote invocation of the Jenkins job `service-scale`. Since Alertmanager configuration is stored as a Docker secret and they are immutable (we cannot update them), we need to remove the service and the secret and create them again. ``` `1` docker service rm monitor_alert-manager `2` `3` docker secret rm alert_manager_config ``` `````````````````````` Let us define a Slack config that will send build requests to the *service-scale* job. The command that creates the service with the configuration is as follows. ``` `1` `echo` `"route:` `2`` group_by: [service]` `3`` repeat_interval: 1h` `4`` receiver: 'jenkins-go-demo_main'` `5` `6` `receivers:` `7`` - name: 'jenkins-go-demo_main'` `8`` webhook_configs:` `9`` - send_resolved: false` `10 `` url: 'http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `11` `dWithParameters?token=DevOps22&service=go-demo_main&scale=1'` `12` `"` `|` docker secret create alert_manager_config - ``` ````````````````````` Unlike the previous configuration, this time we’re using [webhook_config](https://prometheus.io/docs/alerting/configuration/#<webhook_config>). The URL is the same as the one we used before. If the alert is executed, it will send a `buildWithParameters` request that will build `service-scale` job with `go-demo_main` as the `service`. You’ll notice that the parameters of the request are hard-coded. This time we are not using templating to customize the config. The problem is that `url` cannot use templated fields. For good or bad, that is part of the design. Instead, it sends all the fields of the alert as payload and expects the endpoint to translate it for its own needs. That would be great except for the fact that Jenkins does not accept job input fields in any other but its own format. All in all, both Alertmanager and Prometheus expect the other to adapt. So, we’re in a bit of a trouble and have to specify an entry for each service. That is far from optimum. Later on, we might discuss alternatives to this approach. We might come to the conclusion that Alertmanager should be extended with `jenkins_config`. Maybe we’ll extend Alertmanager with our own custom code that reconfigures it using labels. It could be `Docker Flow Alertmanager`. We might choose a different tool altogether. We are engineers, and we should not accept limitations of other tools but extend them to suit our needs or build our own. Everything in this book is based on open source, and we should contribute back to the community. However, we will not do any of those. For now, we’ll just accept the limitation and move on. The important thing to note is that you’d need a receiver for every service that should be scaled. It’s not the best solution, but it should do until a better solution emerges. If you’re interested in a discussion about the decision not to allow templates in `url` fields, please explore the [Alertmanager issue 684](https://github.com/prometheus/alertmanager/issues/684). Since we removed the `monitor_alert-manager` service, we should redeploy the `monitor` stack. This time, we’ll use a slightly modified version of the stack. The only difference is that we’ll (temporarily) publish Alertmanager’s port 9093\. That will allow us to test the configuration by sending HTTP requests to it. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-slack-9093.yml `\` `4 ` monitor ``` ```````````````````` Please wait a few moments until `monitor_alert-manager` service is up and running. You can check the status by listing processes of the `monitor` stack (e.g. `docker stack ps monitor`). Before we test the integration with Alertmanager, we should reset the number of replicas of `go-demo_main` service back to three. ``` `1` docker service scale go-demo_main`=``3` ``` ``````````````````` Now that Alertmanager with the new configuration is running, we’ll send it a request that will help us validate that everything works as expected. ``` `1` curl -H `"Content-Type: application/json"` `\` `2 ` -d `'[{"labels":{"service":"it-does-not-matter"}}]'` `\` `3 ` `$(`docker-machine ip swarm-1`)`:9093/api/v1/alerts ``` `````````````````` Please note that this time we did not specify `go-demo_main` as the service. Since all alerts are forwarded to the same Jenkins job and with the same parameters, it does not matter what we put in the request. We’ll fix that soon. For now, we should open Jenkins and see the activity of the `service-scale` job. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` ````````````````` Alertmanager sent a request to Jenkins which, in turn, run a new build of the `service-scale` job. As a result, `go-demo_main` service should be scaled from three to four replicas. Let us confirm that. ``` `1` docker service ps `\` `2 ` -f desired-state`=`Running go-demo_main ``` ```````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 3 hours ago `4` go-demo_main.2 vfarcic/go-demo:latest swarm-2 Running Running 3 hours ago `5` go-demo_main.3 vfarcic/go-demo:latest swarm-3 Running Running 3 hours ago `6` go-demo_main.4 vfarcic/go-demo:latest swarm-1 Running Running 3 minutes ago ``` ``````````````` As you can see from the output, the service is scaled to four replicas. ![Figure 9-8: Alertmanager triggering of a Jenkins job that results in scaling of a service](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00050.jpeg) Figure 9-8: Alertmanager triggering of a Jenkins job that results in scaling of a service Being able to send requests from Alertmanager to Jenkins works fine if all the alerts are the same. However, that is almost never the case. We should start distinguishing alerts. One easy improvement we can do is to create a default receiver. We can, for example, say that by default all alerts are sent to Slack and specify explicitly those that should be forwarded somewhere else. Let us remove the secret and the service and discuss the new configuration. ``` `1` docker service rm monitor_alert-manager `2` `3` docker secret rm alert_manager_config ``` `````````````` The configuration that envelops both Slack and Jenkins as receivers is as follows. ``` `1` `echo` `"route:` `2`` group_by: [service]` `3`` repeat_interval: 1h` `4`` receiver: 'slack'` `5`` routes:` `6`` - match:` `7`` service: 'go-demo_main'` `8`` receiver: 'jenkins-go-demo_main'` `9` `10` `receivers:` `11 `` - name: 'slack'` `12 `` slack_configs:` `13 `` - send_resolved: true` `14 `` title: '[{{ .Status | toUpper }}] {{ .GroupLabels.service }} service is \` `15` `in danger!'` `16 `` title_link: 'http://``$(`docker-machine ip swarm-1`)``/monitor/alerts'` `17 `` text: '{{ .CommonAnnotations.summary}}'` `18 `` api_url: 'https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0KvvyStV\` `19` `nIt3ZWpIaLnqLCu'` `20 `` - name: 'jenkins-go-demo_main'` `21 `` webhook_configs:` `22 `` - send_resolved: false` `23 `` url: 'http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `24` `dWithParameters?token=DevOps22&service=go-demo_main&scale=1'` `25` `"` `|` docker secret create alert_manager_config - ``` ````````````` The `route` section defines `slack` as the receiver. Further down, the `routes` section uses `match` to filter alerts. We specified that any alert with the `service` label set to `go-demo_main` should be sent to the `jenkins-go-demo_main` receiver. In other words, every alert will be sent to Slack unless it matches one of the `routes`. The `receivers` section defines `slack` and `jenkins-go-demo_main` entries. They are the same as those we used in previous configurations. We should be able to test the whole system now. We should generate a situation that will create an alert in Prometheus, fire it to Alertmanager, and, depending on the alert type, see the result in Slack or Jenkins. But, first things should come first. We should create the new `monitor_alert-manager` service by redeploying the stack. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-slack.yml `\` `4 ` monitor ``` ```````````` As before, please execute `docker stack ps monitor` to confirm that all the services in the stack are running. We’ll also revert the number of replicas of the `go-demo_main` service to three. Since we set the maximum to four, an intent to scale up would fail if we do not put it back to three. ``` `1` docker service scale go-demo_main`=``3` ``` ``````````` Finally, we’ll simulate the “disaster” scenario by changing the `alertIf` conditions of our services. The first we’ll play with is `node-exporter` service from the `exporter` stack. We’ll set its node memory limit to one percent. That is certain to be lower than the actual usage. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf.1`=`@node_mem_limit:0.01 `\` `3 ` exporter_node-exporter ``` `````````` If everything went as planned, the chain of the events is about to unfold. The first stop is Prometheus. Let’s open the alerts screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ````````` The `exporter_nodeexporter_mem_load` alert should change its status to pending (orange color) and then to firing (red). If it’s still green, please wait a few moments and refresh the screen. Prometheus fired the alert to Alertmanager. Since it does not match any of the `routes` (`service` is not `go-demo_main`), it falls into “default” category and will be forwarded to Slack. That is the logical flow of actions. Since we do not (yet) have a mechanism that scales nodes, the only reasonable action is to notify humans through Slack and let them solve this problem. Feel free to visit the *#df-monitor-tests* channel inside [devops20.slack.com](https://devops20.slack.com/). The message generated with the alert from your system should be waiting for you. Before we proceed, we’ll revert the `exporter_node-exporter` service to its original state. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf.1`=`@node_mem_limit:0.8 `\` `3 ` exporter_node-exporter ``` ```````` Soon, another message will appear in Slack stating that the problem with the `exporter_node-exporter` is resolved. Let’s see what happens when an alert is generated and matches one of the routes. We’ll simulate another “disaster”. ``` `1` docker service update `\` `2 ` --label-add com.df.alertIf`=`@service_mem_limit:0.01 `\` `3 ` go-demo_main ``` ``````` You should know the drill by now. Wait until Prometheus fires the alert, wait a bit more, and, this time, confirm it by opening the `service-scale` activity screen in Jenkins. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` `````` Alertmanager filtered the alert, deduced that it matches a specific `route` and sent it to the matching receiver. This time, that receiver was `webhook_config` that sends requests to build `service-scale` Jenkins job using `go-demo_main` as the input parameter. All in all, our service was scaled one more time, and we’ll confirm that by listing all the running processes of the service. ``` `1` docker service ps `\` `2 ` -f desired-state`=`Running go-demo_main ``` ````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2 ` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 3 seconds ago `4` go-demo_main.2 vfarcic/go-demo:latest swarm-2 Running Running 3 hours ago `5` go-demo_main.3 vfarcic/go-demo:latest swarm-3 Running Running 3 hours ago `6` go-demo_main.4 vfarcic/go-demo:latest swarm-1 Running Running 16 minutes a\ `7` go ``` ```` A new replica (with index `1`) was created three seconds ago. We averted the “disaster” that could be caused by an imaginary increase in traffic that resulted in the increase in memory usage of the service. ![Figure 9-9: The full self-adaptive system applied to services](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00051.jpeg) Figure 9-9: The full self-adaptive system applied to services Unfortunately, we do not have a mechanism in place to scale down. The good news is that we will have it soon. ### What Now We explored how we can add Jenkins to the mix and make it scale any service. We used relative scaling and made sure that there are some limits so that the service will always be within some boundaries. Jenkins, by itself, proved to be very flexible and allowed us to set up a reasonably bullet-proof scaling mechanism with only a few lines of Declarative Pipeline code. Unfortunately, we hit some limits when integrating Alertmanager with Jenkins. As a result, Alertmanager config is not as generic as we’d like it to be. We might revisit that subject later and apply some alternative strategy. We might want to extend it. The solution might be called `Docker Flow Alertmanager`. Or, we might choose to replace Jenkins with our own solution. Since I’m fond of names that start with *Docker Flow*, we might add *Scaler* to the mix. We might opt for something completely unexpected, or we might say that the current solution is good enough. Time will tell. For now, the important thing to note is that we made a very important step towards having a *Self-Adapting* system that works on Swarm’s out-of-the-box *Self-Healing* capabilities. There are still a few critical problems we need to work on. Our *Self-Adapting* system applied to services does not scale down. The reason is simple. We need more data. Using memory as a metric is very important but not very reliable. Having memory below some threshold hardly gives us enough reason to scale up, and it definitely does not provide a valid metric that would let us decide to scale down. We need something else, and I’ll leave you guessing what that is. Another major missing piece of the puzzle is hardware. We are yet to build a system that *Self-Heals* and *Self-Adapts* servers. For now, we were concentrated only on services. That was the longest chapter by now. You must be wasted. If you’re not, I am, and this is where we’ll make a break. As always, hardware needs to rest as much as we do so we’ll destroy the machines we created in this chapter and start the next one fresh. ``` `1` docker-machine rm -f swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ```````````````````````````````` ````````````````````````````````` `````````````````````````````````` ``````````````````````````````````` ```````````````````````````````````` ````````````````````````````````````` `````````````````````````````````````` ``````````````````````````````````````` ```````````````````````````````````````` ````````````````````````````````````````` `````````````````````````````````````````` ``````````````````````````````````````````` ```````````````````````````````````````````` ````````````````````````````````````````````` `````````````````````````````````````````````` ``````````````````````````````````````````````` ```````````````````````````````````````````````` ````````````````````````````````````````````````` `````````````````````````````````````````````````` ``````````````````````````````````````````````````` ```````````````````````````````````````````````````` ````````````````````````````````````````````````````` `````````````````````````````````````````````````````` ``````````````````````````````````````````````````````` ```````````````````````````````````````````````````````` ````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````

第十四章：描绘全貌：到目前为止的自给自足系统

自给自足系统是一个能够自我修复和自我适应的系统。修复意味着集群始终保持在设计的状态。例如，如果某个服务的副本宕机，系统需要重新启动它。适应性则是指修改期望状态，以便系统能够应对变化的条件。一个简单的例子是流量增加。当流量增加时，服务需要扩展。当修复和适应自动化时，我们就得到了自我修复和自我适应。两者共同构成了一个无需人工干预即可运行的自给自足系统。

一个自给自足的系统是什么样的？它的主要组成部分有哪些？谁是其中的参与者？

我们将讨论范围限制为服务，并忽略硬件同样重要的事实。考虑到这一限制，我们将描绘一个高层次的图景，描述从服务角度来看（大多数）自主系统。我们将从细节中抽身，俯瞰整个系统。

如果你是那种什么都想一次性看全的人，系统已在图 10-1 中做了总结。

图 10-1：具有自我修复和自我适应服务的系统

这样的图表可能一下子让人难以消化。直接展示给你可能会让你觉得同理心不是我的强项。如果你有这种感觉，你并不孤单。我的妻子也有同样的印象，即使没有任何图表。这次我会尽力改变你的看法，从头开始，重新整理一下。

我们可以将系统分为两个主要领域：人类和机器。可以把它们想象成黑客帝国。如果你还没看过这部电影，立刻停下阅读这本书，准备些爆米花，去看一看吧。

在《黑客帝国》中，世界被机器控制。人类几乎不做什么，只有那些意识到发生了什么的人例外。大多数人生活在一个反映人类历史过去事件的梦境中。他们身体上处于现在，但思想却停留在过去。现代集群也呈现出类似的情况。大多数人仍然像 1999 年那样操作它们。几乎所有的操作都是手动的，过程繁琐，系统依靠蛮力和浪费的能量勉强存活下来。少数人意识到现在已经是 2017 年（至少在本文写作时是这样），而一个设计良好的系统是一个大部分工作由机器自动完成的系统。几乎所有的操作都是由机器而非人工操控的。

这并不意味着我们（人类）没有角色。我们有，但它更与创造性和非重复性任务相关。因此，如果我们仅关注集群操作，人类领域正在缩小，并被机器领域取代。

系统可以分为不同的角色。正如你将看到的那样，一个工具或一个人可以非常专业化，只执行单一的角色，也可以负责操作的多个方面。

开发者在系统中的角色

人类领域包括那些由人工操作的流程和工具。我们试图将所有可以重复的操作从这个领域移除。这并不意味着这个领域的目标是消失，恰恰相反。通过将重复的任务移出人类领域，我们解放了自己，使我们能花更多时间在那些带来真正价值的事情上。我们越少做那些可以委托给机器的任务，我们就能有更多时间去做那些需要创造力的工作。这种哲学符合每个角色在这场戏中的优势和劣势。机器擅长处理数据，它们知道如何快速执行预定义的操作，且比我们更快、更可靠。与机器不同，我们能够进行批判性思维，我们可以富有创造力。我们可以编程这些机器，告诉它们做什么以及何时做。

我将开发者指定为人类领域的主要角色。我有意避免使用“编码员”一词。开发者是指参与软件开发项目的所有人。无论你是编码员、测试员、运维专家，还是敏捷教练，都归为开发者这个群体。你们的工作成果是将某些内容提交到代码库中。在它到达之前，它就像不存在一样。无论它是在你的笔记本电脑上、笔记本中、桌面上，还是在一张小纸条上附着在信鸽上，都不重要。从系统的角度看，直到它进入代码库，它才算存在。那个代码库希望是 Git，但为了便于讨论，它可以是任何一个可以存储和版本管理代码的地方。

这个代码库也是人类领域的一部分。尽管它是一款软件，它仍然属于我们。我们来操作它。我们提交代码、拉取代码、合并代码，有时还会因太多的合并冲突而无奈地盯着它看。这并不意味着它没有自动化操作，也不意味着机器领域的某些部分在没有任何人为干预的情况下操作它。尽管如此，只要某件事大多是人工操作的，我们就会认为它属于人类领域。代码库绝对算是需要大量人工干预的系统的一部分。

图 10-2: 开发者将代码提交到代码库

让我们看看当代码提交到代码库时发生了什么。

系统中持续部署的角色

持续部署过程是完全自动化的。没有例外。如果你的流水线没有自动化，那就不是持续部署。你可能需要手动操作才能部署到生产环境。如果这个操作仅仅是按下一个写着deploy的按钮，那么你的过程是持续交付。我可以接受这种情况。可能出于业务原因需要这样一个按钮。尽管如此，自动化的程度和持续部署是一样的。你只是一个决策者。如果有任何其他手动操作，那你要么是在做持续集成，要么，更可能是在做一些不该带有“持续”字眼的工作。

无论是持续部署还是持续交付，过程都是完全自动化的。只有在你的系统是一个遗留系统，并且你的组织选择不去触碰它时，你才可以免于手动操作（通常是一个 Cobol 应用）。它仅仅是坐在服务器上做一些事情。我非常喜欢“没人知道它在做什么，不要碰它”类型的规则。这是一种在保持安全距离的同时，表现出极高尊重的方式。然而，我假设这不是你的情况。你想要去触碰它。你内心的渴望在燃烧。如果不是这样，而你不幸正在一个那种“远离它”的系统上工作，那么你读错了书，我很惊讶你自己没有意识到这一点。

一旦代码仓库接收到提交或拉取请求，它会触发一个 Web hook，发送请求给 CD 工具，启动持续部署过程。在我们的案例中，这个工具是Jenkins。该请求会启动流水线构建，执行各种持续部署任务。它会检出代码并运行单元测试。它构建一个镜像并将其推送到注册表。它运行功能测试、集成测试、性能测试以及其他需要实时服务的测试。流程的最后阶段（不包括生产环境测试）是向调度器发送请求，在生产集群中部署或更新服务。我们选择的调度器是 Docker Swarm。

图 10-3：通过 Jenkins 部署服务

与持续部署并行，另一组进程正在运行并试图保持系统配置的最新状态。

系统中服务配置的角色

系统的某些部分需要在集群的任何方面发生变化时重新配置。代理可能需要更新其配置，指标收集器可能需要新的目标，日志解析器可能需要更新它的规则。

无论系统的哪些部分需要更改，这些更改都需要自动应用。几乎没有人对此有异议。更大的问题是，如何找到那些应当被纳入系统的信息。最理想的地方是服务本身。由于几乎所有的调度程序都使用 Docker，因此关于服务的信息最合理的位置就是它内部，以标签的形式存在。将信息放在其他地方会阻碍我们拥有单一的信息源，并使自动发现变得难以实现。

将服务信息放在服务内部，并不意味着同样的信息不应该存在于集群中的其他地方。它应该存在。然而，服务是主信息必须存在的地方，从那里开始，信息应当传播到其他服务。Docker 使这一过程变得非常简单。它已经有一个 API，任何人都可以接入，并发现任何服务的所有信息。

用来发现服务信息并将其传播到系统其他部分的工具是 Docker Flow Swarm Listener (DFSL)。你可以选择其他工具或构建自己的解决方案。这类工具的目标，特别是 Docker Flow Swarm Listener，是监听 Docker Swarm 事件。如果服务包含特定的标签集，监听器将在服务部署或更新时立即获取信息，并将其传递给所有相关方。在这种情况下，相关方是 Docker Flow Proxy (DFP)（内部包含 HAProxy）和 Docker Flow Monitor (DFM)（内部包含 Prometheus）。最终，二者都拥有始终最新的配置。代理拥有所有公开服务的路由，而 Prometheus 则拥有有关导出器、警报、Alertmanager 的地址以及其他许多信息。

图 10-4：通过 Docker Flow Swarm Listener 重新配置系统

在部署和重新配置进行时，用户必须能够在不中断服务的情况下访问我们的服务。

代理在系统中的角色

每个集群都需要一个代理，代理将接收发送到单个端口的请求并将其转发到目标服务。唯一的例外是当我们只有一个面向公众的服务时。在这种情况下，值得质疑的不仅是我们是否需要代理，甚至是否根本需要集群。

当请求到达代理时，它会被评估，并根据其路径、域名或其他几个头信息，转发到其中一个服务。

Docker 使得代理的许多方面变得过时。负载均衡已经没有必要，Docker 的 Overlay 网络会为我们完成这项工作。我们也不需要维护托管服务的节点的 IP 地址，服务发现系统会为我们处理这一切。对头信息的评估和转发基本上就是代理应当执行的所有工作。

由于 Docker Swarm 在服务的每个方面发生变化时都会利用滚动更新，因此持续部署（CD）过程不应产生任何停机时间。为了确保这一点成立，需要满足一些要求。除了其他要求外，服务至少需要运行两个副本，最好更多。否则，任何单副本的服务更新都会不可避免地造成停机。无论是几分钟、几秒钟还是毫秒，都没有区别。

停机并不总是灾难性的，这取决于服务的类型。如果 Prometheus 被更新为新版本，由于它无法扩展，肯定会有停机时间。但它不是一个面向公众的服务，除非你算上几个操作员。几秒钟的停机对它来说并不算大问题。

一个面向公众的服务，例如一个在线零售商店，成千上万甚至数百万用户在其中购物，如果服务出现故障，迅速就会失去良好的声誉。作为消费者，我们已经非常惯坏了，哪怕是一个小小的故障，也能让我们改变主意，转而去竞争对手那里。如果这个“故障”一而再、再而三地发生，业务损失几乎是注定的。持续部署有很多优点，但由于它执行得比较频繁，它也放大了潜在的部署问题，其中停机就是其中之一。每天多次出现的“一秒钟停机”，实际上是不可接受的。

好消息是，结合滚动更新和多个副本，我们可以避免停机，只要代理始终保持最新。

滚动更新与能够动态重新配置自己的代理相结合，结果是用户可以在任何时候向服务发送请求，而不受持续部署、故障和集群状态变化的影响。

当用户向某个域发送请求时，请求通过任何一个健康节点进入集群，并被 Docker 的Ingress网络接管。该网络会检测到请求使用的是代理发布的端口并进行转发。代理则会评估请求的路径、域名或其他某个方面，并将请求转发到目标服务。

我们使用的是Docker Flow Proxy (DFP)，它在 HAProxy 上增加了所需的动态性。

图 10-5：请求流向目标服务的过程

我们接下来要讨论的角色是关于收集度量指标的。

系统中的度量指标角色

任何集群中至关重要的部分，尤其是那些朝着自适应方向发展的集群，是数据。几乎没有人会争辩过去和当前指标的必要性。如果没有它们，当出现问题时我们就像无头的苍蝇一样乱跑。关键问题不在于是否需要它们，而在于我们如何使用它们。传统上，运维人员会花费无数小时盯着仪表板看。这远远低效，不如看 Netflix。至少，后者更具娱乐性。系统应该使用这些指标。系统生成它们，收集它们，并且应该决定在它们达到某些阈值时执行哪些操作。只有这样，系统才能自适应。只有在没有人工干预的情况下执行操作，系统才能自给自足。

一个实现自适应的系统需要收集数据、存储数据并对数据采取行动。我将跳过推送和抓取数据的利弊讨论。由于我们选择使用Prometheus作为数据存储和评估的地方，以及生成和触发告警的服务，因此选择了抓取数据。这些数据通过导出器的形式提供。它们可以是通用的（例如 Node Exporter、cAdvisor 等），也可以是特定于某个服务的。在后者情况下，服务必须以 Prometheus 期望的简单格式暴露指标。

独立于我们之前描述的流程，导出器暴露了不同类型的指标。Prometheus 会定期抓取这些指标并将其存储在数据库中。与抓取数据并行，Prometheus 会持续评估由告警设置的阈值，如果达到任何一个阈值，它会被传播到Alertmanager。在大多数情况下，这些阈值的触发是由于条件发生变化（例如，系统负载增加）。

图 10-6：数据收集与告警

告警接收器是决定差异的关键。

系统中的告警角色

告警根据接收器的类型分为两大类。它可以转发到系统或人类。当某个告警被认定为应该发送到系统时，通常会转发一个请求到一个能够评估情况并执行适应系统任务的服务。在我们的案例中，这个服务是 Jenkins，它会执行预定义的某个作业。

Jenkins 执行的最常见任务集是调整（或反向调整）服务。然而，在尝试扩展之前，它需要发现当前副本数量，并将其与我们通过服务标签设置的上限和下限进行比较。如果扩展会导致副本数量超出这些边界，它会向 Slack 发送通知，以便人类决定应采取什么正确的操作来解决问题。另一方面，当扩展会保持副本数量在限制范围内时，Jenkins 会向其中一个 Swarm 管理器发送请求，后者会增加（或减少）服务的副本数量。我们称这个过程为自适应，因为系统在没有人类干预的情况下适应了变化的条件。

图 10-7：系统自适应的通知

即使目标是使系统完全自动化，几乎可以肯定在某些情况下需要人工干预。这些情况本质上是无法预测的。当发生预期之外的情况时，让系统修复它。另一方面，当出现意外情况时，呼叫人类。在这些情况下，Alertmanager 向人类领域发送消息。在我们的情况下，这是一条Slack消息，但也可以是任何其他通讯服务。

当您开始设计自愈系统时，大多数警报将属于“意外”类别。您无法预测系统可能发生的所有情况。您可以做的是确保每个这类情况只被视为意外一次。当您收到警报时，您的第一组任务应该是手动调整系统。第二组同样重要的行动是改进 Alertmanager 和 Jenkins 的规则，以便在下次发生相同情况时，系统可以自动处理。

图 10-8：当发生意外情况时向人类发送通知

设置自适应系统是困难的，它是永远不会真正结束的事情。它将需要持续改进。那么自愈系统呢？它同样难以实现吗？

系统中的调度程序角色

与自适应不同，自愈相对容易实现。只要有可用资源，调度程序将确保指定数量的副本始终在运行。在我们的情况下，该调度程序是Docker Swarm。

副本可能会失败，可能会被杀死，也可能会驻留在不健康的节点上。但这并不重要，因为 Swarm 会确保在需要时重新调度它们，并且（几乎）总是保持运行。如果我们的所有服务都是可扩展的，并且每个服务至少运行几个副本，就永远不会出现停机时间。Docker 内部的自我修复过程将确保这一点，而我们的自适应过程旨在提供高可用性。两者的结合使得系统几乎完全自治，且自给自足。

当服务不可扩展时，问题就开始堆积。如果我们无法拥有多个副本，Swarm 就无法保证没有停机时间。如果一个副本失败，它会被重新调度。但是，如果那个副本是唯一的副本，那么从失败到恢复运行之间的这段时间就会导致停机。这就像我们自己一样：我们生病了，躺在床上，过一段时间后才回到工作岗位。问题是，如果我们是公司里唯一的员工，而在我们离开时没有人来接手工作，那么就会造成问题。服务也是如此。两个副本是任何希望避免停机的服务的最小要求。

图 10-9：Docker Swarm 确保无停机时间

不幸的是，你的服务可能没有考虑到可扩展性。即使考虑到了，可你所使用的某些第三方服务可能并未做到这一点。可扩展性是一个重要的设计决策，也是我们在选择下一个工具时必须评估的必要条件。我们需要清楚地区分那些绝不能有停机时间的服务和那些在几秒钟不可用时不会对系统造成风险的服务。一旦做出这种区分，你就能知道哪些服务必须具备可扩展性。可扩展性是零停机服务的必要要求。

集群角色在系统中的作用

最后，我们所做的一切都在一个或多个集群内。现在不再有独立的服务器。我们不决定什么东西放在哪里。调度器来做决定。从我们的（人类的）角度来看，最小的实体是集群，它是由资源（如内存和 CPU）组成的集合。

图 10-10：一切都是集群

现在怎么办？

我们休息了一下。希望这次短暂的停顿对提升你的层次有所帮助，并且能从更远的角度看我们所做的事情。希望这段插曲能让整体思路更加清晰，同时你也已经重新充电了。还有很多工作要做，我希望你已准备好迎接新的挑战。

第十五章：服务的监控

在前几章中，我们使用了来自 cAdvisor 的数据来自动扩展服务。具体来说，当内存限制达到时，Prometheus 会触发警报。当内存使用超过限制时，我们会扩展与数据相关联的服务。虽然这种方法是一个好的开始，但对于我们正在构建的系统类型来说，远远不够。最低限度，我们需要测量服务的响应时间。我们是否应该寻找一个可以提供这些信息的导出器？

你的第一个想法可能是使用 haproxy_exporter。如果所有公共请求都经过它，抓取响应时间并基于收集到的数据设置一些警报是有意义的。这个模型与大多数其他监控系统的运作方式一致。唯一的问题是，这种方法几乎没什么用处。

并非所有请求都通过代理。那些不需要公开访问的服务并未与代理连接。例如，Docker Flow Swarm Listener 无法被访问。它没有公开任何端口，也没有 API。它监听 Docker Socket 并将信息发送给其他服务（例如 Docker Flow Proxy、Docker Flow Monitor 等）。它对代理完全不可见。如果这就是监控代理的唯一问题，我们可能会忽视这种信息的缺乏。

当请求进入代理时，它会根据请求路径、域名和其他一些标准转发到一个服务。通过从代理中抓取指标，我们只会知道这些请求的响应时间。在许多情况下，接收来自代理的请求的服务还会向其他服务发起请求。例如，go-demo与MongoDB进行通信。接收来自代理的请求的服务可能会向其他服务发起许多请求。代理对此一无所知。它接收请求，转发请求，等待响应，并将其重新发送到发起通信的客户端。它对中间发生的任何其他进程或请求毫不知情。因此，我们只能知道进入代理的请求的持续时间，但无法得知每个参与处理这些请求的服务的响应时间。

如果没有关于每个服务响应时间的知识，我们无法推断出哪个服务需要扩展。如果一个后端的响应时间很高，应该扩展这个后端还是它使用的数据库？

事情变得更加复杂的是，响应时间并不是我们需要的唯一指标。我们可能还会关注失败率、路径、方法以及其他一些附加数据。而所有这些数据需要与特定的服务或甚至某个具体的服务副本相关联。

如果你的记忆不错，你可能记得我说过我的建议是始终从导出器开始，只有在现有导出器没有提供所需指标时，才为你的服务添加监控。嗯……我们已经到了导出器不足的阶段。我们需要为我们的服务添加监控，并收集更详细的指标。

我们将把焦点限制在少数几个指标上。具体来说，我们将探索如何收集错误计数、响应时间、状态码以及其他几个指标。不要把这个当作其他类型指标不重要的标志。它们是必要的。然而，我们需要将范围保持在一个合理的限度内，并在一定页数内产生可行的结果。否则，我们完全可以开始与大英百科全书竞争。我假设你会把这些示例当作它们的本意，并以此为基础构建你自己的系统。错误率、响应时间和状态码可能是最常见的指标类型，但它们几乎肯定不是你需要的唯一类型。

在范围仅限于少数几个指标的情况下，我们应该花一点时间讨论我们需要的数据。

定义服务特定指标背后的需求

我们可能需要不同类型的指标。其中一些可能是简单的计数器。一个很好的例子是错误。我们可能希望计数错误，并在错误数量达到某个阈值时做出反应。仅此可能还不够，我们应该能够根据产生错误的函数或服务的某个部分来区分错误。

那么更复杂的指标呢？响应时间就是另一个很好的例子。

我们可能需要一个提供请求响应时间的指标。这可能会引导我们产生类似resp_time 0.043这样的指标。它有一个名称（resp_time）和以秒为单位的值（0.043）。如果我们实现了这样的指标，我们很快会发现我们还需要更多。仅仅知道系统响应慢并不能告诉我们是哪一部分出现了问题。我们需要知道服务的名称。

我们可能无法为集群中的所有服务添加监控。如果以go-demo栈为例，它由两个服务组成。它有一个后端和一个 MongoDB。后端在我们的控制下，我们可以通过添加监控来轻松扩展它。数据库则是另一回事。虽然我们可以（并且应该）使用MongoDB Exporter，它提供的是与服务器状态相关的数据。我们需要的是能够与后端服务相关联的指标。我们需要知道，发送到go-demo栈的请求是否因为后端或数据库的问题而变慢。假设我们不会“调整”MongoDB 来满足我们的需求，我们应该尝试通过扩展我们控制的服务内的指标来回答这个问题以及其他一些问题。

我们可以使用请求路径和方法。如果将其添加到我们的度量标准中，它应该能提供相当好的信息粒度。根据路径和方法，我们可以知道该度量标准是否与数据库相关，或者仅限于服务的内部流程。我们也可以添加查询，但那样做就有些过头了。它几乎会单独记录每个请求，可能会导致存储在 Prometheus 中时过多的内存和 CPU 使用。我们更新后的度量标准可能如下所示。

`1` resp_time{method="GET",path="/demo/hello",service="go-demo"} 0.611446292

Through those labels, we would know which service the metric belongs to, the path of the request, and the method. In the sea of possible additional labels we could add, there is one more that could be considered critical. We should know the status code. If we adopt standard HTTP response codes, the same ones our backend provides with the rest of the response, we can easily filter metrics and, for example, retrieve only those that are related to server errors. Our updated metric could be as follows. ``` `1` resp_time{code="200",method="GET",path="/demo/hello",service="go-demo"} 0.611446\ `2` 292 ``` ``````````````````````````````````````````````` That is surely it. Right? Well, that’s not quite what we truly need. A few other critical things are missing but we will not comment on them just yet. Since the libraries used to instrument code add a few additional features, we’ll comment on them once we reach the hands-on part. For now, it is enough to know that we can instrument services to generate metrics to count (e.g. errors) or observe (e.g. response times). Any additional information can be provided through labels. ### Differentiating Services Based On Their Types Before we start instrumenting our services, we should discuss services we’re deploying. They can be divided into three categories: online services, offline services, and batch processes. While there is overlap between each of those types and it is often not that easy to place a service into only one of them, such a division will provide us with a good understanding of the types of metrics we should implement. We can define online services as those that accept requests from another service, a human, or a client (e.g. browser). Those who send requests to online services often expect an immediate response. Front-end, APIs, and databases are only a few of the examples of such services. Due to the expectations we have from them, the key metrics we are interested in are the number of requests they served, the number of errors they produced, and latency. Offline services are those that do not have a client that is waiting for a response. Something, or someone, instructs those services to do some tasks without waiting until they are finished. A good example of such a service would be Jenkins. Even though it does have it’s UI and API and can fall in the category of online services, most of the work it does is offline. An example would be builds triggered by a webhook from a code repository. Those webhooks do not wait until Jenkins finishes building the jobs initiated by them. Instead, they announce that there is a new commit and trust that Jenkins will know what to do and will do it well. With offline services, we usually want to track the number of tasks being executed, the last time something was processed, and the length of queues. Finally, the last group of services is batch processes. The significant difference when compared with offline services is that batch jobs do not run continuously. They start execution of a task or a group of tasks, terminate, and disappear. That makes them very difficult to scrape. Prometheus would not know when a batch job started nor when it should end. We cannot expect a system (Prometheus or any other) to pull metrics from a batch job. Our best bet is to push them instead. With such services, we usually track how long it takes to complete them, how long each stage of a job lasted, the time a job finished executing, and whether it produced an error or it was successful. I prefer avoiding batch jobs since they are very hard to track and measure. Instead, when possible, we should consider converting them into offline services. A good example is, again, Jenkins. It allows us to schedule execution of a job thus providing a similar functionality as a batch process while still providing easy to scrape metrics and health checks. Now that we divided services into groups, we can discuss different types of metrics we can define when instrumenting our services. ### Choosing Instrumentation Type Prometheus supports four major metric types. We can make a choice between counters, gauges, summaries, and histograms. We will see them in action soon. For now, we’ll limit the discussion to a brief overview of each. A *counter* can only go up. We cannot decrease its value. It is useful for accumulating values. An example would be errors. Each error in the system should increase the counter by one. Unlike counters, *gauge* values can go both up and down. A good example of a gauge is memory usage. It can increase, only to decrease a few moments later. *Histograms* and *summaries* are more complex types. They are often used to measure durations of requests and sizes of responses. They track both summaries and counts. When those two are combined, we can measure averages over time. Their data is usually placed in buckets that form quantiles. We’ll go deeper into each of the metric types through practical examples starting with a *counter* as the simplest of all. But, before we do that, we need to create a cluster that will serve as our playground. ### Creating The Cluster And Deploying Services All hands-on parts of the past chapters started with the execution of a script that creates a cluster and deploys the services we’ll need. This chapter is no exception. You know the drill so let’s get to it. ``` `1` chmod +x scripts/dm-swarm-11.sh `2` `3` ./scripts/dm-swarm-11.sh `4` `5` `eval` `$(`docker-machine env swarm-1`)` `6` `7` docker stack ls ``` `````````````````````````````````````````````` We executed the `dm-swarm-11.sh` script which, in turn, created a Swarm cluster composed of Docker Machines, created the networks, and deployed only one stack. The last command listed all the stacks in the cluster and showed that we are running only the `proxy` stack. Let’s move into the *counter* metric. ### Instrumenting Services Using Counters There are many usages of the *counter* metric. We can measure the number of requests entering the system, the number of bytes sent through the network, the number of errors, and so on. Whenever we want to record an incremental value, a counter is a good choice. We’ll use counter to track errors produced by a service. With such a goal, a counter is usually put around code that handles errors. The examples that follow are taken from [Docker Flow Swarm Listener](http://swarmlistener.dockerflow.com/). The code is written in Go. Do not be afraid if that is not your language of choice. As you will see, examples are straightforward and can be easily extrapolated to any programming language. Prometheus provides [client libraries](https://prometheus.io/docs/instrumenting/clientlibs/) for a myriad of languages. Even if your favorite language is not one of them, it should be relatively easy to roll-out your own solution based on [exposition formats](https://prometheus.io/docs/instrumenting/exposition_formats/). The reason for choosing *Docker Flow Swarm Listener* lies in its type. It is (mostly) an offline service. Most of its objectives are to listen to Docker events through its socket and propagate Swarm events to other services like *Docker Flow Proxy* and *Docker Flow Monitor*. It does have an API but, since it is not its primary function, we’ll ignore it. As such, it is not a good candidate for more complex types of metrics thus making it suitable for a counter. That does not mean that the counter is the only metric it implements. However, we need to start from something simple, so we’ll ignore the others. There are a few essential things code needs to do to start producing metrics. We must define a variable that determines the type of the metric. Since we want to count errors, the code can be as follows. ``` `1` `var` `errorCounter` `=` `prometheus``.``NewCounterVec``(` `2 ` `prometheus``.``CounterOpts``{` `3 ` `Subsystem``:` `"docker_flow"``,` `4 ` `Name``:` `"error"``,` `5 ` `Help``:` `"Error counter"``,` `6 ` `},` `7 ` `[]``string``{``"service"``,` `"operation"``},` `8` `)` ``` ``` errorCounter based on CounterVec structure provided through the NewCounterVec function. The function requires two arguments. The first one (CounterOpts) defines options of the counter. In our case, we set the subsytem to docker_flow and the name to error. The fully qualified metric name consists of the namespace (we’re not using it today), subsystem, and name. When combined, the metric we are creating will be called docker_flow_error. Help is only for informative purposes and should help users of our metrics understand better its purpose. As you can see, I was not very descriptive with the help. Hopefully, it is clear what it does without a more detailed explanation. ``` The second argument is the list of labels. They are critical since they allow us to filter metrics. In our case, we want to know which service generated metrics. That way, we can have the same instrumentation across many services and choose whether to explore them all at once or filter by the service name. Knowing which service produced errors is often not enough. We should be able to pinpoint a particular operation that caused a problem. The second label called `operation` provides that additional info. It is important to specify all the labels we might need when filtering metrics, but not more. Each label requires extra resources. While that is in most cases negligible overhead, it could still have a negative impact when dealing with big systems. Just follow the rule of “everything you need, but not more” and you should be on the right track. Please read the [Use labels](https://prometheus.io/docs/practices/instrumentation/#use-labels) section of the instrumentation page for a discussion about dos and don’ts. The `errorCounter` variable is, in Prometheus terms, called collector. Each collector needs to be registered. We’ll do that inside `init` function that is executed automatically, thus saving us from worrying about it. ``` `1` `func` `init``()` `{` `2 ` `prometheus``.``MustRegister``(``errorCounter``)` `3` `}` ``` ````````````````````````````````````````````` Now we are ready to start incrementing the `errorCounter`. Since I do not like repeated code, the code that increments the metric is wrapped into another function. It is as follows. ``` `1` `func` `recordError``(``operation` `string``,` `err` `error``)` `{` `2 ` `metrics``.``errorCounter``.``With``(``prometheus``.``Labels``{` `3 ` `"service"``:` `metrics``.``serviceName``,` `4 ` `"operation"``:` `operation``,` `5 ` `}).``Inc``()` `6` `}` ``` ```````````````````````````````````````````` Whenever this function is called, `errorCounter` will be incremented by one (`Inc()`). Each time that happens, the name of the service and the operation that produced the error will be recorded as labels. An example invocation of the `recordError` function is as follows. ``` `1` `...` `2` `err` `=` `n``.``ServicesCreate``(` `3` `newServices``,` `4` `args``.``Retry``,` `5` `args``.``RetryInterval``,` `6` `)` `7` `if` `err` `!=` `nil` `{` `8` `metrics``.``RecordError``(``"ServicesCreate"``)` `9` `}` `10` `...` ``` ``````````````````````````````````````````` The function `ServicesCreate` returns an `err` (short for `error`). If the `err` is not `nil`, the `recordError` is called passing `GetServices` as operation and thus incrementing the counter. The last piece missing is to enable `/metrics` as the endpoint Prometheus can use to scrape metrics from out service. ``` `1` `func` `(``m` `*``Serve``)` `Run``()` `error` `{` `2 ` `mux` `:=` `http``.``NewServeMux``()` `3 ` `...` `4 ` `mux``.``Handle``(``"/metrics"``,` `prometheus``.``Handler``())` `5 ` `...` `6` `}` ``` ``` /metrics as the address that is handled by Prometheus handler provided with the GoLang client library we’re using. ``` I hope that those few snippets of Go code were not scary. Even if you never worked with Go, you probably managed to understand the gist of it and will be able to create something similar in your favorite language. Remember to visit [Client Libraries](https://prometheus.io/docs/instrumenting/clientlibs/) page, choose the preferred language, and follow the instructions. If you’re interested in the full source code behind the snippets, please visit [vfarcic/docker-flow-swarm-listener](https://github.com/vfarcic/docker-flow-swarm-listener) GitHub repository. Let’s see those metrics in action. Since `swarm-listener` deployed through the `proxy` stack does not publish port `8080`, we’ll create a new service attached to the `proxy` network. It will be global so that it is guaranteed to run on each node. That way it’ll be easier to find the container, enter into it, and send requests to `swarm-listener`. ``` `1` docker service create --name util `\` `2 ` --mode global `\` `3 ` --network proxy `\` `4 ` alpine sleep `1000000` ``` `````````````````````````````````````````` We created the `util` service based on the `alpine` image and made it sleep for a very long time. Please confirm that it is up-and-running by executing `docker service ps util`. Let’s find the ID of the container running on the node our Docker client points to and enter inside it. ``` `1` `ID``=``$(`docker container ls -q `\` `2 ` -f `"label=com.docker.swarm.service.name=util"``)` `3` `4` docker container `exec` -it `$ID` sh ``` ````````````````````````````````````````` The only thing missing is to install `curl`. ``` `1` apk add --update curl ``` ```````````````````````````````````````` Now we can send a request to `swarm-listener` and retrieve metrics. ``` `1` curl `"http://swarm-listener:8080/metrics"` ``` ``````````````````````````````````````` You’ll see a lot of metrics that come out of the box when using Prometheus clients. In this case, most of the metrics are very particular to Go, so we’ll skip them. What you won’t be able to see is `docker_flow_error`. Since the service did not produce any errors, that metric does not show. Let’s get out of the container we’re in. ``` `1` `exit` ``` `````````````````````````````````````` My guess is that you would not be delighted reaching this far without seeing the metric we discussed so let us generate a situation in which `swarm-listener` will produce errors. *Docker Flow Swarm Listener* discovers services by communicating with Docker Engine through its socket. Typically, the service mounts the socket to the host and, in that way, Docker client inside the container communicates with Docker Engine running on the node. If we remove that mount, the communication will be broken, and *Docker Flow Swarm Listener* will start reporting errors. Let’s test it out. ``` `1` docker service update `\` `2 ` --mount-rm /var/run/docker.sock `\` `3 ` proxy_swarm-listener ``` ````````````````````````````````````` We removed the `/var/run/docker.sock` mount and the communication between Docker client inside the container and Docker engine on the host was cut. We should wait a few moments until Docker reschedules a new replica. If you want to confirm that the update was finished, please execute `docker stack ps proxy`. Let’s check the logs and confirm that the service is indeed generating errors. ``` `1` docker service logs proxy_swarm-listener ``` ```````````````````````````````````` One of the output entries should be similar to the one that follows. ``` `1` ... `2` Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docke\ `3` r daemon running? ``` ``````````````````````````````````` Now that the service is generating errors, we can take another look at the metrics and confirm that `docker_flow_error` is indeed added to the list of metrics. ``` `1` docker container `exec` -it `$ID` sh `2` `3` curl `"http://swarm-listener:8080/metrics"` ``` `````````````````````````````````` We entered the `util` replica and sent a request to `swarm-listener` endpoint `/metrics`. The output, limited to the relevant parts, should be as follows. ``` `1` ... `2` # HELP docker_flow_error Error counter `3` # TYPE docker_flow_error counter `4` docker_flow_error{operation="GetServices",service="swarm_listener"} 10 `5` ... ``` ````````````````````````````````` Please note that metrics are ordered alphabetically, so `docker_flow_error` should be somewhere around the top. As you can see, `docker_flow_error` generated `10` errors. By inspecting labels, we can see that the operation that causes errors is `GetServices` and that the service is `swarm_listener`. If this would be a production system, we’d know not only that there is a problem with the service but also which part of it caused the issue. That is very important since the actions the system should take are rarely the same for the whole service. Knowing that the problem is related to a particular operation or a function, lets us fine tune the actions the system should take when certain thresholds are reached. Before we continue, let us exit the container we’re in and restore the `swarm-listener` to its original state. ``` `1` `exit` `2` `3` docker stack deploy `\` `4 ` -c stacks/docker-flow-proxy-mem.yml `\` `5 ` proxy ``` ```````````````````````````````` We redeployed the stack and thus restored the mount we removed. Let’s try to generate the same metric with a different value. We can, for example, remove `proxy` service and deploy `go-demo` stack. *Docker Flow Swarm Listener* will detect a new service and try to send the information to the `proxy`. If it fails to do so, Prometheus client will increase `docker_flow_error` by one. ``` `1` docker service rm proxy_proxy `2` `3` docker stack deploy `\` `4 ` -c stacks/go-demo-scale.yml `\` `5 ` go-demo ``` ``````````````````````````````` We removed the proxy and deployed `go-demo` stack. *Docker Flow Swarm Listener* will try to send service information to the proxy and, since we removed it, fail to do so. By default, if `swarm-listener` fails to deliver information, it retries for fifty times with five seconds delay in between. That means that we need to wait a bit over 4 minutes for `swarm-listener` to give up and throw an error. After a while, we can check the logs. ``` `1` docker service logs proxy_swarm-listener ``` `````````````````````````````` After fifty retries, you should see log entries similar to the ones that follow. ``` `1` ... `2` Retrying service created notification to ... `3` ERROR: Get ...: dial tcp: lookup proxy on 127.0.0.11:53: no such host ``` ````````````````````````````` Now we can go back to the `util` container and take another look at the metrics. ``` `1` docker container `exec` -it `$ID` sh `2` `3` curl `"http://swarm-listener:8080/metrics"` ``` ```````````````````````````` This time, `docker_flow_error` metric is slightly different. ``` `1` # HELP docker_flow_error Error counter `2` # TYPE docker_flow_error counter `3` docker_flow_error{operation="notificationSendCreateServiceRequest",service="swar\ `4` m_listener"} 1 `5` ... ``` ``````````````````````````` The `operation` label has the value `notificationSendCreateServiceRequest` clearly indicating that it comes from a different place than the previous error. The two errors we explored are of quite a different nature and should be treated differently. The one associated with the label `GetServices` means that there is no communication with the Docker socket. That could be caused by a faulty manager and the action that should remedy that could be to reschedule the service to a different node or maybe even to remove that node altogether. The code of the service will retry establishing socket connection so we should probably not react on the first occupancy of the metric but wait until, for example, it reaches twenty failed attempts over the timespan of five minutes or less. The error related to the `notificationSendCreateServiceRequest` means that there is no communication with the services that should receive notifications. In this case, that destination is the proxy. The problem might be related to networking, or the proxy is not running. Our action might be to check whether the proxy is running and, if it isn’t, deploy it again. Or maybe there should be no action at all. The proxy itself should have its own alerts that will remedy the situation. Moreover, the service does not throw an error when the connection with the proxy fails. Instead, it retries it for a while and errors only if all attempts failed. That means that we should react on the first occurrence of the error. As you can see, even though those two errors come from the same service, the causes and the actions associated with them are entirely different. For that reason, we are using the `operation` label to distinguish them. Later on, it should be relatively easy to filter them in Prometheus and define different alerts. Instrumenting our service with counters was easy. Let’s see whether gauge is any different. Since we removed the proxy service, we should exit the `util` container and restore the stack to its original state before we proceed further. ``` `1` `exit` `2` `3` docker stack deploy `\` `4 ` -c stacks/docker-flow-proxy-mem.yml `\` `5 ` proxy ``` `````````````````````````` ### Instrumenting Services Using Gauges Gauges are very similar to counters. The only significant difference is that we can not only increment, but also decrease their values. We’ll continue exploring [vfarcic/docker-flow-swarm-listener](https://github.com/vfarcic/docker-flow-swarm-listener) GitHub repository for an example of a gauge. Since `gauge` is almost identical to `counter`, we won’t go into many details but only briefly explore a few snippets. Just as with a counter, we need to declare a variable that defines the type of the metric. A simple example is as follows. ``` `1` `var` `serviceGauge` `=` `prometheus``.``NewGaugeVec``(` `2` `prometheus``.``GaugeOpts``{` `3` `Subsystem``:` `"docker_flow"``,` `4` `Name``:` `"service_count"``,` `5` `Help``:` `"Service gauge"``,` `6` `},` `7` `[]``string``{``"service"``},` `8` `)` ``` ````````````````````````` Next, we need to register it with Prometheus. We’ll reuse the code from the `init` function where we defined the `errorCounter` and add `serviceGauge`. ``` `1` `func` `init``()` `{` `2` `prometheus``.``MustRegister``(``errorCounter``,` `serviceGauge``)` `3` `}` ``` ```````````````````````` There’s also a function that simplifies the usage of the metric. ``` `1` func RecordService(count int) { `2` serviceGauge.With(prometheus.Labels{ `3` "service": serviceName, `4` }).Set(float64(count)) `5` } ``` ``````````````````````` We’re setting the value of the gauge using the `Set` function. Alternatively, we could have used `Add` or `Sub` functions to add or subtract the value. `Inc` or `Dec` can be used to increase of decrease the value by one. Finally, on every iteration of the `swarm-listener`, we are setting the gauge to the number of services retrieved by `swarm-listener`. ``` `1` `metrics``.``RecordService``(``len``(``service``.``Services``))` ``` `````````````````````` Let’s take another look at the `/metrics` endpoint. ``` `1` docker container `exec` -it `$ID` sh `2` `3` curl `"http://swarm-listener:8080/metrics"` ``` ````````````````````` One of the metric entries is as follows. ``` `1` # HELP docker_flow_service_count Service gauge `2` # TYPE docker_flow_service_count gauge `3` docker_flow_service_count{service="swarm_listener"} 1 ``` ```````````````````` It might look confusing that the value of the metric is one since we are running a few other services. *Docker Flow Swarm Listener* fetches only services with the `com.df.notify` label. Among the services we’re currently running, only `go-demo_main` has that label, hence being the only one included in the metric. Let’s see what happens if we remove `go-demo_main` service. ``` `1` `exit` `2` `3` docker service rm go-demo_main `4` `5` docker container `exec` -it `$ID` sh `6` `7` curl `"http://swarm-listener:8080/metrics"` ``` ``````````````````` The output of the `/metrics` API is as follows (limited to the relevant parts). ``` `1` # HELP docker_flow_service_count Service gauge `2` # TYPE docker_flow_service_count gauge `3` docker_flow_service_count{service="swarm_listener"} 0 `4` ... ``` `````````````````` As you can see, the `docker_flow_service_count` metric is now set to zero thus accurately representing the number of services discovered by `swarm-listener`. If, in your case, the number is still one, please wait a few moments and try again. *Docker Swarm Listener* has five seconds iterations, and you might have requested metrics too soon. Let us exit the `util` container and restore the `go-demo` stack before we proceed into histograms. ``` `1` `exit` `2` `3` docker stack deploy `\` `4 ` -c stacks/go-demo-scale.yml `\` `5 ` go-demo ``` ````````````````` ### Instrumenting Services Using Histograms And Summaries When compared with counters and gauges, histograms are much more complex. That does not mean that they are harder to implement but that the data they provide is less simple when compared with the other metric types we explored. We’ll comment on them by studying a sample code and the output it provides. We’ll switch from the [vfarcic/docker-flow-swarm-listener](https://github.com/vfarcic/docker-flow-swarm-listener) repository to [vfarcic/go-demo](https://github.com/vfarcic/go-demo) since it provides a simple example of a histogram. Just as with the other types of metrics, histogram also needs to be declared as a variable of the particular type. ``` `1` `var` `(` `2` `histogram` `=` `prometheus``.``NewHistogramVec``(``prometheus``.``HistogramOpts``{` `3` `Subsystem``:` `"http_server"``,` `4` `Name``:` `"resp_time"``,` `5` `Help``:` `"Request response time"``,` `6` `},` `[]``string``{` `7` `"service"``,` `8` `"code"``,` `9` `"method"``,` `10 ` `"path"``,` `11 ` `})` `12` `)` ``` ```````````````` The objective of the metric is to record information about response times. Its labels provide additional information like the name of the service (`service`), the response code (`code`), the method of the request (`method`), and the path (`path`). All those labels together should give us a fairly accurate picture of the response times of the service, and we’ll be able to filter the results using any combination of the labels. Next is a helper function that will allow us to record metrics easily. ``` `1` `func` `recordMetrics``(``start` `time``.``Time``,` `req` `*``http``.``Request``,` `code` `int``)` `{` `2` `duration` `:=` `time``.``Since``(``start``)` `3` `histogram``.``With``(` `4` `prometheus``.``Labels``{` `5` `"service"``:` `serviceName``,` `6` `"code"``:` `fmt``.``Sprintf``(``"%d"``,` `code``),` `7` `"method"``:` `req``.``Method``,` `8` `"path"``:` `req``.``URL``.``Path``,` `9` `},` `10 ` `).``Observe``(``duration``.``Seconds``())` `11` `}` ``` ``````````````` The `recordMetrics` function accepts argument that defines the time when a request started (`start`), the request itself (`req`), and the response code (`code`). We’re calling histogram’s’ `Observe` function with the duration of the request expressed in seconds. The duration is obtained by calculating the time passed since the value of the `start` variable. Let’s take a look at one of the functions that invokes `recordMetrics`. ``` `1` `func` `HelloServer``(``w` `http``.``ResponseWriter``,` `req` `*``http``.``Request``)` `{` `2 ` `start` `:=` `time``.``Now``()` `3 ` `defer` `func``()` `{` `recordMetrics``(``start``,` `req``,` `http``.``StatusOK``)` `}()` `4` `5 ` `// The rest of the code that processes the request.` `6` `}` ``` `````````````` Whenever a request is made to a particular path, the web server invokes the `HelloServer` function. That function starts by recording the current time and storing it in the `start` variable. Go has a special statement that defers execution of a function. In this case, we defined that the invocation of the `recordMetrics` should be deferred. As a result, it will be executed before the `HelloServer` function exists, thus giving us an (almost) exact duration of the requests. A similar logic is applied to all endpoints of the service thus providing us with the response times of the whole service. If you’re interested in the full source code behind the snippets, please visit [vfarcic/go-demo](https://github.com/vfarcic/go-demo) GitHub repository. Let us send some traffic to the `go-demo` service before we explore the histogram metrics. ``` `1` `for` i in `{``1`..100`}``;` `do` `2 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello"` `3` `done` ``` ````````````` We’ll repeat the already familiar process of entering the `util` container and retrieving the metrics. The only difference is that this time we’ll explore `go-demo_main` metrics instead of those from the `swarm-listener`. ``` `1` docker container `exec` -it `$ID` sh `2` `3` curl `"http://go-demo_main:8080/metrics"` ``` ```````````` The output, limited to relevant parts, is as follows. ``` `1` ... `2` # HELP resp_time Request response time `3` # TYPE resp_time histogram `4` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `5` ="0.005"} 69 `6` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `7` ="0.01"} 69 `8` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `9` ="0.025"} 69 `10` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `11` ="0.05"} 69 `12` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `13` ="0.1"} 69 `14` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `15` ="0.25"} 69 `16` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `17` ="0.5"} 69 `18` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `19` ="1"} 69 `20` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `21` ="2.5"} 69 `22` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `23` ="5"} 69 `24` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `25` ="10"} 69 `26` resp_time_bucket{code="200",method="GET",path="/demo/hello",service="go-demo",le\ `27` ="+Inf"} 69 `28` resp_time_sum{code="200",method="GET",path="/demo/hello",service="go-demo"} 0.00\ `29` 3403602 `30` resp_time_count{code="200",method="GET",path="/demo/hello",service="go-demo"} 69 `31` ... ``` ``````````` Unlike counters and gauges, each histogram produces quite a few metrics. The major one is `resp_time_sum` that provides a summary of all the recorded responses. Below it is `resp_time_counter` with the number of responses. Based on those two, we can see that `69` responses took `0.0034` seconds. If we’d like to get the average time of the responses, we’d need to divide `sum` with `count`. In addition to `sum` and `count`, we can observe the number of responses grouped into different buckets called quantiles. At the moment, all sixty-nine requests fall into all of the quantiles, so we’ll postpone discussion about them until we reach the examples with more differencing response times. One thing worth noting is that the metrics come from only one of the three replicas, so our current examples do not paint the full picture. Later on, when we start scraping the metrics with Prometheus, we’ll see that they are aggregated from all the replicas. Finally, you might have expected around thirty-three responses since we sent a hundred requests that were distributed across three replicas. However, the service continuously pings itself, so the final number was quite higher. Let’s get out of the `util` container and try to generate some requests that will end with errored responses. ``` `1` `exit` `2` `3` `for` i in `{``1`..100`}``;` `do` `4 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/random-error"` `5` `done` ``` `````````` The `/demo/random-error` endpoint produces response code `500` in approximately ten percent of cases. The rest should be “normal” responses with status code `200`. The output should be similar to the one that follows. ``` `1` ... `2` Everything is still OK `3` Everything is still OK `4` Everything is still OK `5` Everything is still OK `6` ERROR: Something, somewhere, went wrong! `7` ... ``` ````````` Let’s see how do metrics look like now. ``` `1` docker container `exec` -it `$ID` sh `2` `3` curl `"http://go-demo_main:8080/metrics"` ``` ```````` The output limited to the relevant parts is as follows. ``` `1` ... `2` # HELP http_server_resp_time Request response time `3` # TYPE http_server_resp_time histogram `4` ... `5` http_server_resp_time_sum{code="200",method="GET",path="/demo/random-error",serv\ `6` ice="go-demo"} 0.001033751 `7` http_server_resp_time_count{code="200",method="GET",path="/demo/random-error",se\ `8` rvice="go-demo"} 32 `9` ... `10` http_server_resp_time_sum{code="500",method="GET",path="/demo/random-error",serv\ `11` ice="go-demo"} 7.033700000000001e-05 `12` http_server_resp_time_count{code="500",method="GET",path="/demo/random-error",se\ `13` rvice="go-demo"} 2 `14` ... ``` ``````` Since the response code is one of the labels, we got two metrics; one for the code `200`, and the other for `500`. Since those hundred requests were load balanced across three replicas, the one that produced this output got approximately one-third of them (32+2). We can see that the requests that produce errors take considerably longer time with the total of seven seconds for only two requests. You might have been “unlucky” and did not get a single response with the code `500`. If that was the case, feel free to send another hundred requests. Now that we confirmed that our response metrics are separated by different labels, we should explore quantiles. For that, we need to simulate queries with varying response times. Fortunately, `go-demo` has such an endpoint. ``` `1` `exit` `2` `3` `for` i in `{``1`..30`}``;` `do` `4 ` `DELAY``=`$`[` `$RANDOM` % `6000` `]` `5 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello?delay=``$DELAY``"` `6` `done` ``` `````` When `delay` query parameter is set, `go-demo` goes to sleep for the specified number of milliseconds. We made thirty iterations. Each generated a random number between 0 and 6000 and sent that number as the `delay` parameter. As a result, the service should have received requests with a wide range of response times. Let’s take another look at the metrics. ``` `1` docker container `exec` -it `$ID` sh `2` `3` curl `"http://go-demo_main:8080/metrics"` ``` ````` The output, limited to relevant parts, is as follows. ``` `1` ... `2` # HELP http_server_resp_time Request response time `3` # TYPE http_server_resp_time histogram `4` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `5` "go-demo",le="0.005"} 78 `6` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `7` "go-demo",le="0.01"} 78 `8` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `9` "go-demo",le="0.025"} 78 `10` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `11` "go-demo",le="0.05"} 78 `12` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `13` "go-demo",le="0.1"} 78 `14` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `15` "go-demo",le="0.25"} 78 `16` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `17` "go-demo",le="0.5"} 79 `18` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `19` "go-demo",le="1"} 80 `20` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `21` "go-demo",le="2.5"} 83 `22` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `23` "go-demo",le="5"} 87 `24` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `25` "go-demo",le="10"} 88 `26` http_server_resp_time_bucket{code="200",method="GET",path="/demo/hello",service=\ `27` "go-demo",le="+Inf"} 88 `28` http_server_resp_time_sum{code="200",method="GET",path="/demo/hello",service="go\ `29` -demo"} 29.430902277 `30` http_server_resp_time_count{code="200",method="GET",path="/demo/hello",service="\ `31` go-demo"} 88 `32` ... ``` ```` Now we have the combination of the fast responses from before combined with those with a delay of up to six seconds. If we focus only on the last two lines, we can see that there are `88` responses in total with the summed time of `29.43` seconds. The average time of responses is around `0.33` seconds. That, in itself, does not give us enough information. Maybe two requests lasted for `10` seconds each, and all of the rest were lightning fast. Or, perhaps, all of the requests were below `0.5` seconds. We cannot know that by just looking at the sum of all response times and dividing them with the count. We need quantiles. The histogram used in `go-demo` did not specify buckets, so the quantiles are those defined by default. They range from as low as `0.005` to as high as `10` seconds. If you pay closer attention to the numbers beside each of those buckets, you’ll see that `78` requests were below `0.25` seconds, `79` below `0.5`, and so on all the way until all of the `88` requests being below `10` seconds. All the requests from a smaller bucket belong to the larger one. That might be confusing the first time we look at the metrics, but it makes perfect sense. A request that lasted less than, for example, `0.5` seconds, definitely lasted less than, `1` seconds, and so on. Using quantiles (or buckets) will be essential when we start defining Prometheus alerts based on those metrics, so we’ll postpone further discussion until we reach that part. As you can see, unlike counters and gauges, histograms go beyond simple additions and subtractions. They provide observations over a period. They track the number of observations and their summaries thus allowing us to calculate average values. The number of observations behaves like a counter. It can only be increased. The sum, on the other hand, is similar to a gauge. It can be both increased and decreased depending on the values we observe. If it is negative, the sum will decrease. We did not explore such an example since response times are always positive. The most common usage of histrograms is to record request durations and response times. We explored one of those two through our examples. How about summaries? They are the only metric type we did not explore. *Summary* is similar to *histogram* metric type. Both sample observations. The major difference is that summary calculates quantiles based on a sliding time frame. We won’t go deeper into summaries. Instead, please read the [Histograms And Summaries](https://prometheus.io/docs/practices/histograms/) page that explains both in more detail and provides a comparison of the two. ### What Now? We explored, through a few examples, how to instrument our services and provide more detailed metrics than what we would be able to do through exporters. Early in the book, I said that we should use exporters instead instrumentation unless they do not provide enough information. It turned out that they do not. If, for example, we used an exporter, we would get metrics based on requests coming through the proxy. We would not be aware of internal communication between services nor would we be able to obtain response times of certain parts of the services we’re deploying. Actually, [HAProxy Exporter](https://github.com/prometheus/haproxy_exporter) does not even provide response times since the internal metrics it exposes is not entirely compatible with Prometheus and cannot be exported without sacrificing accuracy. That does not mean that HAProxy metrics are not accurate but that they use a different logic. Instead of having a counter, HAProxy exposes response as exponentially decaying value. It cannot be transformed into a histogram. If you’re interested in the discussion about *HAProxy Exporter* response time, please visit [issue 37](https://github.com/prometheus/haproxy_exporter/issues/37). Without accurate response times, we cannot instruct our system to scale and de-scale them effectively. We need to obtain more information if we want to get closer to building a truly *self-adapting* system. While instrumentation we explored through examples is by no means all the instrumentation we should add, it does provide a step forward. Even though response times are not the only metric we’re missing, it is probably the most important one. Counting errors is useful as well but does not provide clear guidance. Some errors will need a different set of actions, and many cannot even be hooked into the system that auto-corrects itself. Generally speaking, errors often (but not always) require human intervention. Response times, on the other hand, are easy to grasp. They do provide clear guidance for the system. If it goes over a certain threshold within a predefined period, scale up. If it goes down, scale down. The next chapter will continue exploring response times. We’ll see what we can do with them in Prometheus and how we can improve our current alerts by incorporating this new data. And now we need a break. Take a rest, go to sleep, recharge your batteries. Before you do any of that, remember that your computer needs a rest too. Get out of the `util` container and remove the machines we created. ``` `1` `exit` `2` `3` docker-machine rm -f swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ```````````````````````````````` ````````````````````````````````` `````````````````````````````````` ``````````````````````````````````` ```````````````````````````````````` ````````````````````````````````````` `````````````````````````````````````` ``````````````````````````````````````` ```````````````````````````````````````` ````````````````````````````````````````` `````````````````````````````````````````` ``````````````````````````````````````````` ```````````````````````````````````````````` ````````````````````````````````````````````` `````````````````````````````````````````````` ```````````````````````````````````````````````

第十六章：自适应应用于仪表化服务

仪表化服务提供比我们从exporters抓取的更多详细指标。添加我们可能需要的所有指标的能力打开了通常由 exporters 关闭的大门。这并不意味着 exporters 就不那么有用，而是我们需要考虑我们观察的资源的性质。

硬件指标应从 exporter 中抓取。毕竟，我们无法对 CPU 进行仪表化。第三方服务是另一个很好的例子，在这些场景下，exporter 通常是更好的选择。如果我们使用数据库，我们应该寻找一个从中获取指标并将其转换为 Prometheus 友好格式的 exporter。代理、网关以及几乎所有其他非我们开发的服务也应如此。

如果我们已经投入了大量时间实现不符合 Prometheus 格式的指标，我们甚至可以选择为我们控制的服务编写一个 exporter。

Exporter 只能帮助我们走一半的路。我们可以指示系统根据内存使用情况进行扩展。cAdvisor提供有关集群内运行的容器的信息，但它提供的指标过于通用，无法获取服务特定的数据。无法针对每个服务进行指标微调，导致我们只能使用基本的警报，信息不足。仪表化填补了这一缺失的拼图。

在我们愿意投入时间对服务进行仪表化的情况下，结果是令人印象深刻的。我们可以在不做妥协的情况下获得所需的一切。我们可以完成几乎任何细节层级的任务，并以一种能够编写可靠警报的方式对服务进行仪表化，这些警报将以系统所需的所有信息通知系统。结果是向自愈更近了一步，更重要的是，向自适应更近了一步。我将自适应归为“更重要”的原因在于，自愈问题已经大部分通过其他工具解决了。调度程序（例如 Docker Swarm）已经在自愈服务方面做得相当好。如果我们排除硬件范围，我们剩下的最大障碍是服务的自适应性。

设定目标

我们需要定义通过仪表化想要实现的目标范围。我们将通过限制自己只关注一个目标来保持目标的简洁。如果服务的响应时间超过上限，我们将扩大其规模；如果低于下限，我们将缩小其规模。任何其他警报都将导致通知发送到 Slack。并不意味着 Slack 通知应该永远存在。相反，它们应该被视为一种临时解决方案，直到我们找到将手动修正操作转化为由系统执行的自动响应的方法。

一个经常手动处理的警报的好例子是错误响应（状态码 500 及以上）。当它们在指定的时间段内达到某个阈值时，我们会发送警报。它们会导致 Slack 通知，并成为待处理任务交给人工。一个内部规则应该是先修复问题，评估问题发生的原因，并编写一个脚本来重复相同的步骤。有了这样的脚本，如果相同的警报再次触发，我们就可以指示系统执行相同的操作。通过这种方法，我们（人类）可以将时间花费在解决意外问题上，把机器交给那些反复出现的问题。

我们将尝试完成的目标总结如下。

定义服务的最大响应时间并创建将其扩展的流程。
定义服务的最小响应时间并创建将其缩减的流程。
基于状态码 500 及以上的响应定义阈值，并发送 Slack 通知。

请注意，响应时间阈值不能仅依赖于毫秒。我们必须定义分位数、速率以及其他一些因素。此外，我们需要设置服务的最小和最大副本数量。否则，我们可能会面临无限扩展或缩减到零副本的风险。一旦我们开始实施系统，就可以查看这些额外的需求是否足够，或者是否需要进一步扩展范围。

说的够多了，接下来让我们进入实际的主题探索。

一如既往，第一步是创建一个集群并部署一些服务。

创建集群并部署服务

你知道该怎么做。我们将创建一个 Swarm 集群，并部署一些我们已经熟悉的栈。完成后，我们将拥有进行任务探索所需的基础。

`1` chmod +x scripts/dm-swarm-12.sh
`2` 
`3` ./scripts/dm-swarm-12.sh
`4` 
`5` `eval` `$(`docker-machine env swarm-1`)`
`6` 
`7` docker stack ls

We created the cluster and deployed three stacks. The output of the last command gives us the list of those stacks. ``` `1` NAME SERVICES `2` monitor 3 `3` proxy 2 `4` jenkins 2 ``` ``````````````````````````````````````````````````````````` Now we’re ready to explore how to scrape metrics from instrumented services. ### Scraping Metrics From Instrumented Services The [go-demo](https://github.com/vfarcic/go-demo) service is already instrumented with a few metrics. However, they do not mean much by themselves. Their usage starts only once Prometheus scrapes them. Even then, they provide only a visual representation and the ability to query them after we find a problem. The major role of graphs and the capacity to query metrics comes after we detect an issue and we want to drill deeper into it. But, before we get there, we need to set up alerts that will notify us that there is a problem. We cannot think about metrics before we have some data those metrics will evaluate. So, we’ll start from the beginning and explore how to let Prometheus know that metrics are coming from services we instrumented. That should be a relatively easy thing to accomplish since we already have all the tools and processes we need. Let’s start by deploying the `go-demo` stack. The `main` service inside it is already instrumented and provides the `/metrics` endpoint Prometheus can query. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-instrument.yml `\` `3 ` go-demo ``` `````````````````````````````````````````````````````````` We can use the time Swarm needs to initialize the replicas of the stack and explore the YAML file. The definition of the [go-demo-instrument.yml](https://github.com/vfarcic/docker-flow-monitor/blob/master/stacks/go-demo-instrument.yml) stack is as follows (limited to relevant parts). ``` `1` ... `2` main: `3` ... `4` networks: `5` ... `6` - monitor `7` deploy: `8` ... `9` labels: `10 ` - com.df.notify=true `11 ` ... `12 ` - com.df.scrapePort=8080 `13 ` ... ``` ````````````````````````````````````````````````````````` We used `com.df.notify=true` label to let *Docker Flow Swarm Listener* know that it should notify *Docker Flow Monitor*. The `com.df.scrapePort` is set to `8080` thus letting Prometheus know the port it should use to scrape metrics. The `monitor` network is added to the stack. Since *Docker Flow Monitor* is attached to the same network, they will be able to communicate internally by using service names. Let’s confirm that *Docker Flow Monitor* configured Prometheus correctly. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/config"` ``` ```````````````````````````````````````````````````````` As you can see, `job_name` is set to the name of the service (`go-demo_main`). The `names` argument is set to `tasks.go-demo_main`. When service DNS is prefixed with `tasks.`, Overlay network returns IPs of all the replicas. That way, Prometheus will be able to scrape metrics from all those that form the `go-demo_main` service. ![Figure 12-1: Prometheus configuration with the go-demo_main job](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00062.jpeg) Figure 12-1: Prometheus configuration with the go-demo_main job Before we proceed, please confirm that all the replicas of the services that form the `go-demo` stack are up-and-running. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ``````````````````````````````````````````````````````` The output should show three replicas of `go-demo_main` and one replica of `go-demo_db` services with the current state `running`. If that’s not the case, please wait a while longer. The output should be similar to the one that follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-2 Running Running about a minu\ `4` te ago `5` go-demo_db.1 mongo:latest swarm-1 Running Running about a minu\ `6` te ago `7` go-demo_main.2 vfarcic/go-demo:latest swarm-2 Running Running about a minu\ `8` te ago `9` go-demo_main.3 vfarcic/go-demo:latest swarm-3 Running Running about a minu\ `10` te ago ``` `````````````````````````````````````````````````````` Now we can verify that all the targets (replicas) are indeed registered. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/targets"` ``` ````````````````````````````````````````````````````` As you can see, Prometheus registered three targets that correspond to three replicas of the service. Now we know that it scrapes metrics from all of them and can explore different ways to query data. ![Figure 12-2: Prometheus targets that correspond with the replicas of the go-demo_main service](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00063.jpeg) Figure 12-2: Prometheus targets that correspond with the replicas of the go-demo_main service If we focus only on scraping instrumented services, the process can be described with a simple diagram from the figure 12-3. ![Figure 12-3: Prometheus scrapes metrics from services](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00064.jpeg) Figure 12-3: Prometheus scrapes metrics from services ### Querying Metrics From Instrumented Services Let’s open the Prometheus *graph* screen and explore different ways to query metrics scraped from the `go-demo_main` service. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/graph"` ``` ```````````````````````````````````````````````````` Please click the *Graph* tab, enter the query that follows, and click the *Execute* button ``` `1` http_server_resp_time_sum / http_server_resp_time_count ``` ``````````````````````````````````````````````````` We divided the summary of response times with the count of the requests. You’ll notice that the output graph shows that the average value is close to zero. Feel free to hover over one of the lines and observe that the values are only a few milliseconds. The `go-demo_main` service pings itself periodically and the responses are very fast. We should generate some slower responses since the current result does not show metrics in their true glory. The `/demo/hello` endpoint of the service can be supplemented with the `delay` parameter. When set to a value in milliseconds, the service will wait for the given period before responding to the request. Since we want to demonstrate a variety of response times, we should send a few requests with different `delay` values. The commands that will send thirty requests with random delays are as follows. ``` `1` `for` i in `{``1`..30`}``;` `do` `2 ` `DELAY``=`$`[` `$RANDOM` % `6000` `]` `3 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello?delay=``$DELAY``"` `4` `done` ``` `````````````````````````````````````````````````` The delay of each of the thirty requests was set to a random value between zero and six thousand milliseconds. Now we should have more variable metrics we can explore. Please write the query that follows in the *Expression* field and click the *Execute* button. ``` `1` http_server_resp_time_sum / http_server_resp_time_count ``` ````````````````````````````````````````````````` This time you should see a graph with more differentiating response times. You should note that the response times you’re seeing are a combination of those we sent (with up to six seconds delay) and fast pings that the service executes periodically. The previous query is not enough, and we should add a few functions into the mix. We’ll add `rate`. It calculates per-second average rate of increase of the time series in the range vector. We’ll also limit the metrics to the last five minutes. While that does not make much of a difference when presenting data in a graph, it is crucial for defining alerts. Since they are our ultimate goal while we’re working with Prometheus, we should get used to such limits from the start. Please write the query that follows in the *Expression* field and click the *Execute* button. ``` `1` rate(http_server_resp_time_sum[5m]) / rate(http_server_resp_time_count[5m]) ``` ```````````````````````````````````````````````` Since we are trying to define queries we’ll use in alerts, we might want to limit the results only to a single service. The expression limited to `go-demo_main` service is as follows. ``` `1` rate(http_server_resp_time_sum{service="go-demo"}[5m]) / rate(http_server_resp_t\ `2` ime_count{service="go-demo"}[5m]) ``` ``````````````````````````````````````````````` The graph output of the last query should show a spike in response times of each of the three replicas. That spike corresponds to the thirty requests we created with the `delay` parameter. ![Figure 12-4: Prometheus graph with response duration spike](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00065.jpeg) Figure 12-4: Prometheus graph with response duration spike The `service` label comes from instrumentation. It is hard-coded to `go-demo` and, therefore, not very reliable. If we would be confident that there will be only one instance of that service, we could continue using that label as one of the filters. However, that might not be the case. Even though it is likely we’ll use only one instance of the `go-demo` service in production, we might still run the same service as part of testing or some other processes. That would produce incorrect query results since we would be combining different instances of the service and potentially get unreliable data. Instead, it might be a better idea to use the `job` label. The `job` label comes out-of-the-box with all metrics scraped by Prometheus. It corresponds to the `job_name` specified in the `scrape` section of the configuration. Since *Docker Flow Monitor* uses the “real” name of the service to register scraping target, it is always unique. That fixes one of our problems since we cannot have two services with the same name inside a single cluster. In our case, the full name of the service is the combination of the name of the stack and the name of the service defined in that YAML file. Since we deployed `go-demo` stack with the service `main`, the full name of the service is `go-demo_main`. If we’d like to see metrics of all the services that provide instrumentation with the metric name `http_server_resp_time`, the query would be as follows. ``` `1` sum(rate(http_server_resp_time_sum[5m])) by (job) / sum(rate(http_server_resp_ti\ `2` me_count[5m])) by (job) ``` `````````````````````````````````````````````` Since we used `sum` to summarize data `by` `job`, each line represents a different service. That is not so obvious from the current graph since we are scraping metrics from only one service, so you’ll need to trust me on this one. If we’d have metrics from multiple services, each would get its line in the graph. That must be it. Doesn’t it? We have an average response time for each job (service) as measured over last five minutes. Unfortunately, even though such expressions might be useful when watching graphs, they have little value when used for alerts. It might be even dangerous to instruct the system to do some corrective actions based on such an alert. Let’s say that we create an alert defined as follows. ``` `1` sum(rate(http_server_resp_time_sum[5m])) by (job) / sum(rate(http_server_resp_ti\ `2` me_count[5m])) by (job) > 0.1 ``` ````````````````````````````````````````````` It will fire if average response time is over one hundred milliseconds (0.1 seconds). Now let us imagine that nine out of ten responses are around ten milliseconds while one out of ten lasts for five hundred milliseconds (half a second). The above alert would not fire in such a scenario since the average is 59 milliseconds, which is still way below the 100 milliseconds alert threshold. As a result, we would never know that there is a problem experienced by ten percent of those who invoke this service. Rejection of the above mentioned alert definition might lead you to write something simpler. The new alert could be as follows. ``` `1` http_server_resp_time_sum > 0.25 ``` ```````````````````````````````````````````` If there is a request that lasted longer than the threshold, fire an alert. We even increased the threshold from `0.1` to `0.25` seconds. While I do like the simplicity of that alert, it is even worse than the one with average response time. It would be enough to have one request that passed the threshold to fire an alert and, potentially, initiate the process that would scale the number of replicas of that service. What if there were a million other responses that were way below that threshold. The alert would still fire and probably produce undesirable consequences. Do we really care that one out of million responses is slow? The problem is that we were focused on averages. While there is value in them, they derailed us from creating a query that we could use to create a useful alert. Instead, we should focus on percentages. A better goal would be to construct an expression that would give us the percentage of requests that are above the certain threshold. The new query is as follows. ``` `1` sum(rate(http_server_resp_time_bucket{le="0.1"}[5m])) by (job) / sum(rate(http_s\ `2` erver_resp_time_count[5m])) by (job) ``` ``````````````````````````````````````````` The first part of the expression returns summary of the number of requests that are in the `0.1` bucket. In other words, it retrieves all the requests that are equal to or faster than `0.1` second. Further on, we are dividing that result with the summary of all the requests. The result is the percentage of requests that are below the `0.1` seconds threshold. ![Figure 12-5: Prometheus graph with percentage of response times below 0.1 second threshold](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00066.jpeg) Figure 12-5: Prometheus graph with percentage of response times below 0.1 second threshold If you do not see a drop in the percentage of requests, the likely cause is that more than one hour passed since you executed thirty requests with the delay. If that’s the case, please rerun the commands that follow and, after that, re-execute the expression. ``` `1` `for` i in `{``1`..30`}``;` `do` `2 ` `DELAY``=`$`[` `$RANDOM` % `6000` `]` `3 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello?delay=``$DELAY``"` `4` `done` ``` `````````````````````````````````````````` That was an expression worthy of an alert. We could use it to create something similar to a *service license agreement*. The alert would fire if, for example, less than 0.999 (99.9%) responses are below the defined time-based threshold. The only thing missing is to limit the output of the expression to the `go-demo_main` service. ``` `1` sum(rate(http_server_resp_time_bucket{job="go-demo_main",le="0.1"}[5m])) / sum(r\ `2` ate(http_server_resp_time_count{job="go-demo_main"}[5m])) ``` ````````````````````````````````````````` Let’s try to explore at least one more example. Among others, the `http_server_resp_time` metric has the `code` label that contains status codes of the responses. We can use that information to define an expression that will retrieve the number of requests that produced an error. Since we are returning standard [HTTP response codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), we can filter metrics so that only those with the `code` label that starts with `5` are retrieved. Before we start filtering metrics in search for errors, we should generate some requests that do result in error responses. ``` `1` `for` i in `{``1`..100`}``;` `do` `2 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/random-error"` `3` `done` ``` ```````````````````````````````````````` We sent a hundred requests to the `/demo/random-error` endpoint. Approximately one out of ten requests resulted in error responses. The expression that follows will retrieve the rate of error responses over the period of five minutes. ``` `1` sum(rate(http_server_resp_time_count{code=~"⁵..$"}[5m])) by (job) ``` ``````````````````````````````````````` The total number does not mean much unless you plan on sending an alert every time an error occurs. Such an action would likely result in too many alerts, and you’d run a risk of developing high-tolerance and start ignoring them. That’s not the way to go. Instead, we should use a similar approach as with response times. We’ll calculate the rate by dividing the number of errors with the total number of responses. ``` `1` sum(rate(http_server_resp_time_count{code=~"⁵..$"}[5m])) by (job) / sum(rate(ht\ `2` tp_server_resp_time_count[5m])) by (job) ``` `````````````````````````````````````` That would be an useful alert that could be fired if the number is higher than some threshold. ![Figure 12-6: Prometheus graph with error rate percentage](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00067.jpeg) Figure 12-6: Prometheus graph with error rate percentage Now that we defined two sets of expressions, we can take a step further and convert them into alerts. ### Firing Alerts Based On Instrumented Metrics Now that we have a solid understanding of some of the expressions based on instrumented metrics, we can proceed and apply that knowledge to create a few alerts. Let us deploy updated version of the `go-demo` stack. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-instrument-alert.yml `\` `3 ` go-demo ``` ````````````````````````````````````` We’ll take a couple of moments to discuss the changes to the updated stack while waiting for the services to become operational. The stack definition, limited to relevant parts, is as follows. ``` `1` ... `2` main: `3` ... `4` deploy: `5` ... `6` labels: `7` ... `8` - com.df.alertName.1=mem_limit `9` - com.df.alertIf.1=@service_mem_limit:0.8 `10 ` - com.df.alertFor.1=5m `11 ` - com.df.alertName.2=resp_time `12 ` - com.df.alertIf.2=sum(rate(http_server_resp_time_bucket{job="go-demo_ma\ `13` in", le="0.1"}[5m])) / sum(rate(http_server_resp_time_count{job="go-demo_main"}[\ `14` 5m])) < 0.99 `15 ` - com.df.alertLabels.2=scale=up,service=go-demo_main `16 ` - com.df.scrapePort=8080 `17 ` - com.df.scaleMin=2 `18 ` - com.df.scaleMax=4 `19 ` ... ``` ```````````````````````````````````` The `com.df.alertName` label was present in the previous stack. However, since specifying memory limit is not enough anymore, we added an index suffix (`.1`) that allows us to specify multiple alerts. The same `.1` suffix was added to the rest of labels that form that alert. The second alert will fire if the number of the responses in the `0.1` bucket (equal to or below 100 milliseconds) is smaller than 99% of all the requests. The rate is measured over the period of five minutes and the results are restricted to the job `go-demo_main`. The `if` statement we used is as follows. ``` `1` sum(rate(http_server_resp_time_bucket{job="go-demo_main", le="0.1"}[5m])) / sum(\ `2` rate(http_server_resp_time_count{job="go-demo_main"}[5m])) < 0.99 ``` ``````````````````````````````````` Since we are measuring the percentage of requests, there’s no real need to set `for` statement. As soon as more than one percent of requests result in response times over 100 milliseconds, the alert will be fired. Later on we’ll discuss what should be done with such an alert. For the moment, we’ll limit the scope and let Alertmanager forward all alerts to Slack. We also added `scale=up` and `service=go-demo_main` alert labels. Later on, the `scale` label will help the system know whether it should scale up or down. Finally, we used `com.df.scaleMin` and `com.df.scaleMax` labels to specify the minimum and the maximum number of replicas allowed for this service. We won’t use those labels just yet. Just remember that they are defined. Next, we’ll repeat the commands that will create slow responses and verify that the alerts are indeed fired. But, before we do that, we’ll open the Prometheus’ alert screen and confirm that the new alert is indeed registered. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` `````````````````````````````````` The *godemo_main_resp_time* row should be green meaning that the alert is registered but that the condition is not met. In other words, at least 99% of responses were generated in 100 milliseconds or less. Now we can truly test the alert. Let’s generate some slow responses. ``` `1` `for` i in `{``1`..30`}``;` `do` `2 ` `DELAY``=`$`[` `$RANDOM` % `6000` `]` `3 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello?delay=``$DELAY``"` `4` `done` ``` ````````````````````````````````` You already executed those commands at least once so there should be no reason to explain what happened. Instead, we’ll go back to the *alerts* screen and confirm that the alert is indeed firing. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````````````````````````````` The *godemo_main_resp_time* should be red. If it isn’t, please wait a few moments and refresh the screen. Feel free to click it if you’d like to see the definition of the alert. ![Figure 12-7: Prometheus alert in firing state](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00068.jpeg) Figure 12-7: Prometheus alert in firing state Please visit the *#df-monitor-tests* channel inside [devops20.slack.com](https://devops20.slack.com/). You should see the *[FIRING] go-demo_main service is in danger!* message. The process, so far, can be described through the diagram in figure 12-8. ![Figure 12-8: Alerts that result in Slack notifications](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00069.jpeg) Figure 12-8: Alerts that result in Slack notifications That worked out quite well. When slow responses start piling up, we’ll get a Slack notification letting us know that we should enter the cluster and scale the service. The only problem is that we should not waste our time with such operations. We should let the system scale up automatically. ### Scaling Services Automatically With the alerts firing from Prometheus into Alertmanager, the only thing left to do is to send requests to Jenkins to scale the service. We already created a similar Alertmanager config in one of the previous chapters, so we’ll comment only on a few minor differences. The configuration is injected into the `alert-manager` service as a Docker secret. Since secrets are immutable, we cannot update the one that is currently used. Instead, we’ll have to remove the stack and the secret and create them again. ``` `1` docker stack rm monitor `2` `3` docker secret rm alert_manager_config ``` ``````````````````````````````` Now we can create a new secret with the updated Alertmanager configuration. ``` `1` `echo` `"route:` `2`` group_by: [service,scale]` `3`` repeat_interval: 5m` `4`` group_interval: 5m` `5`` receiver: 'slack'` `6`` routes:` `7`` - match:` `8`` service: 'go-demo_main'` `9`` scale: 'up'` `10 `` receiver: 'jenkins-go-demo_main-up'` `11` `12` `receivers:` `13 `` - name: 'slack'` `14 `` slack_configs:` `15 `` - send_resolved: true` `16 `` title: '[{{ .Status | toUpper }}] {{ .GroupLabels.service }} service is \` `17` `in danger!'` `18 `` title_link: 'http://``$(`docker-machine ip swarm-1`)``/monitor/alerts'` `19 `` text: '{{ .CommonAnnotations.summary}}'` `20 `` api_url: 'https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0KvvyStV\` `21` `nIt3ZWpIaLnqLCu'` `22 `` - name: 'jenkins-go-demo_main-up'` `23 `` webhook_configs:` `24 `` - send_resolved: false` `25 `` url: 'http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `26` `dWithParameters?token=DevOps22&service=go-demo_main&scale=1'` `27` `"` `|` docker secret create alert_manager_config - ``` `````````````````````````````` Remember that the gist with all the commands from this chapter is available from [12-alert-instrumentation.sh](https://gist.github.com/vfarcic/8bafbe912f277491eb2ce6f9d29039f9). Use it to copy and paste the command if you got tired of typing. The difference, when compared with the similar configuration we used before, is the `scale` label and a subtle change in the Jenkins receiver name. This time we are not grouping routes based only on `service` but with the combination of the labels `service` and `scale`. Even though we are, at the moment, focused only on scaling up, soon we’ll try to add another alert that will de-scale the number of replicas. While we would accomplish the current objective without the `scale` label, it might be a good idea to be prepared for what’s coming next. This time, the `match` section uses a combination of both `service` and `scale` labels. If they are set to `go-demo_main` and `up`, the alert will be forwarded to the `jenkins-go-demo_main-up` receiver. Any other combination will be sent to Slack. The `jenkins-go-demo_main-up` receiver is triggering a build of the Jenkins job `service-scale` with a few parameters. It contains the authentication token, the name of the service that should be scaled, and the increment in the number replicas. The `repeat_interval` is set to five minutes. Alertmanager will send a new notification every five minutes (plus the `group_interval`) unless the problem is fixed and Prometheus stops firing alerts. That is almost certainly not the value you should use in production. One hour (`1h`) is a much more reasonable period. However, I’d like to avoid making you wait for too long so, in this case, it’s set to five minutes (`5m`). Let us deploy the stack with the new secret. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-slack.yml `\` `4 ` monitor ``` ````````````````````````````` There’s only one thing missing before we see the alert in its full glory. We need to run the Jenkins job manually. The first build will fail due to a bug we already experienced in one of the previous chapters. Please open the `service-scale` activity screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` ```````````````````````````` You’ll have to login with `admin` as both username and password. Afterward, click the *Run* button and observe the failure. The issue is that Jenkins was not aware that the job uses a few parameters. After the first run, it’ll get that information, and the job should not fail again. If it does, it’ll be for a different reason. The `go-demo_main` service should have three replicas. Let’s double-check that. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ``````````````````````````` The output should be similar to the one that follows (ID are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running 42 minutes a\ `4` go `5` go-demo_db.1 mongo:latest swarm-3 Running Running 42 minutes a\ `6` go `7` go-demo_main.2 vfarcic/go-demo:latest swarm-3 Running Running 42 minutes a\ `8` go `9` go-demo_main.3 vfarcic/go-demo:latest swarm-1 Running Running 42 minutes a\ `10` go ``` `````````````````````````` Before we proceed, please make sure that all replicas of the `monitor` stack are up and running. You can use `docker stack ps monitor` command to check the status. Now we can send requests that will produce delayed responses and open the Prometheus *alerts* screen. ``` `1` `for` i in `{``1`..30`}``;` `do` `2 ` `DELAY``=`$`[` `$RANDOM` % `6000` `]` `3 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/hello?delay=``$DELAY``"` `4` `done` `5` `6` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ````````````````````````` The *godemo_main_resp_time* alert should be red. If it is not, please wait a few moments and refresh the screen. Prometheus fired the alert to Alertmanager which, in turn, notified Jenkins. As a result, we should see a new build of the `service-scale` job. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` ```````````````````````` Please click on the latest build. It should be green with the output of the last task set to `go-demo_main` was scaled from 3 to 4 replicas. ![Figure 12-9: A build of a Jenkins job that scales Docker services](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00070.jpeg) Figure 12-9: A build of a Jenkins job that scales Docker services We should confirm that Jenkins indeed did the work it was supposed to do. The number of replicas of the `go-demo_main` service should be four. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`Running go-demo ``` ``````````````````````` The output of the `stack ps` command is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE CURRENT STATE \ `2` ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest swarm-1 Running Running about an hou\ `4` r ago `5` go-demo_db.1 mongo:latest swarm-2 Running Running about an hou\ `6` r ago `7` go-demo_main.2 vfarcic/go-demo:latest swarm-2 Running Running about an hou\ `8` r ago `9` go-demo_main.3 vfarcic/go-demo:latest swarm-3 Running Running about an hou\ `10` r ago `11` go-demo_main.4 vfarcic/go-demo:latest swarm-3 Running Running 2 minutes ago ``` `````````````````````` Since we stopped simulating slow responses, the alert in Prometheus should turn into green. Otherwise, if Prometheus would continue firing the alert, Alertmanager would send another notification to Jenkins ten minutes later. Since the service has the `com.df.scaleMax` label set to four, Jenkins job would not scale the service. Instead, it would send a notification to Slack so that we (humans) can deal with the problem. Let’s remove the stack and the secret and work on Alertmanager configuration that will also de-scale services. ``` `1` docker stack rm monitor `2` `3` docker secret rm alert_manager_config ``` ````````````````````` The command that creates a new secret is as follows. ``` `1` `echo` `"route:` `2`` group_by: [service,scale]` `3`` repeat_interval: 5m` `4`` group_interval: 5m` `5`` receiver: 'slack'` `6`` routes:` `7`` - match:` `8`` service: 'go-demo_main'` `9`` scale: 'up'` `10 `` receiver: 'jenkins-go-demo_main-up'` `11 `` - match:` `12 `` service: 'go-demo_main'` `13 `` scale: 'down'` `14 `` receiver: 'jenkins-go-demo_main-down'` `15` `16` `receivers:` `17 `` - name: 'slack'` `18 `` slack_configs:` `19 `` - send_resolved: true` `20 `` title: '[{{ .Status | toUpper }}] {{ .GroupLabels.service }} service is \` `21` `in danger!'` `22 `` title_link: 'http://``$(`docker-machine ip swarm-1`)``/monitor/alerts'` `23 `` text: '{{ .CommonAnnotations.summary}}'` `24 `` api_url: 'https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0KvvyStV\` `25` `nIt3ZWpIaLnqLCu'` `26 `` - name: 'jenkins-go-demo_main-up'` `27 `` webhook_configs:` `28 `` - send_resolved: false` `29 `` url: 'http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `30` `dWithParameters?token=DevOps22&service=go-demo_main&scale=1'` `31 `` - name: 'jenkins-go-demo_main-down'` `32 `` webhook_configs:` `33 `` - send_resolved: false` `34 `` url: 'http://``$(`docker-machine ip swarm-1`)``/jenkins/job/service-scale/buil\` `35` `dWithParameters?token=DevOps22&service=go-demo_main&scale=-1'` `36` `"` `|` docker secret create alert_manager_config - ``` ```````````````````` We added an additional route and a receiver. Both are very similar to their counterparts in charge of scaling up. The only substantial difference is that the route match now looks for `scale` label with the value `down` and that a Jenkins build is invoked with `scale` parameter set to `-1`. As I mentioned earlier in one of the previous chapters, it is unfortunate that we need to produce so much duplication. But, since webhook `url` cannot be parametrized, we need to hard-code each combination. I would encourage you, dear reader, to contribute to Alertmanager project by adding Jenkins receiver. Until then, repetition of similar configuration entries is unavoidable. Let us deploy the `monitor` stack with the new configuration injected as a Docker secret. ``` `1` `DOMAIN``=``$(`docker-machine ip swarm-1`)` `\` `2 ` docker stack deploy `\` `3 ` -c stacks/docker-flow-monitor-slack.yml `\` `4 ` monitor ``` ``````````````````` Please wait until the `monitor` stack is up-and-running. You can check the status of its services with `docker stack ps monitor` command. While we’re into creating services, we’ll deploy a new definition of the `go-demo` stack as well. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-instrument-alert-2.yml `\` `3 ` go-demo ``` `````````````````` The new definition of the stack, limited to relevant parts, is as follows. ``` `1` ... `2` main: `3` ... `4` deploy: `5` ... `6` labels: `7` ... `8` - com.df.alertName.3=resp_time_below `9` - com.df.alertIf.3=sum(rate(http_server_resp_time_bucket{job="my-service\ `10` ", le="0.025"}[5m])) / sum(rate(http_server_resp_time_count{job="my-service"}[5m\ `11` ])) > 0.75 `12 ` - com.df.alertLabels.3=scale=down,service=go-demo_main `13 ` ... ``` ````````````````` We added a new set of labels that define the alert that will send a notification that the service should be scaled down. The expression of the alert uses a similar logic as the one we’re using to scale up. It calculates the percentage of responses that were created in twenty-five milliseconds or less. If the result is over 75 percent, the system has more replicas than it needs so it should be scaled down. Since `go-demo` produces internal pings that are very fast, there’s no need to create fake responses. The alert will fire soon. If you doubt the new alert, we can visit the Prometheus *alerts* screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````````````` The *godemo_main_resp_time_below* alert should be red. Similarly, we can visit Jenkins *service-scale* job and confirm that a new build was executed. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/jenkins/blue/organizations/jenkins/ser\` `2` `vice-scale/activity"` ``` ``````````````` The output of the last step says that *go-demo_main was scaled from 3 to 2 replicas*. That might sound confusing since the previous build scaled it to four replicas. However, we re-deployed the `go-demo` stack which, among other things, specifies that the number of replicas should be three. That leads us to an important note. ![Figure 12-10: A build of a Jenkins job that scales Docker services](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00071.jpeg) Figure 12-10: A build of a Jenkins job that scales Docker services Prometheus will continue firing alerts because the service is still responding faster than the defined lower limit. Since Alertmanager has both the `repeat_interval` and the `group_interval` set to five minutes, it will ignore the alerts until ten minutes expire. For more information about `repeat_interval` and `group_interval` options, please visit [route](https://prometheus.io/docs/alerting/configuration/#route) section of Alertmanager configuration. Once more than ten minutes pass, it will send a build request to Jenkins. This time, since the service is already using the minimum number of replicas, Jenkins will decide not to continue de-scaling and will send a notification message to Slack. Please visit the *#df-monitor-tests* channel inside [devops20.slack.com](https://devops20.slack.com/). Wait for a few minutes, and you should see a Slack notification stating that *go-demo_main could not be scaled*. Specifying long `alertIf` labels can be daunting and error prone. Fortunately, *Docker Flow Monitor* provides shortcuts for the expressions we used. Let’s deploy the `go-demo` stack one last time. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-instrument-alert-short.yml `\` `3 ` go-demo ``` `````````````` The definition of the stack, limited to relevant parts, is as follows. ``` `1` ... `2` main: `3` ... `4` deploy: `5` ... `6` labels: `7` ... `8` - com.df.alertIf.1=@service_mem_limit:0.8 `9` ... `10 ` - com.df.alertIf.2=@resp_time_above:0.1,5m,0.99 `11 ` ... `12 ` - com.df.alertIf.3=@resp_time_below:0.025,5m,0.75 `13 ` ... ``` ````````````` This time we used shortcuts for all three alerts. `@resp_time_above:0.1,5m,0.99` was expanded into the expression that follows. ``` `1` sum(rate(http_server_resp_time_bucket{job="my-service", le="0.1"}[5m])) / sum(ra\ `2` te(http_server_resp_time_count{job="my-service"}[5m])) < 0.99``` `3` `4` Similarly, `@resp_time_below:0.025,5m,0.75` became the following expression. ``` ```````````` sum(rate(http_server_resp_time_bucket{job=”my-service”, le=”0.025”}[5m])) / sum(rate(http_server_resp_time_count{job=”my-service”}[5m])) > 0.75 ``` Feel free to confirm that the alerts were correctly configured in Prometheus. They should be the same as they were before since the shortcuts expand to the same full expressions we deployed previously. We managed to create a system that scales services depending on thresholds based on response times. It is entirely automated except if the service is already running the minimum or the maximum number of replicas. In those cases scaling probably does not help and humans need to find out what is the unexpected circumstance that generated the alerts. We started with expected and created a fallback when unexpected happens. Next, we’ll explore the situation when we start from unexpected. ### Sending Error Notifications To Slack Errors inside our code usually fall into two groups. There are those we are throwing to the caller function because we do not yet know how to handle it properly, or we are too lazy to implement proper recuperation from such a failure. For example, we might implement a function that reads files from a directory and returns an error if that fails. In such a case we might want to get a notification when the error occurs and do some actions to fix it. After evaluating the problem, we might find out that the directory we’re reading does not exist. Apart from the obvious fix to create the missing directory (immediate response), we should probably modify our code so that the directory is created as a result of receiving such an error. Even better, we should probably extend our code to check whether the directory exists before reading the files from it. Errors such as those fall into “I did not expect it the first time it happened, but it will not happen again” type of situations. There’s nothing we would need to do outside the service. The solution depends entirely on code modifications. Another common type of errors is related to problems with communication with other services. For example, we might get a notification that there was an error establishing communication with a database. The first action should be to fix the problem (restore DB connection). After that, we should create another set of alerts that will monitor the database itself and execute some corrective steps to fix the issue. Those metrics should not be inside the service that communicates with the database but probably coming from an exporter that is specialized with that particular database. If such an exporter does not exist, database probably has some metrics that could be transformed into Prometheus format. The alternative approach would be to ping the database periodically. We might be able to use [Blackbox exporter](https://github.com/prometheus/blackbox_exporter) for that. If none of those options is applicable to your database, you might need to evaluate whether it is worth using something that does not have exporter, not it has metrics, nor it can be pinged. The examples we explored do not warrant the effort of making anything more complicated than a simple notification to Slack. Self-adaptation does not apply, and self-healing depends on your code more than anything else. Let us take a look at a few examples. We’ll continue using metrics coming from instrumentation added to the `go-demo` service. In the previous chapter you already saw the metrics with errors based on response codes, so let’s deploy the updated stack with the labels that will add a new alert to *Docker Flow Monitor*. ``` `1` docker stack deploy `\` `2 ` -c stacks/go-demo-instrument-alert-error.yml `\` `3 ` go-demo ``` ``````````` The definition of the stack, limited to relevant parts, is as follows. ``` `1` ... `2` main: `3` ... `4` deploy: `5` ... `6` labels: `7` ... `8` # - com.df.alertName.3=resp_time_below `9` # - com.df.alertIf.3=@resp_time_below:0.025,5m,0.75 `10 ` - com.df.alertName.3=erro_rrate `11 ` - com.df.alertIf.3=sum(rate(http_server_resp_time_count{job="go-demo_mai\ `12` n", code=~"⁵..$$"}[5m])) / sum(rate(http_server_resp_time_count{job="go-demo_ma\ `13` in"}[5m])) > 0.001 `14 ` - com.df.alertLabels.3=service=go-demo_main,type=errors `15 ` - com.df.alertAnnotations.3=summary=Error rate is too high,description=D\ `16` o something or start panicking `17 ` ... ``` `````````` We commented the `resp_time_below` alert because it would only create noise. Without “real” traffic, responses are too fast, and Prometheus would continuously fire alerts that would only derail us from the task at hand. The new alert is as follows. ``` `1` sum(rate(http_server_resp_time_count{job="go-demo_main", code=~"⁵..$$"}[5m])) /\ `2 ` sum(rate(http_server_resp_time_count{job="go-demo_main"}[5m])) > 0.001 ``` ````````` It takes the summary of all response time counts filtered by response codes that start with `5`. That envelops all service side errors. That number is divided with the count of all responses thus giving us the percentage of requests that failed due to server errors. The alert will fire if the result is greater than `0.001` (0.1%). In other words, Prometheus will fire the alert if more than 0.1% of responses result in server side errors. If you tried to write a similar alert, your first thought might have been to send an alert whenever there is an error. Don’t do that! It would probably result in spam since errors are an unavoidable part of what we do. The real goal is not to be notified the system produces an error but when the rate of errors surpasses some threshold. Those thresholds will differ from one service to another. In out case, it is set to 0.1%. That does not mean that alerts should never fire after a single error. In some cases, they should. But this is not one of those. The service will produce errors, and we want to know when there are too many of them and thus discard a temporary problem or something that happened only once, but it will not repeat. One thing you should note is that inside stack YML definition, dollar signs (`$`) need to be escaped with another dollar. For that reason, the part of the regular expression that should be `⁵..$` is defined as `⁵..$$`. Let us open the Prometheus’ alert screen and confirm that the new alert is indeed registered. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` ```````` Among others, there should be the *godemo_main_error_rate* alert marked as green thus indicating that the percentage of server error rate is below 0.1%. We do not need to change Alertmanager configuration. It is already configured to send all alerts to Slack unless they match one of the routes. Let us generate a few responses with fake errors and see whether the system works. ``` `1` `for` i in `{``1`..100`}``;` `do` `2 ` curl `"http://``$(`docker-machine ip swarm-1`)``/demo/random-error"` `3` `done` ``` ``````` Around ten out of those hundred requests should result in errorred responses. That’s just enough to confirm that the alerts work as expected. Let’s go back to the alerts screen. ``` `1` open `"http://``$(`docker-machine ip swarm-1`)``/monitor/alerts"` ``` `````` This time *godemo_main_error_rate* should be red indicating that Prometheus fired it to Alertmanger which, in turn, sent a Slack notification. Please visit the *#df-monitor-tests* channel inside [devops20.slack.com](https://devops20.slack.com/) and confirm that the notification was indeed sent. The message should say *[FIRING] go-demo_main service is in danger!*. If this would not be a simulation, your immediate action should be to go back to Prometheus and query metrics in search of the cause of the problem. The expression could be as follows. ``` `1` sum(rate(http_server_resp_time_count{job="go-demo_main", code=~"⁵..$$"}[5m])) b\ `2` y (path) ``` ````` By grouping data by path, we can discover which one generated those errors and relate it to the code that is in charge of that path. That would get us a step closer to the discovery of the cause of he problem. The rest would greatly differ from one case to another. Maybe we’d need to consult logs, make more queries in Prometheus, find out which node is causing the problem, correlate data with network information, and so on and so forth. No one can give you exact set of steps that should be followed in all cases. The most important part is that you know that there is an error, which service generated it, and what is the path of the requests behind it. You’re on your own for the rest. All the metrics we used so far have their shortcuts that let us write a more concise stack definition. The error rate is no exception. An example can be found in the stack `stacks/go-demo-instrument-alert-short-2.yml`. The definition, limited to relevant parts, is as follows ``` `1` ... `2` main: `3` ... `4` deploy: `5` ... `6` labels: `7` ... `8` - com.df.alertName.3=errorate `9` - com.df.alertIf.3=@resp_time_server_error:5m,0.001 `10 ` ... ``` ```` The system we built in this chapter can be described through the diagram in the figure 12-11. ![Figure 12-11: Alerts fired to Jenkins and Slack](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00072.jpeg) Figure 12-11: Alerts fired to Jenkins and Slack ### What Now? Did we finish with self-adaptation applied to services? We’re not even close. The exercises we passed through should give you only the base you’ll need to extend and adapt to your specific use-cases. I hope that now you know what to do. What you need now, more than anything else, is time. Instrument your services, start scraping metrics, create alerts based both on exporters and instrumentation, start receiving notifications, and observe the patterns. The system should start small and grow organically. Do not create a project out of the lessons you learned but adopt continuous improvement. Every alert is not only a potential problem but also an opportunity to make the system better and more robust. The figure 12-12 provides a high-level overview of the system we built so far. Use it as a way to refresh your memory with everything we learned by now. ![Figure 12-12: Self-healing and self-adapting system (so far)](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00073.jpeg) Figure 12-12: Self-healing and self-adapting system (so far) The time has come to take another rest. Destroy the cluster and recharge your batteries. There’s still a lot left to do. ``` `1` docker-machine rm `\` `2 ` -f swarm-1 swarm-2 swarm-3 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ```````````````````````````````` ````````````````````````````````` `````````````````````````````````` ``````````````````````````````````` ```````````````````````````````````` ````````````````````````````````````` `````````````````````````````````````` ``````````````````````````````````````` ```````````````````````````````````````` ````````````````````````````````````````` `````````````````````````````````````````` ``````````````````````````````````````````` ```````````````````````````````````````````` ````````````````````````````````````````````` `````````````````````````````````````````````` ``````````````````````````````````````````````` ```````````````````````````````````````````````` ````````````````````````````````````````````````` `````````````````````````````````````````````````` ``````````````````````````````````````````````````` ```````````````````````````````````````````````````` ````````````````````````````````````````````````````` `````````````````````````````````````````````````````` ``````````````````````````````````````````````````````` ```````````````````````````````````````````````````````` ````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````

第十七章：设置生产集群

我们已经探讨了很多技术、流程和工具，这些可以帮助我们构建一个适用于服务的自给自足系统。Docker Swarm 提供了自愈功能，我们也创建了自己的自适应系统。到目前为止，我们应该对我们的服务相当有信心，现在是时候探索如何将类似的目标应用到基础设施上了。

系统应该能够重新创建故障节点，能够无停机时间地进行升级，并根据波动的需求扩展服务器。我们无法使用基于 Docker Machine 节点本地运行的集群来探讨这些主题。我们笔记本电脑的容量有限，因此无法将节点扩展到更大的数量。即使能够，供生产集群使用的基础设施也完全不同。我们需要一个 API 使我们的系统能够与基础设施进行通信。此外，我们之前使用的服务并没有机会探讨持久存储。这些少数示例仅仅是我们所需的冰山一角，细节我们暂时不进入。目前，我们将尝试创建一个生产就绪的集群，以便继续向自给自足的系统迈进。

直接的目标是将基于 Docker 机器本地运行的 Swarm 集群过渡到一个更可靠的环境。我们必须迁移到云端。

我们可以选择的托管供应商太多，逐一解释每个供应商的过程几乎是不可能的。即使我们只专注于那些非常流行的供应商，仍然至少有十个供应商需要介绍。这会使本书的内容范围超出可管理的大小，因此我们将选择一个托管供应商，用它来演示生产集群的设置。必须选一个，而 AWS 是最常用的托管供应商。

根据你目前选择的供应商，你可能会对其非常满意，或者极度不满。如果你更喜欢使用 Microsoft Azure，你会发现你能够跟随我们为 AWS 探索的相同步骤。你可能更倾向于使用 Google Compute Engine (GCE)，Digital Ocean，OpenStack 本地部署，或者成千上万的解决方案和供应商中的任何一个。我会尽力解释我们在 AWS 上进行设置的背后逻辑。希望你能够将相同的逻辑应用到你的基础设施中。我会尽量让你明白你应该做什么，而我希望你能卷起袖子自己动手做。我会提供一个蓝图，你来完成工作。

你可能会想直接将接下来的练习翻译到你自己的托管解决方案中。但不要！如果你还没有账户，请在Amazon Web Services (AWS)上创建一个账户，并按照指示进行操作。这样，你应该能清楚地了解可以做什么，以及应该采取的路径。只有在阅读完本书后，你才应该尝试将这些经验应用到你的基础设施中。在这一过程中，我会尽力以一种方式讲解 AWS 中我们所做的一切，使得相同的原则能够转化到其他任何选择上。而且，我会尽最大努力将 AWS 的费用控制在最低限度。

说了这么多，我们开始进入本章的动手实践部分，创建一个 Docker Swarm 集群。集群搭建好后，我们将部署迄今为止使用的所有服务。最后，我们会讨论哪些服务可能还缺失，以及我们应该对我们的栈做哪些修改，以便让它们具备生产环境的准备。开始吧！

创建 Docker For AWS 集群

在《DevOps 2.1 工具包：Docker Swarm》中，我曾提到，创建 AWS 上的 Swarm 集群的最佳方式是结合使用Packers和Terraform。另一种选择是使用Docker CE for AWS。那时，Docker for AWS 还不够成熟。但今天，情况已不同。Docker for AWS 提供了一个稳健的 Docker Swarm 集群，几乎具备我们期望的所有服务。

我们将创建一个Docker for AWS集群，并在过程中讨论其中的一些方面。

在开始创建集群之前，我们需要选择一个区域。真正重要的是你选择的区域是否支持至少三个可用区。如果只有一个可用区，那么一旦该可用区不可用，我们就会面临停机风险。如果有两个可用区，一旦其中一个可用区发生故障，我们将失去 Docker 管理器的法定人数。正如我们应该始终在 Docker 管理器中使用奇数数量一样，我们也应该将集群分布在奇数个可用区中。三个可用区是一个不错的选择，适用于大多数场景。

如果你是 AWS 的新手，了解一下可用区（AZ）。一个可用区是区域内部的一个隔离位置。每个区域由一个或多个可用区组成。每个 AZ 都是隔离的，但同一区域内的 AZ 通过低延迟链接连接在一起。AZ 之间的隔离提供了高可用性。跨多个 AZ 部署的集群，即便一个 AZ 发生故障，依然能够继续运行。使用同一区域内的 AZ 时，延迟较低，因此不会影响性能。总的来说，我们应始终在同一区域内跨多个 AZ 运行集群。

让我们检查一下你最喜欢的 AWS 区域是否至少有三个可用区。请从AWS 控制台打开EC2 屏幕。你会看到屏幕右上角选中了其中一个可用区。如果这不是你想用于集群的位置，点击它进行更改。

向下滚动到服务健康部分，你会在其中找到可用区状态。如果列出了至少三个可用区，那么你选择的区域是可以的。否则，请更改区域并再次检查是否至少有三个可用区。

图 13-1：美国东部区域支持的可用区列表

在创建集群之前，我们还需要满足另一个先决条件。我们需要创建一个 SSH 密钥。没有它，我们将无法访问构成集群的任何节点。

请返回AWS 控制台，从左侧菜单中点击密钥对链接。点击创建密钥对按钮，输入devops22作为密钥对名称，然后点击创建按钮。新创建的 SSH 密钥将下载到你的笔记本电脑。请将其复制到docker-flow-monitor目录中。该项目中已在 .gitignore 文件中添加了 /*.pem 条目，以确保密钥不会意外提交到 GitHub。尽管如此，为了额外的安全措施，我们应该确保只有你能读取该文件的内容。

`1` chmod `400` devops22.pem

Now we are ready to create the cluster. Please open *https://store.docker.com/editions/community/docker-ce-aws* in your favorite browser and click the *Get Docker* button. You might be asked to log in to *AWS console*. The region should be set to the one you chose previously. If it isn’t, please change it by clicking the name of the current region (e.g., *N. Virginia*) button located in the top-right section of the screen. We can proceed once you’re logged in, and the desired region is selected. Please click the *Next* button located at the bottom of the screen. You will be presented with the *Specify Details* screen. Please type *devops22* as the *Stack name*. We’ll leave the number of managers set to three, but we’ll change the number of workers to *0*. For now, we will not need more nodes. We can always increase the number of workers later on if such a need arises. For now, we’ll go with a minimal setup. Please select *devops22* as the answer to *Which SSH key to use?*. We’ll do the opposite from the default values for the rest of the fields in the *Swarm Properties* section. We do want to *enable daily resource cleanup* so we’ll change it to *yes*. That way, our cluster will be nice and clean most of the time since Docker will prune it periodically. We will select *no* as the value of the *use CloudWatch for container logging* drop-box. CloudWatch is very limiting. There are much better and cheaper solutions for storing logs, and we’ll explore them soon. Finally, please select *yes* as the value of the *create EFS prerequisites for CloudStor* drop-box. The setup process will make sure that all the requirements for the usage of EFS are created and thus speed up the process of mounting network drives. We should select the type of instances. One option could be to use *t2.micro* which is one of the free tiers. However, in my experience, *t2.micro* is just too small. *1GB* memory and *1* virtual CPU (vCPU) is not enough for some of the services we’ll run. We’ll use *t2.small* instead. With *2GB* of memory and *1* vCPU, it is still very small and would not be suitable for “real” production usage. However, it should be enough for the exercises we’ll run throughout the rest of this chapter. Please select *t2.small* as both the *Swarm manager instance type* and *Agent worker instance type* values. Even though we’re not creating any workers, we might choose to add some later on so having the proper size set in advance might be a good idea. We might discover that we need bigger nodes later on. Still, any aspect of the cluster is easy to modify, so there’s no reason to aim for perfection from the start. ![Figure 13-2: Docker For AWS Parameters screen](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00075.jpeg) Figure 13-2: Docker For AWS Parameters screen Please click the *Next* button. You’ll be presented with the *Options* screen. We won’t modify any of the available options so please click the *Next* button on this screen as well. We reached the last screen of the setup. It shows the summary of all the options we chose. Please go through the information and confirm that everything is set to the correct values. Once you’re done, click the *I acknowledge that AWS CloudFormation might create IAM resources* checkbox followed by the *Create* button. It’ll take around ten to fifteen minutes for the CloudFormation to finish creating all the resources. We can use that time to comment on a few of them. If you plan to transfer this knowledge to a different hosting solution, you’ll probably need to replicate the same types of resources and the processes behind them. The list of all the resources created by the template can be found by selecting the *devops22* stack and clicking the *Resources* tab. Please click the *Restore* icon from the bottom-right part of the page if you don’t see the tabs located at the bottom of the screen. We won’t comment on all the resources *Docker for AWS* template creates but only on the few that are crucial if you’d like to replicate a similar setup with a different vendor. *VPC* (short for *Virtual Private Cloud*) makes the system secured by closing all but a few externally accessible ports. The only port open by default is *22* required for SSH access. All others are locked down. Even the port *22* is not open directly but through a load balancer. *ELB* (short for *Elastic Load Balancer*) is sitting on top of the cluster. In the beginning, it forwards only SSH traffic. However, it is configured in a way that forwarding will be added to the ELB every time we create a service that publishes a port. As a result, any service with a published port will be accessible through *ELB* only. The load balancer itself cannot (in its current setting) forward requests based on their paths, domains, and other information from their headers. It does (a kind of) layer 4 load balancing that uses only port as the forwarding criteria. It does a similar job as the ingress network. That, in itself, is not very useful if all your services are routed through a layer 7 proxy like [Docker Flow Proxy](http://proxy.dockerflow.com/), and since it lacks proper routing, it cannot replace it. However, the more important feature ELB provides is load balancing across healthy nodes. It provides a DNS that we can use to setup our domain’s *C Name* entries. No matter whether a node fails or is replaced during upgrades, ELB will always forward requests to one of the healthy nodes. *EFS* (short for *Elastic File System*) will provide network drives we’ll use to persist stateful services that do not have replication capabilities. It can be replaced with *EBS* (short for *Elastic Block Storage*). Each has advantages and disadvantages. EFS volumes can be used across multiple availability zones thus allowing us to move services from one to another without any additional steps. However, EFS is slower than EBS so, if IO speed is of the essence, it might not be the best choice. EBS, on the other hand, is opposite. If is faster than EFS, but it cannot be used across multiple AZs. If a replica needs to be moved from one to another, a data snapshot needs to be created first and restored on the EBS volume created in a different AZ. *ASGs* (short for *Auto-Scaling Groups*) provide an effortless way to scale (or de-scale) nodes. It will be essential in our quest for self-healing system applied to infrastructure. *Overlay Network*, even though it is not unique to AWS, envelops all the nodes of the cluster and provides communication between services. *Dynamo DB* is used to store information about the primary manager. That information is changed if the node hosting the primary manager goes down and a different one is promoted. When a new node is added to the cluster, it uses information from Dynamo DB to find out the location of the primary manager and join itself to the cluster. The cluster, limited to the most significant resources, can be described through the *figure 13-3*. ![Figure 13-3: Simplified diagram with the key services created through the Docker For AWS template](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00076.jpeg) Figure 13-3: Simplified diagram with the key services created through the Docker For AWS template By this time, the cluster should be up and running and waiting for us to deploy the first stack. We can confirm that it is finished by checking the *Status* column of the *devops22* CloudFormation stack. We’re all set if the value is *CREATE_COMPLETE*. If it isn’t, please wait a few more minutes until the last round of resources is created. We’ll need to retrieve a few pieces of information before we proceed. We’ll need to know the DNS of the newly created cluster as well as the IP of one of the manager nodes. All the information we need is in the *Outputs* tab. Please go there and copy the value of the *DefaultDNSTarget* key. We’ll paste it into an environment variable. That will allow us to avoid coming back to this screen every time we need to use the DNS. ``` `1` `CLUSTER_DNS``=[`...`]` ``` ````````````````````````````````````````````````````````` Please change `[...]` with the actual DNS of your cluster. You should map your domain to that DNS in a “real” world situation. But, for the sake of simplicity, we’ll skip that part and use the DNS provided by AWS. The only thing left before we enter the cluster is to get the IP of one of the managers. Please click the link next to the *Managers* key. You will be presented with the *EC2 Instances* screen that lists all the manager nodes of the cluster. Select any of them and copy the value of the *IPv4 Public IP* key. Just as with DNS, we’ll set that value as an environment variable. ``` `1` `CLUSTER_IP``=[`...`]` ``` ```````````````````````````````````````````````````````` Please change `[...]` with the actual public IP of one of the manager nodes. The moment of truth has come. Does our cluster indeed work? Let’s check it out. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ``````````````````````````````````````````````````````` We entered into one of the manager nodes and executed `docker node ls`. The output of the latter command is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-2-46.ec2.internal Ready Active Reachable `3` ip-172-31-35-26.ec2.internal Ready Active Leader `4` ip-172-31-19-176.ec2.internal Ready Active Reachable ``` `````````````````````````````````````````````````````` As you can see, all three nodes are up and running and joined into a single Docker Swarm cluster. Even though this looks like a simple cluster, many things are going on in the background, and we’ll explore many of the cluster features later on. For now, we’ll concentrate on only a few observations. The nodes we’re running has an OS created by Docker and designed with only one goal. It runs containers, and nothing else. We cannot install packages directly. The benefits such an OS brings are related mainly to stability and performance. An OS designed with a specific goal is often more effective than general distributions capable of fulfilling all needs. Those often end up being fine at many things but not excellent with any. Docker’s OS is optimized for containers, and that makes it more stable. When there are no things we don’t use, there are fewer things that can cause trouble. In this case, the only thing we need is Docker Server (or Engine). Whatever else we might need must be deployed as a container. The truth is that we do not need much with Docker. A few things that we do need are already available. Let’s take a quick look at the containers running on this node. ``` `1` docker container ls -a ``` ````````````````````````````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` IMAGE COMMAND CREATED \ `2` STATUS PORTS NAMES `3` docker4x/l4controller-aws:17.06.0-ce-aws2 "loadbalancer run ..." 10 minutes ago \ `4` Up 10 minutes l4controller-aws `5` docker4x/meta-aws:17.06.0-ce-aws2 "metaserver -iaas_..." 10 minutes ago \ `6` Up 10 minutes 172.31.19.205:9024->8080/tcp meta-aws `7` docker4x/guide-aws:17.06.0-ce-aws2 "/entry.sh" 10 minutes ago \ `8` Up 10 minutes guide-aws `9` docker4x/shell-aws:17.06.0-ce-aws2 "/entry.sh /usr/sb..." 10 minutes ago \ `10` Up 10 minutes 0.0.0.0:22->22/tcp shell-aws `11` docker4x/init-aws:17.06.1-ce-aws1 "/entry.sh" 10 minutes ago \ `12` Exited (0) 10 minutes ago lucid_leakey ``` ```````````````````````````````````````````````````` We’ll explore those containers only briefly so that we understand their high level purposes. The *l4controller-aws* container is in charge of ELB. It monitors services and updates load balancer whenever a service that publishes a port is created, updated, or removed. You’ll see the ELB integration in action soon. For now, the important part to note is that we do not need to worry what happens when a node goes down nor we need to update security groups when a new port needs to be opened. ELB and *l4controller-aws* containers are making sure those things are always up-to-date. The *meta-aws* container provides general server metadata to the rest of the swarm cluster. Its main purpose is to provide tokens for members to join a Swarm cluster. The *guide-aws* container is in charge of house keeping. It removes unused images, stopped containers, volumes, and so on. On top of those responsibilities, it updates DynamoDB with information about managers and a few other things. The *shell-aws* container provides Shell, FPT, SSH, and a few other essential tools. When we entered the node we’re in right now, we actually entered this container. We’re not running commands (i.g., `docker container ls`) from the OS but from inside this container. The OS is so specialized that it does not even have SSH. The *lucid_leakey* (your name might be different) is based on *docker4x/init-aws* might be the most interesting system container. It was run, did its job, and exited. It has only one purpose. It discovered the IP and the token of the primary manager and joined the node to the cluster. With that process in place, we can add more nodes whenever we need them knowing that they will join the cluster automatically. If a node fails, the auto-scaling group will create a new one which will, through this container, join the cluster. We did not explore all of the features of the cluster. We’ll postpone the discussion for the next chapter when we explore self-healing capabilities of the cluster and, later on, self-adaptation. Instead, we’ll proceed by deploying the services we used in the previous chapters. The immediate goal is to reach the same state as the one we left in the previous chapter. The only real difference, for now, will be that the services will run on a production-ready cluster. ### Deploying Services We’ll start by deploying the stacks we used so far. We will not modify them in any form or way but deploy them as they are. Further on, we’ll explore what modifications we should add to those stacks to make them more production-ready. We’ll execute [scripts/aws-services.sh](https://github.com/vfarcic/docker-flow-monitor/blob/master/scripts/aws-services.sh) script that contains all the commands we used thus far. Please replace `[...]` with the DNS of your cluster. ``` `1` `export` `CLUSTER_DNS``=[`...`]` `2` `3` curl -o aws-services.sh `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/scripts`\` `5` /aws-services.sh `6` `7` chmod +x aws-services.sh ``` ``````````````````````````````````````````````````` The commands we executed created the environment variable `CLUSTER_DNS`, downloaded the script, and assigned it execute permissions. We won’t go into details of the script since it deploys the same stacks we used before. Feel free to explore it yourself. Now we can execute the script which will deploy the familiar stacks and services. ``` `1` ./aws-services.sh `2` `3` docker stack ls ``` `````````````````````````````````````````````````` We executed the script and listed all the stacks we deployed with it. The output of the `docker stack ls` command is as follows. ``` `1` NAME SERVICES `2` exporter 3 `3` go-demo 2 `4` jenkins 2 `5` monitor 3 `6` proxy 2 ``` ````````````````````````````````````````````````` You should be familiar with all those stacks. At this moment, the services in our AWS cluster behave in the same way as when we deployed them to Docker Machine clusters. As you might have guessed, there are a few things we’re still missing before those services can be considered production-ready. The first problem we’ll tackle is security. ### Securing Services There’s not much reason to secure internal services that do not publish any ports. Such services are usually intended to be accessed by other services that are attached to the same internal network. For example, the `go-demo` stack deploys two services. One of them is the `db` service that can be accessed only by the other service from the stack (`main`). We accomplished that by having both services attached to the same network and by not publishing any ports. The main objective should be to secure communication between clients outside your cluster and services residing inside. We usually accomplish that by adding SSL certificates to a proxy and, potentially, disabling HTTP communication. *Docker Flow Proxy* makes that an easy task. If you haven’t set up your SSL, you might want to explore [Configuring SSL Certificates](http://proxy.dockerflow.com/certs/) tutorial. There are quite a few ways to get certificates, but the one that sticks above the crowd is [Let’s Encrypt](https://letsencrypt.org/). It’s free and commonly used by a massive community. Two projects integrate *Let’s Encrypt* with *Docker Flow Proxy*. You can find them in GitHub repositories [n1b0r/docker-flow-proxy-letsencrypt](https://github.com/n1b0r/docker-flow-proxy-letsencrypt) and [hamburml/docker-flow-letsencrypt](https://github.com/hamburml/docker-flow-letsencrypt). They use different approaches to obtain certificates and pass them to the proxy. I urge not to explore both before making a decision which one to use (if any). Unfortunately, we won’t be able to set up certificates since we do not have a valid domain. Let’s Encrypt would not allow us to use DNS name AWS gave us and I could not know in advance whether you have a domain name you could use for this exercise. So, we’ll skip the examples of how to set up SSL assuming that you’ll explore it on your own. Feel free to reach me on [DevOps20](http://slack.devops20toolkit.com/) Slack channel if you have a question or you run into a problem. Encryption is only a part of what we need to do to secure our services. One of the obvious things we’re missing is authentication. Let us review the publicly available services we’re currently running inside our cluster and discuss the authentication strategies we might apply. We’ll start with Jenkins. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/jenkins"` ``` ```````````````````````````````````````````````` Jenkins service was created from a custom built image that already has an admin user set through Docker secrets. That was a great first step that allowed us to skip the manual setup and, at the same time, have a relatively secure initial experience. However, we need more. Potentially, every member of our organization should be able to access Jenkins. We could add them all as users of Jenkins’ internal registry, but that would prove to be too much work for anything but small teams. Fortunately, Jenkins allows authentication through almost any provider. All you have to do is install and configure one of the authentication plugins. [GitHub Authentication](https://plugins.jenkins.io/github-oauth), [Google Login](https://plugins.jenkins.io/google-login), [LDAP](https://plugins.jenkins.io/ldap), and [Gitlab Authentication](https://plugins.jenkins.io/gitlab-oauth) are only a few among many other available solutions. We won’t go into details how to setup a “proper” authentication since there are too many of them and I cannot predict which one would suit your needs. In most cases, following the instructions on the plugin page should be more than enough to get you up and running in no time. For now, it is important that the image we’re running is secured by default with the user we defined through Docker secrets and that you can easily replace it with authentication through one of the plugins. The current setup allows any user to see the jobs, but only the administrator to create new ones, to build them, or to update them. Let’s move to Prometheus and explore how to secure it with authentication. While Jenkins that has both its internal credentials storage as well as the ability to connect to many third-party credential providers, Prometheus has neither. There is no internal authentication, nor it has a built-in ability to integrate with an external authentication service. All that does not mean that everything is lost. Prometheus holds metrics of your cluster and the services inside it. Metrics have labels, and they might keep confidential information. It needs to be protected, and the only option is to deny external access to the service or to authenticate requests before they reach it. The first option would entail VPN and black-listing Prometheus domain or some other method that would deny access to anyone but those inside the VPN. The alternative is to use authentication gateway or instruct the proxy to request authentication. We won’t go into a discussion of pros and cons of each method since it often depends on personal preferences, the company culture, and the existing infrastructure. Instead, we’ll roll with the simplest solution. We’ll instruct the proxy to authenticate requests to Prometheus. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` curl -o proxy.yml `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `5` docker-flow-proxy-aws.yml `6` `7` cat proxy.yml ``` ``````````````````````````````````````````````` We entered the cluster, downloaded a new proxy stack and displayed its content. The output of the `cat` command, limited to relevant parts, is as follows. ``` `1` ... `2` proxy: `3` ... `4` secrets: `5` - dfp_users_admin `6` ... `7` `8` secrets: `9` dfp_users_admin: `10 ` external: true ``` `````````````````````````````````````````````` We added Docker secret `dfp_users_admin`. We’ll use it to store username and password we’ll use later one with services that require authentication through the proxy. Now that we know that the stack requires a secret, we can create it and redeploy the services. ``` `1` `echo` `"admin:admin"` `|` docker secret `\` `2 ` create dfp_users_admin - `3` `4` docker stack deploy -c proxy.yml `\` `5 ` proxy ``` ````````````````````````````````````````````` We piped the value `admin:admin` to the command that created the `dfp_users_admin` secret and deployed the new definition of the stack. All that’s left now is to update the *monitor* service by adding a few labels that will tell the proxy that the service requires authentication using the credentials from the secret we created. ``` `1` curl -o monitor.yml `\` `2 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `3` docker-flow-monitor-user.yml `4` `5` cat monitor.yml ``` ```````````````````````````````````````````` We downloaded a new monitor stack and displayed its content. The output, limited to relevant parts, is as follows. ``` `1` ... `2 ` monitor: `3 ` ... `4 ` deploy: `5 ` labels: `6 ` - com.df.usersPassEncrypted=false `7 ` - com.df.usersSecret=admin `8 ` ... ``` ``````````````````````````````````````````` *Docker Flow Proxy* uses a naming convention to resolve names of Docker secrets that contain users and passwords. The value of the `userSecret` parameter will be prepended with `dfp_users_` thus matching the name of the secret we created a moment ago. We used the `usersPassEncrypted` parameter to tell the proxy that the credentials are not encrypted. Please check [HTTP Mode Query Parameters](http://proxy.dockerflow.com/usage/#http-mode-query-parameters) section of the documentation for more details and additional options. The monitor stack requires the DNS of our cluster so we’ll define it as environment variable `CLUSTER_DNS`. Please replace `[...]` with the `CLUSTER_DNS` value obtained from the variable defined locally. ``` `1` `exit` `2` `3` `echo` `$CLUSTER_DNS` `4` `5` ssh -i devops22.pem docker@`$CLUSTER_IP` `6` `7` `CLUSTER_DNS``=[`...`]` ``` `````````````````````````````````````````` We exited the cluster so that we can output the value of the `CLUSTER_DNS` variable we created locally, entered back, and defined the same variable inside one of the nodes of the cluster. Those commands might seem like overkill but, in my case, they are easier than opening CloudFormation UI and looking for the outputs. You probably guessed by now that I prefer doing as much as possible from the command line. Now we can deploy the updated stack. ``` `1` `DOMAIN``=``$CLUSTER_DNS` docker stack `\` `2 ` deploy -c monitor.yml monitor ``` ````````````````````````````````````````` Before we check whether authentication is indeed applied, we should wait for a moment or two until all the services of the stack are up-and-running. You can check the status by executing `docker stack ps monitor`. All that’s left now is to open Prometheus and authenticate. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor"` ``` ```````````````````````````````````````` We exited the cluster and opened Prometheus in our default browser. This time we were asked to enter username and password before being redirected to the UI. Authentication works! While it might not be a perfect solution (nothing is), it is more secure than it was a moment ago when anyone could enter Prometheus. Now we can try to solve one more problem. Our cluster runs a few stateful services that might need to be persisted somewhere. ### Persisting State What shall we do with the stateful services inside our cluster? If any of them fails and Swarm reschedules it, the state will be lost. Even if impossible happens and none of the replicas of the service ever fail, sooner or later we’ll have to upgrade the cluster. That means that existing nodes will be replaced with new images and Swarm will have to reschedule your services to the new nodes. In other words, services will fail or be rescheduled, and we might need to persist state when they are stateful. Let us go through each of the stateful services we’re currently running inside our cluster. The obvious case of stateful services is databases. We are running MongoDB. Should we persist its state? Many would answer positively to that question. I’ll argue against persisting data on disk. Instead, we should create a replica set with at least three MongoDBs. That way, data would be replicated across multiple instances, and a failure of one or even two of them would not mean a loss of data. Unfortunately, MongoDB is not a container-friendly database (almost none of the DBs are) so scaling Mongo service to a few replicas will not do the trick. We’d need to create three services and do some custom plumbing. It’s nothing too complicated, and yet it’s not what we’re used to with Docker Swarm services. We won’t go into details how to setup Mongo replica-set inside Docker services. I’ll leave that up to your Google-ing skills. The important note I tried to convey is that we do not always need to persist state. If stateful service can replicate state across different instances, there might not be a need to store that state on a network drive as well. Moving on… *Docker Flow Proxy* is also a stateful service. It uses HAProxy which uses file system for its configuration. Since we are changing that configuration whenever a service is created, updated, or removed, we can consider that configuration as its state. If it gets lost, *Docker Flow Proxy* would not be able to forward requests to all our public facing services. Fortunately, there’s no need to persist proxy state either. *Docker Flow Swarm Listener* sends service notifications to all proxy replicas. On the other hand, when a new replica of the proxy is created, the first thing it does is to request all the info it needs from the listener. All in all, if we ignore possible bugs, all the replicas of the proxy should always be up-to-date and with identical configuration. In other words, there’s one less stateful service to worry. Moving on… Prometheus is also a stateful service. However, it cannot be scaled so its state cannot be replicated among its instances. It is a good example of a service that needs to persist its data on disk. Let’s open Prometheus flags screen and see the checkpoint interval. ``` `1` open `"http://``$CLUSTER_DNS``/monitor/flags"` ``` ``````````````````````````````````````` You’ll see a property called `storage.local.checkpoint-interval` set to `5m0s`. Prometheus will flush its state to a file every five minutes. By now, you should have a decent amount of data stored in Prometheus. We can confirm that by opening the graph screen. ``` `1` open `"http://``$CLUSTER_DNS``/monitor/graph"` ``` `````````````````````````````````````` Please type the query that follows into the *Expression* field. ``` `1` container_memory_usage_bytes{container_label_com_docker_swarm_service_name!=""} ``` ````````````````````````````````````` Click the *Execute* button followed with a switch to the *Graph* tab. You should see the memory usage of each container in the cluster. However, the reason we got here is not to admire metrics but to demonstrate state persistence. Let’s see what happens when we simulate a failure of the service. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` docker service scale `monitor_monitor``=``0` `4` `5` docker service scale `monitor_monitor``=``1` `6` `7` `exit` ``` ```````````````````````````````````` We entered the cluster, scaled the service to zero replicas, scaled it back to one, and exited. That was probably the fastest way to simulate a failure. Let’s go back to the graph screen. ``` `1` open `"http://``$CLUSTER_DNS``/monitor/graph"` ``` ``````````````````````````````````` If Prometheus does not load, you might need to wait for a few moments and refresh the screen. Repeat the execution of the same query like the one we used a short while ago. ``` `1` container_memory_usage_bytes{container_label_com_docker_swarm_service_name!=""} ``` `````````````````````````````````` You should notice that metrics are gone. You might have a minute or two of data. Those from before the failure simulation are gone. Let’s download a new version of the `monitor` stack and see how it solves our persistence problem. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` curl -o monitor.yml `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `5` docker-flow-monitor-aws.yml `6` `7` cat monitor.yml ``` ````````````````````````````````` We entered into the cluster and downloaded an updated version of the `monitor` stack. The output of the `cat` command, limited to relevant parts, is as follows. ``` `1` ... `2` monitor: `3` image: vfarcic/docker-flow-monitor `4` environment: `5` - ARG_STORAGE_LOCAL_PATH=/data `6` ... `7` volumes: `8` - prom:/data `9` ... `10` `11` volumes: `12 ` prom: `13 ` driver: cloudstor:aws `14 ` external: false ``` ```````````````````````````````` We specified the storage path using the environment variable `ARG_STORAGE_LOCAL_PATH`, mapped `prom` volume to the directory `/data`, and defined the volume with the driver `cloudstor:aws`. The *cloudstor* driver was developed by Docker specifically for usage in AWS and Azure. It will create a network drive (in this case EFS) and attach it to the service. Since the `prom` volume has `external` set to `false`, the volume will be created automatically when we deploy the stack. Otherwise, we’d need to execute `docker volume create` command first. Let’s deploy the new stack. Please make sure to replace `[...]` with the value of the `CLUSTER_DNS` variable defined locally (the second command). ``` `1` `exit` `2` `3` `echo` `$CLUSTER_DNS` `4` `5` ssh -i devops22.pem docker@`$CLUSTER_IP` `6` `7` `CLUSTER_DNS``=[`...`]` `8` `9` `DOMAIN``=``$CLUSTER_DNS` docker stack `\` `10 ` deploy -c monitor.yml monitor ``` ``````````````````````````````` We exited the cluster only to output the DNS, went back, created `CLUSTER_DNS` variable, and deployed the new stack. Now we should wait for a while so that Prometheus can accumulate some metrics. If we repeat the failure simulation right away, we would not be able to confirm whether data is persisted or not. Instead, you should grab a coffee and come back in ten minutes or more. That should be enough for a checkpoint or two to flush data to disk. Now, after a while, we can repeat the failure simulation steps and verify whether data is indeed persisted across failures. ``` `1` docker service scale `monitor_monitor``=``0` `2` `3` docker service scale `monitor_monitor``=``1` ``` `````````````````````````````` We changed the number of replicas to zero only to increase them to one a few moments later. As a result, Swarm created a new instance of the service. Let’s go back to the graph screen and confirm that the data survived the failure. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor/graph"` ``` ````````````````````````````` Please type the query that follows in the *Expression* field, click the *Execute* button, and switch to the *Graph* tab. ``` `1` container_memory_usage_bytes{container_label_com_docker_swarm_service_name!=""} ``` ```````````````````````````` You should see that metrics go back in time longer then the duration of the newly scheduled replica. Data persistence works! Partly… Prometheus is an in-memory database. It keeps all the metrics in memory and periodically flushes them to disk. It is not designed to be transactional but fast. We might have lost some data that was scraped between the last checkpoint and the (simulated) failure of the service. However, that, in most cases, is not a real problem since metrics are supposed to show us tendencies, not every single transaction. If you compare the graphs from before and after the (simulated) crash, you’ll notice that they are, more or less, the same even though some data might be lost. We have one more service left to fix. Jenkins is also a stateful service. It stores its state as files. They are, in a way, its database, and we need to persist them. Let’s download a new Jenkins stack. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` curl -o jenkins.yml `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `5` jenkins-aws.yml `6` `7` cat jenkins.yml ``` ``````````````````````````` The output of the `cat` command, limited to relevant parts, is as follows. ``` `1` ... `2` master: `3` ... `4` volumes: `5` - master:/var/jenkins_home `6` ... `7` `8` volumes: `9` master: `10 ` driver: cloudstor:aws `11 ` external: false ``` `````````````````````````` By now, all the additions should be familiar. We defined a volume called `master` and mapped it to Jenkins home directory. Further down, we defined the `master` volume to use `cloudstor:aws` driver and set `external` to `false` so that `docekr stack deploy` command can take care of creating the volume. We’ll deploy the new stack before checking whether persistence works. ``` `1` docker stack deploy -c jenkins.yml `\` `2 ` jenkins `3` `4` docker stack ps jenkins ``` ````````````````````````` It will take a couple of minutes until the volume is created, the image is pulled, and Jenkins process inside the container is initialized. You’ll know that Jenkins is initialized when you see the message `INFO: Jenkins is fully up and running` in its logs. Use `docker service logs jenkins_master` command to see the output. Now that Jenkins is initialized and uses EFS to store its state, we should confirm that persistence indeed works. We’ll do that by creating a new job, shutting down Jenkins, letting Swarm reschedule a new replica, and, finally, checking that the newly created job is present. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/jenkins/newJob"` ``` ```````````````````````` We exited the cluster and opened *New Job* screen. Please use *admin* and *username* and *password* if you’re asked to authenticate. Next, type *test* as the *item name*, select *Pipeline* as job type and click the *OK* button. Once inside the job configuration screen, please click the *Pipeline* tab. We are about to write a very complicated pipeline script. Are you ready? Please type the script that follows inside the *Pipeline Script* field and press the *Save* button. ``` `1` `echo` `"This is a test"` ``` ``````````````````````` Now that we created a mighty pipeline job, we can simulate Jenkins failure by sending it an *exit* command. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/exit"` ``` `````````````````````` We exited the cluster and opened the *exit* screen. You’ll see a button saying *Try POSTing*. Click it. Jenkins will shut down, and Swarm will detect that as a failure and schedule a new replica of the service. Wait a few moments until Jenkins inside a new replica is initialized and open the home screen. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins"` ``` ````````````````````` As you can see, the newly created job is there. Persistence works! If you visit Jenkins nodes screen, you’ll notice that we are running only one agent labeled `prod`. That’s the agent we should use only to deploy a new release to production and, potentially, run production tests. We still need to setup agents we’ll use to run unit tests, build images, run integration tests, and so on. We’ll postpone that part for one of the next chapters since efficient usage of agents is related to self-adaptation applied to infrastructure. We are yet to reach that section. ### Alternatives to CloudStor Volume Driver If you’re not using *Docker For AWS* or *Azure*, using CloudStor might not be the best idea. Even though it can be made to work with AWS or Azure without the template we used to create the cluster, it is not well documented. For now, its goal is only to be used with AWS or Azure clusters made with Docker templates. For anything else, I’d recommend you choose one of the alternatives. My personal preference is [REX-Ray](http://rexray.readthedocs.io/). All in all, stick with *CloudStor* if you choose to create your Swarm cluster using *Docker For AWS* or *Azure* templates. It is well integrated and provides great out-of-the-box experience. For anything else use *REX-Ray* if it supports your hosting vendor. Otherwise, look for some other alternative. There are plenty others, and more is yet to come. The most important part of the story is to know when to persist the state and when to let replication do the work. When persistence is paramount, use any of the volume drivers that support your hosting vendor and fit your requirements. The only thing left before we can call this cluster production-ready is to set up centralized logging. ### Setting Up Centralized Logging We choose not to integrate our cluster with CloudWatch. Actually, I chose not to use it, and you blindly followed my example. Therefore, I guess that an explanation is in order. It’s going to be a short one. I don’t like CloudWatch. I think it is a bad solution that is way behind the competition and, at the same time, it can become quite expensive when dealing with large quantities of data. More importantly, I believe that we should use services coming from hosting vendors only when they are essential or provide an actual benefit. Otherwise, we’d run a risk of entering the trap called *vendor locking*. Docker Swarm allows us to deploy services in the same way, no matter whether they are running in AWS or anywhere else. The only difference would be a volume driver we choose to plug in. Similarly, all the services we decided to deploy thus far can run anywhere. The only “lock-in” is with Docker Swarm but, unlike AWS, it is open source. If needed we can even fork it to our repository and build our own Docker Server. That is not to say that I would recommend forking Docker but rather that I am trying to make a clear distinction between being locked into an open source project and with a commercial product. Moreover, with relatively moderate changes, we could migrate our Swarm services to Kubernetes or even Mesos and Marathon. Again, that is not something I recommend but more of a statement that a choice to change the solution is not as time demanding as it might seem on the first look. I think I run astray from the main subject so let me summarize it. CloudWatch is bad, and it costs money. Many of the free alternatives are much better. If you read my previous books, you probably know that my preference for a logging solution is the ELK stack (ElasticSearch, LogStash, and Kibana). I used them for both logging and metrics but, since then, metrics solution was replaced with Prometheus. How about centralized logging? I think that ELK is still one of the best self-hosted solutions even though I’m not entirely convinced I like the new path Elastic is taking as a company. I’ll leave the discussion about their direction for later and, instead, we’ll dive right into setting up the ELK stack in our cluster. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` curl -o logging.yml `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `5` logging-aws.yml `6` `7` cat logging.yml ``` ```````````````````` We went back to the cluster, downloaded the logging stack, and displayed its contents. The YML file defines the ELK (ElasticSearch, LogStash, and Kibana) stack as well as LogSpout. ElasticSearch is an in-memory database that will store our logs. LogSpout will be sending logs from all containers running inside the cluster to LogStash, which, in turn, will process them and send the output to ElasticSearch. Kibana will be used as UI to explore logs. That was all the details of the stack you’ll get. I’ll assume that you are already familiar with the services we’ll use. They were described in [The DevOps 2.1 Toolkit: Docker Swarm](https://www.amazon.com/dp/1542468914). If you did not read it, information could be easily found on the Internet. Google is your friend. The first service in the stack is `elasticsearch`. It is an in-memory database we’ll use to store logs. Its definition is as follows. ``` `1` elasticsearch: `2` image: docker.elastic.co/elasticsearch/elasticsearch:5.5.2 `3` environment: `4` - xpack.security.enabled=false `5` volumes: `6` - es:/usr/share/elasticsearch/data `7` networks: `8` - default `9` deploy: `10 ` labels: `11 ` - com.df.distribute=true `12 ` - com.df.notify=true `13 ` - com.df.port=80 `14 ` - com.df.alertName=mem_limit `15 ` - com.df.alertIf=@service_mem_limit:0.8 `16 ` - com.df.alertFor=30s `17 ` resources: `18 ` reservations: `19 ` memory: 3000M `20 ` limits: `21 ` memory: 3500M `22 ` placement: `23 ` constraints: [node.role == worker] `24` ... `25` volumes: `26 ` es: `27 ` driver: cloudstor:aws `28 ` external: false `29` ... ``` ``````````````````` There’s nothing special about the service. We used the environment variable `xpack.security.enabled` to disable X-Pack. It is a commercial product baked into ElasticSearch image. Since this book uses only open source services, we had to disable it. That does not mean that X-Pack is not useful. It is. Among other things, it provides authentication capabilities to ElasticSearch. I encourage you to explore it and make your own decision whether it’s worth the money. I could argue that there’s not much reason to secure ElasticSearch since we are not exposing any ports. Only services that are attached to the same network will be able to access it. That means that only people you trust to deploy services would have direct access to it. Usually, we’d run multiple ElasticSearch services and join them into a cluster (ElasticSearch calls replica set a cluster). Data would be replicated between multiple instances and would be thus preserved in case of a failure. However, we do not need multiple ElasticSearch services, nor do we have enough hardware to host them. Therefore, we’ll run only one ElasticSearch service and, since there will be no replication, we’ll store its state on a volume called `es`. The only other noteworthy part of the service definition is the placement defined as `constraints: [node.role == worker]`. Since ElasticSearch is very resource demanding, it might not be a wise idea to place it on a manager. Therefore, we defined that it should always run on one of the workers and reserved 3GB of memory. That should be enough to get us started. Later on, depending on a number of log entries you’re storing and the cleanup strategy, you might need to increase the memory allocated to it and scale it to multiple services. Let’s move to the next service. ``` `1` ... `2` logstash: `3` image: docker.elastic.co/logstash/logstash:5.5.2 `4` networks: `5` - default `6` deploy: `7` labels: `8` - com.df.distribute=true `9` - com.df.notify=true `10 ` - com.df.port=80 `11 ` - com.df.alertName=mem_limit `12 ` - com.df.alertIf=@service_mem_limit:0.8 `13 ` - com.df.alertFor=30s `14 ` resources: `15 ` reservations: `16 ` memory: 600M `17 ` limits: `18 ` memory: 1000M `19 ` configs: `20 ` - logstash.conf `21 ` environment: `22 ` - LOGSPOUT=ignore `23 ` command: logstash -f /logstash.conf `24` ... `25` configs: `26 ` logstash.conf: `27 ` external: true ``` `````````````````` LogStash will accept logs using syslog format and protocol and forward them to ElasticSearch. You’ll see the configuration soon. The only interesting part about the service is that we’re injecting a Docker config. It works in almost the same way as secrets except that it is not encrypted at rest. Since it will not contain anything compromising, there’s no need to set it up as a secret. We did not specify config destination, so it will be available as file `/logstash.conf`. The command is set to reflect that. We’re halfway through. The next service in line is `kibana`. ``` `1` kibana: `2` image: docker.elastic.co/kibana/kibana:5.5.2 `3` networks: `4` - default `5` - proxy `6` environment: `7` - xpack.security.enabled=false `8` - ELASTICSEARCH_URL=http://elasticsearch:9200 `9` deploy: `10 ` labels: `11 ` - com.df.notify=true `12 ` - com.df.distribute=true `13 ` - com.df.usersPassEncrypted=false `14 ` - com.df.usersSecret=admin `15 ` - com.df.servicePath=/app,/elasticsearch,/api,/ui,/bundles,/plugins,/sta\ `16` tus,/es_admin `17 ` - com.df.port=5601 `18 ` - com.df.alertName=mem_limit `19 ` - com.df.alertIf=@service_mem_limit:0.8 `20 ` - com.df.alertFor=30s `21 ` resources: `22 ` reservations: `23 ` memory: 600M `24 ` limits: `25 ` memory: 1000M ``` ````````````````` Kibana will provide a UI that will allow us to filter and display logs. It can do many other things but logs are all we need for now. Unfortunately, Kibana is not proxy-friendly. Even though there are a few environment variables that can configure the base path, they do not truly work as expected. We had to specify multiple paths through the `com.df.servicePath`. They reflect all the combinations of requests Kibana makes. I’d recommend that you replace `com.df.servicePath` with `com.df.serviceDomain`. The value could be a subdomain (e.g., `kibana.acme.com`). The rest of the definition is pretty uneventful, so we’ll move on. We, finally, reached the last service of the stack. ``` `1` logspout: `2` image: gliderlabs/logspout:v3.2.2 `3` networks: `4` - default `5` environment: `6` - SYSLOG_FORMAT=rfc3164 `7` volumes: `8` - /var/run/docker.sock:/var/run/docker.sock `9` command: syslog://logstash:51415 `10 ` deploy: `11 ` mode: global `12 ` labels: `13 ` - com.df.notify=true `14 ` - com.df.distribute=true `15 ` - com.df.alertName=mem_limit `16 ` - com.df.alertIf=@service_mem_limit:0.8 `17 ` - com.df.alertFor=30s `18 ` resources: `19 ` reservations: `20 ` memory: 20M `21 ` limits: `22 ` memory: 30M ``` ```````````````` LogSpout will monitor Docker events and send all logs to ElasticSearch. We’re exposing Docker socket as a volume so that the service can communicate with Docker server. The command specifies `syslog` as protocol and `logstash` running on `51415` as the destination address. Since all the services of the stack are connected through the same `default` network, the name of the service (`logstash`) is all we need as address. The service will run in the `global` mode so that a replica is present on each node of the cluster. We need to create the `logstash.conf` config before we deploy the stack. The command is as follows. ``` `1` `echo` `'` `2` `input {` `3`` syslog { port => 51415 }` `4` `}` `5` `6` `output {` `7`` elasticsearch {` `8`` hosts => ["elasticsearch:9200"]` `9`` }` `10` `}` `11` `'` `|` docker config create logstash.conf - ``` ``````````````` We echoed a configuration and piped the output to the `docker config create` command. The configuration specifies `syslog` running on port `51415` as `input`. The output is ElasticSearch running on port `9200`. The address of the output is the name of the destination service (`elasticsearch`). Now we can deploy the stack. ``` `1` docker stack deploy -c logging.yml `\` `2 ` logging ``` `````````````` A few of the images are big, and it will take a moment or two until all services are up-and-running. We’ll confirm the state of the stack by executing the command that follows. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`running logging ``` ````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE\ `2` DESIRED STATE CURRENT STATE ERR\ `3` OR PORTS `4` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `5` 72-31-46-204.us-east-2.compute.internal Running Running 3 minutes ago `6` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `7` 72-31-12-85.us-east-2.compute.internal Running Running 2 minutes ago `8` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `9` 72-31-31-76.us-east-2.compute.internal Running Running 3 minutes ago `10` logging_kibana.1 docker.elastic.co/kibana/kibana:5.5.2 ip-1\ `11` 72-31-46-204.us-east-2.compute.internal Running Running 15 seconds ago `12` logging_logstash.1 docker.elastic.co/logstash/logstash:5.5.2 ip-1\ `13` 72-31-31-76.us-east-2.compute.internal Running Running 3 minutes ago `14` logging_elasticsearch.1 docker.elastic.co/elasticsearch/elasticsearch:5.5.2 \ `15 ` Running Pending 3 minutes a ``` ```````````` You’ll notice that `elasticsearch` in in *pending* state. Swarm cannot deploy it because none of the servers meet the requirements we set. We need at least 3GB of memory and a worker node. We should either change the constraint and reservations to fit out current cluster setup or add a worker as a new node. We’ll go with latter. As a side note, Kibana might fail after a while. It will try to connect to ElasticSearch for a few times and stop the process. Soon after, it will be rescheduled by Swarm, only to stop again. That will continue until we manage to run ElasticSearch. Please exit the cluster before we proceed. ``` `1` `exit` ``` ``````````` ### Extending The Capacity Of The Cluster Among other resources, *Docker For AWS* template created two auto-scaling groups. One is used for masters and the other for workers. Those security groups have multiple purposes. If we choose to update the stack to, for example, change the size of the nodes or upgrade Docker server to a newer version, the template will temporarily increase the number of nodes by one and shut down one of the old ones. The replicas that were running on the old server will be moved to the new one. Once the new server is created, it will move to the next, and the next after that, all the way until all the nodes are replaced. The process is very similar to rolling updates we performed by Swarm when updating services. The same process is done whenever we decide to update any aspect of the *Docker For AWS* stack. Similarly, if one of the nodes fail health checks, the template will increase auto-scaling group by one so that a new node is created in its place and, once everything goes back to “normal” update the ASG back to its initial value. In all those cases, not only that new nodes will be created through auto-scaling groups, but they will also join the cluster as a manager or a worker depending on the type of the server that is being replaced. We will explore failure recovery in the chapter dedicated to self-healing applied to infrastructure. For now, we’ll limit the scope to an example how to update the CloudFormation stack that created our cluster. We even have a perfect use-case. Our ElasticSearch service needs a worker node, and it needs it to be bigger than those we use as managers. Let’s create it. We’ll start by opening CloudFormation home screen. ``` `1` open `"https://us-east-2.console.aws.amazon.com/cloudformation/home"` ``` `````````` Please select the *devops22* stack. Click the *Actions* drop-down list and select the *Update Stack* item. Click the *Next* button You will be presented with the same initial screen you saw while we were creating the *Docker For AWS* stack. The only difference is that the values are now populated with choices we made previously. We can change anything we want. Not only that the changes will be applied accordingly, but the process will use rolling updates to avoid downtime. Whether you will have downtime or not depends on the capabilities of your services. If needed, the process will change one node at the time. If you’re running multiple replicas of a service, the worst case scenario is that you will experience degraded performance for a short period. However, services that are not scalable like, for example, Prometheus, will experience downtime. When a node is destroyed, Swarm will move it to a newly created server. If the state of that service is on a network drive like EFS, it will continue working as if nothing happened. However, we must count the time between the service failure due to the destruction of the node and until it is up and running again. In most cases that should be only a couple of seconds. No matter how short the downtime is, it is still a period during which our non-scalable services are not operational. Be it as it may, not all services are scalable, and the process is the best we can do. If there is downtime, let it be as short as possible. In this case, we won’t make an update that will force the system to recreate nodes. Instead, we’ll only add a new worker node. Please scroll to the *Number of Swarm worker nodes?* field and change the value from *0* to *1*. Since we defined that ElasticSearch should reserve 3GB of memory, we should change worker instance type. Our managers are using *t2.small* that comes with 2GB. The smallest instance that fulfills our requirements is *t2.medium* that comes 4GB of allocated memory. Please change the value of the *Agent worker instance type?* drop-down list to *t2.medium*. We will not change any other aspect of the cluster, so all that’s left is to click the *Next* button twice, and select the *I acknowledge that AWS CloudFormation might create IAM resources.* checkbox. After a few moments, the *Preview your changes* section of the screen will be populated with the list of changes that will be applied to the cluster. Since this is a simple and non-destructive update, only a few resources related to auto-scaling groups will be updated. ![Figure 13-4: Preview your changes screen from the Docker For AWS template](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00077.jpeg) Figure 13-4: Preview your changes screen from the Docker For AWS template Click the *Update* button and relax. It’ll take a minute or two until the new server is created and it joins the cluster. While waiting, we should explore a different method to accomplish the same result. Please open the *Auto-Scaling Groups Details* screen. ``` `1` open `"https://console.aws.amazon.com/ec2/autoscaling/home?#AutoScalingGroups:vie\` `2` `w=details"` ``` ````````` You’ll be presented with the *Welcome to Auto Scaling* screen. Click the *Auto Scaling Groups: 2* link. Select the item with the name starting with *devops22-NodeAsg*, click the *Actions* drop-down list, and select the *Edit* item. We’re looking for the *Desired* field located in the *details* tab. It can be changed to any value, and the number of workers would increase (or decrease) accordingly. We could do the same with the auto-scaling group associated with manager nodes. Do not make any change. We’re almost finished with this chapter, and we already have more than enough nodes for the services we’re running. The knowledge that we can change the number of manager or worker nodes by changing the values in auto-scaling groups is essential. Later on, we’ll combine that with AWS API and Prometheus alerts to automate the process when certain conditions are met. The new worker node should be up-and-running by now unless you are a very fast reader. If that’s the case, go and grab a coffee. Let’s go back to the cluster and list the available nodes. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ```````` The output is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-2-119.us-east-2.compute.internal Ready Active `3` ip-172-31-32-225.us-east-2.compute.internal Ready Active Leader `4` ip-172-31-10-207.us-east-2.compute.internal Ready Active Reachable `5` ip-172-31-30-18.us-east-2.compute.internal Ready Active Reachable ``` ``````` As you can see, a new node is added to the mix. Since its a worker, manager status is empty. Your first thought might be that it is a simple process. After all, all that AWS did was create a new VM. That is right from AWS point of view, but there are a few other things that happened in the background. During VM initialization, it contacted Dynamo DB to find out the address of the primary manager and the access token. Equipped with that info, it sent a request to that manager to join the cluster. From there on, the new node (in this case worked) is available as part of the Swarm cluster. ![Figure 13-5: The process of increasing the number of worker nodes](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00078.jpeg) Figure 13-5: The process of increasing the number of worker nodes Let’s take a look at the `logging` stack and confirm that adding a new worker node accomplished the mission. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`running logging ``` `````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE\ `2` DESIRED STATE CURRENT STATE E\ `3` RROR PORTS `4` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `5` 72-31-30-18.us-east-2.compute.internal Running Running 4 minutes ago `6` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `7` 72-31-10-207.us-east-2.compute.internal Running Running 4 minutes ago `8` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `9` 72-31-32-225.us-east-2.compute.internal Running Running 4 minutes ago `10` logging_logspout... gliderlabs/logspout:v3.2.2 ip-1\ `11` 72-31-2-119.us-east-2.compute.internal Running Running 4 minutes ago `12` logging_elasticsearch.1 docker.elastic.co/elasticsearch/elasticsearch:5.5.2 ip-1\ `13` 72-31-2-119.us-east-2.compute.internal Running Running 3 seconds ago `14` logging_kibana.1 docker.elastic.co/kibana/kibana:5.5.2 ip-1\ `15` 72-31-30-18.us-east-2.compute.internal Running Running 28 seconds ago `16` logging_logstash.1 docker.elastic.co/logstash/logstash:5.5.2 ip-1\ `17` 72-31-10-207.us-east-2.compute.internal Running Running 52 seconds ``` ````` Since `logspout` is a global service, a new replica was created in the new node. More importantly, `elasticsearch` changed its current state from pending to running. Swarm detected that a worker was added to the cluster and deployed a replica of the `elasticsearch` service. Our whole production setup is up and running. The only thing left to do is to confirm that Kibana is indeed working as expected and that logs are shipped to ElasticSearch. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/app/kibana"` ``` ```` We exited the cluster and opened Kibana in a browser. Since we defined the `com.df.usersSecret` label, *Docker Flow Proxy* will not allow access to it without authentication. Please use *admin* as both username and password. The first time you open Kibana, you’ll be presented with the *Configure an index pattern* screen. Select *@timestamp* as *Time Filter field name* and click the *Create* button. Kibana and the rest of the ELK stack are ready for use. ### What Now? Our production cluster is up and running, and it already has most of the vertical services we’ll need. The next steps will build on top of that. We’ll explore *Docker For AWS* features that make it self-heal and, later on, discuss how we can make it self-adapt as well. We explored how to update our cluster through UI. That is useful as a way to learn what’s going on but not that much if we’re planning to automate the processes. Fortunately, everything that can be done through UI can be accomplished through AWS API. We’ll use it soon. Docker folks did a great job with *Docker For AWS* and *Azure*. The result is fantastic. It is a very simple, yet very powerful tool in our belt. I hope you’re hosting your Swarm cluster in AWS or Azure (both behave almost the same). If you’re not, it will be very useful to use *Docker For AWS* for a while. That, together with this chapter and those that follow, should give you inspiration how to create a cluster with your vendor of choice. Even though the resources will be different, the logic should be the same. The major difference is that you will have to roll your sleeves and replicate many of the features you already saw. More is yet to come so be prepared. I still recommend using Terraform for anything but a cluster running in AWS or Azure. It is the best tool for creating infrastructure resources. I wish Docker chose it as well instead of relying on the tools that are native to AWS and Azure. That would simplify the process of extending them to other vendors and foster contributions that would enable the same features elsewhere. On the other hand, providing a “native” experience like the one you saw in this chapter has its benefits. There’s nothing else left to say (until the next chapter). There’s no reason to pay for things you don’t use. We’ll destroy the cluster and take a break. ``` `1` open `"https://console.aws.amazon.com/cloudformation"` ``` `Select the *devops22* stack, click the the *Actions* drop-down list, select the *Delete Stack* item, and click the *Yes, Delete* button. The cluster will be gone in a few minutes. Don’t worry. We’ll create a new one soon. Until then, you won’t be able to complain that I’m forcing you to make unnecessary expense.` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ```````````````````````````````` ````````````````````````````````` `````````````````````````````````` ``````````````````````````````````` ```````````````````````````````````` ````````````````````````````````````` `````````````````````````````````````` ``````````````````````````````````````` ```````````````````````````````````````` ````````````````````````````````````````` `````````````````````````````````````````` ``````````````````````````````````````````` ```````````````````````````````````````````` ````````````````````````````````````````````` `````````````````````````````````````````````` ``````````````````````````````````````````````` ```````````````````````````````````````````````` ````````````````````````````````````````````````` `````````````````````````````````````````````````` ``````````````````````````````````````````````````` ```````````````````````````````````````````````````` ````````````````````````````````````````````````````` `````````````````````````````````````````````````````` ``````````````````````````````````````````````````````` ```````````````````````````````````````````````````````` `````````````````````````````````````````````````````````

第十八章：应用于基础设施的自愈功能

我们已经看到 Docker Swarm 如何为服务提供自愈功能。如果某个服务的副本出现故障，它将被重新调度到健康的节点之一。很快，集群内将运行所需数量的副本。我们将此功能与附加到网络驱动器的卷结合使用，以便持久化有状态服务的状态。

现在是时候探索如何为基础设施实现相同的自愈功能了。我们已经知道如何基于Docker For AWS模板或其在Azure中的等效项创建集群。如果您没有使用这些主机服务提供商，我们已经探索了您应该自己实现的基本功能。

在进入基础设施自愈之前，我们需要创建在上一章结尾处销毁的集群。我们将利用这个机会探索如何在不使用 AWS UI 的情况下实现相同的结果。

自动化集群设置

我们需要做的第一件事是获取 AWS 凭证。

请打开Amazon EC2 控制台，点击右上角菜单中的您的名字，然后选择我的安全凭证。您将看到一个包含不同类型凭证的界面。展开访问密钥（访问密钥 ID 和秘密访问密钥）部分，然后点击创建新访问密钥按钮。展开显示访问密钥部分以查看密钥。

之后您将无法再查看这些密钥，因此这是您唯一一次可以下载密钥文件的机会。

我们将密钥作为环境变量放置，这些变量将被AWS 命令行界面（AWS CLI）使用。

请在执行后续命令之前将[...]替换为您的密钥。

`1` `export` `AWS_ACCESS_KEY_ID``=[`...`]`
`2` 
`3` `export` `AWS_SECRET_ACCESS_KEY``=[`...`]`
`4` 
`5` `export` `AWS_DEFAULT_REGION``=`us-east-1

You’re free to change the region to any that suits you as long as it has at least three availability zones. Before we proceed, please make sure that [AWS CLI](https://aws.amazon.com/cli/) and [jq](https://stedolan.github.io/jq/download/) are installed. We’ll use the *CLI* to communicate with AWS and *jq* to format and filter JSON output returned by the *CLI*. ![Figure 14-1: AWS Command Line Interface screen](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00079.jpeg) Figure 14-1: AWS Command Line Interface screen Let’s take a look at the *Docker For AWS* template. ``` `1` curl https://editions-us-east-1.s3.amazonaws.com/aws/stable/Docker.tmpl `\` `2 ` `|` jq `"."` ``` `````````````````````````````````````````````````` The output is vast, and we won’t have time to go through the details of all the services it defines. Instead, we’ll focus on the parameters since they have to be specified during the execution of the template. Let’s take another look at the template but, this time, limited to the `.Metadata` section. ``` `1` curl https://editions-us-east-1.s3.amazonaws.com/aws/stable/Docker.tmpl `\` `2 ` `|` jq `".Metadata"` ``` ````````````````````````````````````````````````` We output the template and used `jq` to filter the result. The output is as follows. ``` `1` `{` `2` `"AWS::CloudFormation::Interface"``:` `{` `3` `"ParameterGroups"``:` `[` `4` `{` `5` `"Label"``:` `{` `6` `"default"``:` `"Swarm Size"` `7` `},` `8` `"Parameters"``:` `[` `9` `"ManagerSize"``,` `10 ` `"ClusterSize"` `11 ` `]` `12 ` `},` `13 ` `{` `14 ` `"Label"``:` `{` `15 ` `"default"``:` `"Swarm Properties"` `16 ` `},` `17 ` `"Parameters"``:` `[` `18 ` `"KeyName"``,` `19 ` `"EnableSystemPrune"``,` `20 ` `"EnableCloudWatchLogs"``,` `21 ` `"EnableCloudStorEfs"` `22 ` `]` `23 ` `},` `24 ` `{` `25 ` `"Label"``:` `{` `26 ` `"default"``:` `"Swarm Manager Properties"` `27 ` `},` `28 ` `"Parameters"``:` `[` `29 ` `"ManagerInstanceType"``,` `30 ` `"ManagerDiskSize"``,` `31 ` `"ManagerDiskType"` `32 ` `]` `33 ` `},` `34 ` `{` `35 ` `"Label"``:` `{` `36 ` `"default"``:` `"Swarm Worker Properties"` `37 ` `},` `38 ` `"Parameters"``:` `[` `39 ` `"InstanceType"``,` `40 ` `"WorkerDiskSize"``,` `41 ` `"WorkerDiskType"` `42 ` `]` `43 ` `}` `44 ` `],` `45 ` `"ParameterLabels"``:` `{` `46 ` `"ClusterSize"``:` `{` `47 ` `"default"``:` `"Number of Swarm worker nodes?"` `48 ` `},` `49 ` `"EnableCloudStorEfs"``:` `{` `50 ` `"default"``:` `"Create EFS prerequsities for CloudStor?"` `51 ` `},` `52 ` `"EnableCloudWatchLogs"``:` `{` `53 ` `"default"``:` `"Use Cloudwatch for container logging?"` `54 ` `},` `55 ` `"EnableSystemPrune"``:` `{` `56 ` `"default"``:` `"Enable daily resource cleanup?"` `57 ` `},` `58 ` `"InstanceType"``:` `{` `59 ` `"default"``:` `"Agent worker instance type?"` `60 ` `},` `61 ` `"KeyName"``:` `{` `62 ` `"default"``:` `"Which SSH key to use?"` `63 ` `},` `64 ` `"ManagerDiskSize"``:` `{` `65 ` `"default"``:` `"Manager ephemeral storage volume size?"` `66 ` `},` `67 ` `"ManagerDiskType"``:` `{` `68 ` `"default"``:` `"Manager ephemeral storage volume type"` `69 ` `},` `70 ` `"ManagerInstanceType"``:` `{` `71 ` `"default"``:` `"Swarm manager instance type?"` `72 ` `},` `73 ` `"ManagerSize"``:` `{` `74 ` `"default"``:` `"Number of Swarm managers?"` `75 ` `},` `76 ` `"WorkerDiskSize"``:` `{` `77 ` `"default"``:` `"Worker ephemeral storage volume size?"` `78 ` `},` `79 ` `"WorkerDiskType"``:` `{` `80 ` `"default"``:` `"Worker ephemeral storage volume type"` `81 ` `}` `82 ` `}` `83 ` `}` `84` `}` ``` ```````````````````````````````````````````````` You should be familiar with those parameters. They are the same as those you saw when you created the cluster through AWS UI. Now we are ready to create a cluster. ``` `1` aws cloudformation create-stack `\` `2` --template-url https://editions-us-east-1.s3.amazonaws.com/aws/stable/Docker`\` `3` .tmpl `\` `4` --capabilities CAPABILITY_IAM `\` `5` --stack-name devops22 `\` `6` --parameters `\` `7` `ParameterKey``=`ManagerSize,ParameterValue`=``3` `\` `8` `ParameterKey``=`ClusterSize,ParameterValue`=``0` `\` `9` `ParameterKey``=`KeyName,ParameterValue`=`devops22 `\` `10 ` `ParameterKey``=`EnableSystemPrune,ParameterValue`=`yes `\` `11 ` `ParameterKey``=`EnableCloudWatchLogs,ParameterValue`=`no `\` `12 ` `ParameterKey``=`EnableCloudStorEfs,ParameterValue`=`yes `\` `13 ` `ParameterKey``=`ManagerInstanceType,ParameterValue`=`t2.micro `\` `14 ` `ParameterKey``=`InstanceType,ParameterValue`=`t2.micro ``` ``````````````````````````````````````````````` We named the stack `devops22` and used the parameters to set the number of managers (`3`) and workers (`0`) and SSH key (`devops22`). We enabled prune and EFS, and disabled *CloudWatch*. This time we used `t2.micro` instances. We won’t deploy many services so 1 vCPU and 1GB of memory should be more than enough. At the same time, `t2.micro` is *free tier eligible* making it a perfect instance type for the exercises in this chapter. We can use `aws cloudformation` command to list the resources defined in the stack and see their current status. ``` `1` aws cloudformation describe-stack-resources `\` `2 ` --stack-name devops22 `|` jq `"."` ``` `````````````````````````````````````````````` The output is too big to be listed here so we’ll move on. Our immediate goal is to find out the status of the stack and confirm that it was created successfully before we SSH into it. We can describe a stack through the `aws cloudformation describe-stacks` command. ``` `1` aws cloudformation describe-stacks `\` `2 ` --stack-name devops22 `|` jq `"."` ``` ````````````````````````````````````````````` We retrieved the description of the `devops22` stack. The output is as follows. ``` `1` `{` `2` `"Stacks"``:` `[` `3` `{` `4` `"StackId"``:` `"arn:aws:cloudformation:us-east-2:036548781187:stack/devops22/b\` `5` `f859420-99f1-11e7-92af-50a68a26e835"``,` `6` `"Description"``:` `"Docker CE for AWS 17.06.1-ce (17.06.1-ce-aws1)"``,` `7` `"Parameters"``:` `[` `8` `{` `9` `"ParameterValue"``:` `"yes"``,` `10 ` `"ParameterKey"``:` `"EnableCloudStorEfs"` `11 ` `},` `12 ` `{` `13 ` `"ParameterValue"``:` `"devops22"``,` `14 ` `"ParameterKey"``:` `"KeyName"` `15 ` `},` `16 ` `{` `17 ` `"ParameterValue"``:` `"t2.micro"``,` `18 ` `"ParameterKey"``:` `"ManagerInstanceType"` `19 ` `},` `20 ` `{` `21 ` `"ParameterValue"``:` `"0"``,` `22 ` `"ParameterKey"``:` `"ClusterSize"` `23 ` `},` `24 ` `{` `25 ` `"ParameterValue"``:` `"standard"``,` `26 ` `"ParameterKey"``:` `"ManagerDiskType"` `27 ` `},` `28 ` `{` `29 ` `"ParameterValue"``:` `"20"``,` `30 ` `"ParameterKey"``:` `"WorkerDiskSize"` `31 ` `},` `32 ` `{` `33 ` `"ParameterValue"``:` `"20"``,` `34 ` `"ParameterKey"``:` `"ManagerDiskSize"` `35 ` `},` `36 ` `{` `37 ` `"ParameterValue"``:` `"standard"``,` `38 ` `"ParameterKey"``:` `"WorkerDiskType"` `39 ` `},` `40 ` `{` `41 ` `"ParameterValue"``:` `"yes"``,` `42 ` `"ParameterKey"``:` `"EnableSystemPrune"` `43 ` `},` `44 ` `{` `45 ` `"ParameterValue"``:` `"no"``,` `46 ` `"ParameterKey"``:` `"EnableCloudWatchLogs"` `47 ` `},` `48 ` `{` `49 ` `"ParameterValue"``:` `"t2.small"``,` `50 ` `"ParameterKey"``:` `"InstanceType"` `51 ` `},` `52 ` `{` `53 ` `"ParameterValue"``:` `"3"``,` `54 ` `"ParameterKey"``:` `"ManagerSize"` `55 ` `}` `56 ` `],` `57 ` `"Tags"``:` `[],` `58 ` `"CreationTime"``:` `"2017-09-15T08:47:10.306Z"``,` `59 ` `"Capabilities"``:` `[` `60 ` `"CAPABILITY_IAM"` `61 ` `],` `62 ` `"StackName"``:` `"devops22"``,` `63 ` `"NotificationARNs"``:` `[],` `64 ` `"StackStatus"``:` `"CREATE_IN_PROGRESS"``,` `65 ` `"DisableRollback"``:` `false` `66 ` `}` `67 ` `]` `68` `}` ``` ```````````````````````````````````````````` Most of the description reflects the parameters we used to create the stack. The value we’re interested in is `StackStatus`. In my case, it is set to `CREATE_IN_PROGRESS` meaning that the cluster is still not ready. We should wait for a while and query the status again. This time, we’ll use `jq` to limit the output only to the `StackStatus` field. ``` `1` aws cloudformation describe-stacks `\` `2 ` --stack-name devops22 `|` `\` `3 ` jq -r `".Stacks[0].StackStatus"` ``` ``````````````````````````````````````````` If the output of the command is `CREATE_COMPLETE`, the cluster is created, and we can move on. Otherwise, please wait for a bit more and recheck the status. It should take around ten minutes to create the whole stack. Now that the cluster is created, we need to get the DNS and the IP of one of the masters. Cluster DNS is available through the `Outputs` section of the stack description. ``` `1` aws cloudformation describe-stacks `\` `2 ` --stack-name devops22 `|` `\` `3 ` jq -r `".Stacks[0].Outputs"` ``` `````````````````````````````````````````` The output is as follows. ``` `1` [ `2` { `3` "Description": "Use this name to update your DNS records", `4` "OutputKey": "DefaultDNSTarget", `5` "OutputValue": "devops22-ExternalL-EEU3J540N4S0-1231273358.us-east-2.elb.ama\ `6` zonaws.com" `7` }, `8` { `9` "Description": "Availabilty Zones Comment", `10 ` "OutputKey": "ZoneAvailabilityComment", `11 ` "OutputValue": "This region has at least 3 Availability Zones (AZ). This is \ `12` ideal to ensure a fully functional Swarm in case you lose an AZ." `13 ` }, `14 ` { `15 ` "Description": "You can see the manager nodes associated with this cluster h\ `16` ere. Follow the instructions here: https://docs.docker.com/docker-for-aws/deploy\ `17` /", `18 ` "OutputKey": "Managers", `19 ` "OutputValue": "https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=\ `20` us-east-2#Instances:tag:aws:autoscaling:groupName=devops22-ManagerAsg-RA4ECZRYJ3\ `21` 7C;sort=desc:dnsName" `22 ` }, `23 ` { `24 ` "Description": "Use this as the VPC for configuring Private Hosted Zones", `25 ` "OutputKey": "VPCID", `26 ` "OutputValue": "vpc-99311ff0" `27 ` }, `28 ` { `29 ` "Description": "SecurityGroup ID of NodeVpcSG", `30 ` "OutputKey": "NodeSecurityGroupID", `31 ` "OutputValue": "sg-0d852c65" `32 ` }, `33 ` { `34 ` "Description": "Use this zone ID to update your DNS records", `35 ` "OutputKey": "ELBDNSZoneID", `36 ` "OutputValue": "Z3AADJGX6KTTL2" `37 ` }, `38 ` { `39 ` "Description": "SecurityGroup ID of ManagerVpcSG", `40 ` "OutputKey": "ManagerSecurityGroupID", `41 ` "OutputValue": "sg-4c832a24" `42 ` }, `43 ` { `44 ` "Description": "SecurityGroup ID of SwarmWideSG", `45 ` "OutputKey": "SwarmWideSecurityGroupID", `46 ` "OutputValue": "sg-aa852cc2" `47 ` } `48` ] ``` ````````````````````````````````````````` What we need is the `DefaultDNSTarget` value. We’ll have to refine our `jq` filters a bit more. ``` `1` aws cloudformation describe-stacks `\` `2 ` --stack-name devops22 `|` `\` `3 ` jq -r `".Stacks[0].Outputs[] | \` `4 `` select(.OutputKey==\"DefaultDNSTarget\")\` `5 `` .OutputValue"` ``` ```````````````````````````````````````` We used jq’s `select` statement to retrieve only the section with `OutputKey` set to `DefaultDNSTarget` and retrieved the `OutputValue`. The output should be similar to the one that follows. ``` `1` devops22-ExternalL-EEU3J540N4S0-1231273358.us-east-2.elb.amazonaws.com ``` ``````````````````````````````````````` We should store the output of the previous command as an environment variable so that we can have it at hand if we need to open one of the services in a browser or, even better, to set it as the address of our domain. ``` `1` `CLUSTER_DNS``=``$(`aws cloudformation `\` `2 ` describe-stacks `\` `3 ` --stack-name devops22 `|` `\` `4 ` jq -r `".Stacks[0].Outputs[] | \` `5 `` select(.OutputKey==\"DefaultDNSTarget\")\` `6 `` .OutputValue"``)` ``` `````````````````````````````````````` Even though we will not need the DNS in this chapter, it’s good to know how to retrieve it. We’ll need it later on in the chapters that follow. The only thing left is to get the public IP of one of the managers. We can use `aws ec2 describe-instances` command to list all the EC2 instances running in the region. ``` `1` aws ec2 describe-instances `|` jq -r `"."` ``` ````````````````````````````````````` The output is too big to be presented in a book. You should see three instances if the cluster we just created is the only one running in your region. Otherwise, there might be others. Since we do not want to risk retrieving anything but managers that belong to the `devops22` stack, we’ll refine the command to the one that follows. ``` `1` aws ec2 describe-instances `|` `\` `2 ` jq -r `".Reservations[].Instances[] \` `3 `` | select(.SecurityGroups[].GroupName \` `4 `` | contains(\"devops22-ManagerVpcSG\"))\` `5 `` .PublicIpAddress"` ``` ```````````````````````````````````` We used `jq` to filter the output and limit the results only to the instances attached to the security group with the name that starts with `devops22-ManagerVpcSG`. Further on, we retrieved the `PublicIpAddress` values. The output is as follows. ``` `1` 52.14.246.52 `2` 13.59.130.67 `3` 13.59.132.147 ``` ``````````````````````````````````` Those three IPs belong to the three managers that for the cluster. We’ll use the previous command to set the environment variable `CLUSTER_IP`. ``` `1` `CLUSTER_IP``=``$(`aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .PublicIpAddress"` `\` `7 ` `|` tail -n `1``)` ``` `````````````````````````````````` Since we needed only one of the IPs, we piped the result of the `describe-instances` command to `tail` which limited the output to a single line. Now that we have both the DNS and the IP of one of the managers, we can enter the cluster and confirm that all the nodes joined it. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ````````````````````````````````` The output of the `node ls` command is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-21-57.us-east-2.compute.internal Ready Active Reachable `3` ip-172-31-44-182.us-east-2.compute.internal Ready Active Reachable `4` ip-172-31-15-30.us-east-2.compute.internal Ready Active Leader ``` ```````````````````````````````` As expected, all three nodes joined the cluster, and we can explore self-healing applied to infrastructure through AWS services created with the *Docker For AWS* template. ### Exploring Fault Tolerance Since we are exploring self-healing (not self-adaptation), there’s no need to deploy all the stacks we used thus far. A single service will be enough to explore what happens when a node goes down. Our cluster, formed out of `t2.micro` instances, would not support much more anyways. ``` `1` docker service create --name `test` `\` `2 ` --replicas `10` alpine sleep `1000000` ``` ``````````````````````````````` We created a service with ten replicas. Let’s confirm that they are spread across the three nodes of the cluster. ``` `1` docker service ps `test` ``` `````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE \ `2` CURRENT STATE ERROR PORTS `3` test.1 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `4` Running 12 seconds ago `5` test.2 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `6` Running 12 seconds ago `7` test.3 alpine:latest ip-172-31-21-57.us-east-2.compute.internal Running \ `8` Running 12 seconds ago `9` test.4 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `10` Running 12 seconds ago `11` test.5 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `12` Running 12 seconds ago `13` test.6 alpine:latest ip-172-31-21-57.us-east-2.compute.internal Running \ `14` Running 12 seconds ago `15` test.7 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `16` Running 12 seconds ago `17` test.8 alpine:latest ip-172-31-21-57.us-east-2.compute.internal Running \ `18` Running 12 seconds ago `19` test.9 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `20` Running 12 seconds ago `21` test.10 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `22` Running 12 seconds ago ``` ````````````````````````````` Let’s exit the cluster before we move onto a discussion how to simulate a failure of a node. ``` `1` `exit` ``` ```````````````````````````` We’ll simulate failure of an instance by terminating it. We’ll do that by executing `aws ec2 terminate-instances` command that requires `--instance-ids` argument. So, the first line of business is to figure out how to find ID of one of the nodes. We already saw that we could use `aws ec2 describe-instances` command to get information about the instances of the cluster. This time we’ll output `InstanceId` of all the nodes that belong to the security group used by managers. ``` `1` aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .InstanceId"` ``` ``````````````````````````` The output is as follows. ``` `1` i-091ad925d0243f7ab `2` i-0e850f3073ec25acd `3` i-05b25bc6fb6730ce1 ``` `````````````````````````` We’ll repeat the same command, limit the output to only one row, and store the result as an environment variable. ``` `1` `INSTANCE_ID``=``$(`aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .InstanceId"` `\` `7 ` `|` tail -n `1``)` ``` ````````````````````````` Now that we have the ID, we can terminate the instance associated with it. ``` `1` aws ec2 terminate-instances `\` `2 ` --instance-ids `$INSTANCE_ID` ``` ```````````````````````` The output is as follows. ``` `1` { `2` "TerminatingInstances": [ `3` { `4` "InstanceId": "i-0fa78489dca8125e8", `5` "CurrentState": { `6` "Code": 32, `7` "Name": "shutting-down" `8` }, `9` "PreviousState": { `10 ` "Code": 16, `11 ` "Name": "running" `12 ` } `13 ` } `14 ` ] `15` } ``` ``````````````````````` We can see that the previous state is `running` and that it changed to `shutting-down`. Let’s see the state of the instances that form the cluster. ``` `1` aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .State.Name"` ``` `````````````````````` We retrieved statuses of all the manager instances attached to the security group with a name starting with `devops22-ManagerVpcSG`. The output is as follows. ``` `1` running `2` running ``` ````````````````````` There are two manager instances in the cluster, and both are `running`. The node was indeed removed, and we are one server short from the desired setup. Let’s wait for a moment or two and take another look at the manager instances. ``` `1` aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .State.Name"` ``` ```````````````````` This time the output is different. ``` `1` pending `2` running `3` running ``` ``````````````````` Besides the two running managers, the third was added and is currently pending. The auto-scaling group associated with the managers detected that one node is missing and started creating a new VM that will restore the cluster to the desired state. The new node is still not ready, so we’ll need to wait for a while longer. ``` `1` aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .State.Name"` ``` `````````````````` The output is as follows. ``` `1` running `2` running `3` running ``` ````````````````` Auto-scaling group’s desired state was restored, and the cluster is operating at its full capacity. We cannot be certain whether the node we destroyed is different than the one we were entering before. Therefore, we should retrieve IP of one of the nodes one more time, and place it in the environment variable `CLUSTER_IP`. ``` `1` `CLUSTER_IP``=``$(`aws ec2 describe-instances `\` `2 ` `|` jq -r `".Reservations[] \` `3 `` .Instances[] \` `4 `` | select(.SecurityGroups[].GroupName \` `5 `` | contains(\"devops22-ManagerVpcSG\"))\` `6 `` .PublicIpAddress"` `\` `7 ` `|` tail -n `1``)` ``` ```````````````` Even though we know that the new node was created automatically, we should still confirm that it also joined the cluster as a Swarm manager. ``` `1` ssh -i devops22.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ``````````````` We entered into one of the managers and listed all the nodes. The output is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-21-57.us-east-2.compute.internal Down Active Unreachable `3` ip-172-31-44-182.us-east-2.compute.internal Ready Active Reachable `4` ip-172-31-15-30.us-east-2.compute.internal Ready Active Leader ``` `````````````` If the output of the `node ls` command is `Error response from daemon: This node is not a swarm manager...`, it means that you entered the node that was just created and it did not yet join the cluster. If that’s the case, all you have to do is wait for a while longer and try it again. I’ll assume that you entered to one of the “old” nodes. The new node is not there. We can see only the three nodes that were initially created. One of them is `unreachable`. Does that mean that the system does not work? Is self-healing working only partially and we need to join the new node manually? Should we create a script that will join new nodes to the cluster? The answer to all those questions is *no*. We were too impatient. Even though AWS reported that the new node is running, it still requires a bit more time until it is fully initialized. Once that is finished, and the VM is fully operational, Docker’s system containers will run and automatically join the node to the cluster. Let’s wait for a few moments and list the nodes one more time. ``` `1` docker node ls ``` ````````````` The output is as follows. ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-26-141.us-east-2.compute.internal Ready Active Reachable `3` ip-172-31-21-57.us-east-2.compute.internal Down Active Unreachable `4` ip-172-31-44-182.us-east-2.compute.internal Ready Active Reachable `5` ip-172-31-15-30.us-east-2.compute.internal Ready Active Leader ``` ```````````` The new node joined the cluster. Now we have four nodes, with one of them `unreachable`. Swarm cannot know that we destroyed the node. All it does know is that one manager is not reachable. That might be due to many reasons besides destruction. The unreachable node will be removed from the list after a while. Let’s see what happened to the replicas of the `test` service. ``` `1` docker service ps `test` ``` ``````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STA\ `2` TE CURRENT STATE ERROR PORTS `3` test.1 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `4` Running 10 minutes ago `5` test.2 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `6` Running 10 minutes ago `7` test.3 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `8` Running 4 minutes ago `9` \_ test.3 alpine:latest ip-172-31-21-57.us-east-2.compute.internal Shutdown \ `10 ` Running 4 minutes ago `11` test.4 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `12 ` Running 10 minutes ago `13` test.5 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `14 ` Running 10 minutes ago `15` test.6 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `16 ` Running 4 minutes ago `17 ` \_ test.6 alpine:latest ip-172-31-21-57.us-east-2.compute.internal Shutdown \ `18 ` Running 4 minutes ago `19` test.7 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `20 ` Running 10 minutes ago `21` test.8 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `22 ` Running 4 minutes ago `23 ` \_ test.8 alpine:latest ip-172-31-21-57.us-east-2.compute.internal Shutdown \ `24 ` Running 4 minutes ago `25` test.9 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `26 ` Running 10 minutes ago `27` test.10 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `28 ` Running 10 minutes ago ``` `````````` When Swarm detected that one of the nodes is unreachable, it rescheduled replicas that were running there to the nodes that were healthy at the time. It did not wait for the new node to join the cluster. Swarm cannot know whether we (or auto-scaling groups, or any other process) will restore the infrastructure to the desired state. What if we removed the node purposefully and had no intention to add a new one in its place? Even if Swarm would be confident that a new node will be added to the cluster, it would still not make sense to wait for it. Creating a new node is a costly operation. It takes too much time. Therefore, as soon as Swarm detected that some of the replicas are not running (those from the failed node), it rescheduled them to the other two nodes. As a result, the third node is currently empty. It will start getting replicas the next time we deploy something or update one of the existing services. Let’s try it out. We’ll update our test service. ``` `1` docker service update `\` `2 ` --env-add `"FOO=BAR"` `test` ``` ````````` Before we take a look at the service processes (or tasks), we should give Swarm a bit of time to perform rolling update to all the replicas. After a moment or two, we can execute `docker service ps` and discuss the result ``` `1` docker service ps `\` `2 ` -f desired-state`=`running `test` ``` ```````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE DESIRED STATE \ `2` CURRENT STATE ERROR PORTS `3` test.1 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `4` Running about a minute ago `5` test.2 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `6` Running 32 seconds ago `7` test.3 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `8` Running about a minute ago `9` test.4 alpine:latest ip-172-31-26-141.us-east-2.compute.internal Running \ `10` Running about a minute ago `11` test.5 alpine:latest ip-172-31-26-141.us-east-2.compute.internal Running \ `12` Running about a minute ago `13` test.6 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `14` Running 55 seconds ago `15` test.7 alpine:latest ip-172-31-26-141.us-east-2.compute.internal Running \ `16` Running 20 seconds ago `17` test.8 alpine:latest ip-172-31-26-141.us-east-2.compute.internal Running \ `18` Running about a minute ago `19` test.9 alpine:latest ip-172-31-15-30.us-east-2.compute.internal Running \ `20` Running 43 seconds ago `21` test.10 alpine:latest ip-172-31-44-182.us-east-2.compute.internal Running \ `22` Running 8 seconds ago ``` ``````` Since containers are immutable, any update of a service always results in a rolling update process that replaces all the replicas. You’ll notice that, this time, they are spread across all the nodes of the cluster, including the new one. ``` `1` `exit` ``` `````` Self-healing applied to infrastructure works! We closed the circle. Swarm makes sure that our services are (almost) always in the desired state. With *Docker For AWS*, we accomplished a similar behavior with nodes. The reason why over 50% of managers must be operational at any given moment lies in the Raft protocol that synchronizes data. Every piece of information is propagated to all the managers. An action is performed only if the majority agrees. That way we can guarantee data integrity. There is no majority if half or more members are absent. You might be compelled to create clusters with five managers as a way to decrease chances of a complete cluster meltdown if two managers fail at the same time. In some cases that is a good strategy. However, the chances that two managers running in separate availability zones will go down at the same time are very slim. Don’t take this advice as a commandment. You should experiment with both approaches and make your own decision. I tend to run all my clusters smaller than ten nodes with three managers. When they are bigger, five is a good number. You might go even further and opt for seven managers. The more, the better. Right? Wrong! Data synchronization between managers is a costly operation. The more managers, the more time is required until a consensus is reached. Seven managers often produce more overhead than benefit. ### What Now? We proved that self-healing works not only with services but also with infrastructure. We are getting close to having a self-sufficient system. The only thing missing is to find out a way to add self-adaptation applied to infrastructure. If we accomplish that, we’ll be able to leave our system alone. We can go on vacation knowing that it will be operational without us. We could even go to one of those exotic places that still do not have the Internet. Wouldn’t that be great? Even though we are one step closer to our goal, we are still not there yet. We’ll take another break before moving on. We’ll continue the practice from previous chapters. We’ll destroy the cluster and save us from unnecessary cost. ``` `1` aws cloudformation delete-stack `\` `2 ` --stack-name devops22 `3` `4` aws cloudformation describe-stacks `\` `5 ` --stack-name devops22 `|` `\` `6 ` jq -r `".Stacks[0].StackStatus"` ``` ````` The output of the `describe-stacks` command is as follows. ``` `1` DELETE_IN_PROGRESS ``` ```` Cluster will be removed soon. Feel free to repeat the command if you don’t trust the system and want to see it through. You’ll know that the cluster is fully removed when you see the error output that follows. ``` `1` An error occurred (ValidationError) when calling the DescribeStacks operation: S\ `2` tack with id devops22 does not exist ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ```````````````````````````````` ````````````````````````````````` `````````````````````````````````` ``````````````````````````````````` ```````````````````````````````````` ````````````````````````````````````` `````````````````````````````````````` ``````````````````````````````````````` ```````````````````````````````````````` ````````````````````````````````````````` `````````````````````````````````````````` ``````````````````````````````````````````` ```````````````````````````````````````````` ````````````````````````````````````````````` `````````````````````````````````````````````` ``````````````````````````````````````````````` ```````````````````````````````````````````````` ````````````````````````````````````````````````` ``````````````````````````````````````````````````

第十九章：应用到基础设施的自适应

我们的目标触手可及。我们采用了调度器（在此案例中为 Docker Swarm），它为服务提供自我修复功能。我们观察到Docker For AWS在基础设施层面实现了类似的目标。我们使用了 Prometheus、Alertmanager 和 Jenkins 来构建一个能够自动适应不断变化条件的系统。我们存储在 Prometheus 中的指标，既有通过 exporters 收集的，也有通过仪表化（instrumentation）添加到我们服务中的。唯一缺少的就是应用于基础设施的自适应能力。如果我们能够实现这一点，我们将闭合这个循环，见证一个几乎无需人工干预即可自给自足的系统。

应用到基础设施的自适应逻辑与我们在服务中使用的逻辑大致相同。我们需要指标、警报和脚本，这些脚本会在条件变化时自动调整集群的容量。

我们已经拥有所有必需的工具。Prometheus 将继续收集指标并触发警报。Alertmanager 仍然是一个出色的选择，用来接收这些警报并将它们转发到不同的系统组件。我们将继续使用 Jenkins 作为一个工具，使我们能够快速编写与系统交互的脚本。由于我们使用 AWS 来托管集群，Jenkins 需要与 AWS 的 API 进行交互。

我们距离最终目标非常接近，我觉得我们应该跳过理论部分，直接进入本章的实际操作。因此，不再赘述，我们将再次创建我们的集群。

创建集群

在前一章中，我们已经探索了如何在没有 UI 的情况下创建集群。接下来的命令应该很熟悉，希望不需要太多解释。

请在执行后续命令之前，将[...]替换为你的密钥。

 `1` `export` `AWS_ACCESS_KEY_ID``=[`...`]`
 `2` 
 `3` `export` `AWS_SECRET_ACCESS_KEY``=[`...`]`
 `4` 
 `5` `export` `AWS_DEFAULT_REGION``=`us-east-1
 `6` 
 `7` `export` `STACK_NAME``=`devops22
 `8` 
 `9` `export` `KEY_NAME``=`devops22
`10` 
`11` aws cloudformation create-stack `\`
`12 `    --template-url https://editions-us-east-1.s3.amazonaws.com/aws/stable/Docker`\`
`13` .tmpl `\`
`14 `    --capabilities CAPABILITY_IAM `\`
`15 `    --stack-name `$STACK_NAME` `\`
`16 `    --parameters `\`
`17 `    `ParameterKey``=`ManagerSize,ParameterValue`=``3` `\`
`18 `    `ParameterKey``=`ClusterSize,ParameterValue`=``0` `\`
`19 `    `ParameterKey``=`KeyName,ParameterValue`=``$KEY_NAME` `\`
`20 `    `ParameterKey``=`EnableSystemPrune,ParameterValue`=`yes `\`
`21 `    `ParameterKey``=`EnableCloudWatchLogs,ParameterValue`=`no `\`
`22 `    `ParameterKey``=`EnableCloudStorEfs,ParameterValue`=`yes `\`
`23 `    `ParameterKey``=`ManagerInstanceType,ParameterValue`=`t2.small `\`
`24 `    `ParameterKey``=`InstanceType,ParameterValue`=`t2.small

We defined a few environment variables and executed the `aws cloudformation create-stack` command that initiated creation of a cluster. It should take around five to ten minutes until it is finished. ``` `1` aws cloudformation describe-stacks `\` `2 ` --stack-name `$STACK_NAME` `|` `\` `3 ` jq -r `".Stacks[0].StackStatus"` ``` ``````````````````````````````````````````````````````````````````````````````````````````````````````````````````` If the output of the `describe-stacks` command is `CREATE_COMPLETE`, our cluster is fully operational, and we can continue. Otherwise, please wait for a while longer and recheck the stack status. Next, we’ll retrieve cluster DNS and public IP of one of the manager nodes and store those values as environment variables `CLUSTER_DNS` and `CLUSTER_IP`. ``` `1` `CLUSTER_DNS``=``$(`aws cloudformation `\` `2` describe-stacks `\` `3` --stack-name `$STACK_NAME` `|` `\` `4` jq -r `".Stacks[0].Outputs[] | \` `5`` select(.OutputKey==\"DefaultDNSTarget\")\` `6`` .OutputValue"``)` `7` `8` `CLUSTER_IP``=``$(`aws ec2 describe-instances `\` `9` `|` jq -r `".Reservations[] \` `10 `` .Instances[] \` `11 `` | select(.SecurityGroups[].GroupName \` `12 `` | contains(\"``$STACK_NAME``-ManagerVpcSG\"))\` `13 `` .PublicIpAddress"` `\` `14 ` `|` tail -n `1``)` ``` `````````````````````````````````````````````````````````````````````````````````````````````````````````````````` Once we enter the cluster, we’ll create a file that will hold the environment variables we’ll need inside the cluster. Those are the same variables we already defined on our host. We’ll output them so that we can easily copy and paste them when we enter one of the nodes. ``` `1` `echo` `"` `2` `export CLUSTER_DNS=``$CLUSTER_DNS` ````` `3` `export AWS_ACCESS_KEY_ID=``$AWS_ACCESS_KEY_ID` ```` `4` `export AWS_SECRET_ACCESS_KEY=``$AWS_SECRET_ACCESS_KEY` ``` `5` `export AWS_DEFAULT_REGION=``$AWS_DEFAULT_REGION` `` `6` `"` `` ``` ```` ````` ``` ````````````````````````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````````````````````````` Please copy the output of the `echo` command. We’ll use it soon. Now that we got all the cluster information we’ll need, we can `ssh` into one of the manager nodes. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` ``` `````````````````````````````````````````````````````````````````````````````````````````````````````````````` Next, we’ll create a file that will hold all the information we’ll need. That way we’ll be able to get in and out of the cluster without losing the ability to retrieve that data quickly. ``` `1` `echo` `"` `2` `export CLUSTER_DNS=[...]` `3` `export AWS_ACCESS_KEY_ID=[...]` `4` `export AWS_SECRET_ACCESS_KEY=[...]` `5` `export AWS_DEFAULT_REGION=[...]` `6` `"`>creds ``` ````````````````````````````````````````````````````````````````````````````````````````````````````````````` Instead of typing the command from above, please type `echo "`, paste the output you copied a moment ago, and close it with `">creds`. The result should be four `export` commands inside the `creds` file. Let’s download a script that will deploy (almost) all the services we used in the previous chapter. ``` `1` curl -o aws-services-15.sh `\` `2 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/scripts`\` `3` /aws-services-15.sh `4` `5` chmod +x aws-services-15.sh ``` ```````````````````````````````````````````````````````````````````````````````````````````````````````````` We download the script and gave it execute permissions. Now we are ready to deploy the services. ``` `1` `source` creds `2` `3` ./aws-services-15.sh `4` `5` docker stack ls ``` ``````````````````````````````````````````````````````````````````````````````````````````````````````````` Since `aws-services-15.sh` needs environment variable `CLUSTER_DNS`, we exported it by executing `source`. Further on, we executed the script and listed all the stacks deployed to the cluster. The output is as follows. ``` `1` NAME SERVICES `2` exporter 3 `3` go-demo 2 `4` jenkins 2 `5` monitor 3 `6` proxy 2 ``` `````````````````````````````````````````````````````````````````````````````````````````````````````````` You’ll notice that the `logging` stack is missing. We did not deploy it since it is not relevant to the goals we’re trying to accomplish in this chapter and, at the same time, it requires extra nodes. Since I am committed towards not making you spend more money than needed, it seemed like a sensible thing not to deploy that stack. Finally, let’s get out of the cluster and explore how we could scale it manually. That will give us an insight into the processes we’ll want to automate. ``` `1` `exit` ``` ````````````````````````````````````````````````````````````````````````````````````````````````````````` ### Scaling Nodes Manually Let’s explore how we can scale nodes manually and, later on, try to apply the same logic to our automated processes. We’re running the cluster in AWS which already has auto-scaling groups defined for both managers and workers. In such a setting, the most sensible way to scale the nodes is to change the desired capacity of those groups. When new nodes are created by auto-scaling groups in *Docker For AWS* or *Azure*, they will join the cluster as managers or workers. If you choose not to use *Docker For AWS* or *Azure*, you’ll have to do some additional work to replicate the same functionality as the one we’re about to explore. You’ll have to create init scripts that will find IP of one of the managers, retrieve join token, and, finally, execute `docker swarm join` command. No matter which hosting vendor you’re using, the logic should, more or less, be always the same. We need to change the number of running managers or workers and, in case that number increased, join new nodes to the cluster. I am confident that you’ll be able to modify the logic that follows to your cluster setup. The first thing we need to do is find out the name of the auto-scaling group created for our cluster. A good start is to list all the groups by executing `aws autoscaling describe-auto-scaling-groups` command. ``` `1` aws autoscaling `\` `2 ` describe-auto-scaling-groups `\` `3 ` `|` jq `"."` ``` ```````````````````````````````````````````````````````````````````````````````````````````````````````` The output is too big to be presented in a book format, and we do not need it in its entirety. Therefore, we’ll limit the output. Luckily, we know that the name of the auto-scaling group starts with `[STACK_NAME]-Node`. We can use that to filter the output. A command that will retrieve only the auto-scaling group assigned to worker nodes and retrieve just the name of the group is as follows. ``` `1` aws autoscaling `\` `2 ` describe-auto-scaling-groups `\` `3 ` `|` jq -r `".AutoScalingGroups[] \` `4 `` | select(.AutoScalingGroupName \` `5 `` | startswith(\"``$STACK_NAME``-NodeAsg-\"))\` `6 `` .AutoScalingGroupName"` ``` ``````````````````````````````````````````````````````````````````````````````````````````````````````` We used `jq` to retrieve all data within the root node `AutoScalingGroups`. Further on, we used `select` command to retrieve only records with `AutoScalingGroupName` that starts with `[STACK_NAME]-Node`. Finally, we limited the output further so that only the name of the name of the group is retrieved. The output will vary from one case to another. It should be similar to the one that follows. ``` `1` devops22-NodeAsg-1J93DRR7VYUHU ``` `````````````````````````````````````````````````````````````````````````````````````````````````````` We cannot change the auto-scaling group desired capacity without knowing what the current number of nodes is. Therefore, we need to construct another query that will provide that information. Fortunately, the command is very similar since all we need is to retrieve a different value based on the same filter. ``` `1` aws autoscaling `\` `2 ` describe-auto-scaling-groups `\` `3 ` `|` jq -r `".AutoScalingGroups[] \` `4 `` | select(.AutoScalingGroupName \` `5 `` | startswith(\"``$STACK_NAME``-NodeAsg-\"))\` `6 `` .DesiredCapacity"` ``` ````````````````````````````````````````````````````````````````````````````````````````````````````` When compared with the previous command, the only change is that, this time, we retrieved `DesiredCapacity` instead `AutoScalingGroupName`. The output is `0`. That should come as no surprise since we specified that we did not want any workers when we created the cluster. We’ll repeat the command we used to retrieve the name of the auto-scaling group and, this time, we’ll put the result as a value of an environment variable. That way we’ll be able to reuse it across the commands we’ll execute later on. ``` `1` `ASG_NAME``=``$(`aws autoscaling `\` `2 ` describe-auto-scaling-groups `\` `3 ` `|` jq -r `".AutoScalingGroups[] \` `4 `` | select(.AutoScalingGroupName \` `5 `` | startswith(\"``$STACK_NAME``-NodeAsg-\"))\` `6 `` .AutoScalingGroupName"``)` ``` ```````````````````````````````````````````````````````````````````````````````````````````````````` Now that we have the name of the auto-scaling group, we can increase the desired capacity from `0` to `1`. ``` `1` aws autoscaling `\` `2 ` update-auto-scaling-group `\` `3 ` --auto-scaling-group-name `$ASG_NAME` `\` `4 ` --desired-capacity `1` ``` ``````````````````````````````````````````````````````````````````````````````````````````````````` Let’s confirm that the capacity is indeed increased. ``` `1` aws autoscaling `\` `2 ` describe-auto-scaling-groups `\` `3 ` --auto-scaling-group-names `$ASG_NAME` `\` `4 ` `|` jq `".AutoScalingGroups[0]\` `5 `` .DesiredCapacity"` ``` `````````````````````````````````````````````````````````````````````````````````````````````````` We executed `describe-auto-scaling-groups` one more time. However, since now we know the name of the group, there was no need for `jq` filters. As expected, the output is `1` confirming that the update indeed worked. The fact that the desired capacity of the group was updated does not necessarily mean that a new node was created. We can check that easily by executing `ec2 describe-instances` combined with a bit of `jq` magic. ``` `1` aws ec2 describe-instances `|` jq -r `\` `2 ` `".Reservations[].Instances[] \` `3 `` | select(.SecurityGroups[].GroupName \` `4 `` | startswith(\"``$STACK_NAME``-NodeVpcSG\"))\` `5 `` .InstanceId"` ``` ````````````````````````````````````````````````````````````````````````````````````````````````` We executed `ec2 describe-instances` and used `jq` to retrieve all instances, filter them by the security group which has a name that starts with a predictable string, and retrieved the ID of the only worker instance. The output should be similar to the one that follows. ``` `1` i-06f7e78c063fedeb3 ``` ```````````````````````````````````````````````````````````````````````````````````````````````` Creation of an EC2 instance is fast. What takes a bit of time is its initialization. We should check its status and confirm that it finished initializing. ``` `1` `INSTANCE_ID``=``$(`aws ec2 `\` `2` describe-instances `|` jq -r `\` `3` `".Reservations[].Instances[] \` `4`` | select(.SecurityGroups[].GroupName \` `5`` | startswith(\"``$STACK_NAME``-NodeVpcSG\"))\` `6`` .InstanceId"``)` `7` `8` aws ec2 describe-instance-status `\` `9` --instance-ids `$INSTANCE_ID` `\` `10 ` `|` jq -r `".InstanceStatuses[0]\` `11 `` .InstanceStatus.Status"` ``` ``````````````````````````````````````````````````````````````````````````````````````````````` We repeated the previous command but, this time, stored the instance ID as the environment variable `INSTANCE_ID`. Later on, we used it with the `ec2 describe-instance-status` command to retrieve the status. If the output is `ok`, the new node is created, is initialized, and (probably) joined the cluster. Otherwise, please wait for a minute or two and recheck the status. Finally, let’s confirm that the new node indeed joined the Swarm cluster. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls `4` `5` `exit` ``` `````````````````````````````````````````````````````````````````````````````````````````````` We entered one of the manager servers, listed all the nodes of the cluster, and returned to the host. The output of the `node ls` command is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-40-169.us-east-2.compute.internal Ready Active `3` ip-172-31-24-32.us-east-2.compute.internal Ready Active Reachable `4` ip-172-31-2-29.us-east-2.compute.internal Ready Active Leader `5` ip-172-31-42-64.us-east-2.compute.internal Ready Active Reachable ``` ````````````````````````````````````````````````````````````````````````````````````````````` That’s brilliant! The new worker joined the cluster, and our capacity increased. If, in your case, the new node did not yet join the cluster, please wait for a few moments and list the nodes again. ![Figure 15-1: Manual updates of Auto-Scaling Groups](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00080.jpeg) Figure 15-1: Manual updates of Auto-Scaling Groups There are a few other manual actions we should explore before we move towards automation. But, before we proceed, we’ll change the auto-scaling group one more time. We’ll set the desired capacity back to `0`. That way we’ll not only confirm that the process works in both directions, but also save a bit of money by not running more nodes than we need. ``` `1` aws autoscaling `\` `2 ` update-auto-scaling-group `\` `3 ` --auto-scaling-group-name `$ASG_NAME` `\` `4 ` --desired-capacity `0` ``` ```````````````````````````````````````````````````````````````````````````````````````````` We updated the auto-scaling group back to the desired capacity of `0`. After a while, we can return to the cluster and confirm that the worker is removed from the cluster. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ``````````````````````````````````````````````````````````````````````````````````````````` The output of the `node ls` command is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-40-169.us-east-2.compute.internal Down Active `3` ip-172-31-24-32.us-east-2.compute.internal Ready Active Reachable `4` ip-172-31-2-29.us-east-2.compute.internal Ready Active Leader `5` ip-172-31-42-64.us-east-2.compute.internal Ready Active Reachable ``` `````````````````````````````````````````````````````````````````````````````````````````` As you can see, the status of the worker node is set to `Down`. Swarm lost communication with the node once the auto-scaling group shut it down and changed its status. Soon it will remove it completely from its records. We still have one more problem to solve. We cannot run `aws` commands from the cluster. *Docker For AWS* does not let us install any additional software. Even if we would find the way to install the CLI, we should not pollute our production servers. Instead, we should run any tool we need thorough a container. Since there is no official AWS CLI Docker image, I created one for the exercises in this chapter. Let’s take a look at the Dockerfile. ``` `1` curl `"https://raw.githubusercontent.com/vfarcic/docker-aws-cli/master/Dockerfile"` ``` ````````````````````````````````````````````````````````````````````````````````````````` ``` `1` FROM alpine `2` `3` MAINTAINER Viktor Farcic <viktor@farcic.com> `4` `5` RUN apk --update add python py-pip jq && \ `6` pip install awscli && \ `7` apk del py-pip && \ `8` rm -rf /var/cache/apk/* `9` `10` ENV AWS_ACCESS_KEY_ID "" `11` ENV AWS_SECRET_ACCESS_KEY "" `12` ENV AWS_DEFAULT_REGION "us-east-1" ``` ```````````````````````````````````````````````````````````````````````````````````````` As you can see, it’s pretty straightforward. The image is based on `alpine` and installs `python`, `py-pip`, and `jq`. We’re installing Python since `pip` is the easiest way to install `awscli`. The rest of the image specification defines a few environment variables required by the AWS CLI. The image was built and pushed as `vfarcic/aws-cli`. Let’s do a test run of a container based on the image. ``` `1` `source` creds `2` `3` docker container run --rm `\` `4 ` -e `AWS_ACCESS_KEY_ID``=``$AWS_ACCESS_KEY_ID` `\` `5 ` -e `AWS_SECRET_ACCESS_KEY``=``$AWS_SECRET_ACCESS_KEY` `\` `6 ` -e `AWS_DEFAULT_REGION``=``$AWS_DEFAULT_REGION` `\` `7 ` vfarcic/aws-cli `\` `8 ` aws ec2 describe-instances ``` ``````````````````````````````````````````````````````````````````````````````````````` We sourced the `creds` file that contains the environment variables we need. Further on we run a container based on the `vfarcic/aws-cli` image. We used `aws ec2 describe-instances` as the command only to demonstrate that any `aws` command could be executed through a container. The result should be information about all the EC2 nodes we have in that region. We’re using the `creds` file only as a convenience and for demo purposes since we cannot inject a secret into a container. It must be a Swarm service. The `docker container run` command we executed is too long to remember. We can mitigate that by creating a Docker Compose YAML file with all the `aws` commands we need. An example of such a file can be found in the [vfarcic/docker-aws-cli](https://github.com/vfarcic/docker-aws-cli) repository. It contains all the commands we’ll use in this chapter. Let’s take a brief look at it. ``` `1` curl `"https://raw.githubusercontent.com/vfarcic/docker-aws-cli/master/docker-com\` `2` `pose.yml"` ``` `````````````````````````````````````````````````````````````````````````````````````` The output is as follows. ``` `1` version: '3.2' `2` `3` services: `4` `5` asg-name: `6` image: vfarcic/aws-cli `7` environment: `8` - AWS_ACCESS_KEY_ID=`${``AWS_ACCESS_KEY_ID``}` `9` - AWS_SECRET_ACCESS_KEY=`${``AWS_SECRET_ACCESS_KEY``}` `10 ` - AWS_DEFAULT_REGION=`${``AWS_DEFAULT_REGION``}` `11 ` command: sh -c "aws autoscaling describe-auto-scaling-groups | jq -r '.AutoS\ `12` calingGroups[] | select(.AutoScalingGroupName | startswith(\"`${``STACK_NAME``}`--Node\ `13` Asg\")).AutoScalingGroupName'" `14` `15 ` asg-desired-capacity: `16 ` image: vfarcic/aws-cli `17 ` environment: `18 ` - AWS_ACCESS_KEY_ID=`${``AWS_ACCESS_KEY_ID``}` `19 ` - AWS_SECRET_ACCESS_KEY=`${``AWS_SECRET_ACCESS_KEY``}` `20 ` - AWS_DEFAULT_REGION=`${``AWS_DEFAULT_REGION``}` `21 ` command: sh -c "aws autoscaling describe-auto-scaling-groups --auto-scaling-\ `22` group-names `$ASG_NAME` | jq '.AutoScalingGroups[0].DesiredCapacity'" `23` `24 ` asg-update-desired-capacity: `25 ` image: vfarcic/aws-cli `26 ` environment: `27 ` - AWS_ACCESS_KEY_ID=`${``AWS_ACCESS_KEY_ID``}` `28 ` - AWS_SECRET_ACCESS_KEY=`${``AWS_SECRET_ACCESS_KEY``}` `29 ` - AWS_DEFAULT_REGION=`${``AWS_DEFAULT_REGION``}` `30 ` command: sh -c "aws autoscaling update-auto-scaling-group --auto-scaling-gro\ `31` up-name `$ASG_NAME` --desired-capacity `$ASG_DESIRED_CAPACITY`" ``` ````````````````````````````````````````````````````````````````````````````````````` We won’t go into details of the services defined in that YAML file. It should be self explanatory what each of them does. Please note that `docker-compose` is not installed on the nodes of the cluster. We will not need it since the plan is to use those Compose services through Jenkins agents which will have Docker Compose. Let’s move on and explore how to transform the commands we used so far into automated scaling solution. ### Creating Scaling Job Let’s try to translate the commands we executed manually into a Jenkins job. If we manage to do that, we can go further and let Alertmanager trigger that job whenever certain thresholds are reached in Prometheus. We’ll start by downloading Jenkins stack from the [vfarcic/docker-flow-monitor](http://github.com/vfarcic/docker-flow-monitor) repository. ``` `1` curl -o jenkins.yml `\` `2 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `3` jenkins-aws-secret.yml `4` `5` cat jenkins.yml ``` ```````````````````````````````````````````````````````````````````````````````````` The stack definition we just downloaded is almost identical to the one we used before so we’ll comment only the differences. ``` `1` version: "3.2" `2` `3` services: `4` `5` ... `6` `7` agent: `8` image: vfarcic/jenkins-swarm-agent `9` ... `10 ` secrets: `11 ` - aws `12 ` ... `13` `14` secrets: `15 ` aws: `16 ` external: true `17 ` ... ``` ``````````````````````````````````````````````````````````````````````````````````` The only new addition to the Jenkins stack is the `aws` secret. It should contain AWS keys and the region we’ll need for AWS CLI. So, let’s start by creating the secret. ``` `1` `source` creds `2` `3` `echo` `"` `4` `export AWS_ACCESS_KEY_ID=``$AWS_ACCESS_KEY_ID` ```` `5` `export AWS_SECRET_ACCESS_KEY=``$AWS_SECRET_ACCESS_KEY` ``` `6` `export AWS_DEFAULT_REGION=``$AWS_DEFAULT_REGION` `` `7` `export STACK_NAME=devops22` `8` `"` `|` docker secret create aws - `` ``` ```` ``` `````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````` We sourced the `creds` file and used the environment variables to construct the `aws` secret. Now we can deploy the `jenkins` stack. ``` `1` docker stack deploy `\` `2 ` -c jenkins.yml jenkins `3` `4` `exit` ``` ``````````````````````````````````````````````````````````````````````````````` We deployed the stack and exited the cluster. Jenkins has a small nuance with its URL. If we do not change anything, it will not know what its address is and, when we construct notification messages, it’ll resolve itself to `null`. Fortunately, the fix is reasonably easy. All we have to do is open the configuration page and click the *Save* button. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/configure"` ``` `````````````````````````````````````````````````````````````````````````````` Please login using *admin* as both the *User* and the *Password*. Once you’re authenticated, you’ll see the configuration screen which, among other fields, contains *Jenkins URL*. Please confirm that it is correct and click the *Save* button. ![Figure 15-2: Jenkins URL configuration](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00081.jpeg) Figure 15-2: Jenkins URL configuration Now that we resolved Jenkins’ identity crisis, we can create a new job capable of scaling nodes of the cluster. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/view/all/newJob"` ``` ````````````````````````````````````````````````````````````````````````````` Please type *aws-scale* as the job name, select *Pipeline* as the job type, and click the *OK* button. You’ll see the job configuration screen. Since we’re planning to trigger builds remotely, we should create an authentication token. Please click the *Build Triggers* tab, select the *Trigger builds remotely* checkbox, and type *DevOps22* as the *Authentication Token*. ![Figure 15-3: Jenkins job build triggers](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00082.jpeg) Figure 15-3: Jenkins job build triggers Now we’re ready to define a pipeline script. Please click the *Pipeline* tab and type the script that follows in the *Pipeline Script* field. ``` `1` `pipeline` `{` `2` `agent` `{` `3` `label` `"prod"` `4` `}` `5` `options` `{` `6` `buildDiscarder``(``logRotator``(``numToKeepStr:` `'2'``))` `7` `disableConcurrentBuilds``()` `8` `}` `9` `parameters` `{` `10 ` `string``(` `11 ` `name:` `"scale"``,` `12 ` `defaultValue:` `"1"``,` `13 ` `description:` `"The number of worker nodes to add or remove"` `14 ` `)` `15 ` `}` `16 ` `stages` `{` `17 ` `stage``(``"scale"``)` `{` `18 ` `steps` `{` `19 ` `git` `"https://github.com/vfarcic/docker-aws-cli.git"` `20 ` `script` `{` `21 ` `def` `asgName` `=` `sh``(` `22 ` `script:` `"source /run/secrets/aws && docker-compose run --rm asg-name\` `23` `"``,` `24 ` `returnStdout:` `true` `25 ` `).``trim``()` `26 ` `if` `(``asgName` `==` `""``)` `{` `27 ` `error` `"Could not find auto-scaling group"` `28 ` `}` `29 ` `def` `asgDesiredCapacity` `=` `sh``(` `30 ` `script:` `"source /run/secrets/aws && ASG_NAME=${asgName} docker-compo\` `31` `se run --rm asg-desired-capacity"``,` `32 ` `returnStdout:` `true` `33 ` `).``trim``().``toInteger``()` `34 ` `def` `asgNewCapacity` `=` `asgDesiredCapacity` `+` `scale``.``toInteger``()` `35 ` `if` `(``asgNewCapacity` `<` `1``)` `{` `36 ` `error` `"The number of worker nodes is already at the minimum capacity\` `37 `` of 1"` `38 ` `}` `else` `if` `(``asgNewCapacity` `>` `3``)` `{` `39 ` `error` `"The number of worker nodes is already at the maximum capacity\` `40 `` of 3"` `41 ` `}` `else` `{` `42 ` `sh` `"source /run/secrets/aws && ASG_NAME=${asgName} ASG_DESIRED_CAPAC\` `43` `ITY=${asgNewCapacity} docker-compose run --rm asg-update-desired-capacity"` `44 ` `echo` `"Changed the number of worker nodes from ${asgDesiredCapacity} \` `45` `to ${asgNewCapacity}"` `46 ` `}` `47 ` `}` `48 ` `}` `49 ` `}` `50 ` `}` `51 ` `post` `{` `52 ` `success` `{` `53 ` `slackSend``(` `54 ` `color:` `"good"``,` `55 ` `message:` `"""Worker nodes were scaled.` `56` `Please check Jenkins logs for the job ${env.JOB_NAME} #${env.BUILD_NUMBER}` `57` `${env.BUILD_URL}console"""` `58 ` `)` `59 ` `}` `60 ` `failure` `{` `61 ` `slackSend``(` `62 ` `color:` `"danger"``,` `63 ` `message:` `"""Worker nodes could not be scaled.` `64` `Please check Jenkins logs for the job ${env.JOB_NAME} #${env.BUILD_NUMBER}` `65` `${env.BUILD_URL}console"""` `66 ` `)` `67 ` `}` `68 ` `}` `69` `}` ``` ```````````````````````````````````````````````````````````````````````````` You should be able to understand most of the Pipeline without any help so I’ll limit the discussion on the steps of the `scale` stage. We start by cloning the `vfarcic/docker-aws-cli` repository that contains `docker-compose.yml` file with AWS CLI services we’ll need. Next, we’re executing Docker Compose service `asg-name` that retrieves the name of the auto-scaling group associated with worker nodes. The result is stored in the variable `asgName`. Since all the services defined in that Compose file require environment variables with AWS keys and the region where the cluster is running, we’re executing `source /run/secrets/aws` before `docker-compose` commands. The file was injected as the Docker secret `aws`. Further on, we’re retrieving the current desired capacity. The new capacity is calculated by adding the value of the `scale` parameter to the current capacity. Finally, we have a simple `if/else` statement that throws an error if the future capacity would be lower than `1` or higher than `3` nodes. That way we are setting boundaries so that the system cannot expand or contract too much. You should change those limits to better match your current size of the cluster. Finally, if the new capacity is within the boundaries, we are updating the auto-scaling group. As you can see, the script is relatively simple and straightforward. Even though this might not be the final version that fits everyone’s purposes, the general gist is there, and I’m confident that you’ll have no problem adapting it to suit your needs. Do not forget to click the *Save* button before moving forward. Let’s give the job a spin. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/blue/organizations/jenkins/aws-scale/activity"` ``` ``````````````````````````````````````````````````````````````````````````` You should see the `aws-scale` screen with the *Run* button in the middle. Please click it. We already discussed the bug that makes the first build of a Pipeline job with properties fail. All subsequent builds should work properly, so we’ll give it another try. Please reload the page and click the *Run* button. You’ll be presented with a screen with a single parameter that allows us to specify how many nodes we’d like to add or remove. Leave the default value of `1` and click the *Run* button. A new build will start. Please click on the row that represents the new build and explore it. The second to last step should state that the number of workers changed from `0` to `1`. ![Figure 15-4: Jenkins build results](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00083.jpeg) Figure 15-4: Jenkins build results As you saw before, it takes a minute or two until a new node is created and initialized. Fetch a coffee. By the time you come back, the new node will be fully operational within the cluster. Let’s enter one of the manager nodes and confirm that the new node joined the Swarm cluster. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` `````````````````````````````````````````````````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-31-102.us-east-2.compute.internal Ready Active Reachable `3` ip-172-31-33-251.us-east-2.compute.internal Ready Active `4` ip-172-31-34-254.us-east-2.compute.internal Ready Active Reachable `5` ip-172-31-7-121.us-east-2.compute.internal Ready Active Leader ``` ````````````````````````````````````````````````````````````````````````` As you can see, the new worker indeed joined the cluster. In your cluster, the new node might not yet be initialized. If that’s the case, please wait for a minute or two and re-execute `docker node ls`. ![Figure 15-5: Infrastructure scaling orchestrated by Jenkins](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00084.jpeg) Figure 15-5: Infrastructure scaling orchestrated by Jenkins UI is useful as a learning experience, but our goal is to trigger the job remotely. Let’s check whether we can send a `POST` request. ``` `1` `exit` `2` `3` curl -XPOST -i `\` `4 ` `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?token=DevOps2\` `5` `2&scale=2"` ``` ```````````````````````````````````````````````````````````````````````` We exited the cluster and sent a post request with the token and the `scale` parameter set to `2`. The response is as follows. ``` `1` HTTP/1.1 201 Created `2` Connection: close `3` Date: Sat, 23 Sep 2017 20:07:21 GMT `4` X-Content-Type-Options: nosniff `5` Location: http://devops22-ExternalL-1OG8BA7IMZCT0-900324820.us-east-2.elb.amazon\ `6` aws.com/jenkins/queue/item/5/ `7` Server: Jetty(9.4.z-SNAPSHOT) ``` ``````````````````````````````````````````````````````````````````````` Let’s confirm that Jenkins build was executed successfully. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/blue/organizations/jenkins/aws-scale/activity"` ``` `````````````````````````````````````````````````````````````````````` Please click the last build and observe that the number of nodes scaled. Similarly, you should see a new notification in *#df-monitor-tests* in the *DevOps20* Slack channel. ![Figure 15-6: Slack notification indicating that nodes scaled](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00085.jpeg) Figure 15-6: Slack notification indicating that nodes scaled Finally, we’ll go back to the cluster and confirm not only that a new node was created but also that it joined the cluster. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ````````````````````````````````````````````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-24-32.us-east-2.compute.internal Ready Active Reachable `3` ip-172-31-2-29.us-east-2.compute.internal Ready Active Leader `4` ip-172-31-42-64.us-east-2.compute.internal Ready Active Reachable `5` ip-172-31-24-95.us-east-2.compute.internal Ready Active `6` ip-172-31-34-28.us-east-2.compute.internal Ready Active `7` ip-172-31-4-136.us-east-2.compute.internal Ready Active ``` ```````````````````````````````````````````````````````````````````` The number of worker nodes increased from one to three. If you do not yet see three worker nodes, please wait for a minute or two and re-run the `docker node ls` command. Let’s test whether the limits we set are respected. Remember, our Pipeline script should not allow less than one nor more than three worker nodes. Since we are already running three workers, we should be able to test it by attempting to add one more. ``` `1` `exit` `2` `3` curl -XPOST -i `\` `4 ` `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?token=DevOps2\` `5` `2&scale=1"` ``` ``````````````````````````````````````````````````````````````````` We exited the cluster and sent a `POST` request to build the `aws-scale` job with the `scale` parameter set to `1`. Let’s see the result in Jenkins UI. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/blue/organizations/jenkins/aws-scale/activity"` ``` `````````````````````````````````````````````````````````````````` You should see that the last build failed. We tried to add more workers than allowed and the build responded with an error. Similarly, we should see an error notification in *#df-monitor-tests* in the *DevOps20* Slack channel. ![Figure 15-7: Slack notification indicating that node scaling failed](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00086.jpeg) Figure 15-7: Slack notification indicating that node scaling failed It won’t hurt to check whether de-scaling nodes works as well. ``` `1` curl -XPOST -i `\` `2 ` `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?token=DevOps2\` `3` `2&scale=-2"` ``` ````````````````````````````````````````````````````````````````` We sent a similar `POST` request like a few times before. The only notable difference is that the `scale` param is now set to `-2`. As a result, two worker nodes should be removed, leaving us with one. At this point, there should be no need to check build results in Jenkins or notifications in Slack. The system proved to be working well. So, we’ll skip through those and jump straight into the cluster and output the list of joined nodes. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ```````````````````````````````````````````````````````````````` We entered the cluster and listed all the nodes. The output is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-24-32.us-east-2.compute.internal Ready Active Reachable `3` ip-172-31-2-29.us-east-2.compute.internal Ready Active Leader `4` ip-172-31-42-64.us-east-2.compute.internal Ready Active Reachable `5` ip-172-31-24-95.us-east-2.compute.internal Down Active `6` ip-172-31-34-28.us-east-2.compute.internal Ready Active `7` ip-172-31-4-136.us-east-2.compute.internal Down Active ``` ``````````````````````````````````````````````````````````````` You’ll notice that status of two of the nodes is set to `Down`. If, in your case, all the nodes are still `Ready`, you might need to wait for a minute or two and re-execute the `docker node ls` command. On the other hand, if you were not fast enough, Swarm might have cleaned its registry, and the two nodes that were `down` might have been removed altogether. Finally, the last verification we should do is to check whether the lower limit is respected as well. ``` `1` `exit` `2` `3` curl -XPOST -i `\` `4 ` `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?token=DevOps2\` `5` `2&scale=-1"` ``` `````````````````````````````````````````````````````````````` You know what to do. Visit the last build in Jenkins, check Slack, list Swarm nodes, or trust me blindly. The number of worker nodes should be left intact since we are running only one and reducing them to zero would violate the lower limit we set in the pipeline job. Now that we confirmed that triggering the `aws-scale` job (de)scales our worker nodes, we can turn our attention to Prometheus and Alertmanager and try to tie them all together into a system that will, for example, scale the number of workers depending on memory usage. ### Scaling Cluster Nodes Automatically We created the last piece of the chain. Jenkins job will scale nodes of a cluster only if something triggers it. We did that manually by sending `POST` requests but, as you might have guessed, that is not our ultimate goal. We need to run those builds through alerts based on metrics. Therefore, we’ll move back to the beginning of the chain and explore some of the metrics we can use and try to convert them into meaningful alerts. Let’s open Prometheus and try to define an alert worthy of our scaling needs. ``` `1` open `"http://``$CLUSTER_DNS``/monitor"` ``` ````````````````````````````````````````````````````````````` Please use *admin* as both the *User Name* and the *Password* if you’re asked to authenticate. We’ll start with an expression we already used in the previous chapters. Please type the query that follows in the *Expression* field, click the *Execute* button, and switch to the *Graph* tab. ``` `1` (sum(node_memory_MemTotal) BY (instance) - sum(node_memory_MemFree + node_memory\ `2` _Buffers + node_memory_Cached) BY (instance)) / sum(node_memory_MemTotal) BY (in\ `3` stance) ``` ```````````````````````````````````````````````````````````` As a reminder, the expression calculates the percentage of used memory for each instance (node). It does that by taking the total amount of memory and reducing it with free, buffered, and cached memory. Further on, the result is divided with total memory to get a percentage. Each segment of the expression is using `BY (instance)` to separate the results. The output should be four graphs representing four nodes currently running in the cluster. Used memory should be somewhere between ten and forty percent for each node. Don’t get confused if you see more than four lines. We had more than four nodes and, during their lifespan, their metrics were also recorded in Prometheus. ![Figure 15-8: Prometheus graph with memory utilization](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00087.jpeg) Figure 15-8: Prometheus graph with memory utilization At this point you might be tempted to write an alert that would be fired whenever memory is, let’s say, over 80%. That alert could result in a `POST` request to Jenkins which, in turn, would scale worker nodes. What would such a spike in memory usage on one of the nodes tell us? If all nodes are using too much memory, it would not make sense to monitor them individually. On the other hand, if only one node has a spike, that would probably not indicate a problem that should be solved by scaling the number of nodes. The issue would, more likely, lie in incorrect memory reservations and limitations defined for one of our services. Or, maybe one of the services went wild with memory consumption. However, in that case, we probably did not even define its resources. If we did, Swarm would reschedule that service and, before that happens, we’d get a service-level alert. Such an alert might result in scaling of the service, or it might require some other type of actions. There might be other reasons for memory spike in one of the nodes but, in most of the cases, the resolution would have to rely on manual intervention. We’d need to (re)define service resources, fix a bug, or do one of many other actions that should be performed manually. All in all, auto-scaling based on memory usage of a single node is, in most cases, not a good strategy. Instead, I believe that it would be better to base our auto-scaling strategy on memory usage of the entire cluster. After all, we already adopted the concept of treating the whole cluster as a single entity. Let’s try to write an expression that will give us the percentage of used memory across the whole cluster. We’ll break it into two parts. First, we’ll write a query that retrieves the number of used bytes and, later on, we’ll get the total available memory of the whole cluster. If we divide those two, the result should be a percentage of the used memory of a cluster. Please type the query that follows in the *Expression* field, and click the *Execute* button. ``` `1` sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + node\ `2` _memory_Cached) ``` ``````````````````````````````````````````````````````````` You should see that around 2GB of memory is currently used in the cluster. The expression we used is very similar to the one that did the calculation for each instance. The only significant difference is that we removed `BY (instance)` parts. Next, we need to find out the total amount of memory of the cluster. That part should be easy since the previous expression already starts with the total. Please type the query that follows in the *Expression* field, and click the *Execute* button. ``` `1` sum(node_memory_MemTotal) ``` `````````````````````````````````````````````````````````` The output should show that we have 8GB of total memory. You might see that the number was bigger in the past since we had a brief period with five or six nodes when we were experimenting with Jenkins’ job `aws-scale`. You might still have it set to more than 8GB. In that case, please wait for a few moments until metrics from the removed nodes expire. If we combine those two expressions, the result is as follows. ``` `1` (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + nod\ `2` e_memory_Cached)) / sum(node_memory_MemTotal) ``` ````````````````````````````````````````````````````````` Please type the previous expression and click the *Execute* button. The result should be current memory usage of approximately 25%. ![Figure 15-9: Prometheus graph with total memory utilization of the cluster](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00088.jpeg) Figure 15-9: Prometheus graph with total memory utilization of the cluster There’s still one more problem we might need to solve. We should probably not treat manager and worker nodes equally so we might want to split metrics between the two. If we’d distinguish `node_exporter` services running on manager nodes from those deployed to workers, we could create different types of alerts for each server types. Let’s go back to the cluster and download an updated version of the `exporter` stack definition. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` curl -o exporters.yml `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `5` exporters-aws.yml `6` `7` cat exporters.yml ``` ```````````````````````````````````````````````````````` We entered the cluster, downloaded the `exporters-aws.yml` stack definition, and displayed its content. The output of the relevant parts of the `cat` command is as follows. ``` `1` node-exporter-manager: `2` ... `3` deploy: `4` labels: `5` ... `6` - com.df.alertName.2=node_mem_limit_total_above `7` - com.df.alertIf.2=@node_mem_limit_total_above:0.8 `8` - com.df.alertLabels.2=receiver=system,scale=no,service=exporter_node-ex\ `9` porter-manager,type=node `10 ` - com.df.alertFor.2=30s `11 ` ... `12 ` placement: `13 ` constraints: `14 ` - node.role == manager `15 ` ... `16` `17 ` node-exporter-worker: `18 ` ... `19 ` deploy: `20 ` labels: `21 ` ... `22 ` - com.df.alertName.2=node_mem_limit_total_above `23 ` - com.df.alertIf.2=@node_mem_limit_total_above:0.8 `24 ` - com.df.alertFor.2=30s `25 ` - com.df.alertName.3=node_mem_limit_total_below `26 ` - com.df.alertIf.3=@node_mem_limit_total_below:0.05 `27 ` - com.df.alertFor.3=30s `28 ` ... `29 ` placement: `30 ` constraints: `31 ` - node.role == worker `32 ` ... ``` ``````````````````````````````````````````````````````` We split `node-exporter` service into two. We added two new label sets besides those we used before. They are `@node_mem_limit_total_above` and `@node_mem_limit_total_below` shortcuts that expand to alerts with the expression we wrote earlier. The first one will trigger an alert if the total memory of all the nodes where the exporter is running is above a certain threshold. Similarly, the other will be triggered if total memory is below the threshold. Those shortcuts are accompanied with labels `scale` and `type`. Default values of the `scale` label are `up` and `down` depending on the shortcut. The `type` label is always set to `node`. That way, we’ll know whether to scale or de-scale and, through the `type` label, we’ll know that the action should be performed on nodes. For more info, please consult [AlertIf Parameter Shortcuts](http://monitor.dockerflow.com/usage/#alertif-parameter-shortcuts) section of the [Docker Flow Monitor](http://monitor.dockerflow.com/) documentation. You’ll notice that the `node-exporter-manager` service does not have the `node_mem_limit_total_below` alert. The reason is simple. If memory usage is very low on manager nodes, there’s still nothing we should do. We’re not going to remove one of the managers since that would put cluster at risk. Furthermore, we changed the `node_mem_limit_total_above` default labels so that `scale` is set to `no`. That way, we can instruct Alertmanager not to send a request to Jenkins to scale the nodes but a Slack notification instead. All in all, when memory usage of manager nodes is too low, we will take no action. When it’s too high, we’ll investigate the reason behind that, instead of taking any automated actions. The `node-exporter-worker` service will trigger automation in both cases. We’ll configure Alertmanager to send requests to Jenkins to add or remove worker nodes if memory usage goes beyond defined thresholds. We put `node_mem_limit_total_below` limit to five percent. If this were a production cluster, that value would be too low. The more reasonable lower threshold would be thirty or forty percent. If total memory usage is below it, we have too many nodes in the cluster, and one (or more) of them should be removed. However, since our current cluster already has more capacity than we need, that would trigger an alert right away. Therefore, we decreased the limit to avoid spoiling the surprise. Finally, both services have `placement constraints` that will make sure that they are running only on the correct node types. We are about to deploy the exporters. Before we do that, please note that they will not trigger correct processes in Alertmanager. We are yet to configure it correctly. For now, we’ll limit our scope only to alerts in Prometheus. ``` `1` docker stack rm exporter `2` `3` docker stack deploy -c exporters.yml `\` `4 ` exporter `5` `6` `exit` `7` `8` open `"http://``$CLUSTER_DNS``/monitor/alerts"` ``` `````````````````````````````````````````````````````` Since the new stack definition does not have one of the services contained in the old one and `docker stack deploy` does not delete services (only creates and updates them), we had to remove the whole stack. The alternative would be to remove only that service (`exporter_node-exporter`) but, since it’s not critical whether we’ll miss a second or two of metrics, removing the whole stack was an easier solution. Further on, we deployed the stack, exited the cluster, and opened the Prometheus’ alerts screen. You’ll notice that, this time, we have two sets of *nodeexporter* alerts. Let’s start with those dedicated to managers. Please expand the `exporter_nodeexportermanager_node_mem_limit_total_above` alert. You’ll see that it contains a similar expression as the one we wrote previously. It is as follows. ``` `1` (sum(node_memory_MemTotal{job="exporter_node-exporter-manager"}) - sum(node_memo\ `2` ry_MemFree{job="exporter_node-exporter-manager"} + node_memory_Buffers{job="expo\ `3` rter_node-exporter-manager"} + node_memory_Cached{job="exporter_node-exporter-ma\ `4` nager"})) / sum(node_memory_MemTotal{job="exporter_node-exporter-manager"}) > 0.\ `5` 8 ``` ````````````````````````````````````````````````````` The difference is that we are limiting the alert only to metrics coming from the `exporter_node-exporter-manager` job. That way, we have a clear distinction between node types. The alert will be triggered only if total memory of manager nodes is above eighty percent. Please click the link next to the `IF` statement. You’ll be presented with the graph screen with the alert query pre-populated. Please remove `> 0.8` and click the *Execute* button. You’ll see the graph with the memory usage of manager nodes. Whatever the values are, they should be way below eighty percent. Please explore the `exporter_nodeexporterworker_node_mem_limit_total_above` and `exporter_nodeexporterworker_node_mem_limit_total_below`. They use the similar logic as the `exporter_nodeexportermanager_node_mem_limit_total_above`. ![Figure 15-10: Prometheus alert based on total memory utilization of cluster managers](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00089.jpeg) Figure 15-10: Prometheus alert based on total memory utilization of cluster managers Now that we created the alerts, we should switch our focus to Alertmanager. Since Docker secrets are immutable, we’ll have to remove the `monitor` stack and the `alert_manager_config` secret before we start working on a new configuration. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker stack rm monitor `4` `5` docker secret rm alert_manager_config ``` ```````````````````````````````````````````````````` Now we can create a new Alertmanager configuration and store it as a Docker secret. ``` `1` `source` creds `2` `3` `echo` `"route:` `4`` group_by: [service,scale,type]` `5`` repeat_interval: 30m` `6`` group_interval: 30m` `7`` receiver: 'slack'` `8`` routes:` `9`` - match:` `10 `` type: 'node'` `11 `` scale: 'up'` `12 `` receiver: 'jenkins-node-up'` `13 `` - match:` `14 `` type: 'node'` `15 `` scale: 'down'` `16 `` receiver: 'jenkins-node-down'` `17 `` - match:` `18 `` service: 'go-demo_main'` `19 `` scale: 'up'` `20 `` receiver: 'jenkins-go-demo_main-up'` `21 `` - match:` `22 `` service: 'go-demo_main'` `23 `` scale: 'down'` `24 `` receiver: 'jenkins-go-demo_main-down'` `25` `26` `receivers:` `27 `` - name: 'slack'` `28 `` slack_configs:` `29 `` - send_resolved: true` `30 `` title: '[{{ .Status | toUpper }}] {{ .GroupLabels.service }} service is \` `31` `in danger!'` `32 `` title_link: 'http://``$CLUSTER_DNS``/monitor/alerts'` `33 `` text: '{{ .CommonAnnotations.summary}}'` `34 `` api_url: 'https://hooks.slack.com/services/T308SC7HD/B59ER97SS/S0KvvyStV\` `35` `nIt3ZWpIaLnqLCu'` `36 `` - name: 'jenkins-go-demo_main-up'` `37 `` webhook_configs:` `38 `` - send_resolved: false` `39 `` url: 'http://``$CLUSTER_DNS``/jenkins/job/service-scale/buildWithParameters?\` `40` `token=DevOps22&service=go-demo_main&scale=1'` `41 `` - name: 'jenkins-go-demo_main-down'` `42 `` webhook_configs:` `43 `` - send_resolved: false` `44 `` url: 'http://``$CLUSTER_DNS``/jenkins/job/service-scale/buildWithParameters?\` `45` `token=DevOps22&service=go-demo_main&scale=-1'` `46 `` - name: 'jenkins-node-up'` `47 `` webhook_configs:` `48 `` - send_resolved: false` `49 `` url: 'http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?toke\` `50` `n=DevOps22&scale=1'` `51 `` - name: 'jenkins-node-down'` `52 `` webhook_configs:` `53 `` - send_resolved: false` `54 `` url: 'http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?toke\` `55` `n=DevOps22&scale=-1'` `56` `"` `|` docker secret create alert_manager_config - ``` ``````````````````````````````````````````````````` Feel free to use the [15-self-adaptation-infra-alertmanager-config.sh](https://gist.github.com/vfarcic/efebfba9d42ba48eedabc118fcac7ed7) gist if you do not feel like typing the whole config. Since the config needs environment variable `CLUSTER_DNS`, we sourced the `creds` file that already contains it. We added two new routes. Alerts will be routed to the `jenkins-node-up` receiver if the `type` label is set to `node` and `scale` is `up`. Similarly, if the `scale` is set to `down`, alerts will be routed to `jenkins-node-down`. Both receivers are sending `POST` requests to build the `aws-scale` job. The only difference is the `scale` parameter that is either `1` or `-1` depending on the outcome we want to accomplish. Jenkins builds will not be executed with the alert associated with managers. It has the `scale` label set to `no`, so none of the routes match it. Instead, we’ll get a notification to Slack (default receiver). On the other hand, worker alerts will trigger Jenkins which, in turn, will scale or de-scale nodes of the cluster. Since `repeat_interval` and `group_interval` are both set to thirty minutes, new nodes would spawn every hour if memory usage does not drop. Now we can deploy the `monitor` stack again. Alertmanager will, this time, use the new configuration and, if everything goes as planned, act as a bridge between Prometheus alerts and Jenkins. ``` `1` `DOMAIN``=``$CLUSTER_DNS` docker stack `\` `2 ` deploy -c monitor.yml monitor ``` `````````````````````````````````````````````````` Now that Alertmanager is using the new configuration, we can test the system. We’ll start with a simple scenario and verify that increased memory usage of manager nodes results in a notification to Slack. Remember, we’re not trying to scale managers automatically. That is reserved for worker nodes. Instead, we want to notify a human that there is an anomaly. Since our current memory usage is way below 80%, we need to either increase the number of services we’re running or change the alert threshold. We’ll choose the latter since it is easier to accomplish. All we need to do is change the label of the `node-exporter-manager` service. Before we proceed, let’s confirm that all the services in the `monitor` stack are up and running. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`running monitor ``` ````````````````````````````````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE \ `2 ` DESIRED STATE CURRENT STATE ERROR PORTS `3` monitor_alert-manager.1 prom/alertmanager:latest ip-172-31-7-5\ `4` 6.us-east-2.compute.internal Running Running 2 minutes ago `5` monitor_monitor.1 vfarcic/docker-flow-monitor:latest ip-172-31-7-5\ `6` 6.us-east-2.compute.internal Running Running 2 minutes ago `7` monitor_swarm-listener.1 vfarcic/docker-flow-swarm-listener:latest ip-172-31-33-\ `8` 127.us-east-2.compute.internal Running Running 2 minutes ago ``` ```````````````````````````````````````````````` Now we can lower the upper memory threshold for the alert related to the `node-exporter-manager` and confirm that the alert associated with it works. ``` `1` docker service update `\` `2 ` --label-add `"com.df.alertIf.2=@node_mem_limit_total_above:0.1"` `\` `3 ` exporter_node-exporter-manager ``` ``````````````````````````````````````````````` Since our memory usage is currently between 20% and 30%, setting up the alert to 10% will certainly result in fired event. Let’s go to Prometheus UI and confirm that the alert is firing. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor/alerts"` ``` `````````````````````````````````````````````` The `exporter_nodeexportermanager_node_mem_limit_total_above` should be red. If it isn’t, please wait a few moments and refresh the screen. Once the alert is fired, we can confirm that a Slack notification was sent by Alertmanager. Please open *DevOps20* slack channel *#df-monitor-tests* and observe the note stating that *Total memory of the nodes is over 0.1*. ![Figure 15-11: Prometheus initiated Slack notifications](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00090.jpeg) Figure 15-11: Prometheus initiated Slack notifications Before we proceed, we should restore the `node-exporter-manager` alert definition to its previous threshold. Otherwise, another alert would fire an hour from now. We’ll imagine that someone saw the alert and fixed the imaginary problem. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker service update `\` `4 ` --label-add `"com.df.alertIf.2=@node_mem_limit_total_above:0.8"` `\` `5 ` exporter_node-exporter-manager ``` ````````````````````````````````````````````` We entered the cluster and updated the service by adding (overwriting) the alert with 80% threshold. Now we can test the real deal. We’ll verify that automated scaling of worker nodes works as expected. We’ll repeat a similar simulation by lowering the threshold. The only difference is that, this time, we’ll update the `node-exporter-worker` service. ``` `1` docker service update `\` `2 ` --label-add `"com.df.alertIf.2=@node_mem_limit_total_above:0.1"` `\` `3 ` exporter_node-exporter-worker ``` ```````````````````````````````````````````` The alert is now set to fire when 10% of the total memory of worker nodes is reached. We can confirm that by visiting Prometheus one more time. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor/alerts"` ``` ``````````````````````````````````````````` You’ll notice that the `exporter_nodeexporterworker_node_mem_limit_total_above` alert is red. If it isn’t, please wait a few moments and refresh the screen. Since we configured Alertmanager to send build requests to Jenkins whenever an alert with the label `type` is set to `node` and `scale` is set to `up`, and those happen to be labels associated with this alert, the result should be a new build of the `aws-scale` job. Let’s confirm that. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/blue/organizations/jenkins/aws-scale/activity"` ``` `````````````````````````````````````````` You’ll notice that the new Jenkins build was triggered. As a result, we should see a notification in Slack stating that *worker nodes were scaled*. More importantly, the number of worker nodes should increase by one. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ````````````````````````````````````````` We entered the cluster and listed all the nodes. The output is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY MANAGER STATUS `2` ip-172-31-9-40.us-east-2.compute.internal Ready Active Leader `3` ip-172-31-18-10.us-east-2.compute.internal Ready Active `4` ip-172-31-25-34.us-east-2.compute.internal Ready Active Reachable `5` ip-172-31-35-24.us-east-2.compute.internal Ready Active Reachable `6` ip-172-31-35-253.us-east-2.compute.internal Ready Active ``` ```````````````````````````````````````` You’ll see that now we have two worker nodes in the cluster (there was one before). If, in your case, there are still no two worker nodes with the `Ready` status, please wait for a minute or two. We need to give enough time for AWS to detect the change in the auto-scaling group, to create a new VM, and to execute the init script that will join it to the cluster. ![Figure 15-12: Prometheus initiated worker nodes scaling](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00091.jpeg) Figure 15-12: Prometheus initiated worker nodes scaling Now that we confirmed that scaling up works, we should verify that the system is capable of scaling down as well. But, before we do that, we’ll restore the label to the initial threshold, and thus avoid getting another node an hour later. ``` `1` docker service update `\` `2 ` --label-add `"com.df.alertIf.2=@node_mem_limit_total_above:0.8"` `\` `3 ` exporter_node-exporter-worker ``` ``````````````````````````````````````` We’ll follow the same testing pattern. But, since we are now testing the processes triggered when there’s too much unused memory, we’ll have to increase the threshold of the next alert. ``` `1` docker service update `\` `2 ` --label-add `"com.df.alertIf.3=@node_mem_limit_total_below:0.9"` `\` `3 ` exporter_node-exporter-worker ``` `````````````````````````````````````` The rest of validations should be the same as before. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor/alerts"` ``` ````````````````````````````````````` We opened the Prometheus *alerts* screen. The *exporter_nodeexporterworker_node_mem_limit_total_below* alert should be red. You know what to do if it’s not. Have patience and refresh the screen. Jenkins build was executed, and we got a new notification in Slack. If you don’t believe me, check it yourself. There’s no need for instructions. Finally, after a few minutes, one of the worker nodes should be removed from the cluster. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ```````````````````````````````````` If you were patient enough, the output of the `node ls` command should be as follows. ``` `1` HOSTNAME STATUS AVAILABILITY \ `2` MANAGER STATUS `3` ip-172-31-25-180.us-east-2.compute.internal Ready Active `4` ip-172-31-44-104.us-east-2.compute.internal Down Active `5` ip-172-31-24-63.us-east-2.compute.internal Ready Active \ `6` Leader `7` ip-172-31-15-200.us-east-2.compute.internal Ready Active \ `8` Reachable `9` ip-172-31-32-99.us-east-2.compute.internal Ready Active \ `10 ` Reachable ``` ``````````````````````````````````` One of the worker nodes was removed (or will be removed soon). Before we move on, we’ll restore the alert to its original formula. ``` `1` docker service update `\` `2 ` --label-add `"com.df.alertIf.3=@node_mem_limit_total_below:0.05"` `\` `3 ` exporter_node-exporter-worker `4` `5` `exit` ``` `````````````````````````````````` Among many other combinations and actions that we could perform, there is one area that might be very important. We might need to reschedule our services after scaling our cluster. ### Rescheduling Services After Scaling Nodes We managed to build a system that scales (and de-scales) our worker nodes automatically. Even though we might need to extend alerts to other types of metrics, the system we created is already good as it is. Kind of… The problem is that new nodes are empty. They do not host any services until we deploy new ones of we updated some of those that are already running inside our cluster. That, in itself, is not a problem if we deploy new releases to production often. Let’s say that, on average, we deploy a new release every hour. That would mean that our newly added nodes will be empty only for a short while. Our deployment pipelines will re-balance the cluster. But, what happens if we do not deploy any new release until the next day? Having empty nodes for a while is not a big problem since our services have memory reservations based on actual memory usage. We observed metrics and decided how much each replica of a service should use. Having nodes with services that are using 80% or even 90% is not a problem. Still, we can do better. We can forcefully update some of our services and, thus, let Swarm reschedule. As a result, new nodes will be filled with replicas. We could, for example, iterate over all the services in the cluster and update them by adding an environment variable. That would initiate rolling updates and result in better distribution of our services across the cluster. However, that might produce downtime. Some of our services (e.g., `go-demo`) are scalable and stateless. They can be updated at any time without any downtime. Unfortunately, not all are created using distributed-systems principles. A good example is Jenkins and Prometheus. They cannot be scaled, so we cannot run multiple replicas. Update of a service with a single replica inevitably produces downtime, no matter whether we employ rolling updates or, for example, blue-green deployment. It does not matter whether that downtime is a millisecond or a full minute. Downtime is downtime. We might never be able to avoid downtime with services like those. Still, we should probably not produce it ourselves without a valid reason. Filling newly added nodes with services is not a reason good enough. Therefore, we need to figure out a way to distinguish which services are safe to update, and from which we should stay away. The solution is probably obvious. We can add one more label to our services. For example, we can use a service label `com.df.reschedule`. If it’s set to `true`, it would mean that the service can be rescheduled (updated) without any danger. Services with any other value (including not having that label) should be ignored. We could use the command that follows to retrieve IDs of all the services with the `com.df.reschedule` label (do not execute it). ``` `1` docker service ls -q `\` `2 ` -f `label``=`com.df.reschedule`=``true` ``` ````````````````````````````````` The output would be the list of IDs (`-q`) of all the services with the label `com.df.reschedule` set to `true`. Further on, we could iterate through that list of IDs and update services. Such an action would result in a redistribution of services across the cluster. We do not have to update anything significant. Anything should do. For example, we can add an environment variable called `RESCHEDULE_DATE`. Since its value needs to be different every time we update it (otherwise update would not trigger rescheduling) we can put current date and time as the value. The command that would update a service can be as follows (do not execute it). ``` `1` docker service update --env-add `'RESCHEDULE_DATE=${date}'` `${``service``}` ``` ```````````````````````````````` Finally, we should execute the process only if we are scaling up and skip it when scaling down. All that, translated to a Jenkins Pipeline script, would produce the snippet that follows (do not paste it to Jenkins). ``` `1` `if` `(``scale``.``toInteger``()` `>` `0``)` `{` `2` `sleep` `300` `3` `script` `{` `4` `def` `servicesOut` `=` `sh``(` `5` `script:` `"docker service ls -q -f label=com.df.reschedule=true"``,` `6` `returnStdout:` `true` `7` `)` `8` `def` `services` `=` `servicesOut``.``split``(``'\n'``)` `9` `def` `date` `=` `new` `Date``()` `10 ` `for``(``int` `i` `=` `0``;` `i` `<` `services``.``size``();` `i``++)` `{` `11 ` `def` `service` `=` `services``[``0``]` `12 ` `sh` `"docker service update --env-add 'RESCHEDULE_DATE=${date}' ${service}"` `13 ` `}` `14 ` `}` `15` `}` ``` ``We start with a simple `if` statement that validates whether we want to scale up. Since it takes a bit of time until a new node is created, we’re waiting for 5 minutes (`300` seconds). We could probably do a more intelligent type of verification with some kind of a loop that would verify whether the node joined the cluster. However, that might be an overkill (for now) so a simple `sleep` should do. Further on, we are retrieving the list of all IDs of services that should be rescheduled. The result is split into an array and assigned to the variable `services`. Finally, we are iterating over all IDs (`services`) and executing `docker service update` which will reschedule the services.`` ``` aws-scale job we created earlier. ``` Please open the `aws-scale` configuration screen. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/configure"` ``` ``````````````````````````````` Click the *Pipeline* tab and type the script that follows in the *Pipeline Script* field. ``` `1` `pipeline` `{` `2` `agent` `{` `3` `label` `"prod"` `4` `}` `5` `options` `{` `6` `buildDiscarder``(``logRotator``(``numToKeepStr:` `'2'``))` `7` `disableConcurrentBuilds``()` `8` `}` `9` `parameters` `{` `10 ` `string``(` `11 ` `name:` `"scale"``,` `12 ` `defaultValue:` `"1"``,` `13 ` `description:` `"The number of worker nodes to add or remove"` `14 ` `)` `15 ` `}` `16 ` `stages` `{` `17 ` `stage``(``"scale"``)` `{` `18 ` `steps` `{` `19 ` `git` `"https://github.com/vfarcic/docker-aws-cli.git"` `20 ` `script` `{` `21 ` `def` `asgName` `=` `sh``(` `22 ` `script:` `"source /run/secrets/aws && docker-compose run --rm asg-name\` `23` `"``,` `24 ` `returnStdout:` `true` `25 ` `).``trim``()` `26 ` `if` `(``asgName` `==` `""``)` `{` `27 ` `error` `"Could not find auto-scaling group"` `28 ` `}` `29 ` `def` `asgDesiredCapacity` `=` `sh``(` `30 ` `script:` `"source /run/secrets/aws && ASG_NAME=${asgName} docker-compo\` `31` `se run --rm asg-desired-capacity"``,` `32 ` `returnStdout:` `true` `33 ` `).``trim``().``toInteger``()` `34 ` `def` `asgNewCapacity` `=` `asgDesiredCapacity` `+` `scale``.``toInteger``()` `35 ` `if` `(``asgNewCapacity` `<` `1``)` `{` `36 ` `error` `"The number of worker nodes is already at the minimum capacity\` `37 `` of 1"` `38 ` `}` `else` `if` `(``asgNewCapacity` `>` `3``)` `{` `39 ` `error` `"The number of worker nodes is already at the maximum capacity\` `40 `` of 3"` `41 ` `}` `else` `{` `42 ` `sh` `"source /run/secrets/aws && ASG_NAME=${asgName} ASG_DESIRED_CAPAC\` `43` `ITY=${asgNewCapacity} docker-compose run --rm asg-update-desired-capacity"` `44 ` `if` `(``scale``.``toInteger``()` `>` `0``)` `{` `45 ` `sleep` `300` `46 ` `script` `{` `47 ` `def` `servicesOut` `=` `sh``(` `48 ` `script:` `"docker service ls -q -f label=com.df.reschedule=true"``,` `49 ` `returnStdout:` `true` `50 ` `)` `51 ` `def` `services` `=` `servicesOut``.``split``(``'\n'``)` `52 ` `def` `date` `=` `new` `Date``()` `53 ` `for``(``int` `i` `=` `0``;` `i` `<` `services``.``size``();` `i``++)` `{` `54 ` `def` `service` `=` `services``[``0``]` `55 ` `sh` `"docker service update --env-add 'RESCHEDULE_DATE=${date}' \` `56` `${service}"` `57 ` `}` `58 ` `}` `59 ` `}` `60 ` `echo` `"Changed the number of worker nodes from ${asgDesiredCapacity} \` `61` `to ${asgNewCapacity}"` `62 ` `}` `63 ` `}` `64 ` `}` `65 ` `}` `66 ` `}` `67 ` `post` `{` `68 ` `success` `{` `69 ` `slackSend``(` `70 ` `color:` `"good"``,` `71 ` `message:` `"""Worker nodes were scaled.` `72` `Please check Jenkins logs for the job ${env.JOB_NAME} #${env.BUILD_NUMBER}` `73` `${env.BUILD_URL}console"""` `74 ` `)` `75 ` `}` `76 ` `failure` `{` `77 ` `slackSend``(` `78 ` `color:` `"danger"``,` `79 ` `message:` `"""Worker nodes could not be scaled.` `80` `Please check Jenkins logs for the job ${env.JOB_NAME} #${env.BUILD_NUMBER}` `81` `${env.BUILD_URL}console"""` `82 ` `)` `83 ` `}` `84 ` `}` `85` `}` ``` `````````````````````````````` Do not forget to click the *Save* button. We should add the `com.df.reschedule` label to at least one service before we give the `aws-scale` job a spin. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` curl -o go-demo.yml `\` `4 ` https://raw.githubusercontent.com/vfarcic/docker-flow-monitor/master/stacks/`\` `5` go-demo-aws.yml `6` `7` cat go-demo.yml ``` ````````````````````````````` We entered the cluster, downloaded the updated version of the `go-demo` stack, and displayed its content on the screen. The output of the `cat` command, limited to relevant parts, is as follows. ``` `1` version: '3' `2` `3` services: `4` `5` main: `6` ... `7` deploy: `8` ... `9` labels: `10 ` ... `11 ` - com.df.reschedule=true `12 ` ... ``` ```````````````````````````` The only notable change, when compared with the previous version of the stack, is in the addition of the `com.df.reschedule` label. Now we can re-deploy the stack and confirm that the updated Jenkins job works as expected. ``` `1` docker stack deploy -c go-demo.yml `\` `2 ` go-demo `3` `4` `exit` `5` `6` curl -XPOST -i `\` `7 ` `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?token=DevOps2\` `8` `2&scale=1"` ``` ``````````````````````````` We deployed the stack, exited the cluster, and sent a `POST` request to build the `aws-scale` job with the `scale` parameter set to `1`. If we go to the `aws-scale` activity screen, there should be a new build. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/blue/organizations/jenkins/aws-scale/activity"` ``` `````````````````````````` Let’s go back to the cluster and confirm that the `go-demo` service was re-scheduled and, since the new node is empty (except for global services), at least one replica should end up there. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` ````````````````````````` If you were quick, the output of the `docker node ls` should reveal that the new node did not yet join the cluster. If that’s the case, wait for a while until AWS creates and initializes the new node and repeat the command. Once the new node is created, please copy its ID. We’ll put it as a value of an environment variable. ``` `1` `NODE_ID``=[`...`]` ``` ```````````````````````` Please make sure that you replaced `[...]` with the actual ID of the new node. If we continued with the fast pace and less than five minutes passed (`sleep 300`) since the new build started, the new node should be empty except for global services. ``` `1` docker node ps `\` `2 ` -f desired-state`=`running `$NODE_ID` ``` ``````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE \ `2 ` DESIRED STATE CURRENT STATE ERROR PORTS `3` exporter_cadvisor... google/cadvisor:latest ip-172-31-4-4.us-eas\ `4` t-2.compute.internal Running Running about a minute ago `5` exporter_node-exporter-worker... basi/node-exporter:v1.14.0 ip-172-31-4-4.us-eas\ `6` t-2.compute.internal Running Running about a minute ago ``` `````````````````````` Once five minutes passed, the update was executed, and the `go-demo` service (the only one with the `com.df.reschedule` label) was rescheduled. Let’s take another look at the processes running on the new node. ``` `1` docker node ps `\` `2 ` -f desired-state`=`running `$NODE_ID` ``` ````````````````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE \ `2 ` DESIRED STATE CURRENT STATE ERROR PORTS `3` exporter_cadvisor... google/cadvisor:latest ip-172-31-4-4.us-eas\ `4` t-2.compute.internal Running Running about a minute ago `5` exporter_node-exporter-worker... basi/node-exporter:v1.14.0 ip-172-31-4-4.us-eas\ `6` t-2.compute.internal Running Running about a minute ago `7` go-demo_main.2 vfarcic/go-demo:latest ip-172-31-4-4.us-eas\ `8` t-2.compute.internal Running Running 3 minutes ago ``` ```````````````````` As you can see, rescheduling worked, and one of the replicas of the `go-demo_main` service was deployed to the new node. ![Figure 15-13: Prometheus initiated worker nodes scaling and service rescheduling](https://github.com/OpenDocCN/freelearn-devops-pt5-zh/raw/master/docs/dop-22-tlkt/img/00092.jpeg) Figure 15-13: Prometheus initiated worker nodes scaling and service rescheduling If you’d like to test that re-scheduling is not executed when de-scaling nodes, please exit the cluster and send a `POST` request to Jenkins with the `scale` parameter set to `-1`. ``` `1` `exit` `2` `3` curl -XPOST -i `\` `4 ` `"http://``$CLUSTER_DNS``/jenkins/job/aws-scale/buildWithParameters?token=DevOps2\` `5` `2&scale=-1"` ``` ``````````````````` I’ll leave to you the tedious steps of checking Jenkins logs and confirming that re-scheduling was not executed. Even though our goal is within our grasp, we’re not yet finished. There’s still one more critical case left to explore. ### Scaling Nodes When Replica State Is Pending A replica of a service might be in the pending state. There might be quite a few reasons for that, and we won’t go through all of them. Instead, we’ll explore one of the most common causes behind having a replica pending deployment. A service might have memory reservation that cannot be fulfilled with the current cluster. Let’s say that a service has memory reservation set to 3GB. All the replicas of that service are running but, at one moment, the system scales that service by increasing the number of replicas by one. What happens if none of the nodes have 3GB of unreserved memory? Docker Swarm will set the status of the new replica to pending, hoping that 3GB will be available in the future. Such a situation might not be discovered with any of the existing alerts. The used memory of each of the nodes might be below the threshold (e.g., 80%). The total used memory of the cluster might be below the threshold as well. All in all, the system scaled the service, but the new replica cannot be deployed because there are not enough un-reserved resources, and none of the existing alerts noticed an anomaly. To make things even more complicated, if scaling was initiated as, for example, the result of slow response times, the same alert will fire again since the problem was not solved. Without the new replica, response times will continue being slow. We can fix that problem by evaluating whether the number of containers that belong to a service matches the number of replicas. If, for example, we intend to have five replicas, we should have an alert that confirms that all five replicas are indeed running. Before we try to create such an alert, we should explore a query that will return the number of replicas of a given service. Let’s go back to Prometheus’ UI. ``` `1` open `"http://``$CLUSTER_DNS``/monitor"` ``` `````````````````` Since the number of replicas is the same as the number of running containers, the query can be as follows. ``` `1` count`(`container_memory_usage_bytes`{`container_label_com_docker_swarm_service_name`\` `2` `=``"go-demo_main"``})` ``` ````````````````` The query counts the number of metrics with, in the case, the label set to `go-demo_main`. Please type the query into the *Expression* field, press the *Execute* button, and select the *Graph* tab. You should see that we are currently running three replicas of the service. If you are a fast reader, the result of the query might have revealed six replicas. When the system updated the `go-demo_main` service, it created three new containers and removed the old ones. Metrics from the old containers might be included in the `count`. If that’s the case, wait for a few moments and repeat the query. The expression we explored, translated to an alert, is as follows (do not try to execute it). ``` `1` count(container_memory_usage_bytes{container_label_com_docker_swarm_service_name\ `2` ="go-demo_main"}) != 3 ``` ```````````````` The alert would fire if the number of running containers (replicas) is different than the expected number (in this case `3`). Since Swarm needs a bit of time to pull images and, in case of a failure, reschedule replicas, we’d have to combine such an `IF` logic with the `FOR` statement so that the alert does not produce false positives. There’s one more thing left to discuss. How do we get the desired number of replicas? We cannot hard-code a value to the alert since it would produce undesirable results when the service is scaled. It needs to be dynamic. The alert needs to change every time the desired number of replicas changes. Fortunately, [Docker Flow Swarm Listener](http://swarmlistener.dockerflow.com/) is, among other parameters, sending the number of replicas of a service to all its notification addresses. [Docker Flow Monitor](http://monitor.dockerflow.com/), on the other hand, already has the shortcut `@replicas_running` that will expand into the alert we discussed and use the number of replicas from the listener. In other words, all we have to do is define `@replicas_running` as one more label of the service. I forgot to mention one more thing. Prometheus is already running that alert. It was defined in the last `go-demo` stack definition. So, let’s take another look at the YAML file we used previously. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` cat go-demo.yml ``` ``````````````` We went back into the cluster and listed the contents of the `go-demo.yml` file. The relevant parts are as follows. ``` `1` `services``:` `2` `3` `main``:` `4` `...` `5` `deploy``:` `6` `...` `7` `labels``:` `8` `...` `9` `-` `com``.``df``.``alertName``.``4``=``replicas_running` `10 ` `-` `com``.``df``.``alertIf``.``4``=``@``replicas_running` `11 ` `-` `com``.``df``.``alertFor``.``4``=``10``m` `12 ` `...` ``` `````````````` There’s no big mystery in those labels. They follow the same pattern as all other Prometheus-related labels we used throughout the book. The shortcut will expand into `count(container_memory_usage_bytes{container_label_com_docker_swarm_service_name="go-demo_main"}) != 3`. If we change the desired number of replicas, the listener will send a new request to the monitor, and the alert will change accordingly. You’ll notice that the `alertFor` label is set to `10m`. If a Docker image is big, it might take more than ten minutes to deploy a replica, and you might want to increase that time. On top of that, you should keep in mind that the more replicas we have, the longer it might take Swarm to deploy them all. However, since `go-demo` is very light, and we’re running only a few replicas, ten minutes should be more than enough. If all the replicas are not running within ten minutes, the alert should fire. Let’s confirm that the alert is indeed registered in Prometheus. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor/alerts"` ``` ````````````` Please observe the alert *godemo_main_replicas_running*. If should contain the definition we discussed. We should test whether the system works so now we need to figure out how force Docker Swarm to create a replica in the `pending` state. But, before we do that, we need to deal with the intervals we set in *Alertmanager* configuration. *Alertmanager* is grouping alerts by labels `service`, `scale`, and `type`, and has parameters `repeat_interval` and `group_interval` both set to `30m`. That means that an alert will be propagated to one of the receivers only if more than an hour passed since the last one with the same labels. In other words, even though Prometheus is firing the alert, Alertmanager might be discarding it if less then an hour passed since the last time we scaled the nodes. If you are impatient and do not want to wait for an hour, we can remove *Alertmanager* and put it back up again. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker service scale `\` `4 ` monitor_alert-manager`=``0` `5` `6` docker service scale `\` `7 ` monitor_alert-manager`=``1` ``` ```````````` Scaling to zero and back up to one means that *Alertmanager* would start over and, as a result, we would not need to wait for an hour to test the new alert. Now we can go back to the task at hand. We can, for example, change memory reservation of the service to 1.5GB. Since our nodes have 2GB each, that should result in one of the replicas in the pending state. To be on the safe side, we can also increase the number of replicas to four. ``` `1` docker service update `\` `2 ` --reserve-memory 1500M `\` `3 ` --replicas `4` `\` `4 ` go-demo_main ``` ``````````` We entered the cluster and updated the service. Since Swarm is doing rolling updates, and it takes approximately twenty seconds for each replica, we should wait for a minute or two until all the replicas are updated. Let’s take a look at the stack processes. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`running go-demo ``` `````````` The output is as follows (IDs are removed for brevity). ``` `1` NAME IMAGE NODE \ `2` DESIRED STATE CURRENT STATE ERROR PORTS `3` go-demo_main.1 vfarcic/go-demo:latest ip-172-31-47-61.us-east-2.compute.internal\ `4` Running Running 2 minutes ago `5` go-demo_db.1 mongo:latest ip-172-31-47-61.us-east-2.compute.internal\ `6` Running Running 8 minutes ago `7` go-demo_main.2 vfarcic/go-demo:latest ip-172-31-36-187.us-east-2.compute.interna\ `8` l Running Running 53 seconds ago `9` go-demo_main.3 vfarcic/go-demo:latest ip-172-31-36-187.us-east-2.compute.interna\ `10` l Running Running 43 seconds ago `11` go-demo_main.4 vfarcic/go-demo:latest \ `12 ` Running Pending 20 seconds ago ``` ````````` As you can see, Swarm could not deploy one of the replicas. None of the nodes has enough un-reserved memory so the state of one of them is `pending`. Let’s see what happens with the alert. ``` `1` `exit` `2` `3` open `"http://``$CLUSTER_DNS``/monitor/alerts"` ``` ```````` Since the `@replicas_running` shortcut creates labels `scale=up` and `type=node`, there’s no need to modify *Alertmanager* config nor the `aws-scale` Jenkins job. Given that we set `alertFor` to `10m`, the *godemo_main_replicas_running* alert should be red ten minutes after we executed the `docker service update` command. For a short time, you might see “strange” numbers generated by the alert. For example, it might be in the pending state, saying that there are five containers instead four, while we’re expecting to see three. Those “strange” results might be due to caching. The alert might be taking into account the old containers, those that were replaced with the recent update. Fear not. A short while later, the alert will stop counting the old containers and will report that there are three running, while it is expecting four. Ten minutes later it’ll fire the alert. Finally, let’s confirm that Jenkins executed a new build. ``` `1` open `"http://``$CLUSTER_DNS``/jenkins/blue/organizations/jenkins/aws-scale/activity"` ``` ``````` As you can see, a new build was executed, or, if less than five minutes passed, is about to finish. The auto-scaling group has been modified, and a new worker node joined the cluster. ``` `1` ssh -i `$KEY_NAME`.pem docker@`$CLUSTER_IP` `2` `3` docker node ls ``` `````` The output of the `node ls` command is as follows (IDs are removed for brevity). ``` `1` HOSTNAME STATUS AVAILABILITY \ `2` MANAGER STATUS `3` ip-172-31-15-5.us-east-2.compute.internal Ready Active \ `4` Reachable `5` ip-172-31-26-189.us-east-2.compute.internal Ready Active \ `6` Reachable `7` ip-172-31-29-239.us-east-2.compute.internal Ready Active `8` ip-172-31-36-140.us-east-2.compute.internal Ready Active `9` ip-172-31-36-153.us-east-2.compute.internal Ready Active \ `10 ` Leader ``` ````` A new worker node was created. There should be three manager and two worker nodes. If you don’t see the new node, please wait for a while and re-run the `docker node ls` command. Let’s see what’s going on with replicas of the `go-demo_main` service. ``` `1` docker stack ps `\` `2 ` -f desired-state`=`running go-demo ``` ```` Swarm found of that, with the additional node, there is enough un-reserved memory and deployed the pending replica. ### What Now? We have a self-sufficient system! It can self-heal and self-adapt. It can work without any humans around. We built [Matrix](http://www.imdb.com/title/tt0133093/)! We’ll… We’re not quite there yet. You will have to observe metrics, look for patterns, create new alerts, and so on. You will have to be behind the system we built so far and continue perfecting it. What we have, for now, is a solid base that you will need to expand. You’ll have to use the knowledge you got so far and adapt the examples to suit your own needs. There are many other combinations and formulas you might want to define as alerts. You might want to perform some actions when CPU usage is too high or when a disk is almost full. I’ll leave that to you with a word of caution. Don’t go crazy. Don’t create too many alerts. Don’t saturate humans with notifications and try to avoid having the system collapse on itself with unreliable alerts. Observe metrics for a while. Try to find patterns. Ask yourself what should be the action when you notice some spike. Define and validate a hypothesis. Wait some more. Repeat the cycle a few more times. You should extend your alerts only after you’re confident in your observations and actions that should be performed. Before we move on, please delete the stack we created. ``` `1` `exit` `2` `3` aws cloudformation delete-stack `\` `4 ` --stack-name devops22 ``` ```` ````` `````` ``````` ```````` ````````` `````````` ``````````` ```````````` ````````````` `````````````` ``````````````` ```````````````` ````````````````` `````````````````` ``````````````````` ```````````````````` ````````````````````` `````````````````````` ``````````````````````` ```````````````````````` ````````````````````````` `````````````````````````` ``````````````````````````` ```````````````````````````` ````````````````````````````` `````````````````````````````` ``````````````````````````````` ```````````````````````````````` ````````````````````````````````` `````````````````````````````````` ``````````````````````````````````` ```````````````````````````````````` ````````````````````````````````````` `````````````````````````````````````` ``````````````````````````````````````` ```````````````````````````````````````` ````````````````````````````````````````` `````````````````````````````````````````` ``````````````````````````````````````````` ```````````````````````````````````````````` ````````````````````````````````````````````` `````````````````````````````````````````````` ``````````````````````````````````````````````` ```````````````````````````````````````````````` ````````````````````````````````````````````````` `````````````````````````````````````````````````` ``````````````````````````````````````````````````` ```````````````````````````````````````````````````` ````````````````````````````````````````````````````` `````````````````````````````````````````````````````` ``````````````````````````````````````````````````````` ```````````````````````````````````````````````````````` ````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````````````````````````` ``````````````````````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````````````````````` ````````````````````````````````````````````````````````````````````````````````````````````````````````````````` `````````````````````````````````````````````````````````````````````````````````````````````````````````````````` ```````````````````````````````````````````````````````````````````````````````````````````````````````````````````

第二十章：自给自足系统的蓝图

我们走了很长一段路，现在我们已经处于旅程的第一阶段的终点。接下来会发生什么，取决于你。你需要扩展我试图传递的知识，并改进我们构建的系统。它是一个需要根据你的需求扩展的基础。每个系统都是不同的，没有任何蓝图可以盲目跟随。

每个好的故事都需要有一个结局，而这个故事也不例外。我会尽量总结前面章节中传递的知识，尽管我觉得我应该简明扼要。如果你需要一个我们所探讨内容的长篇总结，那就意味着我没有做好我的工作。我没有讲解得够清晰，或者内容太枯燥，你跳过了某些部分，期待它们在最后被总结。请告诉我如果我做得不够好，我会尽力改进。现在，我假设你已经理解了我们讨论的主题的要点，并将本章作为对所有内容的简洁总结。

我们将自给自足系统应执行的任务分为与服务相关的任务和面向基础设施的任务。尽管一些工具在这两组中都有使用，但这两者之间的划分使我们能够保持基础设施与运行其上的服务之间的清晰分离。

服务任务

服务任务与负责确保服务正常运行的流程有关，包括正确版本的部署、信息传播到所有依赖项、服务可访问性、按预期行为执行等。换句话说，所有与服务相关的任务都在这个范围之内。

我们将与服务相关的任务分为自我修复、部署、重配置、请求和自我适应流程。

自我修复流程

Docker Swarm（或任何其他调度器）负责自我修复。只要硬件资源足够，它会确保每个服务的期望副本数量几乎总是处于运行状态。如果某个副本宕机，它将被重新调度。如果整个节点被销毁或失去与其他管理节点的连接，它上面运行的所有副本将被重新调度。自我修复是开箱即用的。然而，如果我们希望我们的解决方案能够自给自足并且（几乎）完全自主，仍然有许多其他任务需要定义。

部署流程

提交到代码库是我们希望的最后一个人工操作。虽然这并不总是如此。无论我们的系统多么智能和自主，总会有一些问题是系统无法自动解决的。然而，我们仍应以完全无人工干预的系统为目标。即使我们无法完全实现这一点，它依然是一个值得追求的目标，能够让我们保持专注，避免走捷径。

当我们提交代码时会发生什么？代码仓库（例如，GitHub）执行一个 webhook，向我们选择的持续部署工具发送请求。全书中我们使用的是 Jenkins，但就像我们使用的其他工具一样，它也可以被替换成其他解决方案。

webhook 触发器启动一个新的 Jenkins 任务，该任务运行我们的持续部署（CD）管道。它运行单元测试，构建新镜像，执行功能测试，将镜像发布到 Docker Hub（或任何其他注册表）等。流程结束时，Jenkins 管道指示 Swarm 更新与提交相关的服务。至少，更新应将与服务相关联的镜像更改为我们刚刚构建的镜像。

一旦 Docker Swarm 收到更新服务的指令，它会执行滚动更新过程，一次替换一个副本（除非另有指定）。通过这样的过程，并假设我们的服务是以云友好的方式设计的，新版本不会造成任何停机，我们可以根据需要随时运行它们。

图 16-1：持续部署过程

重新配置流程

部署新版本只是过程的一部分。在大多数情况下，其他服务需要重新配置以包含有关已部署服务的信息。监控（例如，Prometheus）和代理（例如，HAProxy 或 nginx）只是需要了解集群中其他服务的服务中的两个例子。我们将它们称为基础设施服务，因为从功能角度来看，它们的范围与业务无关。它们通常负责使集群能够正常运行，或至少能被识别。

如果我们运行的是一个高度动态的集群，基础设施服务也需要是动态的。高度的动态性无法通过每次部署业务服务时手动修改配置来实现。我们必须有一个过程来监控集群内服务的变化，并更新所有需要关于已部署或已更新服务信息的服务。

解决自动更新基础设施服务问题的方式有很多种。本书中，我们使用了多种可能过程中的一种。我们假设服务的信息会存储为标签，这使我们可以专注于当前服务，让系统的其余部分发现这些信息。

我们使用了Docker Flow Swarm Listener (DFSL)来检测服务的变化（如新的部署、更新和移除）。每当检测到变化时，相关信息会被发送到指定的地址。在我们的案例中，这些地址指向代理服务器（Docker Flow Monitor）和 Prometheus（Docker Flow Proxy）。一旦这些服务收到关于新（或更新，或移除）服务的请求，它们会改变配置并重新加载主进程。通过这种事件流，我们可以确保所有基础设施服务始终保持最新，而无需担心它们的配置。否则，我们将需要创建一个更复杂的管道，不仅部署新版本，还要确保所有其他服务保持最新。

图 16-2: 重新配置流程

请求流

当用户（或外部客户端）向我们的服务发送请求时，该请求首先会被 Ingress 网络捕获。每个服务发布的端口都会使该端口在 Ingress 中打开。由于网络的范围是全球性的，请求可以发送到任何节点。当请求被捕获时，Ingress 会评估该请求并将其转发到发布了相同端口的服务副本之一。在此过程中，Ingress 网络执行轮询负载均衡，从而确保所有副本接收到（或多或少）相同数量的请求。

Overlay 网络（Ingress 是其中的一种类型）不仅负责将请求转发到发布了相同端口的服务，还确保只有健康的副本被纳入轮询负载均衡中。Docker 镜像中定义的HEALTHCHECK对于确保零停机时间部署至关重要。当一个新的副本被部署时，它在报告健康之前不会被纳入负载均衡算法。

在本书中，Docker Flow Proxy (DFP)是唯一发布任何端口的服务。这使我们能够将所有流量通过端口80和443进行通道化。由于其动态特性并且与 DFSL 配合良好，我们无需担心其下方的 HAProxy 配置。这意味着，所有到达我们集群的请求都由 Ingress 网络捕获，并转发到 DFP，后者会评估请求路径、域名及其他来自请求头的信息，然后决定哪个服务应接收请求。一旦做出决策，DFP 会将请求进一步转发。假设代理和目标服务都连接到相同的网络，那么这些转发的请求将再次被 Overlay 网络捕获，进行轮询负载均衡并将请求转发到最终目的地。

尽管请求的流程看起来可能很复杂，但从服务所有者的角度来看，它非常直观。他（或她）所需要做的只是定义几个服务标签，告诉代理所需的路径或区分该服务与其他服务的域名。另一方面，用户无论我们多频繁部署新版本，都不会经历停机。

图 16-3：请求服务流程

自适应流程

一旦我们成功创建了能够在没有停机的情况下部署新版本，同时重新配置所有依赖服务的流程，我们就可以向前推进，解决应用于服务的自适应问题。目标是创建一个系统，根据指标来扩展（或缩减）服务。这样，我们的服务就能高效地运行，无论外部变化如何。例如，如果预定义百分位的响应时间过长，我们可以增加副本的数量。

Prometheus定期从通用导出器以及我们的服务中抓取指标。我们通过对其进行仪表化实现了后者。导出器适用于诸如容器（例如，cAdvisor）或节点（例如，Node exporter）等生成的全局指标。仪表化则适用于我们希望获取更详细的、特定于服务的指标（例如，特定功能的响应时间）。

我们通过Docker Flow Monitor (DFM)配置了 Prometheus，不仅能够从导出器和仪表化服务中抓取指标，还能评估触发的警报，并将其发送到Alertmanager。Alertmanager 会过滤触发的警报，并将通知发送到系统的其他部分（内部或外部）。

在可能的情况下，警报通知应发送到一个或多个服务，这些服务将自动“修正”集群的状态。例如，因服务响应时间过长而触发的警报通知应导致该服务进行扩容。这样的操作相对容易编写脚本。它是一个可重复的操作，机器可以轻松执行，因此不应浪费人力时间。我们使用 Jenkins 作为工具，允许我们执行诸如扩容（增容或缩容）之类的任务。

只有在警报是由于不可预测的情况发生时，才应将通知发送给人类。基于从未发生过的条件的警报是需要人工干预的好候选。我们擅长解决突发问题；机器擅长执行可重复的任务。尽管如此，即使在这些前所未见的情况下，我们（人类）也不仅要解决问题，还应该创建一个脚本，以便下次遇到相同问题时能够重复相同的步骤。第一次触发警报并通知人类时，应该将其转换为通知机器，机器将按照我们之前的步骤操作。换句话说，第一次发生时自己解决问题，若再次发生则让机器重复解决方案。在整本书中，我们使用了 Slack 作为发送给人类的通知引擎，Jenkins 作为接收这些通知的机器端。

图 16-4：自适应服务流程

基础设施任务

基础设施任务与确保硬件正常工作和节点组成集群的流程相关。就像服务副本一样，这些节点是动态的。由于服务背后的需求不断变化，它们的数量在波动。一切与硬件相关的内容，或者更常见的，与虚拟机及其作为集群成员的能力相关，都属于这一范畴。

我们将把与基础设施相关的任务分为自愈、请求和自适应流程。

自愈流程

一个自动管理基础设施的系统与我们围绕服务构建的系统没有太大区别。就像 Docker Swarm（或任何其他调度器）负责确保服务（几乎）始终运行并具备所需的容量一样，AWS 中的自动扩展组确保所需数量的节点（几乎）始终可用。大多数其他托管供应商和本地解决方案都有类似的功能，只是名称不同。

自动扩展组只是应用于基础设施的自愈解决方案的一部分。仅仅重建一个失败的节点是不够的。我们需要一个过程，将该节点加入现有的集群。在整本书中，我们使用了 Docker For AWS，它已经为这个问题提供了解决方案。每个节点都运行一些系统容器，其中一个容器定期检查其运行的节点是否是主节点。如果是，它会将加入令牌和节点的 IP 等信息存储在一个中央位置（截至写作时存储在 DynamoDB 中）。当一个新节点被创建时，系统容器会检索这些数据并使用它来加入集群。

如果你没有使用 Docker For AWS 或 Azure，你可能需要自己动手编写解决方案，或者，如果你懒的话，去找现成的解决方案。有很多开源代码片段可以帮助你。

无论你选择（或自己构建）什么解决方案，步骤几乎总是相同的。创建自动扩展组（或你托管服务商提供的任何功能），以维持所需的节点数量。将加入令牌和主管理节点的 IP 地址存储在容错位置（外部数据库、服务注册表、网络驱动器等），并利用它将新节点加入到集群中。

最后，有状态服务是不可避免的。即使你开发的所有服务都是无状态的，状态也必须存储在某个地方。对于一些情况，我们需要将状态存储到磁盘上。使用本地存储不是一个选项。迟早，一个副本将被重新调度，并且可能会被调度到一个不同的节点。这可能是由于进程故障、升级，或因为某个节点不再可操作。无论重新调度的原因是什么，事实是我们必须假设它不会永远在同一个节点上运行。唯一合理的防止数据丢失的方法是在磁盘上存储状态时使用网络驱动器或分布式文件系统。在本书中，我们使用了 AWS 弹性文件系统（EFS），因为它可以在多个可用区中工作。在其他一些情况下，如果 IO 速度至关重要，你可能会选择 EBS。如果你选择其他供应商，解决方案会有所不同，但逻辑是相同的。创建一个网络驱动器并将其作为卷附加到服务上。AWS 和 Azure 的 Docker 都附带了 CloudStor 卷驱动程序。如果你选择了其他创建集群的解决方案，可能需要寻找其他驱动程序。REXRay是其中一个解决方案，因为它支持大多数常用的托管供应商和操作系统。

在开始使用附加到网络驱动器的卷之前，请确保你真的需要它们。一个常见的错误是认为数据库生成的状态需要持久化。虽然在某些情况下这是对的，但在许多其他情况下并非如此。现代数据库可以在不同实例之间复制数据。在这种情况下，持久化数据可能不需要（甚至可能不欢迎）。如果多个实例拥有相同的数据，那么其中一个实例的失败并不意味着数据丢失。该实例将被重新调度，并且在适当配置的情况下，它将从没有失败的副本中检索数据。

图 16-5：自愈基础设施流程

请求流程

我们已经探讨了如何确保用户或客户在集群外部发起的请求能够到达目标服务。然而，还有一个关键部分未解决。我们保证了一旦请求进入集群就能找到路径，但我们未能提供足够的保证它能够到达集群。我们不能配置 DNS 为节点之一的 IP，因为该服务器随时可能会失败。我们必须在 DNS 和集群之间添加一些东西。那个东西应该有一个单一的目标。它应确保请求到达任何一个健康节点。无论是哪一个节点，因为 Ingress 网络将接管并启动我们讨论过的请求流程。介于其中的元素可以是外部负载均衡器、弹性 IP 或其他任何解决方案。只要它是容错的并且能够执行健康检查以确定哪个节点是运行的，任何解决方案都可以做到。唯一的挑战是确保节点列表始终保持最新。这意味着集群添加新节点时应将其添加到该列表中。这可能有些多余，你可能希望将范围缩小到当前和未来的管理节点。幸运的是，Docker For AWS（或 Azure）已经将该功能集成到其模板和系统级容器中。尽管如此，如果你使用不同的解决方案来创建你的集群，找到类似的替代方案或编写你自己的解决方案应该相对容易。

图 16-6: 请求基础设施流程

自适应流程

基础设施的自适应概念上与服务的自适应相同。我们需要收集指标并将它们存储在某个地方（Prometheus），并定义警报并有一个系统根据指标对其进行评估（Prometheus）。当警报达到阈值并经过指定时间后，它们需要被过滤，并根据问题转换为通知发送到其他服务（Alertmanager）。我们使用 Jenkins 作为这些通知的接收器。如果问题可以由系统解决，将执行预定义的操作。由于我们的示例使用 AWS，Jenkins 将通过 AWS CLI 运行任务。另一方面，如果警报导致需要创造性解决方案的新问题，通知的最终接收者是人类（在我们的情况下通过 Slack）。

图 16-7: 自适应基础设施流程

逻辑至关重要，工具可能会有所不同

不要认为我们迄今使用的工具是理所当然的。技术变化实在太快。当你读到这篇文章时，至少有一个已经过时了。可能存在更好的替代方案。技术变化速度如此之快，以至于即使我们全部时间都用于评估“新玩具”，也无法跟上。

流程和逻辑也不是静止不变的，亦非永恒的。它们不应被视为理所当然，也不应永远被遵循。没有所谓的“永远最佳实践”。然而，逻辑的变化远远慢于工具。它更为重要，因为它能持续更长时间。

我相信本书中所描述的逻辑和流程将超越我们所使用的工具。对此，你可以自行评估其价值。探索其他工具，寻找那些更符合你目标的工具。至于我，我甚至还没写完这本书，但我已经看到了我们使用的工具可以改进的地方。它们中的一些可以被更好的工具替代，其他的可能一开始就不是最佳选择。但这并不像看起来那么重要。真正重要的是流程和逻辑，我希望我们探索过的那些能够再持续一段时间。

不要让这种悲观的态度阻碍你去实现你所学到的东西。把责任归咎于我和我那永无止境的探索，去寻找更好、更有效的方式来做事。

现在该怎么办？

这就是结尾了。去应用你所学的，改进它，回馈社区。

再见了，感谢所有的鱼。

posted @ 2025-06-29 10:38 绝不原创的飞龙阅读(30) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

DevOps-2-2-工具集-全-

DevOps 2.2 工具集（全）

前言

第一章：概述

第二章：观众

第三章：关于作者

第四章：献词

第五章：自适应和自愈系统简介

什么是自适应系统？

什么是自愈系统？

现在怎么办？

第六章：选择度量存储和查询解决方案

无量纲与有量纲的度量

Graphite

InfluxDB

Nagios 和 Sensu

Prometheus

我们应该选择哪种工具？

接下来怎么办？

第七章：部署和配置 Prometheus

部署 Prometheus 堆栈

第八章：拉取指标

创建集群并部署服务

第九章：定义集群级别警报

创建集群并部署服务

第十章：向人类发送警报

创建集群并部署服务

第十一章：系统告警

动态和自给自足系统的四个象限

第十二章：自我修复应用于服务

创建集群并部署服务

第十三章：应用于服务的自适应

选择用于扩展的工具

创建集群并部署服务

第十四章：描绘全貌：到目前为止的自给自足系统

开发者在系统中的角色

系统中持续部署的角色

系统中服务配置的角色

代理在系统中的角色

系统中的度量指标角色

系统中的告警角色

系统中的调度程序角色

集群角色在系统中的作用

现在怎么办？

第十五章：服务的监控

定义服务特定指标背后的需求

第十六章：自适应应用于仪表化服务

设定目标

创建集群并部署服务

第十七章：设置生产集群

创建 Docker For AWS 集群

第十八章：应用于基础设施的自愈功能

自动化集群设置

第十九章：应用到基础设施的自适应

创建集群

第二十章：自给自足系统的蓝图

服务任务

自我修复流程

部署流程

重新配置流程

请求流

自适应流程

基础设施任务

自愈流程

请求流程

自适应流程

逻辑至关重要，工具可能会有所不同

现在该怎么办？

公告