烂翻译系列之学习领域驱动设计——第十六章:数据网格
So far in this book, we have discussed models used to build operational systems. Operational systems implement real-time transactions that manipulate the system’s data and orchestrate its day-to-day interactions with its environment. These models are the online transactional processing (OLTP) data. Another type of data that deserves attention and proper modeling is online analytical processing (OLAP) data.
到目前为止,在本书中,我们已经讨论了用于构建运营系统的模型。运营系统执行实时事务,这些事务会操作系统的数据并协调其与环境的日常交互。这些模型是联机事务处理(OLTP)数据。另一种值得关注和正确建模的数据类型是联机分析处理(OLAP)数据。
In this chapter, you will learn about the analytical data management architecture called data mesh. You will see how the data mesh–based architecture works and how it differs from the more traditional OLAP data management approaches. Ultimately, you will see how domain-driven design and data mesh accommodate each other. But first, let’s see what these analytical data models are and why we can’t just reuse the operational models for analytical use cases.
在本章中,您将了解一种名为数据网格的分析数据管理架构。您将看到基于数据网格的架构是如何工作的,以及它与更传统的OLAP数据管理方法的区别。最终,您将看到领域驱动设计(Domain-Driven Design)和数据网格是如何相互适应的。但是首先,让我们来看看这些分析数据模型是什么,以及为什么我们不能仅仅为了分析性用例而重用运营模型。
Analytical Data Model Versus Transactional Data Model
分析性数据模型与事务性数据模型的比较
They say knowledge is power. Analytical data is the knowledge that gives companies the power to leverage accumulated data to gain insights into how to optimize the business, better understand customers’ needs, and even make automated decisions by training machine learning (ML) models.
人们常说知识就是力量。分析性数据是赋予公司力量的知识,使公司能够利用积累的数据来获得对如何优化业务、更好地理解客户需求以及甚至通过训练机器学习(ML)模型来做出自动化决策的见解。
The analytical models (OLAP) and operational models (OLTP) serve different types of consumers, enable the implementation of different kinds of use cases, and are therefore designed following other design principles.
分析性模型(OLAP)和运营性模型(OLTP)服务于不同类型的用户,能够实现不同类型的用例,因此它们遵循的是不同的设计原则。
Operational models are built around the various entities from the system’s business domain, implementing their lifecycles and orchestrating their interactions with one another. These models, depicted in Figure 16-1, are serving operational systems and hence have to be optimized to support realtime business transaction.
运营性模型是根据系统业务领域的各种实体构建的,实现了这些实体的生命周期并协调了它们之间的交互。如图16-1所示,这些模型服务于运营系统,因此必须进行优化以支持实时业务交易。
Figure 16-1. A relational database schema describing the relationships between entities in an operational model
图16-1. 一个描述运营性模型中实体之间关系的关系数据库模式
Analytical models are designed to provide different insights into the operational systems. Instead of implementing real-time transactions, an analytical model aims to provide insights into the performance of business activities and, more importantly, how the business can optimize its operations to achieve greater value.
分析性模型旨在提供对运营系统不同的见解。分析性模型不是实现实时事务,而是旨在提供对业务活动性能的见解,更重要的是,如何优化业务运营以实现更大的价值。
From a data structure perspective, OLAP models ignore the individual business entities and instead focus on business activities by modeling fact tables and dimension tables. We’ll take a closer look at each of these tables next.
从数据结构的角度来看,OLAP模型忽略了单个业务实体,而是通过建模事实表和维度表来关注业务活动。接下来,我们将更详细地了解这些表。
Fact Table
事实表
Facts represent business activities that have already happened. Facts are similar to the notion of domain events in the sense that both describe things that happened in the past. However, contrary to domain events, there is no stylistic requirement to name facts as verbs in the past tense. Still, facts represent activities of business processes. For example, a fact table Fact_CustomerOnboardings would contain a record for each new onboarded customer and Fact_Sales a record for each committed sale. Figure 16-2 shows an example of a fact table.
事实代表已经发生的业务活动。从描述过去发生的事情的角度来看,事实与领域事件的概念相似。然而,与领域事件不同,对事实的命名没有必须使用过去时态动词的风格要求。尽管如此,事实仍然代表业务流程的活动。例如,事实表Fact_CustomerOnboardings将包含每条新入会客户的记录,而Fact_Sales则包含每条已完成的销售记录。图16-2展示了一个事实表的示例。
Figure 16-2. A fact table containing records for cases solved by a company’s support desk
图16-2. 一个事实表,包含了公司支持部门解决案例的记录
Also, similar to domain events, fact records are never deleted or modified: analytical data is append-only data: the only way to express that current data is outdated is to append a new record with the current state. Consider the fact table Fact_CaseStatus in Figure 16-3. It contains the measurements of the statuses of support requests through time. There is no explicit verb in the fact name, but the business process captured by the fact is the process of taking care of support cases.
同样,与领域事件类似,事实记录也永远不会被删除或修改:分析性数据是只追加的数据:表达当前数据已过时的唯一方法是追加一条包含当前状态的新记录。请考虑图16-3中的事实表Fact_CaseStatus。它包含了随时间变化的支持请求状态的度量。在事实名称中没有明确的动词,但事实所捕获的业务过程是处理支持案例的过程。
Figure 16-3. A fact table describing state changes during the lifecycle of a support case
图16-3. 描述支持案例生命周期中状态变化的事实表
Another significant difference between the OLAP and OLTP models is the granularity of the data. Operational systems require the most precise data to handle business transactions. For analytical models, aggregated data is more efficient in many use cases. For example, in the Fact_CaseStatus table shown in Figure 16-3, you can see that the snapshots are taken every 30 minutes. The data analysts working with the model decide what level of granularity will best suit their needs. Creating a fact record for each change of the measurement—for example, each change of a case’s data—would be wasteful in some cases and even technically impossible in others.
OLAP 和 OLTP 模型之间的另一个显著差异是数据的粒度。业务系统需要最精确的数据来处理业务事务。对于分析性模型,聚合数据在许多用例中更有效。例如,在图16-3所示的Fact_CaseStatus表中,您可以看到每隔30分钟做一次快照。使用模型的数据分析人员决定什么级别的粒度最适合他们的需要。为度量的每一个变化(例如,案例数据的每一个变化)创建一个事实记录,在某些情况下是浪费的,在另一些情况下甚至在技术上是不可行的。
翻译:OLAP和OLTP模型之间的另一个重要区别是数据的粒度。运营系统需要最精确的数据来处理业务事务。对于分析性模型来说,在许多用例中,聚合数据更为高效。例如,在图16-3所示的Fact_CaseStatus表中,您可以看到每30分钟做一次快照。使用模型的数据分析师会决定哪种粒度的数据最适合他们的需要。为测量的每一次变化(例如,案例数据的每次变化)创建事实记录在某些情况下是浪费的,甚至在技术上也是不可能的。
Dimension Table
维表
Another essential building block of an analytical model is a dimension. If a fact represents a business process or action (a verb), a dimension describes the fact (an adjective).
分析性模型的另一个基本构建块是维度。如果事实表示业务流程或动作(动词),则维度描述事实(形容词)。
The dimensions are designed to describe the facts’ attributes and are referenced as a foreign key from a fact table to a dimension table. The attributes modeled as dimensions are any measurements or data that is repeated across different fact records and cannot fit in a single column. For example, the schema in Figure 16-4 augments the SolvedCases fact with its dimensions.
维度旨在描述事实的属性,并且作为外键从事实表引用到维度表。作为维度建模的属性是任何在不同事实记录中重复且无法放入单个列中的度量值或数据。例如,图16-4中的模式通过其维度扩展了SolvedCases事实。
The reason for the high normalization of the dimensions is the analytical system’s need to support flexible querying. That’s another difference between operational and analytical models. It’s possible to predict how an operational model will be queried to support the business requirements. The querying patterns of the analytical models are not predictable. The data analysts need flexible ways of looking at the data, and it’s hard to predict what queries will be executed in the future. As a result, the normalization supports dynamic querying and filtering, and grouping the facts data across the different dimensions.
维度高度规范化的原因是分析性系统需要支持灵活的查询。这是运营性模型和分析性模型之间的另一个区别。可以预测运营性模型如何被查询以支持业务需求。但是,分析性模型的查询模式是不可预测的。数据分析师需要以灵活的方式查看数据,并且很难预测未来会执行哪些查询。因此,规范化支持跨不同维度的动态查询、过滤和事实数据分组。
Analytical Models
分析性模型
The table structure depicted in Figure 16-5 is called the star schema. It is based on the many-to-one relationships between the facts and their dimensions: each dimension record is used by many facts; a fact’s foreign key points to a single dimension record.
图16-5中所示的表结构被称为星型模式(Star Schema)。它基于事实与其维度之间的多对一关系:每个维度记录被多个事实使用;事实的外键指向单个维度记录。
Figure 16-5. The many-to-one relationship between facts and their dimensions
图16-5. 事实与维度之间的多对一关系
Another predominant analytical model is the snowflake schema. The snowflake schema is based on the same building blocks: facts and dimensions. However, in the snowflake schema, the dimensions are multilevel: each dimension is further normalized into more fine-grained dimensions, as shown in Figure 16-6.
另一种主要的分析性模型是雪花模式(Snowflake Schema)。雪花模式基于相同的构建块:事实和维度。但是,在雪花模式中,维度是多层次的:每个维度都进一步规范化为更细粒度的维度,如图16-6所示。
As a result of the additional normalization, the snowflake schema will use less space to store the dimension data and is easier to maintain. However, querying the facts’ data will require joining more tables, and therefore, more computational resources are needed.
由于额外的规范化,雪花模式将使用更少的空间来存储维度数据,并且更容易维护。但是,查询事实数据将需要联接更多的表,因此,需要更多的计算资源。
Both the star and snowflake schemas allow data analysts to analyze business performance, gaining insights into what can be optimized and built into business intelligence (BI) reports.
星型和雪花模式都允许数据分析师分析业务性能,深入了解可以优化的内容,并将其构建到商业智能(BI)报告中。
Figure 16-6. Multilevel dimensions in the snowflake schema
图16-6. 雪花模式中的多层次维度
Let’s shift the discussion from analytical modeling to data management architectures that support generating and serving analytical data. In this section, we will discuss two common analytical data architectures: data warehouse and data lake. You will learn the basic working principles of each architecture, how they differ from each other, and the challenges of each approach. Knowledge of how the two architectures work will build the foundation for discussing the main topic of this chapter: the data mesh paradigm and its interplay with domain-driven design.
让我们将讨论从分析性建模转移到支持生成和服务分析性数据的数据管理架构。在本节中,我们将讨论两种常见的分析性数据架构: 数据仓库和数据湖。您将了解每种架构的基本工作原理、它们之间的差异以及每种方法面临的挑战。了解这两种架构的工作原理将为本章主要主题“数据网格范式及其与领域驱动设计的相互作用”的讨论奠定基础。
Data Warehouse
数据仓库
The data warehouse (DWH) architecture is relatively straightforward. Extract data from all of the enterprise’s operational systems, transform the source data into an analytical model, and load the resultant data into a data analysis–oriented database. This database is the data warehouse.
数据仓库(DWH)架构相对简单。它从企业的所有运营系统中提取数据,将源数据转换为分析性模型,并将结果数据加载到面向数据分析的数据库中。这个数据库就是数据仓库。
This data management architecture is based primarily on the extract-transform-load (ETL) scripts. The data can come from various sources: operational databases, streaming events, logs, and so on. In addition to translating the source data into a facts/dimensions-based model, the transformation step may include additional operations such as removing sensitive data, deduplicating records, reordering events, aggregating fine-grained events, and more. In some cases, the transformation may require temporary storage for the incoming data. This is known as the staging area.
这种数据管理架构主要基于提取-转换-加载(ETL)脚本。数据可以来自各种来源:运营数据库、流事件、日志等。除了将源数据转换为基于事实/维度的模型外,转换步骤还可能包括其他操作,如删除敏感数据、删除重复记录、重新排序事件、聚合细粒度事件等。在某些情况下,转换可能需要为传入数据提供临时存储。这被称为暂存区。
The resultant data warehouse, shown in Figure 16-7, contains analytical data covering all of the enterprise’s business processes. The data is exposed using the SQL language (or one of its dialects) and is used by data analysts and BI engineers.
如图16-7所示,生成的数据仓库包含覆盖企业所有业务流程的分析性数据。这些数据通过SQL语言(或其方言之一)暴露出来,供数据分析师和商业智能工程师使用。
Figure 16-7. A typical enterprise data warehouse architecture
图16-7典型的企业数据仓库架构
The careful reader will notice that the data warehouse architecture shares some of the challenges discussed in Chapters 2 and 3.
细心的读者会注意到,数据仓库架构与第2章和第3章中讨论的一些挑战存在共性。
First, at the heart of the data warehouse architecture is the goal of building an enterprise-wide model. The model should describe the data produced by all of the enterprise’s systems and address all of the different use cases for analytical data. The analytical model enables, for example, optimizing the business, reducing operational costs, making intelligent business decisions, reporting, and even training ML models. As you learned in Chapter 3, such an approach is impractical for anything by the smallest organizations. Designing a model for the task at hand, such as building reports or training ML models, is a much more effective and scalable approach.
首先,数据仓库架构的核心目标是构建一个企业级的模型。该模型应描述企业所有系统生成的数据,并处理分析性数据的所有不同用例。例如,分析性模型能够优化业务、降低运营成本、做出明智的商业决策、报告,甚至训练机器学习模型。正如您在第3章中所学的那样,对于任何小型组织来说,这种方法都是不切实际的。为当前任务(如构建报告或训练机器学习模型)设计模型是一种更有效、更具可扩展性的方法。
The challenge of building an all-encompassing model can be partly addressed by the use of data marts. A data mart is a database that holds data relevant for well-defined analytical needs, such as analysis of a single business department. In the data mart model shown in Figure 16-8, one mart is populated directly by an ETL process from an operational system, while another mart extracts its data from the data warehouse.
构建包罗万象的模型所面临的挑战可以通过使用数据集市(Data Mart)来部分解决。数据集市是一个数据库,用于存储与明确定义的分析需求(如单个业务部门的分析)相关的数据。在图16-8所示的数据集市模型中,一个集市直接通过ETL过程从运营系统中填充数据,而另一个集市则从数据仓库中提取其数据。
Figure 16-8. The enterprise data warehouse architecture augmented with data marts
图16-8. 结合数据集市的企业数据仓库架构
When the data is ingested into a data mart from the enterprise data warehouse, the enterprise-wide model still needs to be defined in the data warehouse. Alternatively, data marts can implement dedicated ETL processes to ingest data directly from the operational systems. In this case, the resultant model makes it challenging to query data across different marts—for example, across different departments—as it requires a cross-database query and significantly impacts performance.
当数据从企业数据仓库被摄入到数据集市时,仍然需要在数据仓库中定义企业级的模型。或者,数据集市可以实施专用的ETL过程来直接从运营系统中摄取数据。在这种情况下,由于需要跨数据库查询,且对性能有显著影响,因此很难跨不同的集市(例如,跨不同的部门)查询数据,从而使生成的模型面临挑战。
Another challenging aspect of the data warehouse architecture is that the ETL processes create a strong coupling between the analytical (OLAP) and the operational (OLTP) systems. The data consumed by the ETL scripts is not necessarily exposed through the system’s public interfaces. Often, DWH systems simply fetch all the data residing in the operational systems’ databases. The schema used in the operational database is not a public interface, but rather an internal implementation detail. As a result, a slight change in the schema is destined to break the data warehouse’s ETL scripts. Since the operational and analytical systems are implemented and maintained by somewhat distant organizational units, the communication between the two is challenging and leads to lots of friction between the teams. This communication pattern is shown in Figure 16-9.
数据仓库架构的另一个具有挑战性的方面是,ETL过程在分析(OLAP)系统和运营(OLTP)系统之间建立了紧密的耦合。ETL脚本所使用的数据并不一定会通过系统的公共接口暴露出来。通常,DWH系统只是简单地获取运营系统数据库中存储的所有数据。运营数据库中使用的模式不是公共接口,而是内部实现细节。因此,模式的微小变化都注定会破坏数据仓库的ETL脚本。由于运营性系统和分析性系统是由相距较远的组织单位实施和维护的,因此两者之间的通信具有挑战性,并导致团队之间产生大量摩擦。这种通信模式如图16-9所示。
Figure 16-9. Data warehouse populated by fetching data directly from operational databases, ignoring the integration-oriented public interfaces
图16-9. 数据仓库通过直接从操作数据库获取数据来填充,忽略了面向集成的公共接口
The data lake architecture addresses some of the shortcomings of the data warehouse architecture.
数据湖架构解决了数据仓库架构的一些不足。
Data Lake
数据湖
As a data warehouse, the data lake architecture is based on the same notion of ingesting the operational systems’ data and transforming it into an analytical model. However, there is a conceptual difference between the two approaches.
作为数据仓库,数据湖架构基于相同的概念,即摄取操作系统的数据并将其转换为分析性模型。然而,这两种方法之间存在概念上的差异。
A data lake–based system ingests the operational systems’ data. However, instead of being transformed right away into an analytical model, the data is persisted in its raw form, that is, in the original operational model.
基于数据湖的系统会摄取运营性系统的数据。但是,这些数据并不会立即被转换成分析性模型,而是以其原始形式(即原始的运营性模型)持久保存。
Eventually, the raw data cannot fit the needs of data analysts. As a result, it is the job of the data engineers and the BI engineers to make sense of the data in the lake and implement the ETL scripts that will generate analytical models and feed them into a data warehouse. Figure 16-10 depicts a data lake architecture.
最终,原始数据无法满足数据分析师的需求。因此,数据工程师和商务智能(BI)工程师的工作就是对数据湖中的数据进行分析,并实施ETL脚本,以生成分析性模型并将其输入到数据仓库中。图16-10描绘了数据湖架构。
Figure 16-10. Data lake architecture
图16-10. 数据湖架构
Since the operational systems’ data is persisted in its original, raw form and is transformed only afterward, the data lake allows working with multiple, task-oriented analytical models. One model can be used for reporting, another for training ML models, and so on. Furthermore, new models can be added in the future and initialized with the existing raw data.
由于运营性系统的数据以其原始形式持久化,并且只在之后进行转换,因此数据湖允许使用多个面向任务的分析性模型。一个模型可用于报告,另一个模型可用于训练机器学习模型,等等。此外,未来还可以添加新的模型,并使用现有的原始数据进行初始化。
That said, the delayed generation of analytical models increases the complexity of the overall system. It’s not uncommon for data engineers to implement and support multiple versions of the same ETL script to accommodate different versions of the operational model, as shown in Figure 16-11.
也就是说,分析性模型的延迟生成增加了整个系统的复杂性。对于数据工程师来说,实现并支持同一 ETL 脚本的多个版本以适应运营性模型的不同版本并不罕见,如图16-11所示。
Figure 16-11. Multiple versions of the same ETL script accommodating different versions of the operational model
图16-11. 同一ETL脚本的多个版本,以适应运营性模型的不同版本
Furthermore, since data lakes are schema-less—there is no schema imposed on the incoming data—and there is no control over the quality of the incoming data, the data lake’s data becomes chaotic at certain levels of scale. Data lakes make it easy to ingest data but much more challenging to make use of it. Or, as is often said, a data lake becomes a data swamp. The data scientist’s job becomes orders of magnitude more complex to make sense of the chaos and to extract useful analytical data.
此外,由于数据湖是无模式的(即对输入数据没有施加模式),并且无法控制输入数据的质量,因此在一定规模下,数据湖中的数据会变得混乱。数据湖虽然使得数据摄取变得容易,但数据的使用却变得更加困难。或者,如人们常说的那样,数据湖变成了数据沼泽。数据科学家的工作复杂程度成倍增加,他们需要理清这种混乱并提取有用的分析数据。
Challenges of Data Warehouse and Data Lake Architectures
数据仓库和数据湖架构面临的挑战
Both data warehouse and data lake architectures are based on the assumption that the more data that is ingested for analytics, the more insight the organization will gain. Both approaches, however, tend to break under the weight of “big” data. The transformation of operational to analytical models converges to thousands of unmaintainable, ad hoc ETL scripts at scale.
数据仓库和数据湖架构都基于这样一种假设,即摄入的分析数据越多,组织获得的洞察力就越多。然而,这两种方法都倾向于在“大数据”的重压下崩溃。在规模上,从运营性模型到分析性模型的转换会汇聚为成千上万的、无法维护的、特设的ETL脚本。
From a modeling perspective, both architectures trespass the boundaries of the operational systems and create dependencies on their implementation details. The resultant coupling to the implementation models creates friction between the operational and analytical systems teams, often to the point of preventing changes to the operational models for the sake of not breaking the analysis system’s ETL jobs.
从建模的角度来看,这两种架构都跨越了运营性系统的边界,并对其实现细节产生了依赖。由此产生的对实现模型的耦合在运营性系统和分析系统团队之间产生了摩擦,经常导致这样的后果:为了不破坏分析性系统的ETL作业而阻止对运营性模型进行更改。
To make matters worse, since the data analysts and data engineers belong to a separate organizational unit, they often lack the deep knowledge of the business domain possessed by the operational systems’ development teams. Instead of the knowledge of the business domain, they are specialized mainly in big data tooling.
更糟糕的是,由于数据分析师和数据工程师属于不同的组织单位,他们往往缺乏业务系统开发团队所拥有的深入的业务领域知识。他们主要的专业知识在于大数据工具,而不是业务领域知识。
Last but not least, the coupling to the implementation models is especially acute in domain-driven design–based projects, in which the emphasis is on continuously evolving and improving the business domain’s models. As a result, a change in the operational model can have unforeseen consequences in the analytical model. Such changes are frequent in DDD projects and often result in friction between R&D and data teams.
最后但同样重要的是,在实现模型上的耦合在基于领域驱动设计(DDD)的项目中尤为严重,这类项目强调不断演进和改进业务领域的模型。因此,运营性模型的更改可能会对分析性模型产生不可预见的后果。在DDD项目中,此类更改频繁发生,并经常导致研发团队和数据团队之间的摩擦。
These limitations of data warehouses and data lakes inspired a new analytical data management architecture: data mesh.
数据仓库和数据湖的这些局限性催生了一种新的分析数据管理架构:数据网格(Data Mesh)。
Data Mesh
数据网格
The data mesh architecture is, in a sense, domain-driven design for analytical data. As the different patterns of DDD draw boundaries and protect their contents, the data mesh architecture defines and protects model and ownership boundaries for analytical data.
从某种意义上说,数据网格架构是分析性数据的领域驱动设计。就像DDD的不同模式划定界限并保护其内容一样,数据网格架构也定义并保护了分析性数据的模型和所有权界限。
The data mesh architecture is based on four core principles: decompose data around domains, data as a product, enable autonomy, and build an ecosystem. Let’s discuss each principle in detail.
数据网格架构基于四个核心原则:围绕领域分解数据、数据即产品、实现自治以及构建生态系统。让我们详细讨论每个原则。
Decompose Data Around Domains
分解领域周围的数据
Both the data warehouse and data lake approaches aim to unify all of the enterprise’s data into one big model. The resultant analytical model is ineffective for all the same reasons as an enterprise-wide operational model is. Furthermore, gathering data from all systems into one location blurs the ownership boundaries of the various data elements.
数据仓库和数据湖的方法都旨在将企业的所有数据统一到一个大型模型中。由此产生的分析性模型之所以无效,与全企业范围的运营性模型无效的原因相同。此外,将所有系统的数据收集到一个位置会模糊各种数据元素的所有权边界。
Instead of building a monolithic analytical model, the data mesh architecture calls for leveraging the same solution we discussed in Chapter 3 for operational data: use multiple analytical models and align them with the origin of the data. This naturally aligns the ownership boundaries of the analytical models with the bounded contexts’ boundaries, as shown in Figure 16-12. When the analysis model is decomposed according to the system’s bounded contexts, the generation of the analysis data becomes the responsibility of the corresponding product teams.
数据网格架构不构建单一的分析性模型,而是要求我们利用第3章中讨论的用于运营数据的相同解决方案:使用多个分析性模型并将其与数据的来源对齐。这自然地将分析性模型的所有权边界与有界上下文的界限对齐,如图16-12所示。当分析性模型根据系统的有界上下文进行分解时,生成分析性数据就成为相应产品团队的责任。
Figure 16-12. Aligning the ownership boundaries of the analytical models with the bounded contexts’ boundaries
图16-12. 将分析性模型的所有权边界与有界上下文的边界保持一致
Each bounded context now owns its operational (OLTP) and analytical (OLAP) models. Consequently, the same team owns the operational model, now in charge of transforming it into the analytical model.
现在,每个有界上下文都拥有自己的运营性(OLTP)和分析性(OLAP)模型。因此,同一个团队既拥有运营性模型,又负责将其转换为分析性模型。
Data as a Product
数据即产品
The classic data management architectures make it difficult to discover, understand, and fetch quality analytical data. This is especially acute in the case of data lakes.
传统的数据管理架构使得发现、理解和获取高质量的分析性数据变得困难。在数据湖的情况下,这一问题尤为突出。
The data as a product principle calls for treating the analytical data as a first-class citizen. Instead of the analytical systems having to get the operational data from dubious sources (internal database, logfiles, etc.), in a data mesh–based system the bounded contexts serve the analytical data through well-defined output ports, as shown in Figure 16-13.
“数据即产品”的原则要求将分析性数据视为一等公民。在基于数据网格的系统中,有界上下文通过明确定义的输出端口提供分析性数据,而不是让分析性系统从可疑的来源(内部数据库、日志文件等)获取运营性数据,如图16-13所示。
Figure 16-13. Polyglot data endpoints exposing the analytical data to the consumers
图16-13. 多语言数据端点向消费者公开分析性数据
Analytical data should be treated the same as any public API:
分析性数据应与任何公共API一样对待:
- It should be easy to discover the necessary endpoints: the data output ports. 应该很容易发现必要的端点:数据输出端口。
- The analytical endpoints should have a well-defined schema describing the served data and its format. 分析端点应该具有定义良好的架构,用于描述提供的数据及其格式。
- The analytical data should be trustworthy, and as with any API, it should have defined and monitored service-level agreements (SLAs). 分析性数据应该是值得信赖的,并且与任何API一样,它应该具有已定义和受监控的服务级别协议(SLA)。
- The analytical model should be versioned as a regular API and correspondingly manage integration-breaking changes in the model. 分析性模型应该像常规API一样进行版本控制,并相应地管理模型中的破坏性更改。
Furthermore, since the analytical data is treated as a product, it has to address the needs of its consumers. The bounded context’s team is in charge of ensuring that the resultant model addresses the needs of its consumers. Contrary to the data warehouse and data lake architectures, with data mesh, accountability for data quality is a top-level concern.
此外,由于分析性数据被视为一种产品,因此它必须满足消费者的需求。有界上下文的团队负责确保所产生的模型满足消费者的需求。与数据仓库和数据湖架构不同,在使用数据网格时,对数据质量的责任是一个最高级别的关注。
The goal of the distributed data management architecture is to allow the fine-grained analytical models to be combined to address the organization’s data analysis needs. For example, if a BI report should reflect data from multiple bounded contexts, it should be able to easily fetch their analytical data if needed, apply local transformations, and produce the report.
分布式数据管理架构的目标是允许将细粒度的分析性模型组合起来,以满足组织的数据分析需求。例如,如果商业智能(BI)报告应反映来自多个有界上下文的数据,那么它应该能够在需要时轻松地获取这些分析性数据,应用本地转换,并生成报告。
To implement the data as a product principle, product teams require adding data-oriented specialists. That’s the missing piece in the cross-functional teams puzzle, which traditionally includes only specialists related to the operational systems.
为了实现“数据即产品”的原则,产品团队需要增加面向数据的专业人员。这是跨职能团队拼图中缺失的一块,传统上该团队只包括与运营性系统相关的专业人员。
Enable Autonomy
实现自治
The product teams should be able to both create their own data products and consume data products served by other bounded contexts. Just as in the case of bounded contexts, the data products should be interoperable.
产品团队应该能够创建自己的数据产品,并消费其他有界上下文提供的数据产品。就像有界上下文一样,数据产品也应该是可互操作的。
It would be wasteful, inefficient, and hard to integrate if each team builds their own solution for serving analytical data. To prevent this from happening, a platform is needed to abstract the complexity of building, executing, and maintaining interoperable data products. Designing and building such a platform is a considerable undertaking and requires a dedicated data infrastructure platform team.
如果每个团队都建立自己的解决方案来提供分析性数据,那么这将是一种浪费、低效且难以集成的做法。为了防止这种情况发生,需要一个平台来抽象构建、执行和维护可互操作数据产品的复杂性。设计和构建这样的平台是一项艰巨的任务,需要专门的数据基础设施平台团队。
The data infrastructure platform team should be in charge of defining the data product blueprints, unified access patterns, access control, and polyglot storage that can be leveraged by product teams, as well as monitoring the platform and ensuring that the SLAs and objectives are met.
数据基础设施平台团队应负责定义数据产品蓝图、统一访问模式、访问控制和多语言存储(多种数据库),以便产品团队可以利用它们,同时监控平台并确保满足SLA和目标。
Build an Ecosystem
建立生态系统
The final step to creating a data mesh system is to appoint a federated governance body to enable interoperability and ecosystem thinking in the domain of the analytical data. Typically, that would be a group consisting of the bounded contexts’ data and product owners and representatives of the data infrastructure platform team, as shown in Figure 16-14.
创建数据网格系统的最后一步是任命一个联合治理机构,以便在分析数据领域实现互操作性和生态系统思维。通常,这个机构将包括有界上下文的数据和产品负责人以及数据基础设施平台团队的代表,如图16-14所示。
The governance group is in charge of defining the rules to ensure a healthy and interoperable ecosystem. The rules have to be applied to all data products and their interfaces, and it’s the group’s responsibility to ensure adherence to the rules throughout the enterprise.
治理小组负责制定规则,以确保建立一个健康且可互操作的生态系统。这些规则必须应用于所有数据产品及其接口,治理小组有责任确保整个企业都遵守这些规则。
Figure 16-14. The governance group, which ensures that the distributed data analytics ecosystem is interoperable, healthy, and serves the organization’s needs
图16-14. 治理组,它确保分布式数据分析生态系统是可互操作的、健康的,并服务于组织的需求 翻译:治理小组负责确保分布式数据分析生态系统是可互操作的、健康的,并满足组织的需求
Combining Data Mesh and Domain-Driven Design
结合数据网格和领域驱动设计
These are the four principles that the data mesh architecture is based on. The emphasis on defining boundaries, and encapsulating the implementation details behind well-defined output ports, makes it evident that the data mesh architecture is based on the same reasoning as domain-driven design. Furthermore, some of the domain-driven design patterns can greatly support implementing the data mesh architecture.
这些是数据网格架构所基于的四个原则。它强调定义边界,并将实现细节封装在明确定义的输出端口之后,这表明数据网格架构与领域驱动设计基于相同的推理。此外,一些领域驱动设计模式可以极大地支持实现数据网格架构。
First and foremost, the ubiquitous language and the resultant domain knowledge are essential for designing analytical models. As we discussed in the data warehouse and data lake sections, domain knowledge is lacking in traditional architectures.
首先,通用语言和由此产生的领域知识对于设计分析性模型是必不可少的。正如我们在数据仓库和数据湖部分所讨论的那样,传统架构中缺乏领域知识。
Second, exposing a bounded context’s data in a model that is different from its operational model is the open-host pattern. In this case, the analytical model is an additional published language.
其次,将有界上下文的数据以与其运营性模型不同的模型暴露出来,这是开放主机模式。在这种情况下,分析性模型是另一种发布语言。
The CQRS pattern makes it easy to generate multiple models of the same data. It can be leveraged to transform the operational model into an analytical model. The CQRS pattern’s ability to generate models from scratch makes it easy to generate and serve multiple versions of the analytical model simultaneously, as shown in Figure 16-15.
CQRS模式使得可以轻松生成同一数据的多个模型。它可以被用来将运营性模型转换为分析性模型。CQRS模式从头开始生成模型的能力使得同时生成和提供分析性模型的多个版本变得容易,如图16-15所示。
Figure 16-15. Leveraging the CQRS pattern to simultaneously serve the analytical data in two different schema versions
图16-15. 利用CQRS模式同时以两种不同的模式版本提供分析数据
Finally, since the data mesh architecture combines the different bounded contexts’ models to implement analytical use cases, the bounded context integration patterns for operational models apply for analytical models as well. Two product teams can evolve their analytical models in partnership. Another can implement an anticorruption layer to protect itself from an ineffective analytical model. Or, on the other hand, the teams can go their separate ways and produce duplicate implementations of analytical models.
最后,由于数据网格架构结合了不同有界上下文的模型来实现分析用例,因此操作性模型的有界上下文集成模式也适用于分析模型。两个产品团队可以合作演变他们的分析性模型。另一个团队可以实现一个防腐层,以防止自己受到无效的分析性模型的影响。或者,另一方面,团队可以采取不同的方式,生成分析模型的重复实现。
ConclusionIn
总结
this chapter, you learned the different aspects of designing software systems, in particular, defining and managing analytical data. We discussed the predominant models for analytical data, including the star and snowflake schemas, and how the data is traditionally managed in data warehouses and data lakes.
在本章中,您学习了设计软件系统的不同方面,特别是定义和管理分析性数据。我们讨论了分析性数据的主要模型,包括星型和雪花型模式,以及传统的在数据仓库和数据湖中管理数据的方式。
The data mesh architecture aims to address the challenges of the traditional data management architectures. At its core, it applies the same principles as domain-driven design but to analytical data:decomposing the analytical model into manageable units and ensuring that the analytical data can be reliably accessed and used through its public interfaces. Ultimately, the CQRS and bounded context integration patterns can support implementing the data mesh architecture.
数据网格架构旨在解决传统数据管理架构面临的挑战。其核心在于,它将与领域驱动设计相同的原理应用于分析数据:将分析性模型分解为可管理的单元,并确保通过其公共接口可以可靠地访问和使用分析数据。最终,CQRS和有界上下文集成模式可以支持实现数据网格架构。
Exercises
练习
1. Which of the following statements is/are correct regarding the differences between transactional (OLTP) and analytical (OLAP) models? 关于事务处理(OLTP)和分析性(OLAP)模型之间的差异,以下哪些陈述是正确的?
a. OLAP models should expose more flexible querying options than OLTP models. OLAP模型应该比OLTP模型提供更多灵活的查询选项。
b. OLAP models are expected to undergo more updates than OLTP models, and thus have to be optimized for writes. 与OLTP模型相比,OLAP模型预计会经历更多的更新,因此必须针对写入进行优化。
c. OLTP data is optimized for real-time operations, whereas it’s acceptable to wait seconds or even minutes for an OLAP query’s response. OLTP数据针对实时操作进行了优化,而等待OLAP查询的响应几秒甚至几分钟是可以接受的。
d. A and C are correct. A 和 C 是正确的。
答案:d。
2. Which bounded context integration pattern is essential for implementation of the data mesh architecture? 对于数据网格架构的实现,哪种有界上下文集成模式至关重要?
a. Shared kernel 共享内核
b. Open-host service 开放主机服务
c. Anticorruption layer 防腐层
d. Partnership 合作伙伴关系
答案:b。One of the published languages exposed by the open-host service can be OLAP data optimized for analytical processing. 开放主机服务公开的发布语言之一可以是为分析处理而优化的 OLAP 数据。
3. Which architectural pattern is essential for implementation of the data mesh architecture? 哪种架构模式对于实现数据网格架构至关重要?
a. Layered architecture. 分层架构
b. Ports & adapters. 端口适配器
c. CQRS.
d. Architectural patterns cannot support implementation of an OLAP model. 架构模式不能支持 OLAP 模型的实现。
答案:c。The CQRS pattern can be leveraged to generate projections of the OLAP model out of the transactional model. 可以利用CQRS(命令查询职责分离)模式从事务模型中生成OLAP模型的投影。
4. The definition of data mesh architecture calls for decomposing data around “domains.” What is DDD’s term for denoting the data mesh’s domains? 数据网格架构的定义要求围绕“领域”分解数据。在领域驱动设计(DDD)中,用于表示数据网格领域的术语是什么?
a. Bounded contexts. 有界上下文
b. Business domains. 业务领域
c. Subdomains. 子域
d. There is no synonym for a data mesh’s domains in DDD. DDD 中没有数据网格领域的同义词。
答案:a。在DDD中,用于表示特定业务或技术领域的术语是“Bounded Context”(限界上下文)。虽然数据网格架构没有直接使用“Bounded Context”这个词来定义其领域,但DDD中的这一概念与数据网格中围绕特定业务领域分解数据的理念是相契合的。因此,如果要用DDD的术语来类比数据网格的“领域”,那么最接近的可能是“Bounded Context”。不过,需要注意的是,数据网格架构和DDD在领域划分和应用方面可能有着不同的侧重点和具体实现方式。
