关于以数据为中心的人工智能学习
Data-centric Artificial Intelligence: A Survey
以数据为中心的人工智能:一项调查
DAOCHEN ZHA, Computer Science, Rice University, Houston, United States
ZAID PERVAIZ BHAT, Computer Science, Texas A&M University, College Station, United States
KWEI-HERNG LAI, Computer Science, Rice University, Houston, United States
FAN YANG, Computer Science, Rice University, Houston, United States
ZHIMENG JIANG, Computer Science, Texas A&M University, College Station, United States
SHAOCHEN ZHONG, Computer Science, Rice University, Houston, United States
XIA HU, Computer Science, Rice University, Houston, United States
DAOCHEN ZHA,美国休斯顿莱斯大学计算机科学 ZAID PERVAIZ BHAT,美国德克萨斯农工大学计算机科学 KWEI-HERNG LAI,美国休斯顿莱斯大学计算机科学 FAN YANG,美国休斯顿莱斯大学计算机科学 志萌江,美国德克萨斯农工大学计算机科学,美国大学城 钟绍晨,美国莱斯大学计算机科学 XIA 胡,莱斯大学计算机科学,休斯顿,美国
CCS Concepts: $\cdot$ Computing methodologies Artificial intelligence;
CCS 概念: $\cdot$ 计算方法 人工智能;
Additional Key Words and Phrases: Artificial intelligence, machine learning, data-centric AI
其他关键词和短语:人工智能、机器学习、以数据为中心的人工智能
1Introduction
1介绍
The past decade has witnessed dramatic progress in Artificial Intelligence (AI) across various directions, such as natural language processing [45] and computer vision [224], which has also made a profound impact on nearly every other domain, spanning recommender system [267], healthcare [150], biology [237], finance [164], and so forth. A vital enabler of these great successes is the availability of abundant and high-quality data. Many major AI breakthroughs occur only after we have the access to the right training data. For example, AlexNet [120], one of the first successful convolutional neural networks, was designed based on the ImageNet dataset [54]. AlphaFold [110], a breakthrough of AI in scientific discovery, will not be possible without annotated protein sequences [152]. The recent advances in large language models rely on large text data for training [32, 114, 177, 178] (left of Figure 1). Besides training data, well-designed inference data, the input used in the inference phase of AI systems for making predictions, has facilitated the initial recognition of numerous critical issues in AI and unlocked new model capabilities. A famous example is adversarial samples [122] that confuse neural networks through specialized modifications of input data, which causes a surge of interest in studying AI security. Another example is prompt engineering [136], which accomplishes various tasks by solely tuning the input data to probe knowledge from the model while keeping the model fixed (right of Figure 1). In parallel, the value of data has been well-recognized in industries. Many AI-driven companies like Meta [9o] have established data infrastructure to reliably and efficiently supply data for building AI systems. All these efforts in constructing training data, inference data, and the infrastructure to maintain data have paved the path for the achievements in AI today.
在过去的十年中,人工智能(AI)在自然语言处理[45]和计算机视觉[224]等各个方向上取得了巨大进步,这也对几乎所有其他领域产生了深远的影响,涵盖推荐系统[267]、医疗保健[150]、生物学[237]、金融[164]等。这些巨大成功的一个重要推动因素是丰富和高质量数据的可用性。许多重大的人工智能突破只有在我们获得正确的训练数据之后才会发生。例如,AlexNet[120]是最早成功的卷积神经网络之一,它是基于ImageNet数据集[54]设计的。AlphaFold[110]是人工智能在科学发现方面的突破,如果没有带注释的蛋白质序列,就不可能实现[152]。大型语言模型的最新进展依赖于大文本数据进行训练[32,114,177,178](图1左侧)。除了训练数据之外,精心设计的推理数据(人工智能系统推理阶段用于预测的输入)也促进了对人工智能中许多关键问题的初步识别,并释放了新的模型能力。一个著名的例子是对抗性样本[122],它通过对输入数据的专门修改来混淆神经网络,这引起了人们对研究人工智能安全性的兴趣激增。另一个例子是提示工程[136],它通过仅调整输入数据来探测模型中的知识,同时保持模型固定来完成各种任务(图1右侧)。与此同时,数据的价值在行业中得到了广泛认可。许多人工智能驱动的公司,如Meta [9o],已经建立了数据基础设施,以可靠、高效地为构建人工智能系统提供数据。所有这些在构建训练数据、推理数据和维护数据的基础设施方面的努力为当今人工智能的成就铺平了道路。
Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI [104, 105, 172, 256]. In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data. It is important to note that *data-centric" differs fundamentally from "data-driven, as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.
最近,数据在人工智能中的作用被显着放大,催生了以数据为中心的人工智能的新兴概念[104,105,172,256]。在传统的以模型为中心的人工智能生命周期中,研究人员和开发人员主要专注于识别更有效的模型,以提高人工智能性能,同时保持数据基本不变。然而,这种以模型为中心的范式忽视了数据的潜在质量问题和不良缺陷,例如缺失值、不正确的标签和异常。作为对现有模型推进努力的补充,以数据为中心的人工智能强调对数据进行系统工程以构建人工智能系统,将我们的重点从模型转移到数据上。需要注意的是,“以数据为中心”与“数据驱动”有着根本的不同,因为后者只强调使用数据来指导人工智能的发展,而人工智能开发通常仍然以开发模型而不是工程数据为中心。
Several initiatives have already been dedicated to the data-centric AI movement. A notable one is a competition launched by $\mathrm { N g }$ et al. [159], which asks the participants to iterate on the dataset only to improve the performance. Snorkel [180] builds a system that enables automatic data annotation with heuristic functions without hand labeling. A few rising AI companies have placed data in the central role because of many benefits, such as improved accuracy, faster deployment, and standardized workflow [158, 179, 229]. These collective initiatives across academia and industry demonstrate the necessity of building AI systems using data-centric approaches.
已经有几项举措致力于以数据为中心的人工智能运动。一个值得注意的竞赛是由$\mathrm { N g }$等人[159]发起的竞赛,该竞赛要求参与者迭代数据集只是为了提高性能。Snorkel [180]构建了一个系统,无需手动标记即可使用启发式函数自动进行数据注释。一些新兴的人工智能公司将数据置于核心角色,因为它具有许多好处,例如提高准确性、加快部署速度和标准化工作流程[158,179,229]。学术界和工业界的这些集体举措证明了使用以数据为中心的方法构建人工智能系统的必要性。
With the growing need for data-centric AI, various methods have been proposed. Some relevant research subjects are not new. For instance, data augmentation [69] has been extensively investigated to improve data diversity. Feature selection [129] has been studied since decades ago for preparing more concise data. Meanwhile, some new research directions have emerged recently, such as data programming for labeling data quickly [181], algorithmic recourse for understanding model decisions [113], and prompt engineering that modifies the input of large language models to obtain the desirable predictions [136]. From another dimension, some works are dedicated to making data processing more automated, such as automated data augmentation [51], and automated pipeline discovery [63, 124]. Some other methods emphasize human-machine collaboration in creating data so that the model can align with human intentions. For example, the remarkable success of ChatGPT and GPT-4 [161] is largely attributed to the reinforcement learning from human feedback procedure [46], which asks humans to provide appropriate responses to prompts and rank the outputs to serve as the rewards [163]. Although the above methods are independently developed for different purposes, their common objective is to ensure data quality, quantity, and reliability so that the models behave as intended.
随着对以数据为中心的人工智能的需求不断增长,人们提出了各种方法。一些相关的研究课题并不新鲜。例如,数据增强[69]已被广泛研究以提高数据多样性。特征选择[129]自几十年前以来就被研究用于准备更简洁的数据。同时,最近出现了一些新的研究方向,例如用于快速标记数据的数据编程[181]、用于理解模型决策的算法资源[113]以及修改大型语言模型输入以获得理想预测的提示工程[136]。从另一个维度来看,一些工作致力于使数据处理更加自动化,例如自动数据增强[51]和自动管道发现[63,124]。其他一些方法强调人机协作创建数据,以便模型能够与人类意图保持一致。例如,ChatGPT和GPT-4 [161]的显著成功主要归功于人类反馈强化学习程序[46],该程序要求人类对提示做出适当的响应,并对输出进行排名以作为奖励[163]。尽管上述方法是为不同的目的独立开发的,但它们的共同目标是确保数据的质量、数量和可靠性,以便模型按预期运行。
Fig. 1. Motivating examples that highlight the central role of data in Al. On the left, large and high-quality training data are the driving force of recent successes of CPT models, while model architectures remain similar, except for more model weights. The detailed data collection strategies of GPT models are provided in References [32, 161, 163, 177, 178, 275]. On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.
图 1.突出数据在 Al 中的核心作用的激励示例。在左边,大而高质量的训练数据是 CPT 模型最近成功的驱动力,而模型架构保持相似,只是模型权重更多。GPT 模型的详细数据收集策略在参考文献 [32, 161, 163, 177, 178, 275] 中提供了。在右边,当模型变得足够强大时,我们只需要设计提示(推理数据)来完成我们的目标,模型是固定的。
Motivated by the need for data-centric AI and the numerous proposed methods, this survey provides a holistic view of the technological advances in data-centric AI and summarizes the existing research directions. We organize the existing literature according to a goal-driven taxonomy, examining each paper through the lenses of automation and human collaboration. This approach aims to provide a global understanding of the objectives behind each data-centric AI task, the degree to which they have been automated, and the involvement of humans in the process, thereby fostering inspiration for future advancements in this domain. In particular, this survey centers on the following research questions:
出于对以数据为中心的人工智能的需求和众多提出的方法的推动,本调查提供了以数据为中心的人工智能技术进步的整体视角,并总结了现有的研究方向。我们根据目标驱动的分类法组织现有文献,通过自动化和人类协作的视角检查每篇论文。这种方法旨在让人们全面了解每项以数据为中心的人工智能任务背后的目标、它们的自动化程度以及人类参与该过程的情况,从而为该领域的未来进步提供灵感。特别是,本次调查集中在以下研究问题上:
- RQ1: What are the necessary tasks to make AI data-centric? RQ2: Why is automation significant for developing and maintaining data? - RQ3: In which cases and why is human participation essential in data-centric AI? RQ4: What is the current progress of data-centric AI?
- RQ1:使人工智能以数据为中心的必要任务是什么?RQ2:为什么自动化对于开发和维护数据很重要?- RQ3:在哪些情况下以及为什么人类参与以数据为中心的人工智能至关重要?RQ4:以数据为中心的人工智能目前进展如何?
By answering these questions, we make three contributions. First, we provide a comprehensive overview to help readers efficiently grasp a broad picture of data-centric AI from different perspectives, including definitions, tasks, algorithms, challenges, and benchmarks. Second, we organize the existing literature under a goal-driven taxonomy. We further identify whether human involvement is needed in each method and label the method with a level of automation or a degree of human participation. Last, we analyze the existing research and discuss potential future opportunities.
通过回答这些问题,我们做出了三个贡献。首先,我们提供了全面的概述,帮助读者从不同角度有效地掌握以数据为中心的人工智能的广阔图景,包括定义、任务、算法、挑战和基准。其次,我们将现有文献组织在目标驱动的分类法下。我们进一步确定每种方法是否需要人工参与,并为该方法标记一定程度的自动化或一定程度的人类参与。最后,我们分析现有研究并讨论未来潜在的机会。
This survey is structured as follows. Section 2 presents an overview of the concepts related to data-centric AI. Section 3 introduces our survey methodology, including a goal-driven taxonomy, categorization of automation and human participation, related work, and strategy for paper collection. Then, we elaborate on the needs, representative methods, and challenges of three general data-centric AI goals, including training data development (Section 4), inference data development (Section 5, and data maintenance (Section 6). Section 7 summarizes benchmarks for various tasks. Section 8 discusses data-centric AI from a global view and highlights the potential future directions. Finally, we conclude this survey in Section 8.
本次调查的结构如下。第 2 节概述了与以数据为中心的人工智能相关的概念。第 3 节介绍了我们的调查方法,包括目标驱动的分类法、自动化和人类参与的分类、相关工作以及论文收集策略。然后,我们详细阐述了三个以数据为中心的通用人工智能目标的需求、代表性和挑战,包括训练数据开发(第 4 节)、推理数据开发(第 5 节)和数据维护(第 6 节)。第 7 节总结了各种任务的基准。第 8 节从全球角度讨论了以数据为中心的人工智能,并强调了未来潜在的方向。最后,我们在第 8 节中结束了本次调查。
2 Background of Data-centric AI
2 以数据为中心的人工智能的背景
This section provides a background of data-centric AI. Section 2.1 defines the relevant concepts.
Section 2.2 discusses why data-centric AI is needed.
本节提供以数据为中心的 AI 的背景。第 2.1 节定义了相关概念。第 2.2 节讨论了为什么需要以数据为中心的 AI。
2.1 Definitions
2.1 定义
Researchers have described data-centric AI in different ways. Ng et al. defined it as *the discipline of systematically engineering the data used to build an AI system" [157]. Jarrahi et al. mentioned that data-centric AI "advocates for a systematic and iterative approach to dealing with data issues' [105]. Miranda noted that data-centric AI focuses on the problems that "do not only involve the type of model to use, but also the quality of data at hand' [151]. While all these descriptions have emphasized the importance of data, the scope of data-centric AI remains ambiguous, i.e., what tasks and techniques belong to data-centric AI. Such ambiguity could prevent us from grasping a concrete picture of this field. Before starting the survey, it is essential to define some relevant concepts:
研究人员以不同的方式描述了以数据为中心的人工智能。Ng等人将其定义为“系统地设计用于构建人工智能系统的数据的学科”[157]。Jarrahi等人提到,以数据为中心的人工智能“提倡采用系统和迭代的方法来处理数据问题”[105]。Miranda指出,以数据为中心的人工智能关注的问题“不仅涉及要使用的模型类型,还涉及手头数据的质量”[151]。虽然所有这些描述都强调了数据的重要性,但以数据为中心的人工智能的范围仍然模糊不清,即哪些任务和技术属于以数据为中心的人工智能。这种模糊性可能会使我们无法掌握该领域的具体情况。在开始调查之前,有必要定义一些相关概念:
- Artificial Intelligence (AI): AI is a broad and interdisciplinary field that tries to enable computers to have human intelligence to solve complex tasks [240]. A dominant technique for AI is machine learning, which leverages data to train predictive models to accomplish some tasks.
- Data: Data is a very general concept to describe a collection of values that convey information. In the context of AI, data is used to train machine learning models or serve as the model input to make predictions. Data can appear in various formats, such as tabular data, images, texts, audio, and video.
- Training Data: Training data is the data used in the training phase of AI models. For machine learning, the model often leverages training data to adjust its parameters and make predictions.
-- Inference Data: Inference data is the data used in the inference phase of AI models. On the one hand, it can evaluate the performance of the model after it has been trained. On the other hand, tuning the inference data can help obtain the desirable outputs, such as tuning prompts for language models [136]. Note that another term, "testing data, is commonly used, often referring to a subset of the dataset for evaluation. We instead use the term *inference data, because the data for inference may not necessarily originate from a dataset subset.
-- Data Maintenance: Data maintenance refers to the process of maintaining the quality and reliability of data, which often involves efficient algorithms, tools, and infrastructures to understand and debug data. Data maintenance plays a crucial role in AI, since it ensures training and inference data are accurate and consistent [103].
-- Data-centric AI: Data-centric AI refers to a framework to develop, iterate, and maintain data for AI systems [256]. Data-centric AI involves the tasks and methods for building effective training data, designing proper inference data, and maintaining the data.
- 人工智能(AI):人工智能是一个广泛的跨学科领域,它试图使计算机拥有人类智能来解决复杂的任务[240]。人工智能的一种主要技术是机器学习,它利用数据来训练预测模型来完成某些任务。- 数据:数据是一个非常笼统的概念,用于描述传达信息的值集合。在人工智能的背景下,数据用于训练机器学习模型或作为模型输入来进行预测。数据可以以多种格式出现,例如表格数据、图像、文本、音频和视频。- 训练数据:训练数据是人工智能模型训练阶段使用的数据。对于机器学习,模型通常利用训练数据来调整其参数并做出预测。-- 推理数据:推理数据是人工智能模型推理阶段使用的数据。一方面,它可以在模型经过训练后评估其性能。另一方面,调整推理数据可以帮助获得理想的输出,例如调整语言模型的提示[136]。请注意,另一个术语“测试数据”是常用的,通常指用于评估的数据集的子集。我们改用术语“*推理数据”,因为用于推理的数据不一定来自数据集子集。-- 数据维护:数据维护是指维护数据质量和可靠性的过程,通常涉及高效的算法、工具和基础设施来理解和调试数据。数据维护在人工智能中起着至关重要的作用,因为它确保训练和推理数据的准确性和一致性[103]。-- 以数据为中心的人工智能:以数据为中心的人工智能是指为人工智能系统开发、迭代和维护数据的框架[256]。以数据为中心的人工智能涉及构建有效训练数据、设计适当的推理数据和维护数据的任务和方法。
Note that data-centric AI is related to data management, a very broad field entailing complex data practices for business intelligence and informed decision-making [206, 232]. Many popular topics within the data-centric AI community, such as data quality, have long been discussed in data management. However, the objective of data-centric AI differs. It does not target decisionmaking, which is a broader and typically business-oriented goal, but rather focuses exclusively on enhancing AI systems. As a result, data-centric AI is a discipline positioned at the intersection of AI and data management, which encompasses efforts in data management that bring benefits to AI systems. It also involves more AI-focused tasks, such as labeling training data and designing inference data to evaluate and harness AI capabilities.
请注意,以数据为中心的人工智能与数据管理有关,这是一个非常广泛的领域,需要复杂的数据实践来实现商业智能和明智决策 [206, 232]。以数据为中心的人工智能社区中的许多热门话题,例如数据质量,长期以来一直在数据管理中讨论。然而,以数据为中心的人工智能的目标有所不同。它不针对决策,这是一个更广泛且典型的面向业务的目标,而是专门关注增强人工智能系统。因此,以数据为中心的人工智能是一门位于人工智能和数据管理交叉领域的学科,它包括为人工智能系统带来好处的数据管理工作。它还涉及更多以人工智能为中心的任务,例如标记训练数据和设计推理数据以评估和利用人工智能功能。
2.2 Need for Data-centric AI
2.2 需要以数据为中心的人工智能
In the past, AI was often viewed as a model-centric field, where the focus was on advancing model designs given fixed datasets. However, the overwhelming reliance on fixed datasets does not necessarily lead to a better model behavior in real-world applications, as it overlooks the breadth, difficulty, and fidelity of data to the underlying problem [144]. Moreover, the models are often diffcult to transfer from one problem to another, since they are highly specialized and tailored to specific problems. Furthermore, undervaluing data quality could trigger data cascades, *compounding events causing negative, downstream effects from data issues" [190], which can trigger negative effects for AI systems such as decreased accuracy and persistent biases [34]. This can severely hinder the applicability of AI systems, particularly in high-stakes domains.
过去,人工智能通常被视为一个以模型为中心的领域,其重点是在给定固定数据集的情况下推进模型设计。然而,对固定数据集的压倒性依赖并不一定能在实际应用中带来更好的模型行为,因为它忽视了数据的广度、难度和对潜在问题的保真度[144]。此外,这些模型通常很难从一个问题转移到另一个问题,因为它们是高度专业化的,并且针对特定问题量身定制。此外,低估数据质量可能会引发数据级联,“*复合事件,导致数据问题的负面下游影响”[190],这可能会引发人工智能系统的负面影响,例如准确性下降和持续存在偏差[34]。这会严重阻碍人工智能系统的适用性,特别是在高风险领域。
Consequently, the attention of researchers and practitioners has gradually shifted toward datacentric AI to pursue data excellence [8]. Data-centric AI places a greater emphasis on enhancing the quality and quantity of the data with the model relatively more fixed. While this transition is still ongoing, we have already witnessed several accomplishments that shed light on its benefits. For example, the advancement of large language models is greatly dependent on the use of huge datasets [32, 114, 177, 178]. Compared to GPT-2 [178], GPT-3 [32] only made minor modifications in the neural architecture while spending efforts collecting a significantly larger high-quality dataset for training. ChatGPT [163], a remarkably successful application of GPT-3, adopts a similar neural architecture as GPT-3 and uses a reinforcement learning from human feedback procedure [46] to generate high-quality labeled data for fine-tuning. A new approach, known as prompt engineering [136], has seen significant success by focusing solely on tuning the prompts that are inputted to the model. The benefits of data-centric approaches can also be validated by practitioners [158, 179, 229]. For instance, Landing AI, a computer vision company, observes improved accuracy, reduced development time, and more consistent and scalable methods from the adoption of data-centric approaches [158]. All these achievements demonstrate the promise of data-centric AI.
因此,研究人员和从业者的注意力逐渐转向以数据为中心的人工智能,以追求数据的卓越性[8]。以数据为中心的人工智能更加强调提高数据的质量和数量,模型相对更加固定。虽然这种转变仍在进行中,但我们已经见证了几项成就,揭示了其好处。例如,大型语言模型的进步在很大程度上依赖于庞大数据集的使用[32,114,177,178]。与GPT-2[178]相比,GPT-3[32]仅对神经架构进行了微小的修改,同时花费了精力收集更大的高质量数据集进行训练。ChatGPT [163]是GPT-3的一个非常成功的应用,它采用了与GPT-3类似的神经架构,并使用来自人类反馈程序的强化学习[46]来生成高质量的标记数据进行微调。一种被称为提示工程的新方法[136]通过专注于调整输入到模型的提示,取得了巨大的成功。以数据为中心的方法的好处也可以被从业者验证[158,179,229]。例如,计算机视觉公司Landing AI观察到,通过采用以数据为中心的方法,准确性提高了,开发时间缩短了,方法更加一致和可扩展[158]。所有这些成就都证明了以数据为中心的人工智能的前景。
It is noteworthy that data-centric AI does not diminish the value of model-centric AI. Instead. these two paradigms are complementarily interwoven in building AI systems. On the one hand, model-centric methods can be used to achieve data-centric AI goals. For example, we can utilize a generation model, such as GAN [81, 265] and diffusion model [97, 117, 184], to perform data augmentation and generate more high-quality data. On the other hand, data-centric AI could facilitate the improvement of model-centric AI objectives. For instance, the increased availability of augmented data could inspire further advancements in model design. Therefore. in production scenarios, data and models tend to evolve alternatively in a constantly changing environment [172].
值得注意的是,以数据为中心的人工智能并没有削弱以模型为中心的人工智能的价值。 相反。 这两种范式在构建人工智能系统时互补地交织在一起。一方面,可以使用以模型为中心的方法来实现以数据为中心的人工智能目标。例如,我们可以利用生成模型,如 GAN [81, 265] 和扩散模型 [97, 117, 184],来执行数据增强并生成更高质量的数据。另一方面,以数据为中心的人工智能可以促进以模型为中心的人工智能目标的改进。例如,增强数据可用性的提高可以激发模型设计的进一步进步。 因此。 在生产场景中,数据和模型往往在不断变化的环境中交替演变[172]。
3 Survey Methodology
3 调查方法
This section introduces the methods used to conduct this survey. Section 3.1 draws a big picture of the related tasks and presents a goal-driven taxonomy to organize the existing literature. Section 3.2 focuses on automation and human participation. Section 3.3 discusses the related surveys. Appendix A presents the paper collection strategies.
本节介绍用于进行此调查的方法。第 3.1 节描绘了相关任务的总体图景,并提出了一个目标驱动的分类法来组织现有文献。第 3.2 节侧重于自动化和人类参与。第 3.3 节讨论了相关调查。附录 A 介绍了论文收集策略。
3.1 Goal-driven Taxonomy of Data-centric Al
3.1 以数据为中心的 Al 的目标驱动分类
The ambitious movement to data-centric AI cannot be realized without making progress on concrete and specific tasks. Unfortunately, most of the existing literature has been focused on discussing the foundations and perspectives of data-centric AI without clearly specifying the associated tasks [104, 105, 172, 199]. As an effort to resolve this ambiguity, the recently proposed DataPerf benchmark [144] has defined six data-centric AI tasks: training set creation, test set creation, selection algorithm, debugging algorithm, slicing algorithm, and valuation algorithm.
如果不在具体和具体的任务上取得进展,就无法实现向以数据为中心的人工智能的雄心勃勃的运动。不幸的是,大多数现有文献都集中在讨论以数据为中心的人工智能的基础和观点上,而没有明确规定相关任务[104,105,172,199]。为了解决这一歧义,最近提出的DataPerf基准测试[144]定义了六个以数据为中心的人工智能任务:训练集创建、测试集创建、选择算法、调试算法、切片算法和估值算法。
Fig. 2. Data-centric Al framework.
图2.以数据为中心的Al框架。
However, this flat taxonomy can only partially cover the existing data-centric AI literature. For example, some crucial tasks such as data labeling [266] are not included. The selection algorithm only addresses instance selection but not feature selection [129]. The test set creation is restricted to selecting items from a supplemental set rather than generating a new set [193]. Thus, a more nuanced taxonomy is necessary to fully encompass data-centric AI literature.
然而,这种扁平化的分类法只能部分覆盖现有的以数据为中心的人工智能文献。例如,一些关键任务,如数据标记[266],不包括在内。选择算法仅涉及实例选择,而不涉及特征选择[129]。测试集的创建仅限于从补充集中选择项目,而不是生成新集[193]。因此,需要更细致的分类法来完全涵盖以数据为中心的人工智能文献。
To gain a more comprehensive understanding of data-centric AI, we draw a big picture of the related tasks and present a goal-driven taxonomy to organize the existing literature in Figure 2. We divide data-centric AI into three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals, and each task belongs to a sug-goal. Here, we use "goal" or *"sub-goal' to describe higher-level objectives, while using *task" to denote specific and actionable steps. We give an overview of the goals below.
为了更全面地了解以数据为中心的 AI,我们绘制了相关任务的大图景,并提出了一个目标驱动的分类法来组织图 2 中的现有文献。我们将以数据为中心的 AI 分为三个目标:训练数据开发、推理数据开发和数据维护,其中每个目标都与多个子目标相关联,每个任务都属于一个 sug-go。在这里,我们使用“目标”或“子目标”来描述更高级别的目标,同时使用 *task“来表示具体且可作的步骤。我们在下面对目标进行概述。
- Training data development: The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models. It consists of five sub-goals, including (1) data collection for gathering raw training data, (2) data labeling for adding informative labels, (3) data preparation for cleaning and transforming data, (4) data reduction for decreasing data size with potentially improved performance, and (5) data augmentation for enhancing data diversity without collecting more data.
- 训练数据开发:训练数据开发的目标是收集和生成丰富且高质量的训练数据,以支持机器学习模型的训练。它由五个子目标组成,包括 (1) 用于收集原始训练数据的数据收集,(2) 用于添加信息标签的数据标记,(3) 用于清理和转换数据的数据准备,(4) 数据缩减以减小数据大小并可能提高性能,以及 (5) 数据增强以增强数据多样性而无需收集更多数据。
-- Inference data development: The objective is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs. There are three sub-goals in this effort: (1) in-distribution evaluation and (2) out-of-distribution evaluation aim to generate samples that adhere to or differ from the training data distribution, respectively, while (3) prompt engineering tunes the prompt in language models to get the desired predictions. The tasks in inference data development are relatively open-ended, since they are often designed to assess or unlock various capabilities of the model.
-- 推理数据开发:目标是创建新颖的评估集,这些评估集可以提供对模型的更精细的见解,或通过工程数据输入触发模型的特定功能。这项工作有三个子目标:(1) 分布内评估和 (2) 分布外评估旨在分别生成符合或不同于训练数据分布的样本,而 (3) 提示工程调整语言模型中的提示以获得所需的预测。推理数据开发中的任务相对开放,因为它们通常旨在评估或解锁模型的各种功能。
- Data maintenance: In real-world applications, data is not created once but rather necessitates continuous maintenance. The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment. It involves three essential sub-goals: (1) data understanding, which targets providing visualization and valuation of the complex data, enabling humans to gain valuable insights, (2) data quality assurance, which develops quantitative measurements and quality improvement strategies to monitor and repair data, and (3) data acceleration, which aims to devise efficient algorithms to supply the data in need via properly allocating resources and efficiently processing queries. Data maintenance plays a fundamental and supportive role in the data-centric AI framework, ensuring that the data in training and inference is accurate and reliable.
- 数据维护:在实际应用中,数据不是一次创建,而是需要持续维护。数据维护的目的是确保动态环境中数据的质量和可靠性。它涉及三个基本的子目标:(1)数据理解,旨在提供复杂数据的可视化和估值,使人类能够获得有价值的见解,(2)数据质量保证,制定定量测量和质量改进策略来监控和修复数据,以及(3)数据加速,旨在设计高效的算法,通过正确分配资源和高效处理查询来提供所需的数据。数据维护在以数据为中心的人工智能框架中起着基础性和支持性的作用,确保训练和推理中的数据准确可靠。
Table 1. Representative Tasks Discussed in This Survey
表 1.本次调查讨论的代表性任务
Goal | Sub-goal | Data-centric AI Tasks |
Training data | Collection (Section 4.1)d | Dataset discovery, data integration, raw data synthesis |
Labeling (Section 4.2) | Crowdsourced labeling, semi-supervised labeling, active learning,. data programming, distant supervision | |
Preparation (Section 4.3) | Data cleaning, feature extraction, feature transformation. | |
Reduction (Section 4.4) | Feature selection, dimensionality reduction, instance selection | |
Augmentation (Section 4.5) | Basic manipulation, augmentation data synthesis, upsampling | |
Inference data development | In-distribution (Section 5.1) | Data slicing, algorithmic recourse |
Out-of-distribution (Section 5.2) | Generating adversarial samples, generating samples with distribution shift | |
Prompt engineering (Section 5.3) | Manual prompt engineering, automated prompt engineering. | |
Data maintenance | Understanding (Section 6.1) | Visual summarization, clustering for visualization,. visualization recommendation, valuation |
Quality assurance (Section 6.1) | Quality assessment, quality improvement. | |
Resource allocation, query acceleration with index selection,. | ||
Storage and retrieval (Section 6.1) | query acceleration with rewriting |
Our discussion centers on three goals, each comprising several sub-goals.
我们的讨论集中在三个目标上,每个目标都包含几个子目标。
It is worth noting that these three goals are not necessarily pursued in a specific order but are often intertwined. Specifically, data maintenance frequently plays a supportive role in both training and inference data development processes. For example, AI researchers may often utilize "data understanding' tools to gain insights from visualizations during the training data development. In practice, activities under these three goals may even be executed in parallel in real-world scenarios. For instance, in a typical AI-driven company, specialized teams or roles are often designated to handle these goals concurrently. This includes machine learning engineers or data science teams focusing on training data development, AI testers or prompt engineers managing inference data development, and AI data infrastructure teams focusing on data maintenance. While their activities may overlap, they have distinct focuses toward building data-centric AI systems. Just as different roles in an AI-driven company serve distinct purposes, our survey follows this goal-driven taxonomy to discuss the representative data-centric AI tasks under each sub-goal, summarized in Table 1.
值得注意的是,这三个目标不一定是按特定顺序追求的,而是经常交织在一起的。具体来说,数据维护在训练和推理数据开发过程中经常发挥支持作用。例如,人工智能研究人员可能经常利用“数据理解”工具在训练数据开发过程中从可视化中获取见解。在实践中,这三个目标下的活动甚至可以在现实场景中并行执行。例如,在典型的人工智能驱动公司中,通常会指定专门的团队或角色来同时处理这些目标。这包括专注于训练数据开发的机器学习工程师或数据科学团队、管理推理数据开发的人工智能测试人员或提示工程师以及专注于数据维护的人工智能数据基础设施团队。虽然他们的活动可能重叠,但他们在构建以数据为中心的人工智能系统方面有着不同的重点。正如人工智能驱动型公司中的不同角色有不同的目的一样,我们的调查遵循这种目标驱动的分类法,讨论每个子目标下代表性的以数据为中心的人工智能任务,总结如表 1 所示。
3.2 Automation and Human Participation in Data-centric AI
3.2 以数据为中心的人工智能中的自动化和人类参与
Data-centric AI consists of a spectrum of tasks related to different data lifecycle stages. To keep pace with the ever-growing size of the available data, in some data-centric AI tasks, it is imperative to develop automated algorithms to streamline the process. For example, there is an increasing interest in automation in data augmentation [51, 257], and feature transformation [115]. Automation in these tasks will improve not only efficiency but also accuracy [144]. Moreover, automation can facilitate the consistency of the results, reducing the chance of human errors. Whereas for some other tasks, human involvement is essential to ensure the data is consistent with our intentions. For example, humans often play an indispensable role in labeling data [266], which helps machine learning algorithms learn to make the desired predictions. Whether human participation is needed depends on whether our objective is to align data with human expectations. In this survey, we categorize each paper into automation and collaboration, where the former focuses on automating the process, and the latter concerns human participation. Automation-oriented methods usually have different automation objectives. We can identify several levels of automation from the existing methods:
以数据为中心的人工智能由一系列与不同数据生命周期阶段相关的任务组成。为了跟上可用数据不断增长的规模,在一些以数据为中心的人工智能任务中,必须开发自动化算法来简化流程。例如,人们对数据增强[51,257]和特征转换[115]的自动化越来越感兴趣。这些任务的自动化不仅会提高效率,还会提高准确性[144]。此外,自动化可以促进结果的一致性,减少人为错误的机会。而对于其他一些任务,人类的参与对于确保数据与我们的意图一致至关重要。例如,人类在标记数据方面通常扮演着不可或缺的角色[266],这有助于机器学习算法学习做出所需的预测。是否需要人类参与取决于我们的目标是否是使数据与人类期望保持一致。在本次调查中,我们将每篇论文分为自动化和协作,前者侧重于流程自动化,后者关注人类参与。面向自动化的方法通常有不同的自动化目标。我们可以从现有方法中确定几个自动化级别:
Fig. 3. Data-centric Al papers are categorized into automation and collaboration depending on whether human participation is needed. Each method has a different level of automation or requires a different degree of human participation.
图3.以数据为中心的人工智能论文根据是否需要人工参与分为自动化和协作。每种方法都有不同的自动化水平或需要不同程度的人工参与。
-- Programmatic automation: Using programs to deal with the data automatically. The programs are often designed based on some heuristics and statistical information.
- Learning-based automation: Learning automation strategies with optimization, e.g., minimizing an objective function. The methods at this level are often more flexible and adaptive but require additional costs for learning.
- Pipeline automation: Integrating and tuning a series of strategies across multiple tasks, which could help identify globally optimal strategies. However, tuning may incur significantly more costs.
-- 程序自动化:使用程序自动处理数据。这些程序通常是根据一些启发式和统计信息设计的。- 基于学习的自动化:通过优化学习自动化策略,例如,最小化目标函数。这个级别的方法通常更灵活和适应性更强,但需要额外的学习成本。- 管道自动化:跨多个任务集成和调整一系列策略,这有助于确定全局最佳策略。然而,调整可能会产生更多的成本。
Note that this categorization does not intend to differentiate good and bad methods. For example, a pipeline automation method may not necessarily be better than programmatic automation solutions, since it could be over-complicated in many scenarios. Instead, we aim to show insight into how automation has been applied to different data-centric goals and understand the literature from a global view. From another perspective, collaboration-oriented methods often require human participation in different forms. We can identify several degrees of human participation:
请注意,这种分类并不打算区分好方法和坏方法。例如,管道自动化方法不一定比程序化自动化解决方案更好,因为它在许多场景中可能过于复杂。相反,我们的目标是展示对自动化如何应用于不同的以数据为中心的目标的见解,并从全球视角理解文献。从另一个角度来看,以协作为导向的方法往往需要人类以不同的形式参与。我们可以确定几个程度的人类参与:
-- Full participation: Humans fully control the process. The method assists humans in making decisions. The methods that require full participation can often align well with human intentions but can be costly.
-- Partial participation: The method is in control of the process. However, humans need to intensively or continuously supply information, e.g., by providing a large amount of feedback or frequent interactions.
- Minimum participation: The method is in full control of the whole process and only consults humans when needed. Humans only participate when prompted or asked to do so. The methods that belong to this degree are often more desirable when encountering a massive amount of data and a limited budget for human efforts.
-- 全员参与:人类完全控制过程。该方法协助人类做出决策。需要充分参与的方法通常可以很好地符合人类的意图,但成本可能很高。-- 部分参与:该方法控制着过程。然而,人类需要密集或持续地提供信息,例如,通过提供大量反馈或频繁的交互。- 最低参与:该方法完全控制整个过程,仅在需要时咨询人类。人类只有在被提示或要求时才参与。当遇到大量数据和有限的人类努力预算时,属于这种程度的方法通常更可取。
Similarly, the degree of human participation, to a certain extent, only reflects the tradeoff between efficiency (less human labor) and effectiveness (better aligned with humans). The selection of methods depends on the application domain and stakeholders' needs. To summarize, we design Figure 3 to organize the existing data-centric AI papers. We assign each paper to either a level of automation or a degree of human participation.
同样,人类参与的程度在一定程度上仅反映了效率(更少的人力)和有效性(更好地与人类保持一致)之间的权衡。方法的选择取决于应用领域和利益相关者的需求。综上所述,我们设计了图 3 来组织现有的以数据为中心的 AI 论文。我们将每篇论文分配到自动化程度或人类参与程度。
3.3 Related Surveys in Data-centric AI
3.3 以数据为中心的人工智能的相关调查
The early surveys only focus on specific aspects of data-centric AI rather than providing a comprehensive overview, such as data augmentation [69, 204, 238], data labeling [266], and feature selection [129]. Recent efforts have aimed to summarize and provide perspectives on data-centric AI. [105] summarizes six principles of data-centric AI and discusses best data practices. [104] introduces data-centric AI from the field of Business and Information Systems Engineering. Our previous work [256] articulates our perspectives on data-centric AI, outlining its challenges and opportunities. However, these studies primarily offer high-level discussions and lack comprehensiveness. A contemporary work [207] also reviews the field of data-centric AI in an end-to-end way. The novelty of our survey is that we provide a more comprehensive review by discussing at least 156 research papers and organizing them with a goal-driven taxonomy followed by an automation- and collaboration-oriented design to categorize methods. Moreover, we discuss the needs, challenges, and future directions from the broad data-centric AI view. Based on this survey, we further presented a tutorial at the KDD conference [259], aiming to motivate collective initiatives to push forward this field.
早期的调查仅关注以数据为中心的人工智能的特定方面,而不是提供全面的概述,例如数据增强[69,204,238]、数据标记[266]和特征选择[129]。最近的努力旨在总结和提供以数据为中心的人工智能的观点。[105]总结了以数据为中心的人工智能的六个原则,并讨论了最佳数据实践。[104]从商业和信息系统工程领域介绍了以数据为中心的人工智能。我们之前的工作[256]阐述了我们对以数据为中心的人工智能的看法,概述了其挑战和机遇。然而,这些研究主要提供高层次的讨论,缺乏全面性。当代著作[207]也以端到端的方式回顾了以数据为中心的人工智能领域。我们调查的新颖之处在于,我们通过讨论至少156篇研究论文,并以目标驱动的分类法对它们进行组织,然后采用自动化和协作导向的设计来对方法进行分类,从而提供了更全面的综述。此外,我们还从广泛的以数据为中心的人工智能视角讨论了需求、挑战和未来方向。基于这项调查,我们在KDD会议上进一步提出了一个教程[259],旨在激励集体倡议推动这一领域的发展。
Fig. 4. Overview of training data development, encompassing five sub-goals: data collection, labeling, preparation, reduction, and automation. Additionally, we present several representative tasks associated with each sub-goal. Note that the figure illustrates only a general pipeline, and not all steps are mandatory. For instance, unsupervised learning does not require data labeling. These steps can be executed in a different order as well. For example, data augmentation can occur before data reduction.
图 4.训练数据开发概述,包括五个子目标:数据收集、标记、准备、缩减和自动化。此外,我们还介绍了与每个子目标相关的几个代表性任务。请注意,该图仅说明了一个通用管道,并非所有步骤都是强制性的。例如,无监督学习不需要数据标记。这些步骤也可以以不同的顺序执行。例如,数据扩充可以在数据缩减之前进行。
4Training Data Development
4训练数据开发
Training data provides the foundation for machine learning models, as the model performance is heavily influenced by its quality and quantity. In this section, we summarize the essential steps to create and process training data, visualized in Figure 4. Data creation focuses on effectively and efficiently encoding human intentions into datasets, including data collection (Section 4.1) and data labeling (Section 4.2). Data processing aims to make data suitable for learning, including data preparation (Section 4.3), data reduction (Section 4.4), and data augmentation (Section 4.5). After introducing these steps, we discuss pipeline search (Section 4.6), an emerging trend that aims to connect them and search for the most effective end-to-end solution. Table 3 in Appendix B summarizes the representative tasks and methods for training data development.
训练数据为机器学习模型提供了基础,因为模型性能在很大程度上受到其质量和数量的影响。在本节中,我们总结了创建和处理训练数据的基本步骤,如图 4 所示。数据创建侧重于有效且高效地将人类意图编码到数据集中,包括数据收集(第 4.1 节)和数据标记(第 4.2 节)。数据处理旨在使数据适合学习,包括数据准备(第 4.3 节)、数据缩减(第 4.4 节)和数据增强(第 4.5 节)。在介绍了这些步骤之后,我们讨论了管道搜索(第 4.6 节),这是一种新兴趋势,旨在将它们连接起来并寻找最有效的端到端解决方案。附录 B 中的表 3 总结了训练数据开发的代表性任务和方法。
4.1 Data Collection
4.1 数据收集
Data collection is the process of gathering and acquiring data from various sources, which fundamentally determines data quality and quantity. This process heavily relies on domain knowledge. With the increasing availability of data, there has been a surge in the development of efficient strategies to leverage existing datasets. In the following, we discuss the need for data collection, an overview of more efficient data collection strategies, and challenges. Note that the data collection process often requires significant infrastructure and tooling support. Our discussion primarily focuses on the strategic aspect of data collection-specifically, how to identify the right data-rather than on the infrastructure or tools required for the process.
数据收集是从各种来源收集和获取数据的过程,从根本上决定了数据的质量和数量。这个过程在很大程度上依赖于领域知识。随着数据可用性的增加,利用现有数据集的有效策略的开发激增。在下文中,我们将讨论数据收集的必要性、更有效的数据收集策略的概述以及挑战。请注意,数据收集过程通常需要大量的基础设施和工具支持。我们的讨论主要集中在数据收集的战略方面——特别是如何识别正确的数据——而不是该过程所需的基础设施或工具。
4.1.1 Need for Data Collection. Collecting the right data is often the initial stride in con. structing an AI system. The collected data need to be both relevant, aligning with stakeholders intentions, and representative, covering different possible scenarios. Domain knowledge often plays an essential role in data collection. For example, when building a recommendation system, it is crucial to decide what user/item features to collect based on the application domain [267]. The domain-specific knowledge can also help in synthesizing data. For instance, knowledge about financial markets and trading strategies can facilitate the generation of more realistic synthetic anomalies [125].
4.1.1 数据收集的必要性。收集正确的数据通常是构建人工智能系统的第一步。收集的数据需要既相关,符合利益相关者的意图,又具有代表性,涵盖不同的可能场景。领域知识在数据收集中通常起着至关重要的作用。例如,在构建推荐系统时,根据应用领域决定收集哪些用户/项目特征至关重要[267]。特定领域的知识也有助于综合数据。例如,有关金融市场和交易策略的知识可以促进生成更现实的合成异常[125]。
4.1.2 Efficient Data Collection Strategies. Traditionally, datasets are constructed from scratch by manually collecting the relevant information. However, this process is time-consuming. More efficient methods have been developed by leveraging the existing data. Here, we describe the methods for dataset discovery, data integration, and data synthesis.
4.1.2 高效的数据收集策略。传统上,数据集是通过手动收集相关信息从头开始构建的。然而,这个过程非常耗时。利用现有数据开发了更有效的方法。在这里,我们描述了数据集发现、数据集成和数据合成的方法。
Dataset discovery. As the number of available datasets continuously grows, it becomes possible to amass the existing datasets of interest to construct a new dataset that meets our needs. Given a human-specified query (e.g., the expected attribute names), dataset discovery aims to identify the most related and useful datasets from a data lake, a repository of datasets stored in its raw formats, such as public data-sharing platforms [20] and data marketplaces. The existing research for dataset discovery mainly differs in calculating relatedness. A representative strategy is to abstract the datasets as a graph, where the nodes are columns of the data sources, and edges represent relationships between two nodes [7o]. Then a tailored query language is designed to allow users to express complex query logic to retrieve the relevant datasets. Another approach is table union search [156], which measures the unionability of datasets based on the overlapping of the attribute values. Recent work measures the relatedness in a more comprehensive way by considering attribute names, value overlapping, word embedding, formats, and domain distributions [26]. Al these methods can significantly reduce human labor in dataset discovery, as humans only need to provide queries.
数据集发现。随着可用数据集数量的不断增长,积累现有感兴趣的数据集来构建满足我们需求的新数据集成为可能。给定人类指定的查询(例如,预期的属性名称),数据集发现旨在从数据湖中识别最相关和最有用的数据集,数据湖是以原始格式存储的数据集存储库,例如公共数据共享平台[20]和数据市场。现有的数据集发现研究主要在计算相关性方面有所不同。一种代表性的策略是将数据集抽象为图,其中节点是数据源的列,边表示两个节点之间的关系[7o]。然后设计一种量身定制的查询语言,允许用户表达复杂的查询逻辑来检索相关数据集。另一种方法是表联合搜索[156],它根据属性值的重叠来衡量数据集的可合并性。最近的工作通过考虑属性名称、值重叠、词嵌入、格式和域分布,以更全面的方式衡量相关性[26]。这些方法都可以显着减少数据集发现中的人力,因为人类只需要提供查询。
Data integration. Given a few datasets from different sources, data integration aims to combine them into a unified dataset. The difficulty lies in matching the columns across datasets and transforming the values of data records from the source dataset to the target dataset. Traditional solutions rely on rule-based systems [121, 128], which cannot scale. Recently, machine learning has been utilized to automate the data integration process in a more scalable way [212, 213]. For example, the transformation of data values can be formulated as a classification problem, where the input is the data value from the source dataset, and the output is the transformed value from the target dataset [213]. Then, we can train a classifier with the training data generated by rules and generalize it to unseen data records. The automated data integration techniques make it possible to merge a larger number of existing datasets efficiently.
数据集成。给定来自不同来源的几个数据集,数据集成旨在将它们组合成一个统一的数据集。难点在于匹配数据集之间的列,并将数据记录的值从源数据集转换为目标数据集。传统的解决方案依赖于基于规则的系统[121,128],无法扩展。最近,机器学习已被用于以更具可扩展性的方式自动化数据集成过程[212,213]。例如,数据值的转换可以表述为分类问题,其中输入是源数据集的数据值,输出是目标数据集的转换值[213]。然后,我们可以用规则生成的训练数据训练分类器,并将其推广到看不见的数据记录。自动化数据集成技术使得有效地合并大量现有数据集成为可能。
Raw data synthesis. In some scenarios, it is more efficient to synthesize a dataset that contains the desirable patterns than to collect these patterns from the real world. A typical scenario is anomaly detection, where it is often hard to collect sufficient real anomalies, since they can be extremely rare. Thus, researchers often insert anomaly patterns into anomaly-free datasets. For example, a general anomaly synthesis criterion has been proposed for time-series data [125], where a time series is modeled as a parameterized combination of trend, seasonality, and shapelets. Then different point- and pattern-wise anomalies can be generated by altering these parameters. However, such synthesis strategies may not be suitable for all domains. For example, the anomaly patterns in financial time series can be quite different from those from electricity time series. Thus, properly designing data synthesis strategies still requires domain knowledge.
原始数据合成。在某些情况下,合成包含所需模式的数据集比从现实世界中收集这些模式更有效。一个典型的场景是异常检测,通常很难收集足够的真实异常,因为它们可能极为罕见。因此,研究人员经常将异常模式插入到无异常数据集中。例如,已经为时间序列数据提出了一个通用的异常综合标准[125],其中时间序列被建模为趋势、季节性和形状的参数化组合。然后,通过改变这些参数可以生成不同的点和模式异常。然而,这种综合策略可能并不适用于所有领域。例如,金融时间序列中的异常模式可能与电力时间序列中的异常模式有很大不同。因此,正确设计数据合成策略仍然需要领域知识。
4.1.3 Challenges. Data collection is a very challenging process that requires careful planning. From the technical perspective, datasets are often diverse and not well-aligned with each other, so it is non-trivial to measure their relatedness or integrate them appropriately. Effectively synthesizing data from the existing dataset is also tricky, as it heavily relies on domain knowledge. Moreover, some critical issues during data collection cannot be resolved solely from a technical perspective. For example, in many real-world situations, we may be unable to locate a readily available dataset that aligns with our requirements, so we still have to collect data from the ground up. However, some data sources can be difficult to obtain due to legal, ethical, or logistical reasons. Collecting new data also involves ethical considerations, particularly with regard to informed consent, data privacy, and data security. Researchers and practitioners must be aware of these challenges in studying and executing data collection.
4.1.3 挑战。数据收集是一个非常具有挑战性的过程,需要仔细规划。从技术角度来看,数据集通常是多种多样的,并且彼此之间没有很好的一致性,因此衡量它们的相关性或适当整合它们并非易事。从现有数据集中有效合成数据也很棘手,因为它严重依赖领域知识。此外,数据收集过程中的一些关键问题不能仅从技术角度解决。例如,在许多现实世界中,我们可能无法找到符合我们要求的现成数据集,因此我们仍然必须从头开始收集数据。然而,由于法律、道德或后勤原因,某些数据源可能难以获得。收集新数据还涉及道德考虑,特别是在知情同意、数据隐私和数据安全方面。研究人员和从业者在研究和执行数据收集时必须意识到这些挑战。
4.2 Data Labeling
4.2 数据标注
Data labeling is the process of assigning one or more descriptive tags or labels to a dataset, enabling algorithms to learn from and make predictions on the labeled data. Traditionally, this is a timeconsuming and resource-intensive manual process, particularly for large datasets. Recently, more efficient labeling methods have been proposed to reduce human efforts. In what follows, we discuss the need for data labeling, efficient labeling strategies, and challenges.
数据标记是将一个或多个描述性标签或标签分配给数据集的过程,使算法能够从标记数据中学习并对其进行预测。传统上,这是一个耗时且资源密集的手动过程,特别是对于大型数据集。最近,提出了更有效的标记方法来减少人力。在下文中,我们将讨论数据标记的必要性、高效的标记策略和挑战。
4.2.1 Need for Data Labeling. Labeling plays a crucial role in ensuring that the model trained on the data accurately reflects human intentions. Without proper labeling, a model may not be able to make the desired predictions, since the model can, at most, be as good as the data fed into it. Although unsupervised learning techniques are successful in domains such as large language models [32, 114, 177, 178] and anomaly detection [165], the trained models may not well align with human expectations. Thus, to achieve a better performance, we often still need to fine-tune the large language models with human labels, such as ChatGPT [163], and tune anomaly detectors with a small amount of labeled data [107, 132-134]. Thus, labeling data is essential for teaching models to align with and behave like humans.
4.2.1 数据标注的必要性。标注在确保在数据上训练的模型准确反映人类意图方面起着至关重要的作用。如果没有适当的标记,模型可能无法做出预期的预测,因为模型最多只能与输入的数据一样好。尽管无监督学习技术在大型语言模型[32,114,177,178]和异常检测[165]等领域取得了成功,但训练的模型可能不太符合人类的期望。因此,为了获得更好的性能,我们通常仍然需要微调带有人类标签的大型语言模型,例如ChatGPT[163],并使用少量标记数据来调整异常检测器[107,132-134]。因此,标记数据对于教导模型与人类保持一致并像人类一样行事至关重要。
4.2.2 Efficient Labeling Strategies. Researchers have long recognized the importance of data labeling. Various strategies have been proposed to enhance labeling efficiency. We will discuss crowdsourced labeling, semi-supervised labeling, active learning, data programming, and distant supervision. Note that it is possible to combine them as hybrid strategies.
4.2.2 高效的标签策略。研究人员早就认识到数据标签的重要性。人们提出了各种策略来提高标签效率。我们将讨论众包标签、半监督标签、主动学习、数据编程和远程监督。请注意,可以将它们组合为混合策略。
Crowdsourced labeling. Crowdsourcing is a classic approach that breaks down a labeling task into smaller and more manageable parts so that they can be outsourced and distributed to a large number of non-expert annotators. Traditional methods often only provide initial guidelines to annotators [252]. However, the guidelines can be unclear and ambiguous, so each annotator could judge the same situation subjectively and differently. One way to mitigate this inconsistency is to start with small pilot studies and iteratively refine the design of the labeling task [123]. Another is to ask multiple workers to annotate the same sample and infer a consensus label [217]. Other studies focus on algorithmically improving label quality, e.g., pruning low-quality teachers [53]. All these crowdsourcing methods require full human participation but assist humans or enhance label quality in different ways.
众包标签。众包是一种经典的方法,它将标签任务分解为更小、更易于管理的部分,以便可以将它们外包并分发给大量非专家注释者。传统方法通常只向注释者提供初始指南[252]。然而,指南可能不明确和模糊,因此每个注释者可以主观地判断相同的情况,也可以以不同的方式判断相同的情况。减轻这种不一致的一种方法是从小型试点研究开始,并迭代完善标记任务的设计[123]。另一种方法是要求多个工人对同一个样本进行注释并推断出一个共识标签[217]。其他研究侧重于通过算法提高标签质量,例如,修剪低质量的教师[53]。所有这些众包方法都需要人类的充分参与,但以不同的方式协助人类或提高标签质量。
Semi-supervised labeling. The key idea is to leverage a small amount of labeled data to infer the labels of the unlabeled data. A popular approach is self-training [277], which trains a classifier based on labeled data and uses it to generate pseudo labels. To improve the quality of pseudo labels. a common strategy is to train multiple classifiers and find a consensus label, such as using different machine learning algorithms to train models on the same data [274]. In parallel, researchers have studied graph-based semi-supervised labeling techniques [44]. The idea is to construct a graph, where each node is a sample, and each edge represents the distance between the two nodes it connects. Then they infer labels through label propagation in the graph. Recently, a reinforcement learning from human feedback procedure is proposed [46] and used in ChatGPT [163]. They train a reward model based on human-labeled data and infer the reward for unlabeled data to fine-tune the language model. These semi-supervised labeling methods only require partial human participation to provide the initial labels.
半监督标记。关键思想是利用少量标记数据来推断未标记数据的标签。一种流行的方法是自训练[277],它基于标记数据训练分类器,并使用它来生成伪标签。以提高伪标签的质量。一种常见的策略是训练多个分类器并找到共识标签,例如使用不同的机器学习算法在同一数据上训练模型[274]。与此同时,研究人员研究了基于图的半监督标记技术[44]。这个想法是构建一个图,其中每个节点都是一个样本,每条边代表它连接的两个节点之间的距离。然后,他们通过图中的标签传播来推断标签。最近,提出了一种来自人类反馈的强化学习程序[46],并在ChatGPT[163]中使用。他们根据人类标记的数据训练奖励模型,并推断出未标记数据的奖励,以微调语言模型。这些半监督标记方法只需要部分人类参与即可提供初始标签。
Active learning. Active learning is an iterative labeling procedure that involves humans in the loop. In each iteration, the algorithm selects an unlabeled sample or batch of samples as a query for human annotation. The newly labeled samples help the algorithm choose the next query. The existing work mainly differs in query selection strategies. Early methods use statistical methods to estimate sample uncertainty and select the unlabeled sample the model is most uncertain about [50]. Recent studies have investigated deep active learning, which leverages model output or designs specialized architectures to measure uncertainty [182]. More recent research aligns the querying process with a Markov decision process and learns to select the long-term best query with contextual bandit [61] or reinforcement learning [258]. Unlike semi-supervised labeling, which requires one-time human participation in the initial stage, active learning needs a continuous supply of information from humans to adaptively select queries.
主动学习。主动学习是一种迭代标记过程,涉及人类的循环。在每次迭代中,算法都会选择一个未标记的样本或一批样本作为人类注释的查询。新标记的样本帮助算法选择下一个查询。现有工作主要在查询选择策略上有所不同。早期的方法使用统计方法来估计样本不确定性,并选择模型最不确定的未标记样本[50]。最近的研究研究了深度主动学习,它利用模型输出或设计专门的架构来测量不确定性[182]。最近的研究将查询过程与马尔可夫决策过程结合起来,并学习使用上下文强盗[61]或强化学习[258]来选择长期最佳查询。与初始阶段需要一次性人类参与的半监督标记不同,主动学习需要人类持续提供信息来自适应地选择查询。
Data programming. Data programming [180, 181] is a weakly supervised approach that infers labels based on human-designed labeling functions. The labeling functions are often some heuristic rules and vary for different data types, e.g., seed words for text classification [261], masks for image segmentation [99], and so on. However, sometimes the labeling functions may not align with human intentions. To address this limitation, researchers have proposed interactive data programming [25, 75], where humans participate more by interactively providing feedback to refine labeling functions. Data programming methods often require minimum human participation or, at most, partial participation. Thus, the methods in this research line are often more desirable when we need to quickly generate a large number of labels.
数据编程。数据编程[180,181]是一种弱监督方法,它基于人类设计的标记函数推断标签。标记函数通常是一些启发式规则,并且因不同的数据类型而异,例如,用于文本分类的种子词[261]、用于图像分割的掩码[99]等。然而,有时标记功能可能与人类意图不一致。为了解决这一限制,研究人员提出了交互式数据编程[25,75],即人类通过交互式提供反馈来完善标记功能来更多地参与。数据编程方法通常需要最少的人类参与,或者最多需要部分参与。因此,当我们需要快速生成大量标签时,该研究方向的方法通常更可取。
Distant supervision. Another weakly supervised approach is distant supervision, which assigns labels by leveraging external sources. A famous application of distant supervision is on relation extraction [149], where the semantic relationships between entities in the text are labeled based on external data, such as Freebase [28]. Distant supervision is often an automated approach that does not require human participation. However, the automatically generated labels can be noisy if there is a discrepancy between the dataset and the external source.
远程监督。另一种弱监督方法是远程监督,它利用外部来源分配标签。远程监督的一个著名应用是关系提取[149],其中文本中实体之间的语义关系是根据外部数据进行标记的,例如Freebase[28]。远程监督通常是一种不需要人工参与的自动化方法。但是,如果数据集和外部源之间存在差异,则自动生成的标签可能会产生噪音。
4.2.3 Challenges. The main challenge for data labeling stems from striking a balance between label quality, label quantity, and financial cost. If given adequate financial support, then it is possible to hire a sufficient number of expert annotators to obtain a satisfactory quantity of high-quality labels. However, when we have a relatively tight budget, we often have to resort to more efficient labeling strategies. Identifying the proper labeling strategy often requires domain knowledge to balance different tradeoffs, particularly human labor and label quality/quantity. Another difficulty lies in the subjectivity of labeling. While the instructions may be clear to the designer, they may be misinterpreted by annotators, which leads to labeling noise. Last but not least, ethical considerations, such as data privacy and bias, remain a pressing issue, especially when the labeling task is distributed to a large and undefined group of people.
4.2.3 挑战。数据标注的主要挑战源于在标签质量、标签数量和财务成本之间取得平衡。如果得到足够的财政支持,那么就有可能聘请足够数量的专家注释员来获得令人满意的高质量标签。然而,当我们的预算相对紧张时,我们往往不得不求助于更有效的标签策略。确定适当的标签策略通常需要领域知识来平衡不同的权衡,特别是人力和标签质量/数量。另一个困难在于标签的主观性。虽然说明对设计者来说可能很清楚,但它们可能会被注释者误解,从而导致标签噪音。最后但并非最不重要的一点是,数据隐私和偏见等道德考虑仍然是一个紧迫的问题,特别是当标签任务分配给一大群未定义的人时。
4.3 Data Preparation
4.3 数据准备
Data preparation involves cleaning and transforming raw data into a format that is appropriate for model training. Conventionally, this process often necessitates a considerable amount of engineer. ing work with laborious trial and error. To automate this process, state-of-the-art approaches often adopt search algorithms to discover the most effective strategies. In this subsection, we introduce the need, representative methods, and challenges for data preparation.
数据准备涉及清理原始数据并将其转换为适合模型训练的格式。传统上,这个过程通常需要大量的工程师。工作需要费力的反复试验。为了自动化这个过程,最先进的方法通常采用搜索算法来发现最有效的策略。在本小节中,我们介绍数据准备的必要性、代表性方法和挑战。
4.3.1 Need for Data Preparation. Raw data is often not ready for model training due to potential issues such as noise, inconsistencies, and unnecessary information, leading to inaccurate and biased results. For instance, the model could overfit on noises, outliers, and irrelevant extracted features, resulting in reduced generalizability [247]. If sensitive information (e.g., race and gender) is not removed, then the model may unintentionally learn to make biased predictions [228]. In addition, the raw feature values may negatively affect model performance if they are in different scales or follow skewed distributions [4]. Thus, it is imperative to clean and transform data. The need can also be verified by a Forbes survey [174], which suggests that data preparation accounts for roughly $8 0 %$ of the work of data scientists.
4.3.1 数据准备的必要性。由于噪声、不一致和不必要的信息等潜在问题,原始数据通常还没有准备好进行模型训练,从而导致结果不准确和有偏差。例如,模型可能会在噪声、异常值和不相关的提取特征上过度拟合,从而导致普遍性降低[247]。如果不删除敏感信息(例如,种族和性别),那么模型可能会无意中学会做出有偏见的预测[228]。此外,如果原始特征值处于不同的尺度或遵循偏态分布,则可能会对模型性能产生负面影响[4]。因此,必须对数据进行清理和转换。福布斯的一项调查[174]也证实了这一需求,该调查表明数据准备大约占数据科学家工作的$8 0 %$。
4.3.2 Methods. We will review and discuss the techniques for achieving three key data preparation objectives, namely, data cleaning, feature extraction, and feature transformation.
4.3.2 方法。我们将回顾和讨论实现三个关键数据准备目标的技术,即数据清洗、特征提取和特征转换。
Data cleaning. Data cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. Traditional methods repair data with programmatic automation, e.g., imputing missing values with mean or median [271] and scanning all data to find duplicates. However, such heuristics can be inaccurate or inefficient. Thus, learning-based methods have been developed, such as training a regression model to predict missing values [126], efficiently estimating the duplicates with sampling [94], and correcting labeling errors [109]. Contemporary data clean. ing methods often do not solely focus on the cleaning itself, but rather on learning to improve final model performance. For instance, a recent study has adopted search algorithms to automatically identify the best cleaning strategy to optimize validation performance [119]. Beyond automation, researchers have studied collaboration-oriented cleaning methods. For example, a hybrid humanmachine workflow is proposed to identify duplicates by presenting similar pairs to humans for annotation [231].
数据清洗。数据清洗是识别和纠正数据集中的错误、不一致和不准确的过程。传统方法通过编程自动化修复数据,例如,用平均值或中位数插补缺失值[271],并扫描所有数据以查找重复项。然而,这种启发式方法可能不准确或低效。因此,已经开发了基于学习的方法,例如训练回归模型来预测缺失值[126],通过采样有效地估计重复值[94],以及纠正标记错误[109]。当代数据清洗。ing方法通常不仅仅关注清洗本身,而是学习提高最终模型性能。例如,最近的一项研究采用了搜索算法来自动识别最佳清洗策略,以优化验证性能[119]。除了自动化之外,研究人员还研究了以协作为导向的清洗方法。例如,提出了一种混合人机工作流程,通过向人类呈现相似的对进行注释来识别重复项[231]。
Feature extraction. Feature extraction is an important step in extracting relevant features from raw data. For training traditional machine learning models, we often need to extract features based on domain knowledge of the data type being targeted. Common features used for images include color features, texture features, intensity features, and so on [189]. For time-series data, temporal. statistical, and spectral features are often considered [13]. Deep learning, in contrast, automatically extracts features by learning the weights of neural networks, which requires less domain knowledge. For instance, convolutional neural networks can be used in both images [120] and time series [236]. The boundary between data and model becomes blurred with deep learning feature extractors, which operate on the data while also being an integral part of the model. Although deep extractors could learn high-quality feature representations, the extraction process is uninterpretable and may amplify the bias in the learned representation [228]. Therefore, traditional feature extraction methods are often preferred in high-stakes domains for interpretability and removing sensitive information.
特征提取。特征提取是从原始数据中提取相关特征的重要步骤。对于训练传统的机器学习模型,我们通常需要根据目标数据类型的领域知识来提取特征。用于图像的常见特征包括颜色特征、纹理特征、强度特征等[189]。对于时间序列数据,时间特征。统计特征和光谱特征通常被认为是[13]。相比之下,深度学习通过学习神经网络的权重来自动提取特征,这需要较少的领域知识。例如,卷积神经网络可用于图像[120]和时间序列[236]。深度学习特征提取器使数据和模型之间的界限变得模糊,深度学习特征提取器在对数据进行作的同时也是模型的组成部分。尽管深度提取器可以学习高质量的特征表示,但提取过程是不可解释的,并且可能会放大学习到的表示中的偏差[228]。因此,传统的特征提取方法通常在高风险领域中是首选,以提高可解释性和去除敏感信息。
Feature transformation. Feature transformation refers to the process of converting the original features into a new set of features, which can often lead to improved model performance. Some typical transformations include normalization, which scales the feature into a bounding range, and standardization, which transforms features so that they have a mean of zero and a standard deviation of one [5]. Other strategies include log transformation and polynomial transformation to smooth the long-tail distribution and create new features through multiplication [22]. These transformation methods can be combined in different ways to improve model performance. For example, a representative work builds a transformation graph for a given dataset, where each node is a type of transformation, and adopts reinforcement learning to search for the best transformation strategy [115]. Learning-based methods often yield superior performance by optimizing transformation strategies based on the feedback obtained from the model.
特征转换。特征转换是指将原始特征转换为一组新特征的过程,这通常可以提高模型性能。一些典型的转换包括归一化(将特征缩放到边界范围内)和标准化(将特征变换,使其平均值为零,标准差为1)[5]。其他策略包括对数变换和多项式变换,以平滑长尾分布并通过乘法创建新特征[22]。这些转换方法可以以不同的方式组合,以提高模型性能。例如,一个代表性的工作为给定的数据集构建了一个转换图,其中每个节点都是一种转换类型,并采用强化学习来搜索最佳转换策略[115]。基于学习的方法通常通过根据从模型获得的反馈优化转换策略来产生卓越的性能。
4.3.3 Challenges. Properly cleaning and transforming data is challenging due to the unique characteristics of different datasets. For example, the errors and inconsistencies in text data are quite different from those in time-series data. Even if two datasets have the same data type, their feature values and potential issues can be very diverse. Thus, researchers and data scientists often need to devote a significant amount of time and effort to clean the data. Although learning-based methods can search for the optimal preparation strategy automatically [115, 119], it remains a challenge to design the appropriate search space, and the search often requires a non-trivial amount of time.
4.3.3 挑战。由于不同数据集的独特特征,正确清理和转换数据具有挑战性。例如,文本数据中的错误和不一致与时间序列数据中的错误和不一致有很大不同。即使两个数据集具有相同的数据类型,它们的特征值和潜在问题也可能非常多样化。因此,研究人员和数据科学家通常需要投入大量的时间和精力来清理数据。尽管基于学习的方法可以自动搜索最佳准备策略[115,119],但设计适当的搜索空间仍然是一个挑战,而且搜索通常需要大量时间。
4.4 Data Reduction
4.4 数据缩减
The goal of data reduction is to reduce the complexity of a given dataset while retaining its essential information. This is often achieved by either reducing the feature size or the sample size. Our discussion will focus on the need for data reduction, representative methods for feature and sample size reduction, and challenges.
数据缩减的目标是降低给定数据集的复杂性,同时保留其基本信息。这通常是通过减小特征大小或样本量来实现的。我们的讨论将重点放在数据缩减的必要性、减小特征和样本量的代表性方法以及挑战。
4.4.1 Need for Data Reduction. With more data being collected at an unprecedented pace, data reduction plays a critical role in boosting training efficiency. From the sample size perspective, reducing the number of samples leads to a simpler yet representative dataset, which can alleviate memory and computation constraints. It also helps to alleviate data imbalance issues by downsampling the samples from the majority class [175]. Similarly, reducing feature size brings many benefits. For example, eliminating irrelevant or redundant features mitigates the risk of overfitting [129]. Smaller feature sizes will also enable faster training and inference in model deployment [230]. In addition, only keeping a subset of features will make the model more interpretable. Data reduction techniques can enable the model to focus only on the essential information, thereby enhancing accuracy, efficiency, and interpretability.
4.4.1 数据缩减的需求。随着更多数据以前所未有的速度收集,数据缩减在提高训练效率方面发挥着至关重要的作用。从样本量的角度来看,减少样本数量可以产生更简单但具有代表性的数据集,从而减轻内存和计算限制。它还有助于通过从多数类中对样本进行下采样来缓解数据不平衡问题[175]。同样,减小特征大小也带来了许多好处。例如,消除不相关或冗余的特征可以降低过度拟合的风险[129]。更小的特征大小也将使模型部署中更快地训练和推理[230]。此外,只保留特征的子集将使模型更具可解释性。数据缩减技术可以使模型仅关注基本信息,从而提高准确性、效率和可解释性。
4.4.2 Methods for Reducing Feature Size. From the feature perspective, we discuss two common reduction strategies. Feature selection. Feature selection is the process of selecting a subset of features most relevant to the intended tasks [129]. It can be broadly classified into filter, wrapper, and embedded methods. Filter methods [219] evaluate and select features independently using a scoring function based on statistical properties such as information gain [9]. Although filter methods are very efficient, they ignore feature dependencies and interactions with the model. Wrapper methods alleviate these issues by leveraging the model performance to assess the quality of selected features and refining the selection iteratively [245]. While these methods often achieve better performances, they are computationally more expensive. Embedded methods, from another angle, integrate feature selection into the model training process [233] so that the selection process is optimized in an end-to-end manner. Beyond automation, active feature selection takes into account human knowledge and incrementally selects the most appropriate features [198, 269]. Feature selection reduces the complexity, producing cleaner and more understandable data while retaining feature semantics.
4.4.2 减小特征大小的方法。从特征的角度,我们讨论了两种常见的缩减策略。特征选择。特征选择是选择与预期任务最相关的特征子集的过程[129]。它可以大致分为过滤器方法、包装方法和嵌入式方法。过滤方法[219]使用基于统计属性(如信息增益)的评分函数独立评估和选择特征[9]。尽管过滤方法非常有效,但它们忽略了特征依赖性和与模型的交互。包装方法通过利用模型性能来评估所选特征的质量并迭代细化选择[245],从而缓解了这些问题。虽然这些方法通常能实现更好的性能,但它们的计算成本更高。从另一个角度来看,嵌入式方法将特征选择集成到模型训练过程中[233],以便以端到端的方式优化选择过程。除了自动化之外,主动特征选择还考虑了人类的知识,并逐步选择最合适的特征[198,269]。特征选择降低了复杂性,产生了更清晰、更易于理解的数据,同时保留了特征语义。
Dimensionality reduction. Dimensionality reduction aims to transform high-dimensional features into a lower-dimensional space while preserving the most representative information. The existing methods can be mainly categorized into linear and non-linear techniques. The former generates new features via linear combinations of features from the original data. One of the most popular algorithms is Principal Component Analysis [2], which performs orthogonal linear combinations of the original features based on the variance in an unsupervised manner. Another representative method targeted for supervised scenarios is Linear Discriminant Analysis [242], which statistically learns linear feature combinations that can separate classes well. Linear techniques, however, may not always perform well, especially when features have complex and non-linear relationships. Non-linear techniques address this issue by utilizing nonlinear mapping functions. A popular technique is autoencoders [12], which use neural networks to encode the original features into a low-dimensional space and reconstruct the features using a neural decoder.
降维。降维旨在将高维特征转换为低维空间,同时保留最具代表性的信息。现有的方法主要可分为线性和非线性技术。前者通过原始数据特征的线性组合来生成新特征。最流行的算法之一是主成分分析[2],它以无监督的方式根据方差对原始特征进行正交线性组合。另一种针对监督场景的代表性方法是线性判别分析[242],它从统计上学习可以很好地分离类的线性特征组合。然而,线性技术可能并不总是表现良好,特别是当特征具有复杂和非线性关系时。非线性技术通过利用非线性映射函数来解决这个问题。一种流行的技术是自动编码器[12],它使用神经网络将原始特征编码到低维空间中,并使用神经解码器重建特征。
4.4.3 Methods for Reducing Sample Size. The reduction of samples is typically achieved with instance selection, which selects a representative subset of data samples that retain the original properties of the dataset. The existing studies can be divided into wrapper and filter methods. The former selects instances based on scoring functions. For example, a common strategy is to select border instances, since they can often shape the decision boundary [183]. Wrapper methods, in contrast, select instances based on model performance [216], which considers the interaction effect with the model. Instance selection techniques can also alleviate data imbalance issues by undersampling the majority class, e.g., with random undersampling [175]. More recent work adopts reinforcement learning to learn the best undersampling strategies [137]. Overall instance selection is a simple yet effective way to reduce data sizes or balance data distributions.
4.4.3 减小样本量的方法。样本的减少通常通过实例选择来实现,实例选择保留数据集原始属性的数据样本的代表性子集。现有的研究可分为包装方法和过滤方法。前者根据评分函数选择实例。例如,一种常见的策略是选择边界实例,因为它们通常可以塑造决策边界[183]。相比之下,包装方法根据模型性能[216]来选择实例,该方法考虑了与模型的交互效应。实例选择技术还可以通过对多数类进行欠采样来缓解数据不平衡问题,例如,使用随机欠采样[175]。最近的工作采用强化学习来学习最佳欠采样策略[137]。总体实例选择是减小数据大小或平衡数据分布的一种简单而有效的方法。
4.4.4 Challenges. The challenges of data reduction are two-folded. On the one hand, selecting the most representative data or projecting data in a low-dimensional space with minimal information loss is non-trivial. While learning-based methods can partially address these challenges, they may necessitate substantial computational resources, especially when dealing with extremely large datasets, e.g., the wrapper and reinforcement learning methods [137, 216, 245]. Therefore, achieving both high accuracy and efficiency is challenging. On the other hand, data reduction can potentially amply data bias, raising fairness concerns. For example, the selected features could be over associating with protected attributes [243]. Fairness-aware data reduction is a critical yet under-explored research direction.
4.4.4 挑战。数据缩减的挑战是双重的。一方面,选择最具代表性的数据或在低维空间中以最小的信息损失投影数据并非易事。虽然基于学习的方法可以部分解决这些挑战,但它们可能需要大量的计算资源,特别是在处理极大的数据集时,例如包装和强化学习方法[137,216,245]。因此,实现高精度和效率具有挑战性。另一方面,数据缩减可能会增加数据偏差,引发公平性问题。例如,所选特征可能与受保护的属性过度关联[243]。公平感知数据缩减是一个关键但尚未充分探索的研究方向。
4.5Data Augmentation
4.5数据增强
Data augmentation is a technique to increase the size and diversity of data by artificially creating variations of the existing data, which can often improve the model performance. It is worth noting that even though data augmentation and data reduction seem to have contradictory objectives, they can be used in conjunction with each other. While data reduction focuses on eliminating redundant information, data augmentation aims to enhance data diversity. We will delve into the need for data augmentation, various representative methods, and the associated challenges.
数据增强是一种通过人为地创建现有数据的变体来增加数据大小和多样性的技术,这通常可以提高模型性能。值得注意的是,尽管数据增强和数据缩减的目标似乎相互矛盾,但它们可以相互结合使用。数据缩减侧重于消除冗余信息,而数据扩充则旨在增强数据多样性。我们将深入探讨数据扩充的需求、各种代表性方法以及相关的挑战。
4.5.1 Need for Data Augmentation. Modern machine learning algorithms, particularly deep learning, often require large amounts of data to learn effectively. However, collecting large datasets, especially annotated data, is labor-intensive. By generating similar data points with variance, data augmentation helps to expose the model to more training examples, hereby improving accuracy, generalization capabilities, and robustness. Data augmentation is particularly important in applications where there is limited data available. For example, it is often expensive and time-consuming to acquire well-annotated medical data [43]. Data augmentation can also alleviate class imbalance issues, where there is a disproportionate ratio of training samples in each class, by augmenting the data from the under-represented class.
4.5.1 需要数据增强。现代机器学习算法,特别是深度学习,通常需要大量数据才能有效学习。然而,收集大型数据集,尤其是带注释的数据,是劳动密集型的。通过生成具有方差的相似数据点,数据增强有助于将模型暴露在更多的训练示例中,从而提高准确性、泛化能力和鲁棒性。数据增强在可用数据有限的应用中尤为重要。例如,获取注释良好的医疗数据通常既昂贵又耗时[43]。数据增强还可以通过增强来自代表性不足的类的数据来缓解类不平衡问题,即每个类中训练样本的比例不成比例。
4.5.2 Common Augmentation Methods. In general, data augmentation methods often manipulate the existing data to generate variances or synthesize new data. We discuss some representative methods in each category below.
4.5.2 常见的增强方法。一般来说,数据增强方法通常会纵现有数据来生成方差或合成新数据。我们将在下面讨论每个类别中的一些代表性方法。
Basic manipulation. This research line involves making minor modifications to the original data samples to produce augmented samples directly. Various strategies have been proposed in the computer vision domain, such as scaling, rotation, flipping, and blurring [270]. One notable approach is Mixup [264], which interpolates the existing data samples to create new samples. It is shown that Mixup serves as a regularizer, encouraging the model to prioritize simpler linear patterns, which in turn enhances the generation performance [264]. More recent studies use learning-based algorithms to automatically search for augmentation strategies. A representative work is AutoAug. ment, which uses reinforcement learning to iteratively improve the augmentation policies [51] Beyond image data, basic manipulation often needs to be tailored for the other data types, such as permutation and jittering in time-series data [238], mixing data in the hidden space for text data to retain semantic meanings [40], and mixing graphon for graph data [88].
基本作。该研究方向涉及对原始数据样本进行细微修改,以直接生成增强样本。在计算机视觉领域已经提出了各种策略,例如缩放、旋转、翻转和模糊[270]。一种值得注意的方法是Mixup[264],它对现有数据样本进行插值以创建新样本。研究表明,Mixup充当正则化器,鼓励模型优先考虑更简单的线性模式,从而提高生成性能[264]。最近的研究使用基于学习的算法来自动搜索增强策略。一个代表性的作品是AutoAug。使用强化学习来迭代改进增强策略[51]除了图像数据之外,通常还需要针对其他数据类型进行基本作,例如时间序列数据中的排列和抖动[238],在隐藏空间中混合文本数据以保留语义[40],以及为图数据混合graphon[88]。
Augmentation data synthesis. Another category focuses on synthesizing new training samples by learning the distribution of the existing data, which is typically achieved by generative modeling. GAN [81, 265] has been widely used for data augmentation [74]. The key idea is to train a discriminator in conjunction with a generator, making the latter generate synthetic data that closely resembles the existing data. GAN-based data augmentation has also been used to augment other data types, such as time-series data [131] and text data [205]. Other studies have used Variational Autoencoder [100] and diffusion models [98] to achieve augmentation. Compared to basic manipulation that augments data locally, data synthesis learns data patterns from the global view and generates new samples with a learned model.
增强数据合成。另一类侧重于通过学习现有数据的分布来合成新的训练样本,这通常通过生成建模来实现。GAN [81, 265]已被广泛用于数据增强[74]。关键思想是将鉴别器与生成器结合训练,使生成器生成与现有数据非常相似的合成数据。基于GAN的数据增强也被用于增强其他数据类型,例如时间序列数据[131]和文本数据[205]。其他研究使用变分自动编码器[100]和扩散模型[98]来实现增强。与在本地增强数据的基本作相比,数据合成从全局视角学习数据模式,并使用学习的模型生成新的样本。
4.5.3 Methods Tailored for Class Imbalance. Class imbalance is a fundamental challenge in machine learning, where the number of majority samples is much larger than that of minority samples. Data augmentation can be used to perform upsampling on the minority class to balance the data distribution. One popular approach is SMOTE [39], which involves generating synthetic samples by linearly interpolating between minority instances and their neighbors. ADASYN [91] is an extension of SMOTE that generates additional synthetic samples for data points that are more difficult to learn, as determined by the ratio of majority class samples in their nearest neighbors. A recent study proposes AutoSMOTE, a learning-based algorithm that searches for best oversampling strategies with reinforcement learning [257].
4.5.3 针对类不平衡量身定制的方法。类不平衡是机器学习中的一个基本挑战,其中多数样本的数量远大于少数样本的数量。数据增强可用于对少数类进行上采样,以平衡数据分布。一种流行的方法是SMOTE [39],它涉及通过在少数实例及其邻居之间进行线性插值来生成合成样本。ADASYN [91]是SMOTE的扩展,它为更难学习的数据点生成额外的合成样本,这些数据点由最近邻居的多数类样本的比率决定。最近的一项研究提出了AutoSMOTE,这是一种基于学习的算法,它通过强化学习来搜索最佳的过采样策略[257]。
4.5.4 Challenges. One critical challenge in data augmentation is that there is no single augmentation strategy that is suitable for all scenarios. Different data types may require diverse strategies. For example, compared to image data, graph data is irregular and not well-aligned, and thus the vanilla Mixup strategy cannot be directly applied [88]. Even though two datasets have the same data type, the optimal strategy differs. For instance, we often need to upsample the minority samples differently to achieve the best results [257]. Although search-based algorithms can identify the best strategies with trial and error, it also increases the computation and storage costs, which can be a limiting factor in some applications. More effective and efficient data augmentation techniques are required to overcome these challenges.
4.5.4 挑战。数据增强的一个关键挑战是,没有一种单一的增强策略适用于所有场景。不同的数据类型可能需要不同的策略。例如,与图像数据相比,图数据不规则且对齐不好,因此无法直接应用普通的Mixup策略[88]。即使两个数据集具有相同的数据类型,最佳策略也不同。例如,我们经常需要以不同的方式对少数样本进行上采样以获得最佳结果[257]。尽管基于搜索的算法可以通过反复试验来识别最佳策略,但它也增加了计算和存储成本,这在某些应用中可能是一个限制因素。需要更有效和高效的数据增强技术来克服这些挑战。
4.6 Pipeline Search
4.6 流水线搜索
Pipeline search is a recent trend that tries to automatically search for the best data pipelines. This subsection introduces the need for pipeline search, some representative methods, and challenges.
管道搜索是最近的一种趋势,它试图自动搜索最佳数据管道。本小节介绍管道搜索的需求、一些代表性方法和挑战。
4.6.1 Need for Pipeline Search. In real-world applications, we often encounter complex data pipelines, where each pipeline step corresponds to a task associated with one of the aforementioned sub-goals. Despite the progress made in each individual task, a pipeline typically functions as a whole, and the various pipeline steps may have an interactive effect. For instance, the best data augmentation strategy may depend on the selected features. Thus, pipeline search is often needed to build an end-to-end solution for constructing training data.
4.6.1 需要管道搜索。在实际应用中,我们经常会遇到复杂的数据管道,其中每个管道步骤对应与上述子目标之一相关的任务。尽管每个单独的任务都取得了进展,但管道通常作为一个整体运行,并且各个管道步骤可能会产生交互效果。例如,最佳的数据增强策略可能取决于所选功能。因此,通常需要管道搜索来构建用于构建训练数据的端到端解决方案。
4.6.2 Methods. One of the first pipeline search frameworks is AutoSklearn [71]. It performs a combined search of preprocessing modules, models, and the associated hyperparameters to optimize the validation performance. However, they use a very small search space for preprocessing modules. DARPA's Data-Driven Discovery of Models (D3M) program pushes the progress further by building an infrastructure for pipeline search [148]. Although D3M originally focused on automated model discovery, it has developed numerous data-centric modules for processing data. Building upon D3M, AlphaD3M uses Monte-Carlo Tree Search to identify the best pipeline [63] D3M is then tailored for time-series anomaly detection [124] and video analysis [262]. Deepline enables the search within a large number of data-centric modules using multi-step reinforcement learning [93]. ClusterP3S allows for personalized pipelines to be created for various features, utilizing clustering techniques to enhance search efficiency [143].
4.6.2 方法。最早的流水线搜索框架之一是AutoSklearn [71]。它对预处理模块、模型和相关的超参数进行组合搜索,以优化验证性能。然而,它们对预处理模块使用非常小的搜索空间。DARPA的数据驱动模型发现(D3M)计划通过构建管道搜索基础设施进一步推动了进展[148]。尽管D3M最初专注于自动化模型发现,但它已经开发了许多以数据为中心的模块来处理数据。AlphaD3M在D3M的基础上,使用蒙特卡洛树搜索来识别最佳管道[63],然后为时间序列异常检测[124]和视频分析[262]定制D3M。Deepline使用多步强化学习[93]在大量以数据为中心的模块中进行搜索。ClusterP3S允许为各种功能创建个性化管道,利用聚类技术来提高搜索效率[143]。
Fig. 5. Overview of inference data development, which entails in-distribution/out-of-distribution evaluation and prompt engineering. These processes aid in understanding and evaluating the model, or in unlocking new model capabilities through input data tuning.
图 5.推理数据开发概述,这需要分布内/分布外评估和提示工程。这些过程有助于理解和评估模型,或通过输入数据调整解锁新的模型功能。
4.6.3 Challenges. Despite these progresses, pipeline search still faces a significant challenge due to the high computational overhead, since the search algorithm often needs to try different module combinations repeatedly. This overhead becomes more pronounced as the number of modules increases, leading to an exponential growth of the search space. Thus, more efficient search strategies [93, 143] are required to enable a broader application of pipeline search in real-world scenarios.
4.6.3 挑战。尽管取得了这些进展,但由于计算开销高,管道搜索仍然面临着重大挑战,因为搜索算法经常需要重复尝试不同的模块组合。随着模块数量的增加,这种开销变得更加明显,导致搜索空间呈指数级增长。因此,需要更有效的搜索策略[93,143]来使管道搜索在现实场景中得到更广泛的应用。
5Inference Data Development
5推理数据开发
Another crucial component in building AI systems is to design inference data to evaluate a trained model or unlock a specific capability of the model. In the conventional model-centric paradigm, we often adopt a hold-out evaluation set that is not included in the training data to measure model performance using specific metrics such as accuracy. However, relying solely on performance metrics may not fully capture many important properties of a model, such as its robustness, generalizability, and rationale in decision-making. Moreover, as models become increasingly large, it becomes possible to obtain the desired predictions by solely engineering the data input. This section introduces some representative methods that evaluate models from a more granular view, or engineering data inputs for inference, shown in Figure 5. Our discussion involves in-distribution set evaluation (Section 5.1), out-of-distribution evaluation (Section 5.2), and prompt engineering (Section 5.3). We summarize the tasks and methods in Table 4 of Appendix B.
构建人工智能系统的另一个关键组成部分是设计推理数据来评估训练好的模型或解锁模型的特定能力。在传统的以模型为中心的范式中,我们经常采用训练数据中未包含的保留评估集,使用特定指标(例如准确性)来衡量模型性能。然而,仅仅依赖性能指标可能无法完全捕捉模型的许多重要属性,例如其鲁棒性、通用性和决策原理。此外,随着模型变得越来越大,仅通过工程化数据输入就可以获得所需的预测。本节介绍了一些具有代表性的从更精细的角度评估模型的方法,或工程数据输入进行推理,如图 5 所示。我们的讨论涉及分布内集评估(第 5.1 节)、分布外评估(第 5.2 节)和提示工程(第 5.3 节)。我们在附录 B 的表 4 中总结了任务和方法。
5.1 In-distribution Evaluation
5.1 分布内评估
In-distribution evaluation data construction aims to generate samples that conform to training data. We will begin by addressing the need for constructing in-distribution evaluation sets. Next, we will review representative methods for two scenarios: evaluating important sub-populations on which the model underperforms through data slicing, and assessing decision boundaries through algorithmic recourse. Last, we will discuss the challenges.
分布式评估数据构建旨在生成符合训练数据的样本。我们将首先解决构建分布式评估集的需求。接下来,我们将回顾两种场景的代表性方法:通过数据切片评估模型表现不佳的重要子群体,以及通过算法追索权评估决策边界。最后,我们将讨论挑战。
5.1.1 Need for In-distribution Evaluation. In-distribution evaluation is the most direct way to assess the quality of trained models, as it reflects their capabilities within the training distribution. The need for a more fine-grained in-distribution evaluation is twofold. First, models that perform well on average may fail to perform adequately on specific sub-populations, requiring identification and calibration of underrepresented groups to avoid biases and errors, particularly in high-stakes applications [147, 162]. Second, it is crucial to understand the decision boundary and inspect the model ethics before deployment, especially in risky applications like policy making [209].
5.1.1 分布内评估的必要性。分布内评估是评估训练模型质量的最直接方法,因为它反映了它们在训练分布中的能力。对更细粒度的分布内评估的需求是双重的。首先,平均表现良好的模型可能无法在特定的子群体上充分表现,需要识别和校准代表性不足的群体以避免偏差和错误,特别是在高风险应用中[147,162]。其次,在部署之前了解决策边界并检查模型伦理至关重要,特别是在政策制定等风险应用中[209]。
5.1.2 Methods. We discuss two representative in-distribution evaluation methods. Data slicing involves partitioning a dataset into relevant sub-populations and evaluating a model's performance on each sub-population separately. A common approach to data slicing is to use pre-defined criteria, such as age, gender, or race [14]. However, data in many real-world applications can be complex, and properly designing the partitioning criteria heavily relies on domain knowledge, such as slicing 3D seismic data in geophysics [254] and program slicing [192].
5.1.2 方法。我们讨论了两种具有代表性的分布内评估方法。数据切片涉及将数据集划分为相关的子群体,并分别评估模型在每个子群体上的性能。数据切片的一种常见方法是使用预定义的标准,例如年龄、性别或种族[14]。然而,许多实际应用中的数据可能很复杂,正确设计划分标准在很大程度上依赖于领域知识,例如地球物理学中的3D地震数据切片[254]和程序切片[192]。
To reduce human effort, automated slicing methods have been developed to discover important data slices by sifting through all potential slices in the data space. One representative work is SliceFinder [48], which identifies slices that are both interpretable (i.e., slicing based on a small set of features) and problematic (the model performs poorly on the slice). To solve this search problem, SliceFinder offers two distinct methods, namely, the tree-based search and the latticebased search. The former is more efficient, while the latter has better efficacy. SliceLine [188] is another notable work that addresses the scalability limitations of slice finding by focusing on both algorithmic and system perspectives. This approach is motivated by frequent itemset mining and leverages relevant monotonicity properties and upper bounds for effective pruning. Moreover, to address hidden stratification, which occurs when each labeled class contains multiple semantically distinct subclasses, GEORGE [208] employs clustering algorithms to slide data across different subclasses. Another tool for automated slicing is Multiaccuracy [116], where a simple *auditor" is trained to predict the residual of the full model using input features. Multiaccuracy, in general, is an efficient approach, since it only requires a small amount of audit data. Data slicing allows researchers and practitioners to identify biases and errors in a model's predictions and calibrate the model to improve its overall capabilities.
为了减少人力,已经开发了自动切片方法,通过筛选数据空间中的所有潜在切片来发现重要的数据切片。一项代表性工作是SliceFinder [48],它识别了既可解释(即基于一小部分特征的切片)又有问题(模型在切片上表现不佳)的切片。为了解决这个搜索问题,SliceFinder提供了两种不同的方法,即基于树的搜索和基于晶格的搜索。前者效率更高,而后者具有更好的功效。SliceLine [188]是另一项值得注意的工作,它通过关注算法和系统视角来解决切片查找的可扩展性限制。这种方法的动机是频繁的项集挖掘,并利用相关的单调性属性和上限进行有效的修剪。此外,为了解决当每个标记类包含多个语义上不同的子类时发生的隐藏分层问题,GEORGE [208]采用聚类算法在不同子类之间滑动数据。另一种自动切片工具是多精度[116],其中训练一个简单的“审计员”,以使用输入特征预测完整模型的残差。一般来说,多精度是一种有效的方法,因为它只需要少量的审计数据。数据切片允许研究人员和从业者识别模型预测中的偏差和错误,并校准模型以提高其整体能力。
Algorithmic recourse (also known as "counterfactuals" [225] in the explainable AI domain) aims to generate a hypothetical set of samples that can flip model decisions toward preferred outcomes. For example, if an individual is denied a loan, algorithmic recourse seeks the closest sample (e.g., with a higher account balance) that would have been approved. Hypothetical samples derived through algorithmic recourse are valuable in understanding decision boundaries. For the previously mentioned example, the hypothetical sample addresses the question of how the individual could have been approved and also aids in the detection of potential biases across individuals.
算法资源(在可解释的人工智能领域也称为“反事实”[225])旨在生成一组假设的样本,这些样本可以将模型决策转向首选结果。例如,如果个人被拒绝贷款,算法追索权会寻找最接近的样本(例如,账户余额较高)本应获得批准。通过算法追索权得出的假设样本对于理解决策边界很有价值。对于前面提到的例子,假设样本解决了个人如何获得批准的问题,也有助于检测个人之间的潜在偏见。
The existing methods primarily vary in their strategies for identifying hypothetical samples, and can generally be classified into white-box and black-box methods. White-box methods necessitate access to the evaluated models, which can be achieved through complete internals [36, 111, 138], gradients [225], or solely the prediction function [52, 57, 127, 202]. Conversely, black-box methods do not require access to the model at all. For example, Dijkstra's algorithm is employed to obtain the shortest path between existing training data points to find recourse under certain distributions [173]. An alternative approach involves dividing the feature space into pure regions, where all data points belong to a single class, and utilizing graph traversing techniques [18, 24] to identify the nearest recourse. Given that the target label for reasoning is usually inputted by humans, these recourse methods all require minimal human participation.
现有方法在识别假设样本的策略上主要有所不同,通常可以分为白盒方法和黑盒方法。白盒方法需要访问评估的模型,这可以通过完整的内部[36,111,138]、梯度[225]或仅通过预测函数[52,57,127,202]来实现。相反,黑盒方法根本不需要访问模型。例如,Dijkstra算法用于获得现有训练数据点之间的最短路径,以在某些分布下找到追索权[173]。另一种方法是将特征空间划分为纯区域,其中所有数据点都属于一个类,并利用图遍历技术[18,24]来识别最近的追索权。鉴于推理的目标标签通常是由人类输入的,这些追索方法都需要最少的人类参与。
5.1.3 Challenges. The main challenge of constructing in-distribution evaluation sets lies in identifying the targeted samples effectively and efficiently. In the case of data slicing, determining the optimal subset of data is particularly challenging due to the exponential increase in the number of possible subsets with additional data points. Similarly, identifying the closest recourse when limited information is available also requires significant effort.
5.1.3 挑战。构建分布内评估集的主要挑战在于有效且高效地识别目标样本。在数据切片的情况下,确定数据的最佳子集尤其具有挑战性,因为可能的子集数量随着额外的数据点呈指数级增长。同样,在可用信息有限的情况下确定最接近的追索权也需要付出巨大的努力。
5.2 Out-of-distribution Evaluation
5.2 分布外评估
Out-of-distribution evaluation data refers to a set of samples that follow a distribution that differs from the one observed in the training data. We begin by discussing the need for out-of-distribution evaluation, followed by a review of two representative tasks: generating adversarial samples and generating samples with distribution shifts. Then we delve into the challenges associated with out-of-distribution data generation.
分布外评估数据是指一组样本,其分布与训练数据中观察到的分布不同。我们首先讨论分布外评估的必要性,然后回顾两项代表性任务:生成对抗性样本和生成具有分布偏移的样本。然后我们深入研究与分布外数据生成相关的挑战。
5.2.1 Need for Out-of-distribution Evaluation. Although modern machine learning techniques generally perform well on in-distribution datasets, the distribution of data in the deployment environment may not align with the training data [203]. Out-of-distribution evaluation primarily assesses a model's ability to generalize to unexpected scenarios by utilizing data samples that differ significantly from the ones used during training. This evaluation can uncover the transferability of a model and instill confidence in its performance in unexpected scenarios. Out-of-distribution evaluation can also provide essential insights into a model's robustness, exposing potential flaws that must be addressed before deployment. This is crucial in determining whether the model is secure in real-world deployments.
5.2.1 分布外评估的必要性。尽管现代机器学习技术通常在分布式数据集上表现良好,但部署环境中的数据分布可能与训练数据不一致[203]。分布外评估主要通过利用与训练期间使用的数据样本显着不同的数据样本来评估模型泛化到意外场景的能力。这种评估可以揭示模型的可转移性,并灌输对其在意外情况下的性能的信心。分布外评估还可以提供对模型鲁棒性的重要见解,揭示部署前必须解决的潜在缺陷。这对于确定模型在实际部署中是否安全至关重要。
5.2.2 Methods. Generating adversarial samples and samples with distribution shifts are two representative strategies for out-of-distribution evaluation. Adversarial samples are the ones with intentionally manipulated or modified input data in a way that causes a model to make incorrect predictions. Adversarial samples can aid in comprehending a model's robustness and are typically generated by applying perturbations to the input data. Manual perturbation involves adding synthetic and controllable perturbations, such as noise and blur, to the original data [95].
5.2.2 方法。生成对抗样本和具有分布偏移的样本是分布外评估的两种代表性策略。对抗样本是指那些故意纵或修改输入数据的样本,导致模型做出错误的预测。对抗样本可以帮助理解模型的鲁棒性,通常通过对输入数据施加扰动来生成。手动扰动涉及在原始数据中添加合成和可控扰动,例如噪声和模糊[95]。
Automated methods design learning-based strategies to generate perturbations automatically and are commonly classified into four categories: white-box attacks, physical world attacks, black-box attacks, and poisoning attacks. White-box attacks involve the attacker being provided with the model and victim sample. Examples of white-box attacks include Biggio's attack [21], DeepFool [154], and projected gradient descent attack [140]. Physical world attacks involve introducing real perturbations to real-world objects. For instance, in the work by the authors of Reference [66], stickers were attached to road signs to significantly impact the sign identifiers of autonomous cars. Black-box attacks are often applied when an attacker lacks access to a classifier's parameters or training set but possesses information regarding the data domain and model architecture. In Reference [166], the authors exploit the transferability property to generate adversarial examples. A zero-th order optimization-based black-box attack is proposed in Reference [41] that leverages the prediction confidence for the victim sample. Poisoning attacks involve the creation of adversarial examples prior to training, utilizing knowledge about model architectures. For instance, the poison frogs technique [200] inserts an adversarial image into the training set with a true label. By evaluating a trained model on various adversarial samples, we can gain a better understanding of the potential weaknesses of the model in deployment. This can help us take steps to prevent undesirable outcomes.
自动化方法设计了基于学习的策略来自动生成扰动,通常分为四类:白盒攻击、物理世界攻击、黑盒攻击和中毒攻击。白盒攻击涉及向攻击者提供模型和受害者样本。白盒攻击的例子包括Biggio攻击[21]、DeepFool攻击[154]和投影梯度下降攻击[140]。物理世界攻击涉及将真实扰动引入现实世界的物体。例如,在参考文献[66]的作者的工作中,在路标上贴上贴纸,以显着影响自动驾驶汽车的标志标识符。当攻击者无法访问分类器的参数或训练集,但拥有有关数据域和模型架构的信息时,通常会应用黑盒攻击。在参考文献[166]中,作者利用可转移性属性生成对抗性示例。参考文献[41]提出了一种基于零阶优化的黑盒攻击,该攻击利用了受害者样本的预测置信度。中毒攻击涉及在训练之前利用模型架构的知识创建对抗性示例。例如,毒蛙技术[200]将具有真实标签的对抗性图像插入训练集中。通过在各种对抗样本上评估训练好的模型,我们可以更好地了解模型在部署中的潜在弱点。这可以帮助我们采取措施防止不良结果。
Generating samples with distribution shifts enables the evaluation of a model on a different distribution. One straightforward way is to collect data with varying patterns, such as shifts across different times or locations [58], camera traps for wildlife monitoring [118], and diverse domains [187]. A more efficient approach would involve constructing the evaluation set from pre-collected data. To illustrate, some studies [85, 201] generate various sets of contiguous video frames that appear visually similar to humans but lead to inconsistent predictions due to the small perturbations.
生成具有分布偏移的样本可以评估不同分布的模型。一种简单的方法是收集不同模式的数据,例如不同时间或地点的变化[58]、用于野生动物监测的相机陷阱[118]和不同的领域[187]。更有效的方法是根据预先收集的数据构建评估集。为了说明这一点,一些研究[85,201]生成了各种连续的视频帧集,这些帧在视觉上看起来与人类相似,但由于扰动很小,导致预测不一致。
Apart from natural distribution shifts in real-world data, synthetic distribution shifts are widely adopted, including three types: (1) covariate shift, which assumes that the input distribution is shifted [82, 214], (2) label shift, which assumes that the label distribution is shifted [10, 135], and (3) general distribution shift, which assumes that both the input and label distributions are shifted [68, 86]. Biased data sampling can be used to synthesize covariate shifts or label shifts, whereas learning-based methods are typically required to synthesize general distribution shifts [68, 86] Generating samples with distribution shift is essential in evaluating a model's transferability, especially when there is a distribution gap between the training and deployment environments.
除了现实世界数据中的自然分布偏移外,合成分布偏移也被广泛采用,包括三种类型:(1)协变量偏移,假设输入分布发生偏移[82,214],(2)标签偏移,假设标签分布发生偏移[10,135],以及(3)一般分布偏移,假设输入和标签分布都发生偏移[68,86]。有偏数据采样可用于合成协变量偏移或标签偏移,而通常需要基于学习的方法来合成一般分布偏移[68,86]生成具有分布偏移的样本对于评估模型的可转移性至关重要,特别是当训练和部署环境之间存在分布差距时。
5.2.3 Challenges. The challenges for out-of-distribution generation set construction are twofold. First, generating high-quality out-of-distribution data is challenging. If the training data is not representative, then it may be difficult to generate appropriate data. Furthermore, the generation models may encounter mode collapse issues, meaning that they only generate a limited number of similar samples and disregard the diversity of the target distribution. Second, evaluating the quality of out-of-distribution generation is difficult, since no single metric can capture the diversity and quality of the generated samples. Commonly used metrics, such as likelihood or accuracy, may not be suitable as they may exhibit bias toward generating samples similar to the training data. Therefore, various evaluation metrics have been proposed to assess the distance between in-distribution and out-of-distribution samples [19, 30, 160, 191]. Overall, creating high-quality out-of-distribution data is a complex and demanding task that requires meticulous design.
5.2.3 挑战。分布外生成集构建的挑战是双重的。首先,生成高质量的分布外数据具有挑战性。如果训练数据不具有代表性,则可能难以生成适当的数据。此外,生成模型可能会遇到模式崩溃问题,这意味着它们只生成有限数量的相似样本,而忽略了目标分布的多样性。其次,评估分布外生成的质量很困难,因为没有一个指标可以捕获生成样本的多样性和质量。常用指标,如似然或准确性,可能不适合,因为它们可能表现出对生成与训练数据相似的样本的偏差。因此,人们提出了各种评估指标来评估分布内样本和分布外样本之间的距离[19,30,160,191]。总的来说,创建高质量的分布外数据是一项复杂而艰巨的任务,需要精心设计。
5.3 Prompt Engineering
5.3 提示工程
Prompt engineering is an emerging task that aims to design and construct high-quality prompts to achieve the most effective performance on downstream tasks [136]. For example, when per. forming text summarization, we can provide the texts we want to summarize followed by specific instructions such as "summarize it" or $\mathrm { ^ { * } T L ; D R } ^ { \mathrm { * } }$ to guide the inference. In the following, we briefly discuss the needs, methods, and challenges of prompt engineering.
提示工程是一项新兴任务,旨在设计和构建高质量的提示,以实现下游任务的最有效性能[136]。例如,在形成文本摘要时,我们可以提供我们想要总结的文本,然后提供“总结它”或$\mathrm { ^ { * } T L ; D R } ^ { \mathrm { * } }$等具体指令来指导推理。下面,我们将简要讨论提示工程的需求、方法和挑战。
5.3.1 Need for Prompt Engineering. With the advent of large language models, it becomes feasible to accomplish a task by solely fine-tuning the input to probe knowledge from the model, while keeping the model fixed. Prompt engineering revolutionizes the traditional workflow by fine-tuning the input data rather than the model itself to achieve a given task, which has already become a common practice in the era of Large Language Models.
5.3.1 提示工程的需求。随着大型语言模型的出现,仅通过微调输入来探测模型中的知识来完成任务,同时保持模型固定,就变得可行。提示工程通过微调输入数据而不是模型本身来实现给定任务,从而彻底改变了传统的工作流程,这已经成为大型语言模型时代的普遍做法。
5.3.2 Methods. A natural way is to perform manual prompt engineering by creating templates. For example, in References [195-197], the authors have pre-defined templates for few-shot learning in text classification and conditional text generation tasks. However, manually crafting templates may not be sufficient to discover the optimal prompts for complex tasks. Thus, automated prompt engineering has been studied. Common programmatic approaches include mining the templates from an external corpus [108] and paraphrasing with a seed prompt [89, 251] Learning-based methods automatically generate the prompt tokens by gradient-based search [227] or generative models [77].
5.3.2 方法。一种自然的方法是通过创建模板来执行手动提示工程。例如,在参考文献[195-197]中,作者在文本分类和条件文本生成任务中为少样本学习预定义了模板。然而,手动制作模板可能不足以发现复杂任务的最佳提示。因此,已经研究了自动化提示工程。常见的编程方法包括从外部语料库中挖掘模板[108]和使用种子提示进行释义[89,251]基于学习的方法通过基于梯度的搜索[227]或生成模型[77]自动生成提示标记。
5.3.3 Challenges. The primary obstacle in prompt engineering arises from the absence of a universal prompt template that consistently performs well. Various templates may result in different model behaviors, and obtaining the desired answers is not guaranteed. Therefore, further research is necessary to gain insight into the response of the model to prompts and guide the prompt engineering process.
5.3.3 挑战。提示工程的主要障碍是由于缺乏始终如一地表现良好的通用提示模板。不同的模板可能会导致不同的模型行为,并且不能保证获得所需的答案。因此,需要进一步的研究来深入了解模型对提示的响应并指导提示工程过程。
Fig. 6. Overview of data maintenance, which encompasses several sub-goals aimed at supporting training and inference data. These include providing insights of data, supplying high-quality data, and enabling faster data acquisition.
图 6.数据维护概述,其中包括几个旨在支持训练和推理数据的子目标。其中包括提供数据洞察、提供高质量数据以及实现更快的数据采集。
6 Data Maintenance
6 数据维护
In production scenarios, data is not created once but is rather continuously updated, making data maintenance a significant challenge that must be considered to ensure reliable and instant data supply in building AI systems. This section provides an overview of the need, representative methods (as depicted in Figure 6), and challenges of data maintenance. Our discussion spans across three aspects: data understanding (Section 6.1), data quality assurance (Section 6.2), and data storage and retrieval (Section 6.3). Table 5 of Appendix B summarizes the tasks and methods.
在生产场景中,数据不是一次创建,而是不断更新,这使得数据维护成为一项重大挑战,必须考虑确保构建人工智能系统时可靠和即时的数据供应。本节概述了数据维护的需求、代表性方法(如图 6 所示)和挑战。我们的讨论涵盖三个方面:数据理解(第 6.1 节)、数据质量保证(第 6.2 节)以及数据存储和检索(第 6.3 节)。附录 B 的表 5 总结了任务和方法。
6.1 Data Understanding
6.1 数据理解
To ensure proper maintenance, it is essential to first understand the data. The following discussion covers the need for data understanding techniques, ways to gain insights through visualization and valuation, and the challenges involved.
为了确保适当的维护,首先必须了解数据。以下讨论涵盖了数据理解技术的需求、通过可视化和估值获得见解的方法以及所涉及的挑战。
6.1.1 Need for Data Understanding Techniques. Real-world data often comes in large volumes and complexity, which can make it difficult to understand and analyze. There are three main reasons why data understanding techniques are crucial. First, comprehending a large number of raw data samples can be challenging for humans. To make it more manageable, we need to summarize the data and present it in a more concise and accessible way. Second, real-world data is often high-dimensional, while human perception is limited to two-or-three-dimensional space. Therefore, visualizing data in a lower-dimensional space is essential for understanding the data. Finally, it is crucial for organizations and stakeholders to understand the value of their data assets and the contribution of each data sample to the performance.
6.1.1 需要数据理解技术。现实世界的数据通常数量庞大且复杂,这使得理解和分析变得困难。数据理解技术至关重要的主要原因有三个。首先,理解大量原始数据样本对人类来说可能具有挑战性。为了使其更易于管理,我们需要总结数据并以更简洁、更易于理解的方式呈现。其次,现实世界的数据通常是高维的,而人类的感知仅限于二维或三维空间。因此,在低维空间中可视化数据对于理解数据至关重要。最后,对于组织和利益相关者来说,了解其数据资产的价值以及每个数据样本对绩效的贡献至关重要。
6.1.2 Data Visualization. Human beings are visual animals, and as such, we have a natural tendency to process and retain information presented in a pictorial and graphical format. Data visualization aims to leverage this innate human trait to help us better understand complex data. In what follows, we will discuss three relevant research topics: visual summarization, clustering for visualization, and visualization recommendation.
6.1.2 数据可视化。人类是视觉动物,因此,我们自然倾向于处理和保留以图形和图形格式呈现的信息。数据可视化旨在利用人类的这种先天特征来帮助我们更好地理解复杂的数据。在接下来的内容中,我们将讨论三个相关的研究主题:可视化总结、可视化聚类和可视化推荐。
Visual summarization. Summarizing the raw data as a set of graphical diagrams can assist humans in gaining insights through a condensed interface. Despite its wide application, gener. ating a faithful yet user-friendly summarization diagram is a non-trivial task. For example, it is hard to select the right visualization format. Radial charts (e.g., star glyphs and rose charts) and linear charts (e.g., line charts and bar charts) are two common formats for visualization. However, it is controversial which format is better. Although empirical evidence suggests that linear charts are superior to radial charts for many analytical tasks [35], radial charts are often more natural and memorable [31]. In some cases, it is acceptable to compromise on the faithfulness of data representation in favor of enhanced memorability or space efficiency [35, 226]. For readers who are interested, [56] and [73] provide a comprehensive taxonomy of visualization formats. Although automated scripts can generate plots, the process of visual summarization often demands minimal human participation to select the most appropriate visualization formats.
可视化总结。将原始数据汇总为一组图形图表可以帮助人类通过简洁的界面获得见解。尽管应用广泛,但通用。制作一个忠实且用户友好的摘要图是一项不平凡的任务。例如,很难选择正确的可视化格式。径向图(例如星形图和玫瑰图)和线性图(例如折线图和条形图)是两种常见的可视化格式。然而,哪种格式更好存在争议。尽管经验证据表明,在许多分析任务中,线性图优于径向图[35],但径向图通常更自然、更令人难忘[31]。在某些情况下,为了提高记忆性或空间效率,在数据表示的忠实性上做出妥协是可以接受的[35,226]。对于感兴趣的读者,[56]和[73]提供了可视化格式的全面分类。尽管自动化脚本可以生成绘图,但视觉摘要过程通常需要最少的人工参与来选择最合适的可视化格式。
Clustering for visualization. Real-world data can be high-dimensional and with complex manifold structures. As such, dimensionality reduction techniques are often applied to visualize data in a two-or-three-dimensional space. Furthermore, automated clustering methods [67] are frequently combined with dimensionality reduction techniques to organize data points in a grouped, categorized, and often color-coded fashion, facilitating human comprehension and insightful analysis of the data.
用于可视化的聚类。现实世界的数据可以是高维的,并且具有复杂的流形结构。因此,降维技术通常用于在二维或三维空间中可视化数据。此外,自动聚类方法[67]经常与降维技术相结合,以分组、分类和通常颜色编码的方式组织数据点,从而促进人类对数据的理解和深入的分析。
Visualization recommendation. Building upon various visualization formats, there has been a surge of interest in visualization recommendation, which involves suggesting the most suitable visualization formats for a particular user. Programmatic automation approaches rank visualization candidates based on predefined rules composed of human perceptual metrics such as data type, statistical information, and human visual preference [176, 241]. Learning-based approaches exploit various machine learning techniques to rank the visualization candidates. An example of such a method is DeepEye [139], which utilizes the statistical information of the data as input and optimizes the normalized discounted cumulative gain based on the quality of the match between the data and the chart. Collaborative visualization techniques allow for a more adaptable user experience by enabling users to continuously provide feedback and requirements for the visualization [203]. A recent study, Snowy [210], accepts human language as input and generates recommendations for utterances during conversational visual analysis. As visualizations are intended for human users, allowing for human-in-the-loop feedback is crucial in developing visualization recommender systems.
可视化推荐。基于各种可视化格式,人们对可视化推荐的兴趣激增,可视化推荐涉及为特定用户建议最合适的可视化格式。程序化自动化方法根据由人类感知指标(如数据类型、统计信息和人类视觉偏好)组成的预定义规则对可视化候选者进行排名[176,241]。基于学习的方法利用各种机器学习技术对可视化候选者进行排名。这种方法的一个例子是DeepEye [139],它利用数据的统计信息作为输入,并根据数据与图表之间的匹配质量优化归一化贴现累积收益。协作可视化技术使用户能够持续提供可视化的反馈和要求,从而提供更具适应性的用户体验[203]。最近的一项研究Snowy [210]接受人类语言作为输入,并在对话视觉分析期间为话语生成建议。由于可视化是为人类用户设计的,因此允许人机交互反馈对于开发可视化推荐系统至关重要。
6.1.3 Data Valuation. The objective of data valuation is to understand how each data point contributes to the final performance. Such information not only provides valuable insights to stakeholders but is also useful in buying or selling data points in the data market and credit attribution [78]. To accomplish this, researchers estimate the Shapley value of the data points, which assigns weights to each data point based on its contribution [3, 79]. A subsequent study has enhanced the robustness of this estimation across multiple datasets and models [78]. Since calculating the exact Shapley value can be computationally expensive, especially when dealing with a large number of data points, the above methods all adopt learning-based algorithms for efficient estimation.
6.1.3 数据估值。数据估值的目的是了解每个数据点如何对最终性能做出贡献。这些信息不仅为利益相关者提供了有价值的见解,而且对于数据市场和信用归因中的买卖数据点也很有用[78]。为了实现这一目标,研究人员估计了数据点的Shapley值,该值根据每个数据点的贡献为每个数据点分配权重[3,79]。随后的研究增强了这种估计在多个数据集和模型中的稳健性[78]。由于计算精确的Shapley值可能具有计算成本,尤其是在处理大量数据点时,因此上述方法都采用基于学习的算法进行高效估计。
6.1.4 Challenges. There are two major challenges. First, the most effective data visualization formats and algorithms (e.g., clustering algorithms) are often specific to the domain and influenced by human behavior, making it difficult to select the best option. This selection process often requires human input. Determining how to best interact with humans adds an additional complexity. Second, developing efficient data valuation algorithms is challenging, since estimating the Shapley value can be computationally expensive, especially as data sizes continue to grow. Additionally, the Shapley value may only offer a limited perspective on data value, as there are many other important factors beyond model performance, such as the problems that can be addressed through training a model on the data.
6.1.4 挑战。有两个主要挑战。首先,最有效的数据可视化格式和算法(例如聚类算法)通常是特定于该领域的,并受到人类行为的影响,因此很难选择最佳选项。这个选择过程通常需要人工输入。确定如何最好地与人类交互增加了额外的复杂性。其次,开发高效的数据评估算法具有挑战性,因为估计 Shapley 值的计算成本可能很高,尤其是在数据大小不断增长的情况下。此外,Shapley 值可能只能提供有限的数据价值视角,因为除了模型性能之外还有许多其他重要因素,例如可以通过在数据上训练模型来解决的问题。
6.2 Data Quality Assurance
6.2 数据质量保证
To ensure a reliable data supply, it is essential to maintain data quality. We will discuss why quality assurance is necessary, the key tasks involved in maintaining data quality (quality assessment and improvement), and the challenges.
为了确保可靠的数据供应,保持数据质量至关重要。我们将讨论为什么需要质量保证、保持数据质量所涉及的关键任务(质量评估和改进)以及挑战。
6.2.1 Need for Data Quality Assurance. In real-world scenarios, data and the corresponding infrastructure for data processing are subject to frequent and continuous updates. As a result, it is important not only to create high-quality training or inference data once but also to maintain their excellence in a dynamic environment. Ensuring data quality in such a dynamic environment involves two aspects. First, continuous monitoring of data quality is necessary. Real-world data in practical applications can be complex, and it may contain various anomalous data points that do not align with our intended outcomes. As a result, it is crucial to establish quantitative measure. ments that can evaluate data quality. Second, if a model is affected by low-quality data, then it is important to implement quality improvement strategies to enhance data quality, which will also lead to improved model performance.
6.2.1 数据质量保证的必要性。在现实场景中,数据和相应的数据处理基础设施需要频繁和持续的更新。因此,重要的是不仅要一次性创建高质量的训练或推理数据,而且要在动态环境中保持其卓越性。在这样的动态环境中确保数据质量涉及两个方面。首先,需要对数据质量进行持续监控。实际应用中的真实世界数据可能很复杂,并且可能包含各种与我们预期结果不符的异常数据点。因此,建立定量衡量标准至关重要。可以评估数据质量的方法。其次,如果模型受到低质量数据的影响,那么实施质量改进策略以提高数据质量非常重要,这也将导致模型性能的提高。
6.2.2 Quality Assessment. Quality assessment develops evaluation metrics to measure the quality of data and detect potential flaws and risks. These metrics can be broadly categorized as either objective or subjective assessments [16, 170, 185, 244]. Although objective and subjective assessments may require different degrees of human participation, both of them are used in each paper we surveyed. Thus, we tag each paper with more than one degree of human participation in Table 5 of Appendix B. We will discuss these two types of assessments and provide some representative examples.
6.2.2 质量评估。质量评估制定评估指标来衡量数据质量并检测潜在的缺陷和风险。这些指标大致可分为客观评估或主观评估[16,170,185,244]。尽管客观和主观评估可能需要不同程度的人类参与,但我们调查的每篇论文都使用了这两种评估。因此,我们在附录B的表5中为每篇论文标记了超过一种程度的人类参与。我们将讨论这两种类型的评估,并提供一些具有代表性的例子。
Objective assessments directly measure data quality using inherent data attributes that are independent of specific applications. Examples of such metrics include accuracy, timeliness, consistency, and completeness. Accuracy refers to the correctness of obtained data, i.e., whether the obtained data values align with those stored in the database. Timeliness assesses whether the data is up-to-date. Consistency refers to the violation of semantic rules defined over a set of data items. Completeness measures the percentage of values that are not null. All of these metrics can be collected directly from the data, requiring only minimal human participation to specify the calculation formula.
客观评估使用独立于特定应用程序的固有数据属性直接衡量数据质量。此类指标的示例包括准确性、及时性、一致性和完整性。准确性是指获得的数据的正确性,即获得的数据值是否与存储在数据库中的数据值一致。及时性评估数据是否是最新的。一致性是指违反对一组数据项定义的语义规则。完整性衡量非空值的百分比。所有这些指标都可以直接从数据中收集,只需要最少的人工参与即可指定计算公式。
Subjective assessments evaluate data quality from a human perspective, often specific to the application and requiring external analysis from experts. Metrics like trustworthiness, understandability, and accessibility are often assessed through user studies and questionnaires. Trustworthiness measures the accuracy of information provided by the data source. Understandability measures the ease with which users can comprehend collected data, while accessibility measures users' ability to access the data. Although subjective assessments may not directly benefit model training, they can facilitate easier collaboration within an organization and provide long-term benefits. Collecting these metrics typically requires full human participation, since they are often based on questionnaires.
主观评估从人类角度评估数据质量,通常特定于应用程序,需要专家进行外部分析。可信度、可理解性和可访问性等指标通常通过用户研究和问卷调查进行评估。可信度衡量数据源提供的信息的准确性。可理解性衡量用户理解收集数据的难易程度,而可访问性衡量用户访问数据的能力。尽管主观评估可能不会直接有利于模型训练,但它们可以促进组织内更轻松的协作并提供长期利益。收集这些指标通常需要人类的充分参与,因为它们通常基于问卷调查。
6.2.3 Quality Improvement. Quality improvement involves developing strategies to enhance the quality of data at various stages of a data pipeline. Initially, programmatic automation methods are used to enforce quality constraints, including integrity constraints [15], denial constraints [47], and conditional functional dependencies [27] between columns. More recently, machine learningbased automation approaches have been developed to improve data quality. For instance, in Reference [17], a data validation module trains a machine learning model on a training set with expected data schema and generalizes it to identify potential problems in unseen scenarios. Furthermore. pipeline automation approaches have been developed to systematically curate data in multiple stages of the data pipeline, such as data integration and data cleaning [194, 220].
6.2.3 质量改进。质量改进涉及制定策略,以提高数据管道各个阶段的数据质量。最初,编程自动化方法用于强制执行质量约束,包括完整性约束[15]、拒绝约束[47]和列之间的条件功能依赖关系[27]。最近,已经开发了基于机器学习的自动化方法来提高数据质量。例如,在参考文献[17]中,数据验证模块在具有预期数据模式的训练集上训练机器学习模型,并对其进行推广,以识别看不见场景中的潜在问题。 此外。 已经开发了管道自动化方法,以系统地管理数据管道的多个阶段的数据,例如数据集成和数据清理[194,220]。
Apart from automation, collaborative approaches have been developed to encourage expert participation in data improvement. For example, in autonomous driving [76] and video content review. ing [55], human annotations are continuously used to enhance the quality of training data with the assistance of machine learning models. Moreover, UniProt [235], a public database for protein sequence and function literature, has created a systematic submission system to harness collective intelligence [42] for data improvement. This system automatically verifies meta-information, updated versions, and research interests of the submitted literature. All of these methods necessitate partial human participation, as humans must continuously provide information through annotations or submissions.
除了自动化之外,还开发了协作方法来鼓励专家参与数据改进。例如,在自动驾驶[76]和视频内容审查中。Ing [55],在机器学习模型的帮助下,不断使用人工注释来提高训练数据的质量。此外,蛋白质序列和功能文献的公共数据库UniProt [235]创建了一个系统的提交系统,以利用集体智慧[42]进行数据改进。该系统会自动验证提交文献的元信息、更新版本和研究兴趣。所有这些方法都需要部分人类参与,因为人类必须通过注释或提交不断提供信息。
6.2.4 Challenges. Ensuring data quality poses two main challenges. First, selecting the most suitable assessment metric is not a straightforward task and heavily relies on domain knowledge. A single metric may not always be adequate in a constantly evolving environment. Second, quality improvement is a vital yet laborious process that necessitates careful consideration. Although automation is crucial in ensuring sustainable data quality, human involvement may also be necessary to ensure that the data quality meets human expectations. Therefore, data assessment metrics and data improvement strategies must be thoughtfully designed.
6.2.4 挑战。确保数据质量带来了两个主要挑战。首先,选择最合适的评估指标不是一项简单的任务,并且严重依赖领域知识。在不断发展的环境中,单一指标可能并不总是足够的。其次,质量改进是一个重要但费力的过程,需要仔细考虑。尽管自动化对于确保可持续的数据质量至关重要,但为了确保数据质量满足人类的期望,也可能需要人工参与。因此,必须深思熟虑地设计数据评估指标和数据改进策略。
6.3 Data Storage and Retrieval
6.3 数据存储和检索
Data storage and retrieval systems play an indispensable role in providing the necessary data to build AI systems. To expedite the process of data acquisition, various efficient strategies have been proposed. In the following discussion, we elaborate on the importance of efficient data storage and retrieval, review some representative acceleration methods for resource allocation and query acceleration, and discuss the challenges associated with them.
数据存储和检索系统在提供构建人工智能系统所需的数据方面发挥着不可或缺的作用。为了加快数据采集的进程,人们提出了各种高效的策略。在下面的讨论中,我们详细阐述了高效数据存储和检索的重要性,回顾了一些具有代表性的资源分配和查询加速加速方法,并讨论了与之相关的挑战。
6.3.1 Need for Efficient Data Storage and Retrieval. As the amount of data being generated continues to grow exponentially, having a robust and scalable data administration system that can efficiently handle the large data volume and velocity is becoming increasingly critical to support the training of AI models. This need encompasses two aspects. First, data administration systems, such as Hadoop [72] and Spark [253], often need to store and merge data from various sources, requiring careful management of memory and computational resources. Second, it is crucial to design querying strategies that enable fast data acquisition to ensure timely and accurate processing of the data.
6.3.1 需要高效的数据存储和检索。随着生成的数据量持续呈指数级增长,拥有一个能够有效处理大量数据和速度的强大且可扩展的数据管理系统对于支持人工智能模型的训练变得越来越重要。这种需求包括两个方面。首先,数据管理系统,如Hadoop[72]和Spark[253],通常需要存储和合并来自各种来源的数据,需要仔细管理内存和计算资源。其次,设计能够快速获取数据的查询策略,以确保及时准确地处理数据至关重要。
6.3.2 Resource Allocation. Resource allocation aims to estimate and balance the cost of operations within a data administration system. Two key efficiency metrics in data administration sys. tems are throughput, which refers to how quickly new data can be collected, and latency, which measures how quickly the system can respond to a request. To optimize these metrics, various parameter-tuning techniques have been proposed, including controlling database configuration settings (e.g., buffer pool size) and runtime operations (e.g., percentage of CPU usage and multiprogramming level) [64]. Early tuning methods rely on rules that are based on intuition, experience, data domain knowledge, and industry best practices from sources such as Apache [6] and Cloudera [141]. For instance, Hadoop guidelines [239] suggest that the number of reduced tasks should be set to approximately 0.95 or 1.75 times the number of reduced slots available in the cluster to ensure system tolerance for re-executing failed or slow tasks.
6.3.2 资源分配。资源分配旨在估计和平衡数据管理系统内的运营成本。数据管理系统中的两个关键效率指标。TEM是吞吐量(吞吐量)和延迟(衡量系统响应请求的速度)。为了优化这些指标,人们提出了各种参数调整技术,包括控制数据库配置设置(例如,缓冲池大小)和运行时作(例如,CPU使用率百分比和多编程级别)[64]。早期的调优方法依赖于基于直觉、经验、数据领域知识和来自Apache[6]和Cloudera[141]等来源的行业最佳实践的规则。例如,Hadoop指南[239]建议将减少的任务数量设置为集群中可用减少的插槽数量的大约0.95或1.75倍,以确保系统对重新执行失败或缓慢任务的容忍度。
Various learning-based strategies have been developed for resource allocation in data processing systems. For instance, Starfish [96] proposes a profile-predict-optimize approach that generates job profiles with dataflow and cost statistics, which are then used to predict virtual job profiles for task scheduling. More recently, machine learning approaches such as OtterTune [222] have been developed to automatically select the most important parameters, map workloads, and recommend parameters to improve latency and throughput. These learning-based automation strategies can adaptively balance system resources without assuming any internal system information.
已经开发了各种基于学习的策略来进行数据处理系统中的资源分配。例如,Starfish [96]提出了一种配置文件-预测-优化方法,该方法生成具有数据流和成本统计数据的作业配置文件,然后用于预测任务调度的虚拟作业配置文件。最近,已经开发了机器学习方法,如OtterTune [222],可以自动选择最重要的参数,映射工作负载,并推荐参数,以提高延迟和吞吐量。这些基于学习的自动化策略可以自适应地平衡系统资源,而无需假设任何内部系统信息。
6.3.3 Query Acceleration. Another research direction is efficient data retrieval, which can be achieved through efficient index selection and query rewriting strategies.
6.3.3 查询加速。另一个研究方向是高效的数据检索,可以通过高效的索引选择和查询重写策略来实现。
Query acceleration with index selection. The objective of index selection is to minimize the number of disk accesses needed during query processing. To achieve this, programmatic automation strategies create an indexing scheme with indexable columns and record query execution costs [215]. Then, they apply either a greedy algorithm [37] or dynamic programming [221] to select the indexing strategy. To enable a more adaptive and flexible querying strategy, learning-based automation strategies collect indexing data from human experts and train machine learning models to predict the proper indexing strategies [168], or search for the optimal strategies using reinforcement learning [186].
通过索引选择加速查询。索引选择的目标是最大限度地减少查询处理过程中所需的磁盘访问次数。为了实现这一目标,编程自动化策略创建了具有可索引列的索引方案,并记录了查询执行成本[215]。然后,他们应用贪婪算法[37]或动态规划[221]来选择索引策略。为了实现更具适应性和灵活性的查询策略,基于学习的自动化策略从人类专家那里收集索引数据并训练机器学习模型来预测正确的索引策略[168],或使用强化学习[186]寻找最佳策略。
Query acceleration with rewriting. In parallel, query rewriting aims to reduce the workload by identifying repeated sub-queries from input queries. Rule-based strategies [11, 38] rewrite queries with pre-defined rules, such as DBridge [38], which constructs a dependency graph to model the data flow and iteratively applies transformation rules. Learning-based approaches use supervised learning [92] or reinforcement learning [273] to predict rewriting rules given an input query.
通过重写加速查询。同时,查询重写旨在通过从输入查询中识别重复的子查询来减少工作量。基于规则的策略[11,38]使用预定义的规则重写查询,例如DBridge [38],它构建依赖关系图来对数据流进行建模并迭代应用转换规则。基于学习的方法使用监督学习[92]或强化学习[273]来预测给定输入查询的重写规则。
6.3.4 Challenges. Existing data storage and retrieval methods typically focus on optimizing specific parts of the system, such as resource allocation and query acceleration we mentioned. However, the real data administration system as a whole can be complex, since it needs to process a vast amount of data in various formats and structures, making end-to-end optimization a challenging task. Additionally, apart from efficiency, data storage and retrieval require consideration of several other crucial and challenging aspects, such as data access control and system maintenance.
6.3.4 挑战。现有的数据存储和检索方法通常侧重于优化系统的特定部分,例如我们提到的资源分配和查询加速。然而,真正的数据管理系统作为一个整体可能很复杂,因为它需要处理各种格式和结构的大量数据,这使得端到端优化成为一项具有挑战性的任务。此外,除了效率之外,数据存储和检索还需要考虑其他几个关键且具有挑战性的方面,例如数据访问控制和系统维护。
7 Data Benchmark
7 数据基准
In the previous sections, we explored a diverse range of data-centric AI tasks throughout various stages of the data lifecycle. Examining benchmarks is a promising approach for gaining insight into the progress of research and development in these tasks, as benchmarks comprehensively evaluate various methods based on standard and agreed-upon metrics. It is important to note that, within the context of data-centric AI, we are specifically interested in data benchmarks rather than model benchmarks, which should assess various techniques aimed at achieving data excellence. To be more specific, a data benchmark refers to a work that (i) provides a common ground for fair comparison of data-centric AI methods, (ii)comprehensively evaluates and compares the existing data-centric AI methods to understand the current progress, or involves both (ii) and (ii) In this section, we survey the existing benchmarks for different goals of data-centric AI. First, we will introduce the benchmark collection strategy, and subsequently, we will summarize and analyze the collected benchmarks.
在前面的部分中,我们探讨了数据生命周期各个阶段的各种以数据为中心的人工智能任务。检查基准是深入了解这些任务研发进展的一种很有前途的方法,因为基准根据标准和商定的指标全面评估各种方法。值得注意的是,在以数据为中心的人工智能的背景下,我们特别感兴趣的是数据基准而不是模型基准,后者应该评估旨在实现数据卓越的各种技术。更具体地说,数据基准是指 (i) 为公平比较以数据为中心的人工智能方法提供共同基础,(ii) 全面评估和比较现有以数据为中心的人工智能方法以了解当前进展,或同时涉及 (ii) 和 (ii) 在本节中,我们调查了以数据为中心的人工智能不同目标的现有基准。首先,我们将介绍基准收集策略,随后,我们将对收集到的基准进行总结和分析。
Collection strategy. We primarily utilize Google Scholar to search for benchmark papers. Specifically, we generate a series of queries for each task using relevant keywords for the sub-goal and task, and supplement them with terms such as "benchmark, "quantitative analysis, and "quantitative survey' For example, the queries for the task "data cleaning" include "benchmark data cleaning, "benchmark data cleansing, "quantitative analysis for data cleaning, "quantitative survey for data cleaning, and so on. It is worth noting that many of the queried benchmarks evaluate models rather than data. Thus, we have carefully read each paper and manually filtered the papers to ensure that they focus on the evaluation of data. We have also screened them based on the number of citations and the reputation of the publication venues.
收集策略。我们主要利用 Google Scholar 来搜索基准论文。具体来说,我们使用子目标和任务的相关关键字为每个任务生成一系列查询,并辅以“基准”、“定量分析”和“定量调查”等术语,例如,“数据清洗”任务的查询包括“基准数据清洗”、“基准数据清洗”、“数据清洗的定量分析”、“数据清洗的定量调查”等。值得注意的是,许多被查询的基准测试是评估模型而不是数据。因此,我们仔细阅读了每篇论文,并对论文进行了人工筛选,以确保它们专注于数据的评估。我们还根据引用次数和发表场所的声誉对其进行了筛选。
Summary of the collected benchmarks. Table 2 comprises the 36 benchmarks that we collected using the above process, out of which 23 incorporate open-source codes. Notably, we did not encounter a benchmark for the task of *generating distribution shift samples, although there are benchmarks available for detecting distribution-shifted samples [118]. We omitted it from the table, since it mainly assesses model performance on distribution shift rather than discussing how to create distribution-shifted data that can expose model weaknesses.
收集的基准的摘要。表2包括我们使用上述过程收集的36个基准,其中23个包含开源代码。值得注意的是,尽管有可用于检测分布偏移样本的基准[118],但我们没有遇到*生成分布偏移样本任务的基准。我们从表格中省略了它,因为它主要评估分布偏移的模型性能,而不是讨论如何创建可能暴露模型弱点的分布偏移数据。
Table 2. Data Benchmarks
表 2.数据基准
Reference | Sub-goal | Task | Domain | Data modality | Open-source | Type |
Training data development | ||||||
Cohen et al. [49] | Collection | Dataset discovery | Biomedical | Tabular, text | x | (i) |
Poess et al. [171] | Collection | Data integration | Database | Tabular, time-series | x | (i) |
Pinkel et al. [169] | Collection | Data integration | Database | Tabular, graph | x | (i) and (ii) |
Wang et al. [234] | Labeling | Semi-supervised learning | AI | Image, text, audio | (i) and (ii) | |
Yang et al. [246] | Labeling | Active learning | AI | Tabular, image, text | x | (ii) |
Meduri et al. [145] | Labeling | Active learning | Database | Tabular, text | x | (i) and (ii) |
Abdelaal et al. [1] | Preparation | Data cleaning | Database | Tabular, text, time-series | (i) and (ii) | |
Li et al. [130] | Preparation | Data cleaning | Database | Tabular, time-series | (i) | |
Jager et al. [102] | Preparation | Data cleaning | AI | Tabular, image | x | (ii) |
Buckley et al. [33] | Preparation | Feature extraction | Healthcare | Tabular, image, time-series | (ii) | |
Vijayan et al. [223] | Preparation | Feature extraction | Biomedical | Tabular, sequential | (i) and (ii) | |
Bommert et al. [29] | Reduction | Feature selection | Biomedical | Tabular, sequential | (ii) | |
Espadoto et al. [65] | Reduction | Dimensionality reduction | Computer graphics | Tabular, image, audio | (i) and (ii) | |
Grochowski et al. [84] | Reduction | Instance selection | Computer graphics | Tabular, image, audio | (ii) | |
Blachnik et al. [23] | Reduction | Instance selection | Computer graphics | Tabular, image, audio | J | (ii) |
Iwana et al. [101] | Augmentation | All tasks in the sub-goal | AI | Time-series | (ii) | |
Nanni et al. [155] | Augmentation | Basic manipulation | AI | Image | (ii) | |
Yoo et al. [248] | Augmentation | Basic manipulation | AI | Image | (ii) | |
Ding et al. [59] | Augmentation | Augmentation data synthesis | AI | Graph | x | (ii) |
Tao et al. [218] | Augmentation | Augmentation data synthesis | Computer security | Tabular | x | (ii) |
Zoller et al. [276] | - | Pipeline search | AI | Tabular, image, audio, time-series | (ii) | |
Gijsbers et al. [80] | Pipeline search | AI | Tabular, image, audio, time-series | (i) and (ii) | ||
Evaluation data development | ||||||
Srivastava et al. [211] | In-distribution | Evaluation data synthesis | AI | Text | (i) and (ii) | |
Pawelczyk et al. [167] | In-distribution | Algorithmic recourse | AI | Tabular | (ii) | |
Dong et al. [62] | Out-of-distribution | Adversarial samples | AI | Image | (ii) | |
Hendrycks et al. [95] | Out-of-distribution | Adversarial samples | AI | Image | (i) and (ii) | |
Yoo et al. [249] | Out-of-distribution | Adversarial samples | AI | Text | (ii) | |
Data maintenance | ||||||
Kanthara et al. [112] | Understanding | Visual summarization | AI | Tabular, text | (i) and (ii) | |
Grinstein et al. [83] | Understanding | Visual summarization | Human-computer interaction | Tabular, image | (i) | |
Zeng et al. [255] | Understanding | Visualization recommendation | Human-computer Interaction | Tabular | x | (i) |
Jia et al. [106] | Understanding | Data valuation | AI | Image | (ii) | |
Batini et al. [16] | Quality assurance | Quality assessment | Database | Tabular | x | (ii) |
Arocena et al. [7] | Quality assurance | Quality improvement | Database | Tabular | x | (i) |
Zhang et al. [268] | Storage and retrieval | Resource allocation | Database | Tabular | (ii) | |
Marcus et al. [142] | Storage and retrieval | Query index selection | Database Unified benchmark | Tabular | x | (ii) |
Mazumder et al. [144] | Multiple | 6 distinct tasks | AI | Multiple | x | (i) |
Note that they evaluate data rather than model. In column In the column *Type, type (i) and/or (i) refers to the work that provides a common ground for fair comparison of data-centric AI methods, and/or comprehensively evaluates and compares the existing data-centric AI methods to understand the current progress, respectively.
请注意,它们评估的是数据而不是模型。在列中 *类型,类型 (i) 和/或 (i) 是指为公平比较以数据为中心的 AI 方法提供共同基础的工作,和/或全面评估和比较现有的以数据为中心的 AI 方法,以了解当前的进展,分别。
Meta-analysis. We give a bird-eye view of existing data-centric AI research across various dimensions by analyzing these collected benchmarks. $\bullet$ Although the AI community has made the most significant contributions to these benchmarks (17), numerous other domains have also made substantial contributions, including databases (9), computer graphics (3), human-computer interaction (2), biomedical (3), computer security (1), and healthcare (1). Notably, healthcare and biomedical are outside the realm of computer science. An established benchmark in a domain often implies that there is a collection of published works. Therefore, data-centric AI is an interdisciplinary effort that spans various domains within and outside of computer science. $\otimes$ The most frequently benchmarked data modality is tabular data (25), followed by image (15), time-series (7), text (6), audio (6), and graph (2). We conjecture that this is because tabular and image data have been extensively studied, while research on graph data is still emerging. $\otimes$ Training data development has received more attention, if we measure it based on the number of benchmarks (22), compared to evaluation data development (5) and data maintenance (8). We hypothesize that this is due to the fact that many of the tasks involved in training data development were considered as preprocessing steps in the model-centric paradigm.
荟萃分析。我们通过分析这些收集到的基准,鸟瞰了现有的以数据为中心的人工智能研究,涵盖各个维度。$\bullet$ 尽管人工智能社区对这些基准做出了最重大的贡献 (17),但许多其他领域也做出了重大贡献,包括数据库 (9)、计算机图形学 (3)、人机交互 (2)、生物医学 (3)、计算机安全 (1) 和医疗保健 (1)。值得注意的是,医疗保健和生物医学不属于计算机科学领域。一个领域中既定的基准通常意味着有已发表作品的集合。因此,以数据为中心的人工智能是一项跨学科的努力,跨越计算机科学内外的各个领域。$\otimes$ 最常基准的数据模式是表格数据 (25),其次是图像 (15)、时间序列 (7)、文本 (6)、音频 (6) 和图形 (2)。我们推测,这是因为表格和图像数据已经被广泛研究,而对图数据的研究仍在兴起。$\otimes$ 如果我们根据基准测试的数量(22)来衡量,与评估数据开发(5)和数据维护(8)相比,训练数据开发受到了更多的关注。我们假设这是因为训练数据开发中涉及的许多任务都被视为以模型为中心的范式中的预处理步骤。
8 Discussion and Future Direction
8 讨论与未来方向
What is the current stage of data-centric AI research, and what are the potential future directions? This section provides a top-level discussion of data-centric AI and presents some of the open problems that we have identified, aiming to motivate future exploration in this field. We start by trying to answer the research questions posed at the beginning:
以数据为中心的人工智能研究目前处于什么阶段,未来可能的方向是什么?本节对以数据为中心的人工智能进行了顶层讨论,并提出了我们已经确定的一些悬而未决的问题,旨在激发该领域的未来探索。我们首先尝试回答开头提出的研究问题:
- RQ1: What are the necessary tasks to make AI data-centric? Data-centric AI encompasses a range of tasks that involve developing training data, inference data, and maintaining data. These tasks include but are not limited to (1) cleaning, labeling, preparing, reducing, and augmenting the training data, (2) generating in-distribution and out-of-distribution data for evaluation, or tuning prompts to achieve desired outcomes, and (3) constructing efficient infrastructures for understanding, organizing, and debugging data.
- RQ1:使人工智能以数据为中心需要哪些任务?以数据为中心的人工智能包含一系列任务,涉及开发训练数据、推理数据和维护数据。这些任务包括但不限于 (1) 清理、标记、准备、减少和增强训练数据,(2) 生成分布内和分布外数据以进行评估,或调整提示以实现预期结果,以及 (3) 构建用于理解、组织和调试数据的高效基础设施。
- RQ2: Why is automation significant for developing and maintaining data? Given the availability of an increasing amount of data at an unprecedented rate, it is imperative to develop automated algorithms to streamline the process of data development and maintenance. Based on the papers surveyed, automated algorithms have been developed for all sub-goals. These automation algorithms span different automation levels, from programmatic automation to learning-based automation, to pipeline automation.
- RQ2:为什么自动化对于开发和维护数据很重要?鉴于数据量以前所未有的速度不断增加,开发自动化算法以简化数据开发和维护过程势在必行。根据调查的论文,已经为所有子目标开发了自动化算法。这些自动化算法跨越不同的自动化级别,从编程自动化到基于学习的自动化,再到管道自动化。
-- RQ3: In which cases and why is human participation essential in data-centric AI? Human participation is necessary for many data-centric AI tasks, such as the majority of data labeling tasks and several tasks in inference data development. Notably, different methods may require varying degrees of human participation, ranging from full involvement to providing minimal inputs. Human participation is crucial in many scenarios, because it is often the only way to ensure that the behavior of AI systems aligns with human intentions.
-- RQ3:在哪些情况下以及为什么人类参与以数据为中心的人工智能是必不可少的?人类参与对于许多以数据为中心的人工智能任务是必要的,例如大多数数据标记任务和推理数据开发中的多项任务。值得注意的是,不同的方法可能需要不同程度的人类参与,从全面参与到提供最少的输入。人类参与在许多场景中至关重要,因为它通常是确保人工智能系统行为符合人类意图的唯一途径。
- RQ4: What is the current progress of data-centric AI? Although data-centric AI is a relatively new concept, considerable progress has already been made in many relevant tasks, the majority of which were viewed as preprocessing steps in the model-centric paradigm. Meanwhile, many new tasks have recently emerged, and research on them is still ongoing. In Section 7, our meta-analysis on benchmark papers reveals that progress has been made across different domains, with the majority of the benchmarks coming from the AI domain. Among the three general data-centric AI goals, training data development has received relatively more research attention. For data modality, tabular and image data have been the primary focus. As research papers on data-centric AI are growing exponentially [256], we could witness even more progress in this field in the future.
- RQ4:以数据为中心的人工智能目前进展如何?尽管以数据为中心的人工智能是一个相对较新的概念,但在许多相关任务上已经取得了相当大的进展,其中大部分被视为以模型为中心的范式中的预处理步骤。同时,最近出现了许多新任务,并且对它们的研究仍在进行中。在第 7 节中,我们对基准论文的荟萃分析表明,不同领域都取得了进展,其中大多数基准来自人工智能领域。在三个以数据为中心的通用人工智能目标中,训练数据开发受到的研究关注相对较多。对于数据模态,表格和图像数据一直是主要关注点。随着以数据为中心的人工智能的研究论文呈指数级增长[256],未来我们可能会见证该领域的更多进展。
By attempting to address these questions, our survey delves into a variety of tasks and their needs and challenges, yielding a more concrete picture of the scope and progress of data-centric AI. However, although we have endeavored to broadly and comprehensively cover various tasks and techniques, it is impossible to include every aspect of data-centric AI. In the following, we connect data-centric AI with two other popular research topics in AI:
通过尝试解决这些问题,我们的调查深入研究了各种任务及其需求和挑战,从而更具体地了解了以数据为中心的人工智能的范围和进展。然而,尽管我们努力广泛、全面地涵盖各种任务和技术,但不可能包括以数据为中心的人工智能的方方面面。在下文中,我们将以数据为中心的人工智能与人工智能领域的另外两个热门研究主题联系起来:
-- Foundation models. A foundation model is a large model that is trained on massive amounts of unlabeled data and can be adapted to various tasks, such as large language models [32, 161], and Stable Diffusion [184]. As models become sufficiently powerful, it becomes feasible to perform many data-centric AI tasks with models, such as data labeling [161], and data augmentation [250]. Consequently, the recent trend of foundation models has the potential to fundamentally alter our understanding of data. Unlike the conventional approach of storing raw data values in datasets, the model itself can be a form of data (or a *container' of raw data), since the model can convey information (see the definition of data in Section 2.1). Foundation models blur the boundary between data and model, but their training still heavily relies on large and high-quality datasets.
-- 基础模型。基础模型是一种大型模型,它经过大量未标记的数据进行训练,可以适应各种任务,例如大型语言模型[32,161]和稳定扩散[184]。随着模型变得足够强大,使用模型执行许多以数据为中心的人工智能任务变得可行,例如数据标记[161]和数据增强[250]。因此,基础模型的最新趋势有可能从根本上改变我们对数据的理解。与在数据集中存储原始数据值的传统方法不同,模型本身可以是一种数据形式(或原始数据的“*容器”),因为模型可以传达信息(参见第 2.1 节中的数据定义)。基础模型模糊了数据和模型之间的界限,但它们的训练仍然严重依赖于大型和高质量的数据集。
Reinforcement learning. Reinforcement learning is a research field that trains intelligent agents to optimize rewards without any initial data [153, 263]. It is a unique learning paradigm that alternates between generating data with the model and training the model with self-generated data. Like foundation models, the advancement of reinforcement learning could also possibly blur the boundary between data and model. Furthermore, reinforcement learning has already been widely adopted in several data-centric AI sub-goals, such as data labeling [46, 61, 258], data preparation [115], data reduction [137], and data augmentation [51, 257]. The reason could be attributed to its goal-oriented nature, which is well-suited for automation.
强化学习。强化学习是一个研究领域,它训练智能体在没有任何初始数据的情况下优化奖励[153,263]。它是一种独特的学习范式,在使用模型生成数据和使用自生成数据训练模型之间交替进行。与基础模型一样,强化学习的进步也可能模糊数据和模型之间的界限。此外,强化学习已经被广泛应用于几个以数据为中心的人工智能子目标中,例如数据标记[46,61,258]、数据准备[115]、数据缩减[137]和数据增强[51,257]。其原因可归因于其以目标为导向的性质,非常适合自动化。
Upon examining the connections to these two rapidly evolving research fields, we hypothesize that data-centric AI and model-centric AI could become even more intertwined in the development of AI systems. Looking forward, despite the concrete progress already achieved in data-centric AI, numerous challenges still persist, often necessitating more end-to-end automation or improved design of human participation, as summarized in Table 6 of Appendix B. We present some potential future directions we have identified in data-centric AI:
在研究了这两个快速发展的研究领域的联系后,我们假设以数据为中心的人工智能和以模型为中心的人工智能在人工智能系统的发展中可能会变得更加交织在一起。展望未来,尽管以数据为中心的人工智能已经取得了具体进展,但许多挑战仍然存在,通常需要更多的端到端自动化或改进人类参与设计,如附录 B 的表 6 所示。我们提出了我们在以数据为中心的人工智能中确定的一些潜在的未来方向:
- Cross-task automation. While there has been significant progress in automating various individual data-centric AI tasks, joint automation across multiple tasks remains largely unexplored. Although pipeline search methods [63, 71, 93, 124, 143, 148, 262] have emerged, they are limited only to training data development. From a broad data-centric AI perspective, it would be desirable to have a unified framework for jointly automating tasks aimed at different goals, ranging from training data development to inference data development and data maintenance.
- 跨任务自动化。虽然在自动化各种以数据为中心的单个人工智能任务方面取得了重大进展,但跨多个任务的联合自动化在很大程度上仍未得到探索。尽管已经出现了管道搜索方法 [63, 71, 93, 124, 143, 148, 262],但它们仅限于训练数据开发。从广泛的以数据为中心的人工智能角度来看,需要有一个统一的框架来联合自动化针对不同目标的任务,从训练数据开发到推理数据开发和数据维护。
-- Data-model co-design. Although data-centric AI advocates for shifting the focus to data, it does not necessarily imply that the model has to remain unchanged. The optimal data strategies may differ when using different models, and vice versa. Furthermore, as discussed above, the boundary between data and model could potentially become increasingly blurred with the advancement of foundation models and reinforcement learning. Consequently, future progress in AI could arise from co-designing data and models, and the co-evolution of data and models could pave the way toward more powerful AI systems.
-- 数据模型协同设计。尽管以数据为中心的人工智能主张将重点转移到数据上,但这并不一定意味着模型必须保持不变。使用不同的模型时,最佳数据策略可能会有所不同,反之亦然。此外,如上所述,随着基础模型和强化学习的进步,数据和模型之间的界限可能会变得越来越模糊。因此,人工智能的未来进步可能来自共同设计数据和模型,而数据和模型的共同进化可能为更强大的人工智能系统铺平道路。
-- Debiasing data. In many high-stakes applications, AI systems have recently been found to exhibit discriminatory behavior toward certain groups of people, sparking significant concerns about fairness [60, 146, 228]. These biases often originate from imbalanced distributions of sensitive variables in the data. From a data-centric perspective, more research efforts are needed to debias data, including but limited to mitigating biases in training data, systematic methodologies to construct evaluation data to expose unfairness issues of unfairness, and continuously maintaining fair data in a dynamic environment.
-- 数据去偏见。在许多高风险应用中,最近发现人工智能系统对某些人群表现出歧视行为,引发了对公平性的严重担忧[60,146,228]。这些偏差通常源于数据中敏感变量的不平衡分布。从以数据为中心的角度来看,需要更多的研究工作来消除数据的偏差,包括但仅限于减轻训练数据中的偏差,构建评估数据以揭露不公平的不公平问题的系统方法,以及在动态环境中持续维护公平数据。
- Tackling data in various modalities. Based on the benchmark analysis presented in Section 7, most research efforts have been directed toward tabular and image data. However, other data modalities that are comparably important but less studied in data-centric AI pose significant challenges. For instance, time-series data [87, 260] exhibit complex temporal correlations, while graph data [272] has intricate data dependencies. Therefore, more research on how to engineer data for these modalities is required. Furthermore, developing data-centric AI solutions that can simultaneously address multiple data modalities is an intriguing avenue for future exploration.
- 处理各种模式的数据。根据第 7 节中提出的基准分析,大多数研究工作都针对表格和图像数据。然而,在以数据为中心的人工智能中同样重要但研究较少的其他数据模式带来了重大挑战。例如,时间序列数据 [87, 260] 表现出复杂的时间相关性,而图形数据 [272] 具有复杂的数据依赖关系。因此,需要对如何为这些模式设计数据进行更多研究。此外,开发能够同时处理多种数据模式的以数据为中心的人工智能解决方案是未来探索的一个有趣的途径。
- Data benchmarks development. The advancement of model-centric AI has been facilitated by benchmarks in advancing model designs. Whereas data-centric AI requires more attention to benchmarking. As discussed in Section 7, existing benchmarks for data-centric AI typically only focus on specific tasks. Constructing a unified benchmark to evaluate overall data quality and various data-centric AI techniques comprehensively presents a significant challenge. Although DataPerf [144] has made notable progress toward this objective, it currently supports only a limited number of tasks. The development of more unified data benchmarks would greatly accelerate research progress in this area.
- 数据基准开发。以模型为中心的人工智能的进步得益于推进模型设计的基准测试。而以数据为中心的人工智能需要更多地关注基准测试。正如第7节所讨论的,以数据为中心的人工智能的现有基准通常只关注特定任务。构建一个统一的基准来全面评估整体数据质量和各种以数据为中心的人工智能技术是一个重大挑战。尽管DataPerf [144]在实现这一目标方面取得了显着进展,但目前它仅支持有限数量的任务。开发更统一的数据基准将大大加速该领域的研究进展。
9 Conclusion
#9 结论
This survey focuses on data-centric AI, an emerging and important research field in AI. We motivated the need for data-centric AI by showing how carefully designing and maintaining data can make AI solutions more desirable across academia and industry. Next, we provided a background of data-centric AI, which includes its definition and a goal-driven taxonomy. Then, guided by the research questions posed, we reviewed various data-centric AI techniques for different purposes from the perspectives of automation and collaboration. Furthermore, we collected data benchmarks from different domains and analyzed them at a meta-level. Last, we discussed data-centric AI from a global view and shared our perspectives on the blurred boundaries between data and model. We also presented potential future directions for this field. To conclude in one line, we believe that data will play an increasingly important role in building AI systems. At the same time, there are still numerous challenges that need to be addressed. We hope our survey could inspire collaborative initiatives in our community to push forward this field.
本次调查重点关注以数据为中心的人工智能,这是人工智能中一个新兴的重要研究领域。我们通过展示精心设计和维护数据如何使人工智能解决方案在学术界和工业界更受欢迎,从而激发了对以数据为中心的人工智能的需求。接下来,我们提供了以数据为中心的人工智能的背景,包括其定义和目标驱动的分类法。然后,在提出的研究问题的指导下,我们从自动化和协作的角度回顾了各种以数据为中心的人工智能技术,用于不同的目的。此外,我们还收集了来自不同领域的数据基准,并在元层面进行了分析。最后,我们从全球角度讨论了以数据为中心的人工智能,并分享了我们对数据和模型之间模糊界限的看法。我们还提出了该领域未来潜在的方向。总之,我们认为数据将在构建人工智能系统中发挥越来越重要的作用。与此同时,仍有许多挑战需要解决。我们希望我们的调查能够激发我们社区的合作举措,以推动这一领域的发展。
References
参考资料
Appendices
附录
AStrategy for Paper Collection and Statistics
论文收集和统计的统计技术
To begin with, we formulated initial search queries by combining the terms listed in the *Subgoal' and "Data-centric AI Tasks" columns of Table 1, such as "data labeling data programming" Subsequently, we conducted manual examination of the returned papers from Google Scholar. For some tasks such as data programming, we iteratively refined or expanded the initial search terms, since we found that some papers may not explicitly include the task keyword. Additionally, we supplemented our search by tracing the citation graph to identify relevant papers. During the above process, we decided whether a paper should be included based on a combination of factors, including their relevance to the tasks, the quality of the venues in which they have been published, and their citation count in Google Scholar. We focused our analysis on papers written in English and excluded non-English papers. The majority of the included papers were identified through search queries, with some identified through the citation graph.
首先,我们通过结合表 1 的 “子目标”和“以数据为中心的 AI 任务”列中列出的术语来制定初始搜索查询,例如“数据标记数据编程”随后,我们对来自 Google Scholar 返回的论文进行了人工检查。对于一些任务,如数据编程,我们迭代地细化或扩展了初始搜索词,因为我们发现某些论文可能没有明确包含任务关键字。此外,我们还通过追踪引文图来补充我们的搜索,以识别相关论文。在上述过程中,我们根据多种因素来决定是否应收录一篇论文,包括论文与任务的相关性、发表地点的质量以及它们在 Google Scholar 中的引用次数。我们的分析重点是用英语撰写的论文,并排除了非英语论文。大多数纳入的论文是通过检索查询确定的,一些是通过引文图确定的。
Figure 7 plots the statistics of the analyzed papers. Of the 192 papers we have analyzed, 175 have been peer-reviewed, and 17 have not been formally published yet. Among the peer-reviewed papers, 52 have been published in journals, while 123 have been presented at conferences. Around
图7绘制了分析论文的统计数据。在我们分析的192篇论文中,有175篇已通过同行评审,17篇尚未正式发表。在同行评审的论文中,有52篇发表在期刊上,123篇在会议上发表。 周围
Fig. 7. Statistics of the analyzed papers, categorized by venue type, goal, and publication year.
图7.分析论文的统计数据,按地点类型、目标和出版年份分类。
$7 0 %$ of the peer-reviewed papers are from the top computer science venues, with NeurIPS (15), VLDB (12), and ICLR (8) being the three most frequent venues. The rest $3 0 %$ are mainly from top venues in other domains with high impact factors. Those unpublished papers are typically included due to their high relevance to the discussed topics, despite lacking offcial publication status. Many of these papers receive high citations, even without formal publication.
$7 0 %$ 篇同行评审论文来自顶级计算机科学领域,其中 NeurIPS (15)、VLDB (12) 和 ICLR (8) 是三个最常见的领域。其余的 $3 0 %$ 主要来自其他领域具有高影响因子的顶级场所。尽管缺乏官方发表状态,但这些未发表的论文通常因其与所讨论主题的高度相关性而被收录。其中许多论文即使没有正式发表,也获得了很高的引用。
From the collected papers, we observe an imbalance in community efforts across the three datacentric AI goals. The majority of research papers are focused on training data development, with 71 papers, followed by inference data development, with 45 papers. Upon sorting the papers by year, we observe that the majority of them are published in or after 2018, which is consistent with our earlier observation in Reference [256], where we conducted a Google Scholar search with the exactly matched phrase "data-centric AI' and found a rapid increase in the number of publications. Interestingly, we observe that some data-centric AI tasks are emerging, such as data programming in training data development, and prompt engineering and algorithmic recourse in inference data development.
从收集到的论文中,我们观察到社区在三个以数据为中心的人工智能目标上的努力不平衡。大多数研究论文都集中在训练数据开发上,有71篇论文,其次是推理数据开发,有45篇论文。在按年份对论文进行排序后,我们观察到其中大多数是在2018年或之后发表的,这与我们之前在参考文献[256]中的观察一致,在参考文献[256]中,我们用完全匹配的短语“以数据为中心的AI”进行了谷歌学术搜索,发现出版物数量迅速增加。有趣的是,我们观察到一些以数据为中心的人工智能任务正在出现,例如训练数据开发中的数据编程,以及推理数据开发中的提示工程和算法追索权。
B Summary Tables
B 汇总表
Table 3. Papers for Achieving Different Sub-goals of Training Data Development, where Papers Are Categorized into Automation or Human Collaboration, Each Labeled with an Automation Degree or Participation Level, as Detailed in Section 3.2
表 3.实现训练数据开发的不同子目标的论文,其中论文分为自动化或人类协作,每个论文都标有自动化程度或参与级别,详见第 3.2 节
Sub-goal | Task | Method type | Automation leveree | Reference |
Collection | Dataset discovery | Collaboration | Minimum | [26, 70, 156] |
Data integration | Automation | Programmatic | [121, 128] | |
Data integration | Automation | Learning-based | [212, 213] | |
Raw data synthesis | Automation | Programmatic | [125] | |
Crowdsourced labeling | Collaboration | Full | [53, 123, 217] | |
Semi-supervised labeling | Collaboration | Partial | [44, 163, 274, 277] | |
Aetive leaninging | Collaboration | Partial | ||
[5, 61, 12 258 | ||||
Data programming | Collaboration | Minimum | [99, 180, 181, 261] | |
Preparation | Distant supervision | Automation | Learning-based | [149] |
Data cleaning | Automation | Programmatic | [271] | |
Data cleaning | Automation | Learning-based | [94, 109, 119, 126] | |
Data cleaning | Collaboration | Partial | [231] | |
Feature extraction | Automation | Programmatic | [13, 189] | |
Feature extraction | Automation | Learning-based | [120, 236] | |
Feature transformation | Automation | Programmatic | [5, 22] | |
Feature transformation | Automation | Learning-based | [115] | |
Reduction | Feature selection | Automation | Programmatic | [9, 219] |
Feature selection | Automation | Learning-based | [233, 245] | |
Feature selection | Collaboration | Partial | [198, 269] | |
Dimensionality reduction | Automation | Learning-based | [2, 12, 242] | |
Instance selection | Automation | Programmatic | [175, 183] | |
Instance selection | Automation | Learning-based | [137, 216] | |
Augmentation | Basic manipulation | Automation | Programmatic | [40, 88, 238, 264, 270]. |
Basic manipulation | Automation | Learning-based | [51] | |
Augmentation data synthesis | Automation | Learning-based | [74, 98, 100, 205] | |
Upsampling | Automation | Programmatic | [39, 91] | |
Upsampling | Automation | Learning-based | [257] | |
Pipeline search | Automation | Pipeline | [63, 71, 93, 124, 148, 262] |
Table 4. Papers for Achieving Different Sub-goals of Inference Data Development, where Papers Are Categorized into Automation or Human Collaboration, Each Labeled with an Automation Degree or Participation Level, as Detailed in Section 3.2
表 4.实现推理数据开发不同子目标的论文,其中论文分为自动化或人工协作,每个论文都标有自动化程度或参与级别,详见第 3.2 节
Sub-goal | Task | Method type | Automptionleeree | References |
In- distribution | Data slicing Data slicing | Collaboration | Minimum | [14] |
Collaboration | Partial | [192, 254] | ||
Data slicing | Automation | Learning-based | [48, 116, 188, 208] | |
Algorithmic recourse | Collaboration | Minimum | [18, 24, 36, 52, 57, 111, 127, 138, 173, 202, 225] | |
Out-of- distribution | Adversarial samples Adversarial samples | Collaboration | Minimum | [95] |
Automation | Learning-based | [21, 41, 66, 140, 154, 166, 200] | ||
Distribution shift | Collaboration | Full | [58, 118, 187] | |
Distribution shift | Collaboration | Partial | [85, 201] | |
Distribution shift | Automation | Programmatic | [10, 82, 135, 214] | |
Distribution shift | Automation | Learning-based | [68, 86] | |
Prompt engineering | Manual engineering | Collaboration | Partial | [195-197] |
Automated engineering | Automation | Programmatic | [89, 108, 251] | |
Automated engineering | Automation | Learning-based | [77, 227] |
Table 5. Papers for Achieving Different Sub-goals of Data Maintenance, where Papers Are Categorized into Automation or Human Collaboration, Each Labeled with an Automation Degree or Participation Level, as Detailed in Section 3.2
表 5.实现数据维护不同子目标的论文,其中论文分为自动化或人工协作,每个论文都标有自动化程度或参与级别,详见第 3.2 节
Sub-goal | Task | Method type | Automation level/ participation degree | Reference |
Understanding | Visual summarization | Collaboration | Minimum | [31, 35, 56, 73, 226] |
Clustering for visualization | Automation | Learning-based | [67] | |
Visualization recommendation | Automation | Programmatic | [241] | |
Visualization recommendation | Automation | Learning-based | [139] | |
Visualization recommendation | Collaboration | Partial | [210] | |
Valuation | Automation | Learning-based | [3, 78, 79] | |
Quality assurance | Quality assessment | Collaboration | Minimum/partial | [16, 170, 185, 244] |
Quality improvement | Automation | Programmatic | [15, 27, 47] | |
Quality improvement | Automation | Learning-based | [17] | |
Quality improvement | Automation | Pipeline | [194, 220] | |
Quality improvement | Collaboration | Partial | [42, 55, 76, 235] | |
Storage and retrieval | Resource allocation | Automation | Programmatic | [6, 141, 239] |
Resource allocation | Automation | Learning-based | [96, 222] | |
Query acceleration with index selection | Automation | Programmatic | [37, 215, 221] | |
Query acceleration with index selection | Automation | Learning-based | [168, 186] | |
Query acceleration with rewriting | Automation | Programmatic | [11, 38] | |
Query acceleration with rewriting | Automation | Learning-based | [92, 273] |
Table 6. Summary of the Challenges in Data-centric AI
表 6.以数据为中心的人工智能面临的挑战总结
Goal | Sub-goal | Data-centric AI Tasks |
Training data development | Collection | (i) Difficulty in measuring relatedness in data integration; (ii) availability of data sources; (ii) ethical considerations, such as informed consent and data privacy. |
Labeling | ||
Preparation | Customized preparation strategies are often required due to unique data characteristics. | |
Reduction | (i) Reduction with minimal information loss; (ii) potential bias after reduction. | |
Augmentation | There is no single augmentation strategy that is suitable for all datasets. | |
Pipeline search | High computational overhead with more search modules. | |
Inferenedada | Indisriutiontionn | |
Prompt engineering | The absence of a universal prompt template that consistently performs well. | |
Data maintenance | Understanding | (i) Visualization is domain-specific; (ii) difficulty in efficient data valuation algorithms. |
Quality assurance Storage and retrieval | (i) Assessment metric is domain-specific; (ii) quality improvement is a laborious process. | |
(i) Difficulty in end-to-end optimization; (ii) Other considerations such as data access control. |