Data Wrangling-数据规整——mixtera_A Data Plane for Foundation Model Training

数据规整

 借鉴其中对文件系统的数据规整，可以有效的管理和使用数据。
     元数据--元数据的生成 解析和存储
	 数据和数据集

本地和远程

    client = MixteraClient.from_directory(directory) 
	     directory (Path or str): The directory where Mixtera stores its metadata file 用来存储元数据的文件的目录
		 元数据目录
		      checkpoint_directory   mixture_log_directory
		 数据目录
    client = MixteraClient.from_remote(server_host, server_port)
          tempfile.TemporaryDirectory() 		


    步骤二：标识-metadata_parser_identifier
	    client.register_metadata_parser("TEST_PARSER", TestMetadataParser)
		
		元数据：self._mdc = MixteraDataCollection(self.directory)
		     class TestMetadataParser(MetadataParser):
			      def get_properties(cls) -> list[MetadataProperty]:
				  def parse(
				  
			    self._metadata_factory = MetadataParserFactory()
                return self._mdc._metadata_factory.add_parser(identifier, parser)					
		
	步骤三
	    client.register_dataset --注册数据集的标识 identifier--name 
		   数据集类型
		   数据集解析函数--meta parsing_func
		return self._mdc.register_dataset(identifier, loc, dtype, parsing_func, metadata_parser_identifier)

使用方式

 步骤 registering your data and (2) running queries/trainings on top of it
  在Mixtera 注册数据，在数据上执行查询或者训练

 Adaptive Optimization  Adaptive Data Optimization (ADO) algorithm in
    “state of the art”对应的中文翻译为“最先进的”或“达到最高水准的
    Training data is typically stored on distributed filesystems in GPU clusters or data lakes in the cloud
 
       Ad-hoc 命令是指临时执行的命令,通常用于快速执行简单的任务
      Filesystems  文件系统  Training data is typically stored and managed as files without a proper data management system
 	 a full-fledged DBMS  数据库管理系统
 	 
 	 a lightweight data plane 轻量化的数据面
 架构 
 	    It is a centralized, read only layer and can be declaratively queried from training clients
       client-server model.
 	       the server   Chunks    generating and streaming chunks
 		      元数据仓库-a centralized database of metadata
 			  为每一组数据分配位移的ID和属性描述而不是把书数据作为同质和连续的数据块 blob
 			  只读--不修改--提供标准的迭代属性--torch.DataLoader 可以
 			  使用checkpoint支持打断 通过混洗确保确定性
 		   clients
 术语：
     . Chunks are fixed-size collections of pointers to samples in files, and adhere to the current mixture
        A chunk is a fixed-size collection of pointers to samples in files.  固定大小的集合-指向文件中的取样
 	     独立于文件的结构
  offline or online
      Blob 是一个固定大小的二进制数据块,而 Stream 是一个数据的流动序列,

系统的设计目标：
    独立于文件系统，可以用声明的方式来混合数据
	轻量级-不需要太多组件-可以轻松的集成到现有的框架内
	用户友好和灵活，支持高通量和确定性
	支持动态调整混合数据

实现抽象

1.  chunk generation
    MixteraDataCollection (MDC)  使用Duckdb作为其 robust-健壮 and flexible灵活的底座
	      MetadataParser  
		      三张表   源数据/文件/样本
			        dataset_id  file_id
			        sample_id   metadata
					
    MixtureKey abstraction 
2.duckdb 不支持多线程操作
  DuckDB does not support insertion from multiple Python process workers.
  We found that converting the worker results to columnar pyarrow in-memory tables and then inserting into the samples table has the highest throughput	
  把数据转成内存中的列pyarrow表 然后再插入样本sample表中
  
3.ChunkerIndex--产生  
    SQL generation and interval detection.
       Mixtera establishes a Common Table Expression (CTE) named base_data that contains the filtered samples
    Chunk generation
	    ChunkerIndex data structure
		MixtureKey abstraction
		Building the ChunkerIndex：C++ worker
4.Networking
      Python’s asyncio framework

源码

    1. class MetadataParser(ABC)
       属性-- prop.dtype 
	    MetadataProperty(name="language", dtype="STRING", multiple=True, nullable=True),
    2.
	class RedPajamaMetadataParser(MetadataParser):
	class ImagenetWebDatasetMetadataParser(MetadataParser):
	3."""Handles the creation of metadata parsers."""
    class MetadataParserFactory:

      a static   pre-defined schedules,
      a dynamic mixing algorithms.

图和代码

从需求到开发--从无到有
    需求分析阶段
        架构图
        用例图： 描述系统功能  用例图：用例图是指由参与者（Actor）、用例（Use Case），边界以及它们之间的关系构成的用于描述系统功能的视图。是系统的蓝图
    概要设计阶段	
        类图:类图(Class diagram)是显示了模型的静态结构，特别是模型中存在的类、类的内部结构以及它们与其他类的关系等	 
    详细设计阶段
    	时序图：（Sequence Diagram）	 描述对象之间发送消息的时间顺序显示多个对象之间的动态协作   
    部署和运维  
        部署图(deployment diagram，配置图)是用来显示系统中软件和硬件的物理架构

落实到代码上面--从有到理解
   从代码上看
     从example 看执行的流程--即各个对象的时序图-流程图
	 从结构代码中看类图

pybind安装

pybind11的三种安装方式:
   1）使用pip安装 
   pip install pybind11 -i https://pypi.tuna.tsinghua.edu.cn/simple/
   
   2）apt-get安装
     sudo apt-get install python-pybind11
   3）源码编译（推荐这种，方便后续CmakeLists.txt中使用find_package(pybind11 REQUIRED)）
    cmake ..
	make -j4
    make install

    安装后的可以使用 后续CmakeLists.txt中使用 find_package(pybind11 REQUIRED)

从代码结构中看

##### Core Files #####
add_subdirectory("core")
add_subdirectory("query")
add_subdirectory("chunker")
\mixtera-main\mixtera\core\query\chunker\src
    set(MIXTERA_CHUNKER_SOURCES
       ./src/chunker.cpp)
    set(MIXTERA_CHUNKER_HEADERS
        ./include/chunker.hpp)
	pybind11_add_module(chunker_extension ${MIXTERA_CHUNKER_SOURCES}  ${MIXTERA_CHUNKER_HEADERS})
 target_include_directories(chunker_extension PUBLIC include  ${PYARROW_INCLUDE_DIR} ${Arrow_INCLUDE_DIR} ${ArrowPython_INCLUDE_DIRS})
  target_compile_options(chunker_extension PRIVATE ${MIXTERA_COMPILE_OPTIONS} -Wno-unused-function)
  target_link_libraries(chunker_extension PRIVATE absl::flat_hash_map indicators::indicators Arrow::arrow_shared ${ARROW_PYTHON_LIB} spdlog fmt) 
  set_target_properties(chunker_extension PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${CMAKE_LIBRARY_OUTPUT_DIRECTORY}")
  set_target_properties(chunker_extension PROPERTIES INSTALL_RPATH_USE_LINK_PATH TRUE)

pybind的使用

1.把c++编译成可供python调用的库  
 源文件和cmakelists.txt文件
   01.源文件：
    //001.包含头文件
      #include <pybind11/pybind11.h>
      #include <pybind11/stl.h>
	  namespace py = pybind11;  
	//002. 创建和操作 Python 对象  
	  py::object   
	      py::isinstance<py::str>(obj)  py::isinstance<py::int_>(obj)
		  py::dict  py::list
		  py::isinstance<py::function>(func)
    ///003.绑定函数到 Python 模块
       PYBIND11_MODULE(example, m) {
        m.def("example_function", &example_function); // 绑定一个示例函数到模块
        m.def("example_dict", &example_dict); // 绑定处理字典的函数到模块
        m.def("example_list", &example_list); // 绑定处理列表的函数到模块
        m.def("call_python_function", &call_python_function); // 绑定调用 Python 函数的函数到模块
    }
	
	
   02.CMakeLists.txt
   	    set(MIXTERA_CHUNKER_SOURCES
       ./src/chunker.cpp)
    set(MIXTERA_CHUNKER_HEADERS
        ./include/chunker.hpp)
	pybind11_add_module(chunker_extension ${MIXTERA_CHUNKER_SOURCES}  ${MIXTERA_CHUNKER_HEADERS})
    target_include_directories(chunker_extension PUBLIC include  ${PYARROW_INCLUDE_DIR} ${Arrow_INCLUDE_DIR} ${ArrowPython_INCLUDE_DIRS})
        target_compile_options(chunker_extension PRIVATE ${MIXTERA_COMPILE_OPTIONS} -Wno-unused-function)
         target_link_libraries(chunker_extension PRIVATE absl::flat_hash_map indicators::indicators Arrow::arrow_shared ${ARROW_PYTHON_LIB} spdlog fmt) 
         set_target_properties(chunker_extension PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${CMAKE_LIBRARY_OUTPUT_DIRECTORY}")
         set_target_properties(chunker_extension PROPERTIES INSTALL_RPATH_USE_LINK_PATH TRUE)   

  
	###说明 
	    在CMake中，
		CMAKE_LIBRARY_OUTPUT_DIRECTORY：默认存放动态库的位置
		`rpath`（runtime path）用于指定运行时库的搜索路径。它允许动态链接器在运行时找到共享库，而无需依赖系统环境变量（如`LD_LIBRARY_PATH`）
		
		
   03.	  from .chunker_extension import create_chunker_index_cpp as create_chunker_index 
  __all__ = ["create_chunker_index"]

2.python中使用
query_result.py
 class QueryResult:  
  from mixtera.core.query.chunker import create_chunker_index as cpp_create
      def _create_chunker_index(table: pa.Table) -> ChunkerIndex:
          return cpp_create(table, num_workers if not in_test else 1)
	 
	  
 self._chunker_index: ChunkerIndex = QueryResult._create_chunker_index(results)	
 from mixtera.core.datacollection.index import ChunkerIndex, ChunkerIndexDatasetEntries	
     IndexRowRangeType = list[tuple[int, int]]
     ChunkerIndexDatasetEntries = dict[int, dict[int | str, IndexRowRangeType]]	 
     ChunkerIndex = Dict["MixtureKey", ChunkerIndexDatasetEntries]

 ChunkerIndex: A nested dictionary mapping mixture keys to dataset IDs, file IDs, and intervals.

其他C++库

   PYARROW_INCLUDE_DIR
   Arrow_INCLUDE_DIR
   ArrowPython_INCLUDE_DIRS
   
spdlog 是一个快速的 C++ 日志库，只包含头文件，兼容 C++11 更偏向于Python一点   
    单一日志文件
	循环日志文件
	异步打印日志文
	
Python 中的 PyArrow 和 C++ 中的 Arrow 库可以通过共享内存或序列化的方式进行数据交换	
      Arrow C++ libraries
   Arrow C++ and PyArrow C++ header files are bundled with a pyarrow installation	
    Arrow API：  #include <arrow/api.h>  
  pyarrow API：  #include <arrow/python/pyarrow.h>
  
  
 Apache Arrow 是一组库的集合，而不是一个单一的库 
   C 数据接口  ArrowSchema、ArrowArray、ArrowDeviceArray、ArrowArrayStream 和 ArrowDeviceArrayStream。
  
 C++ 库中的 arrow::Field 类

预训练数据的设计

 预训练数据的设计
  收集时间，内容过滤策略（毒性/质量）不同质量和毒性  不同的领域	
数据集的组成：包括哪些数据源，是否要过滤质量和毒性等属性，以及何时收集新文档	    
English Colossal Clean Crawled Corpus (C4) 是2019年Common Crawl的存档
Pile：一个面向训练大规模语言模型的825 GiB英语文本语料库。Pile由22个多样化的高质量子集构成

 micromamba：mamba的C++实现，超越conda 是 conda 的一个轻量级替代品,专为高效运行而设计
     micromamba 是 mamba 包管理器的小型版本，采用C++实现，具有mamba的核心功能

苏黎世联邦理工学院( ETH Zurich 苏黎世)   
   The Systems Group is an open research and teaching collaboration
      by professors from the Department of Computer Science at ETH Zürich in the general area of systems.
    Institute for Computing Platforms - Systems Group
	    Data Processing on Modern Hardware
	    Operating Systems： We are the Network and Operating Systems research group. 网络操作系统
		Efficient Cloud Systems： Efficient Architectures and Systems Lab (EASL) 
multiprocessing .current_process().name
    通过参数的类型和个数-- 确定当前进程(MP.current_process().name) 
	           每个Process实例都有一个名称,其默认值可以在创建进程时改变。给进程命名对于跟踪进程很有用,特别是在当前应用中有
		
 隧道 -tunnel 
 __all__ = ["MixteraDataCollection", "Property", "PropertyType"]

参考

 https://gitlab.inf.ethz.ch/project-opensockeye/kirsch/-/tree/main
 https://gitlab.inf.ethz.ch/project-opensockeye/efeu
 https://github.com/eth-easl/mixtera/tree/main/examples
Mixtera: A Data Plane for Foundation Model Training  https://arxiv.org/pdf/2502.19790

posted @ 2025-04-02 19:24 辰令阅读(46) 评论(0) 收藏举报

刷新页面返回顶部

辰令

辰时令节

Data Wrangling-数据规整——mixtera_A Data Plane for Foundation Model Training

数据规整

本地和远程

使用方式

实现抽象

源码

图和代码

pybind安装

从代码结构中看

pybind的使用

其他C++库

预训练数据的设计

参考