openseek-学习与复现记录

OpenSeek 致力于联合全球开源社区,推动算法、数据和系统方面的协作创新,目标是开发超越 DeepSeek 的下一代模型。

📌 项目概况

OpenSeek 是由北京人工智能研究院 (BAAI) 发起的开源项目,旨在联合全球开源社区,推动算法、数据和系统方面的协作创新,开发超越 DeepSeek 的下一代模型。该项目从 Bigscience 和 OPT 等大型模型计划中汲取灵感,致力于构建独立的开源算法创新系统。自 DeepSeek 模型开源以来,学术界已经看到了许多算法改进和突破,但这些创新往往缺乏完整的代码实现、必要的计算资源和高质量的数据支持。OpenSeek 项目旨在通过联合开源社区探索高质量的数据集构建机制,推动整个大型模型训练管道的开源,构建创新的训练和推理代码,以支持除 Nvidia 之外的各种 AI 芯片,并推动自主技术创新和应用开发。

Objectives of OpenSeek: OpenSeek 的目标:

  • 高级数据技术 :应对获取高质量数据的挑战。

  • 多种 AI 设备支持 :减少对特定芯片的依赖,提高模型的通用性和适应性。

  • 标准化的 LLM 训练基线 :通过开源协作促进独立的算法创新和技术共享。

Project: https://github.com/orgs/FlagAI-Open/projects/1
项目名称:https://github.com/orgs/FlagAI-Open/projects/1

环境安装

安装Flagscale 环境(这里推荐使用 docker)

# Pull images
docker pull openseek2025/openseek:flagscale-20250527

# Clone the repository
git clone https://github.com/FlagOpen/FlagScale.git

进入docker image进行训练

docker_image=openseek2025/openseek:flagscale-20250527
docker run -it --gpus all --ipc=host --shm-size=8g -v /data2:/data2 $docker_image /bin/bash

运行baseline

openseek-baseline 作为 PAZHOU 算法竞赛的基线,也用于评估 openseek 中的 PR。Openseek-baseline 是一个标准化的 LLM 训练和评估管道,它由一个 100B 数据集 一个训练代码 wandb 检查点评估结果组成。

数据准备

git lfs install
git clone https://huggingface.co/datasets/BAAI/OpenSeek-Pretrain-100B

如果网络不稳定,可以考虑使用 modelscope

pip install modelscope
modelscope download --dataset BAAI/OpenSeek-Pretrain-100B --local_dir ./dir

下载完成,文件大小约为413.57GB.

OpenSeek-Pretrain-100B 数据分布:

Name Tokens Tokens(B)
Nemotron-CC-high-actual-actual-high 1140543860 1.14
Nemotron-CC-high-actual-actual-low 368646238 0.37
Nemotron-CC-high-actual-actual-mid 801213010 0.80
Nemotron-CC-high-synthetic-distill-high 294569308 0.29
Nemotron-CC-high-synthetic-distill-low 172342068 0.17
Nemotron-CC-high-synthetic-distill-mid 240998642 0.24
Nemotron-CC-high-synthetic-diverse_qa_pairs-high 556137649 0.56
Nemotron-CC-high-synthetic-diverse_qa_pairs-low 418742390 0.42
Nemotron-CC-high-synthetic-diverse_qa_pairs-mid 515733187 0.52
Nemotron-CC-high-synthetic-extract_knowledge-high 475714119 0.48
Nemotron-CC-high-synthetic-extract_knowledge-low 68996838 0.07
Nemotron-CC-high-synthetic-extract_knowledge-mid 353316407 0.35
Nemotron-CC-high-synthetic-knowledge_list-high 268953064 0.27
Nemotron-CC-high-synthetic-knowledge_list-low 187973360 0.19
Nemotron-CC-high-synthetic-knowledge_list-mid 238373108 0.24
Nemotron-CC-high-synthetic-wrap_medium-high 848837296 0.85
Nemotron-CC-high-synthetic-wrap_medium-low 295324405 0.30
Nemotron-CC-high-synthetic-wrap_medium-mid 687328353 0.69
Nemotron-CC-low-synthetic-wrap_medium-high 479896420 0.48
Nemotron-CC-low-synthetic-wrap_medium-low 206574167 0.21
Nemotron-CC-low-synthetic-wrap_medium-mid 444865784 0.44
Nemotron-CC-medium-actual-actual-high 1174405205 1.17
Nemotron-CC-medium-actual-actual-low 698884310 0.70
Nemotron-CC-medium-actual-actual-mid 945401567 0.95
arxiv 660912931 0.66
books 483917796 0.48
code-high 1040945650 1.04
code-low 1175000655 1.18
code-mid 996826302 1.00
cot_synthesis2_CC-high 386941302 0.39
cot_synthesis2_CC-low 51390680 0.05
cot_synthesis2_CC-mid 1885475230 1.89
cot_synthesis2_OpenSource-high 265167656 0.27
cot_synthesis2_OpenSource-low 168830028 0.17
cot_synthesis2_OpenSource-mid 334976884 0.33
cot_synthesis2_arxiv-high 12894983685 12.89
cot_synthesis2_arxiv-low 9177670132 9.18
cot_synthesis2_arxiv-mid 10446468216 10.45
cot_synthesis2_code-high 473767419 0.47
cot_synthesis2_code-low 706636812 0.71
cot_synthesis2_code-mid 926436168 0.93
cot_synthesis2_math-high 1353517224 1.35
cot_synthesis2_math-low 1703361358 1.70
cot_synthesis2_math-mid 364330324 0.36
cot_synthesis2_wiki-high 650684154 0.65
cot_synthesis2_wiki-low 615978070 0.62
cot_synthesis2_wiki-mid 814947142 0.81
cot_synthesis_CC-high 229324269 0.23
cot_synthesis_CC-low 185148748 0.19
cot_synthesis_CC-mid 210471356 0.21
cot_synthesis_OpenSource-high 420505110 0.42
cot_synthesis_OpenSource-low 170987708 0.17
cot_synthesis_OpenSource-mid 1321855051 1.32
cot_synthesis_arxiv-high 5853027309 5.85
cot_synthesis_arxiv-low 7718911399 7.72
cot_synthesis_arxiv-mid 9208148090 9.21
cot_synthesis_code-high 789672525 0.79
cot_synthesis_code-low 417526994 0.42
cot_synthesis_code-mid 197436971 0.20
cot_synthesis_math-high 522900778 0.52
cot_synthesis_math-low 663320643 0.66
cot_synthesis_math-mid 660137084 0.66
cot_synthesis_wiki-high 412152225 0.41
cot_synthesis_wiki-low 367306600 0.37
cot_synthesis_wiki-mid 594421619 0.59
math-high 1871864190 1.87
math-low 1745580082 1.75
math-mid 1680811027 1.68
pes2o 6386997158 6.39
pes2o-full-train 1469110938 1.47
pes2o-full-val 14693152 0.01
stack 435813429 0.44
wiki 433002447 0.43
zh_cc-high-loss0 1872431176 1.87
zh_cc-high-loss1 1007405788 1.01
zh_cc-high-loss2 383830893 0.38
zh_cc-medidum-loss0 978118384 0.98
zh_cc-medidum-loss1 951741139 0.95
zh_cc-medidum-loss2 1096769115 1.10

根据 yaml 中的 radio 进行可视化

Clipboard_Screenshot_1751012424

Clipboard_Screenshot_1751012454

训练

确保您已完成上一节中所述的环境安装和配置,并且您的 OpenSeek 文件夹应如下所示:

OpenSeek
├── OpenSeek-Pretrain-100B (Dataset directory for downloaded datasets.)
├── FlagScale (FlagScale directory cloned from GitHub.)
├── OpenSeek-Small-v1-Baseline (Experiment directory will be created automatically and contains logs and model checkpoints etc.)
├── ...

接下来,您可以使用一个简单的命令运行基线:

bash openseek/baseline/run_exp.sh start

这个命令,主要是根据模型配置和数据配置来生成相应的训练cmds脚本。

其中,模型配置如下:

  • 注意,envs 中默认使用单卡进行训练,如果需要修改为多卡,参照以下,修改 nnodes、nproc_per_node、VISIBLE_DEVICES、DEVICE_MAX_CONNECTIONS 即可。
  • 对于训练cmds,如果默认使用的是docker-image,应该修改
    before_start:source /root/miniconda3/bin/activate flagscale-train,通过这里配置,可以去修改 conda 的环境,以及进入的 conda 环境。
# DeepSeek 1_4B, 0_4A Model
defaults:
  - _self_
  - train: train_deepseek_v3_1_4b.yaml

experiment:
  exp_name: OpenSeek-Small-v1-Baseline
  dataset_base_dir: ../OpenSeek-Pretrain-100B
  seed: 42
  save_steps: 600
  load: null
  exp_dir: ${experiment.exp_name}
  ckpt_format: torch
  task:
    type: train
    backend: megatron
    entrypoint: flagscale/train/train_gpt.py
  runner:
    no_shared_fs: false
    backend: torchrun
    rdzv_backend: static
    ssh_port: 22
    # nnodes: 1
    nnodes: 8
    # nproc_per_node: 1
    nproc_per_node: 8
    hostfile: null
  cmds:
    # before_start: "ulimit -n 1048576 && source /root/miniconda3/bin/activate flagscale"
    before_start: "ulimit -n 1048576 && source /root/miniconda3/bin/activate flagscale-train"

    # before_start: "ulimit -n 1048576 && source /data/anaconda3/bin/activate flagscale-train"
    
  envs:
    # VISIBLE_DEVICES: 0
    # DEVICE_MAX_CONNECTIONS: 1
    VISIBLE_DEVICES: 0,1,2,3,4,5,6,7
    DEVICE_MAX_CONNECTIONS: 8

action: run

hydra:
  run:
    dir: ${experiment.exp_dir}/hydra

如何验证程序是否正常运行

执行 bash openseek/baseline/run_exp.sh start 后,您可以按照以下步骤确认您的程序是否按预期运行。

  1. 导航到 OpenSeek 根目录 。您会注意到,在此目录中创建了一个名为 OpenSeek-Small-v1-Baseline 的新文件夹。这是 log 目录。

  2. 您可以通过使用文本编辑器(如 vim)打开 OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output 来查看程序的日志和错误消息:

    vi OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
    
    W0627 15:58:45.576000 4413 site-packages/torch/distributed/run.py:792] 
    W0627 15:58:45.576000 4413 site-packages/torch/distributed/run.py:792] *****************************************
    W0627 15:58:45.576000 4413 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    W0627 15:58:45.576000 4413 site-packages/torch/distributed/run.py:792] *****************************************
    [default7]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default7]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default1]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default1]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default3]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default3]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default2]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default2]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default0]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default0]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default6]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default6]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default7]:[2025-06-27 15:58:51,700] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default3]:[2025-06-27 15:58:52,186] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default1]:[2025-06-27 15:58:52,148] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default6]:[2025-06-27 15:58:52,308] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default2]:[2025-06-27 15:58:52,318] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default5]:[2025-06-27 15:58:52,392] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default0]:[2025-06-27 15:58:52,401] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default4]:[2025-06-27 15:58:52,602] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    [default0]:using world size: 8, data-parallel size: 8, context-parallel size: 1, hierarchical context-parallel sizes: Nonetensor-model-parallel size: 1, encoder-tensor-model-parallel size: 0, pipeline-model-parallel size: 1, encoder-pipeline-model-parallel size: 0
    [default0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:QwenTokenizerFS
    [default0]:Number of virtual stages per pipeline stage: None
    [default0]:using torch.bfloat16 for parameters ...
    [default0]:------------------------ arguments ------------------------
    [default0]:  account_for_embedding_in_pipeline_split ......... False
    [default0]:  account_for_loss_in_pipeline_split .............. False
    [default0]:  accumulate_allreduce_grads_in_fp32 .............. True
    [default0]:  adam_beta1 ...................................... 0.9
    [default0]:  adam_beta2 ...................................... 0.95
    [default0]:  adam_eps ........................................ 1e-08
    [default0]:  add_bias_linear ................................. False
    [default0]:  add_position_embedding .......................... False
    [default0]:  add_qkv_bias .................................... False
    [default0]:  adlr_autoresume ................................. False
    [default0]:  adlr_autoresume_interval ........................ 1000
    [default0]:  align_grad_reduce ............................... True
    [default0]:  align_param_gather .............................. False
    [default0]:  app_tag_run_name ................................ None
    [default0]:  app_tag_run_version ............................. 0.0.0
    [default0]:  apply_layernorm_1p .............................. False
    [default0]:  apply_query_key_layer_scaling ................... False
    [default0]:  apply_residual_connection_post_layernorm ........ False
    [default0]:  apply_rope_fusion ............................... False
    [default0]:  apply_sft_dataset_separated_loss_mask_if_existed  False
    [default0]:  async_save ...................................... None
    [default0]:  async_tensor_model_parallel_allreduce ........... True
    [default0]:  attention_backend ............................... AttnBackend.auto
    [default0]:  attention_dropout ............................... 0.0
    [default0]:  attention_softmax_in_fp32 ....................... True
    [default0]:  auto_detect_ckpt_format ......................... False
    [default0]:  auto_skip_spiky_loss ............................ False
    [default0]:  auto_tune ....................................... False
    [default0]:  barrier_with_L1_time ............................ True
    [default0]:  bert_binary_head ................................ True
    [default0]:  bert_embedder_type .............................. megatron
    [default0]:  bert_load ....................................... None
    [default0]:  bf16 ............................................ True
    [default0]:  bias_dropout_fusion ............................. True
    [default0]:  bias_gelu_fusion ................................ False
    [default0]:  bias_swiglu_fusion .............................. True
    [default0]:  biencoder_projection_dim ........................ 0
    [default0]:  biencoder_shared_query_context_model ............ False
    [default0]:  block_data_path ................................. None
    [default0]:  calc_ft_timeouts ................................ False
    [default0]:  calculate_per_token_loss ........................ False
    [default0]:  check_for_large_grads ........................... False
    [default0]:  check_for_nan_in_loss_and_grad .................. True
    [default0]:  check_for_spiky_loss ............................ False
    [default0]:  check_weight_hash_across_dp_replicas_interval ... None
    [default0]:  ckpt_assume_constant_structure .................. False
    [default0]:  ckpt_convert_format ............................. None
    [default0]:  ckpt_convert_save ............................... None
    [default0]:  ckpt_convert_update_legacy_dist_opt_format ...... False
    [default0]:  ckpt_format ..................................... torch
    [default0]:  ckpt_fully_parallel_load ........................ False
    [default0]:  ckpt_fully_parallel_save ........................ True
    [default0]:  ckpt_fully_parallel_save_deprecated ............. False
    [default0]:  ckpt_step ....................................... None
    [default0]:  classes_fraction ................................ 1.0
    [default0]:  clip_grad ....................................... 1.0
    [default0]:  clone_scatter_output_in_embedding ............... True
    [default0]:  config_logger_dir ............................... 
    [default0]:  consumed_train_samples .......................... 0
    [default0]:  consumed_valid_samples .......................... 0
    [default0]:  context_parallel_size ........................... 1
    [default0]:  cp_comm_type .................................... ['p2p']
    [default0]:  create_attention_mask_in_dataloader ............. True
    [default0]:  cross_entropy_fusion_impl ....................... native
    [default0]:  cross_entropy_loss_fusion ....................... False
    [default0]:  cuda_graph_scope ................................ full
    [default0]:  cuda_graph_warmup_steps ......................... 3
    [default0]:  data_args_path .................................. None
    [default0]:  data_cache_path ................................. None
    [default0]:  data_parallel_random_init ....................... False
    [default0]:  data_parallel_sharding_strategy ................. no_shard
    [default0]:  data_parallel_size .............................. 8
    [default0]:  data_path ....................................... ['1.1068', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high/part_142_text_document', '0.3577', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low/part_62_text_document', '0.7775', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid/part_189_text_document', '0.2859', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high/part_76_text_document', '0.1672', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low/part_124_text_document', '0.2339', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document', '0.5397', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document', '0.4064', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document', '0.5005', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid/part_444_text_document', '0.4616', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high/part_498_text_document', '0.067', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low/part_10_text_document', '0.3429', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid/part_144_text_document', '0.261', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high/part_86_text_document', '0.1824', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low/part_133_text_document', '0.2313', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid/part_139_text_document', '0.8237', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high/part_47_text_document', '0.2866', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low/part_11_text_document', '0.667', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid/part_97_text_document', '0.4657', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high/part_43_text_document', '0.2005', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low/part_10_text_document', '0.4317', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid/part_164_text_document', '1.1397', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high/part_92_text_document', '0.6782', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low/part_113_text_document', '0.9175', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid/part_563_text_document', '0.6414', '../OpenSeek-Pretrain-100B/arxiv/007_00000_text_document', '0.4696', '../OpenSeek-Pretrain-100B/books/016_00007_text_document', '1.0102', '../OpenSeek-Pretrain-100B/code-high/part_13_text_document', '1.1403', '../OpenSeek-Pretrain-100B/code-low/part_36_text_document', '0.9674', '../OpenSeek-Pretrain-100B/code-mid/part_37_text_document', '0.3755', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-high/23_text_document', '0.0499', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-low/51_text_document', '0.3608', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/118_text_document', '0.3623', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/176_text_document', '0.3704', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/256_text_document', '0.3733', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/320_text_document', '0.3631', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/32_text_document', '0.2573', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high/1_text_document', '0.1638', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low/2_text_document', '0.3251', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid/3_text_document', '6.0237', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high/2_text_document', '8.9063', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low/1_text_document', '10.1376', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid/2_text_document', '0.4598', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-high/4_text_document', '0.6857', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-low/6_text_document', '0.899', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-mid/23_text_document', '1.3135', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-high/12_text_document', '1.653', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-low/3_text_document', '0.3536', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-mid/5_text_document', '0.6314', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high/5_text_document', '0.5978', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low/5_text_document', '0.7909', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid/4_text_document', '0.2225', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-high/74_text_document', '0.1797', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-low/54_text_document', '0.2042', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-mid/275_text_document', '0.4081', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high/4_text_document', '0.1659', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low/2_text_document', '1.2828', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid/6_text_document', '5.68', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high/2_text_document', '7.4907', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low/1_text_document', '8.9359', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid/1_text_document', '0.7663', '../OpenSeek-Pretrain-100B/cot_synthesis_code-high/13_text_document', '0.4052', '../OpenSeek-Pretrain-100B/cot_synthesis_code-low/9_text_document', '0.1916', '../OpenSeek-Pretrain-100B/cot_synthesis_code-mid/6_text_document', '0.5074', '../OpenSeek-Pretrain-100B/cot_synthesis_math-high/11_text_document', '0.6437', '../OpenSeek-Pretrain-100B/cot_synthesis_math-low/11_text_document', '0.6406', '../OpenSeek-Pretrain-100B/cot_synthesis_math-mid/29_text_document', '0.4', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-high/4_text_document', '0.3564', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-low/6_text_document', '0.5768', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid/3_text_document', '1.8165', '../OpenSeek-Pretrain-100B/math-high/part_04_text_document', '1.694', '../OpenSeek-Pretrain-100B/math-low/part_10_text_document', '1.6311', '../OpenSeek-Pretrain-100B/math-mid/part_07_text_document', '0.687', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0041-of-0136_text_document', '0.7387', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0125-of-0136_text_document', '0.0143', '../OpenSeek-Pretrain-100B/pes2o-full-val/valid-0034-of-0060_text_document', '6.1982', '../OpenSeek-Pretrain-100B/pes2o/pubmedcentral_3_text_document', '0.4229', '../OpenSeek-Pretrain-100B/stack/018_00000_text_document', '0.4202', '../OpenSeek-Pretrain-100B/wiki/012_00000_text_document', '1.8171', '../OpenSeek-Pretrain-100B/zh_cc-high-loss0/part_28_text_document', '0.9776', '../OpenSeek-Pretrain-100B/zh_cc-high-loss1/part_59_text_document', '0.3725', '../OpenSeek-Pretrain-100B/zh_cc-high-loss2/part_16_text_document', '0.9492', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss0/part_192_text_document', '0.9236', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss1/part_550_text_document', '1.0643', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss2/part_71_text_document']
    [default0]:  data_per_class_fraction ......................... 1.0
    [default0]:  data_sharding ................................... True
    [default0]:  dataloader_type ................................. single
    [default0]:  ddp_average_in_collective ....................... False
    [default0]:  ddp_bucket_size ................................. None
    [default0]:  ddp_num_buckets ................................. None
    [default0]:  ddp_pad_buckets_for_high_nccl_busbw ............. False
    [default0]:  decoder_first_pipeline_num_layers ............... None
    [default0]:  decoder_last_pipeline_num_layers ................ None
    [default0]:  decoder_num_layers .............................. None
    [default0]:  decoder_seq_length .............................. None
    [default0]:  decoupled_lr .................................... None
    [default0]:  decoupled_min_lr ................................ None
    [default0]:  decrease_batch_size_if_needed ................... False
    [default0]:  defer_embedding_wgrad_compute ................... False
    [default0]:  deprecated_use_mcore_models ..................... False
    [default0]:  deterministic_mode .............................. False
    [default0]:  dino_bottleneck_size ............................ 256
    [default0]:  dino_freeze_last_layer .......................... 1
    [default0]:  dino_head_hidden_size ........................... 2048
    [default0]:  dino_local_crops_number ......................... 10
    [default0]:  dino_local_img_size ............................. 96
    [default0]:  dino_norm_last_layer ............................ False
    [default0]:  dino_teacher_temp ............................... 0.07
    [default0]:  dino_warmup_teacher_temp ........................ 0.04
    [default0]:  dino_warmup_teacher_temp_epochs ................. 30
    [default0]:  disable_bf16_reduced_precision_matmul ........... False
    [default0]:  disable_mamba_mem_eff_path ...................... False
    [default0]:  disable_straggler_on_startup .................... False
    [default0]:  dist_ckpt_format_deprecated ..................... None
    [default0]:  dist_ckpt_strictness ............................ assume_ok_unexpected
    [default0]:  distribute_saved_activations .................... False
    [default0]:  distributed_backend ............................. nccl
    [default0]:  distributed_timeout_minutes ..................... 10
    [default0]:  embedding_path .................................. None
    [default0]:  empty_unused_memory_level ....................... 0
    [default0]:  enable_cuda_graph ............................... False
    [default0]:  enable_ft_package ............................... False
    [default0]:  enable_gloo_process_groups ...................... True
    [default0]:  enable_hetero ................................... False
    [default0]:  enable_msc ...................................... True
    [default0]:  enable_one_logger ............................... True
    [default0]:  encoder_num_layers .............................. 6
    [default0]:  encoder_pipeline_model_parallel_size ............ 0
    [default0]:  encoder_seq_length .............................. 4096
    [default0]:  encoder_tensor_model_parallel_size .............. 0
    [default0]:  end_weight_decay ................................ 0.1
    [default0]:  eod_mask_loss ................................... False
    [default0]:  error_injection_rate ............................ 0
    [default0]:  error_injection_type ............................ transient_error
    [default0]:  eval_interval ................................... 1000
    [default0]:  eval_iters ...................................... 0
    [default0]:  evidence_data_path .............................. None
    [default0]:  exit_duration_in_mins ........................... None
    [default0]:  exit_interval ................................... None
    [default0]:  exit_on_missing_checkpoint ...................... False
    [default0]:  exit_signal_handler ............................. False
    [default0]:  exp_avg_dtype ................................... torch.float32
    [default0]:  exp_avg_sq_dtype ................................ torch.float32
    [default0]:  expert_model_parallel_size ...................... 1
    [default0]:  expert_tensor_parallel_size ..................... 1
    [default0]:  expert_tensor_parallel_size_per_process_mesh .... None
    [default0]:  external_cuda_graph ............................. False
    [default0]:  extra_eval_interval ............................. None
    [default0]:  extra_valid_data_path ........................... None
    [default0]:  ffn_hidden_size ................................. 7168
    [default0]:  finetune ........................................ False
    [default0]:  finetune_dataset_type ........................... None
    [default0]:  first_last_layers_bf16 .......................... False
    [default0]:  flash_decode .................................... False
    [default0]:  fp16 ............................................ False
    [default0]:  fp16_lm_cross_entropy ........................... False
    [default0]:  fp32_residual_connection ........................ False
    [default0]:  fp8 ............................................. None
    [default0]:  fp8_amax_compute_algo ........................... most_recent
    [default0]:  fp8_amax_history_len ............................ 1
    [default0]:  fp8_interval .................................... 1
    [default0]:  fp8_margin ...................................... 0
    [default0]:  fp8_param_gather ................................ False
    [default0]:  fp8_recipe ...................................... delayed
    [default0]:  fp8_wgrad ....................................... True
    [default0]:  global_batch_size ............................... 1024
    [default0]:  grad_reduce_in_bf16 ............................. False
    [default0]:  gradient_accumulation_fusion .................... True
    [default0]:  gradient_reduce_div_fusion ...................... True
    [default0]:  group_query_attention ........................... False
    [default0]:  head_lr_mult .................................... 1.0
    [default0]:  hetero_current_device_type ...................... None
    [default0]:  hetero_device_types ............................. None
    [default0]:  hetero_pipeline_layer_split ..................... None
    [default0]:  hetero_process_meshes ........................... None
    [default0]:  hetero_use_cpu_communication .................... False
    [default0]:  heterogeneous_layers_config_encoded_json ........ None
    [default0]:  heterogeneous_layers_config_path ................ None
    [default0]:  hidden_dim_multiplier ........................... None
    [default0]:  hidden_dropout .................................. 0.0
    [default0]:  hidden_size ..................................... 1280
    [default0]:  hierarchical_context_parallel_sizes ............. None
    [default0]:  hybrid_attention_ratio .......................... 0.0
    [default0]:  hybrid_mlp_ratio ................................ 0.0
    [default0]:  hybrid_override_pattern ......................... None
    [default0]:  hysteresis ...................................... 2
    [default0]:  ict_head_size ................................... None
    [default0]:  ict_load ........................................ None
    [default0]:  img_h ........................................... 224
    [default0]:  img_w ........................................... 224
    [default0]:  indexer_batch_size .............................. 128
    [default0]:  indexer_log_interval ............................ 1000
    [default0]:  inference_batch_times_seqlen_threshold .......... -1
    [default0]:  inference_dynamic_batching ...................... False
    [default0]:  inference_dynamic_batching_buffer_guaranteed_fraction  0.2
    [default0]:  inference_dynamic_batching_buffer_overflow_factor  None
    [default0]:  inference_dynamic_batching_buffer_size_gb ....... 40.0
    [default0]:  inference_dynamic_batching_chunk_size ........... 256
    [default0]:  inference_dynamic_batching_max_requests_override  None
    [default0]:  inference_dynamic_batching_max_tokens_override .. None
    [default0]:  inference_max_batch_size ........................ 8
    [default0]:  inference_max_seq_length ........................ 2560
    [default0]:  inference_rng_tracker ........................... False
    [default0]:  init_method_std ................................. 0.006
    [default0]:  init_method_xavier_uniform ...................... False
    [default0]:  init_model_with_meta_device ..................... False
    [default0]:  initial_loss_scale .............................. 4294967296
    [default0]:  is_hybrid_model ................................. False
    [default0]:  iter_per_epoch .................................. 1250
    [default0]:  iterations_to_skip .............................. []
    [default0]:  keep_fp8_transpose_cache_when_using_custom_fsdp . False
    [default0]:  kv_channels ..................................... 128
    [default0]:  kv_lora_rank .................................... 512
    [default0]:  lazy_mpu_init ................................... None
    [default0]:  load ............................................ /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/checkpoints
    [default0]:  local_rank ...................................... 0
    [default0]:  log_interval .................................... 1
    [default0]:  log_loss_scale_to_tensorboard ................... True
    [default0]:  log_memory_to_tensorboard ....................... True
    [default0]:  log_num_zeros_in_grad ........................... True
    [default0]:  log_params_norm ................................. True
    [default0]:  log_progress .................................... False
    [default0]:  log_straggler ................................... False
    [default0]:  log_throughput .................................. True
    [default0]:  log_timers_to_tensorboard ....................... True
    [default0]:  log_validation_ppl_to_tensorboard ............... True
    [default0]:  log_world_size_to_tensorboard ................... False
    [default0]:  logging_level ................................... None
    [default0]:  loss_scale ...................................... None
    [default0]:  loss_scale_window ............................... 1000
    [default0]:  lr .............................................. 0.003
    [default0]:  lr_decay_iters .................................. None
    [default0]:  lr_decay_samples ................................ None
    [default0]:  lr_decay_stablelm2_alpha ........................ 1.0
    [default0]:  lr_decay_stablelm2_beta ......................... 0.0
    [default0]:  lr_decay_stablelm2_cosine_max_lr ................ None
    [default0]:  lr_decay_stablelm2_cosine_period_samples ........ 0
    [default0]:  lr_decay_stablelm2_cosine_samples ............... 0
    [default0]:  lr_decay_stablelm2_decay_samples ................ 0
    [default0]:  lr_decay_stablelm2_rsqrt_samples ................ 0
    [default0]:  lr_decay_style .................................. cosine
    [default0]:  lr_warmup_fraction .............................. None
    [default0]:  lr_warmup_init .................................. 0.0
    [default0]:  lr_warmup_iters ................................. 0
    [default0]:  lr_warmup_samples ............................... 2048000
    [default0]:  lr_wsd_decay_iters .............................. None
    [default0]:  lr_wsd_decay_samples ............................ None
    [default0]:  lr_wsd_decay_style .............................. exponential
    [default0]:  main_grads_dtype ................................ torch.float32
    [default0]:  main_params_dtype ............................... torch.float32
    [default0]:  make_vocab_size_divisible_by .................... 64
    [default0]:  mamba_head_dim .................................. 64
    [default0]:  mamba_num_groups ................................ 8
    [default0]:  mamba_num_heads ................................. None
    [default0]:  mamba_state_dim ................................. 128
    [default0]:  manual_gc ....................................... False
    [default0]:  manual_gc_eval .................................. True
    [default0]:  manual_gc_interval .............................. 0
    [default0]:  mask_factor ..................................... 1.0
    [default0]:  mask_prob ....................................... 0.15
    [default0]:  mask_type ....................................... random
    [default0]:  masked_softmax_fusion ........................... True
    [default0]:  max_position_embeddings ......................... 4096
    [default0]:  max_tokens_to_oom ............................... 12000
    [default0]:  memory_snapshot_path ............................ snapshot.pickle
    [default0]:  merge_file ...................................... None
    [default0]:  micro_batch_size ................................ 1
    [default0]:  microbatch_group_size_per_vp_stage .............. None
    [default0]:  mid_level_dataset_surplus ....................... 0.005
    [default0]:  min_loss_scale .................................. 1.0
    [default0]:  min_lr .......................................... 0.0003
    [default0]:  mlp_chunks_for_prefill .......................... 1
    [default0]:  mmap_bin_files .................................. False
    [default0]:  mock_data ....................................... False
    [default0]:  moe_aux_loss_coeff .............................. 0.0001
    [default0]:  moe_enable_deepep ............................... False
    [default0]:  moe_expert_capacity_factor ...................... None
    [default0]:  moe_extended_tp ................................. False
    [default0]:  moe_ffn_hidden_size ............................. 896
    [default0]:  moe_grouped_gemm ................................ True
    [default0]:  moe_input_jitter_eps ............................ None
    [default0]:  moe_layer_freq .................................. [0, 1, 1, 1, 1, 1]
    [default0]:  moe_layer_recompute ............................. False
    [default0]:  moe_pad_expert_input_to_capacity ................ False
    [default0]:  moe_per_layer_logging ........................... False
    [default0]:  moe_permute_fusion .............................. False
    [default0]:  moe_router_bias_update_rate ..................... 0.001
    [default0]:  moe_router_dtype ................................ fp32
    [default0]:  moe_router_enable_expert_bias ................... True
    [default0]:  moe_router_group_topk ........................... 1
    [default0]:  moe_router_load_balancing_type .................. seq_aux_loss
    [default0]:  moe_router_num_groups ........................... 1
    [default0]:  moe_router_pre_softmax .......................... False
    [default0]:  moe_router_score_function ....................... sigmoid
    [default0]:  moe_router_topk ................................. 6
    [default0]:  moe_router_topk_scaling_factor .................. 2.446
    [default0]:  moe_shared_expert_intermediate_size ............. 1792
    [default0]:  moe_shared_expert_overlap ....................... False
    [default0]:  moe_token_dispatcher_type ....................... alltoall
    [default0]:  moe_token_drop_policy ........................... probs
    [default0]:  moe_use_legacy_grouped_gemm ..................... False
    [default0]:  moe_use_upcycling ............................... False
    [default0]:  moe_z_loss_coeff ................................ None
    [default0]:  mrope_section ................................... None
    [default0]:  mscale .......................................... 1.0
    [default0]:  mscale_all_dim .................................. 1.0
    [default0]:  mtp_loss_coeff .................................. 0.3
    [default0]:  mtp_loss_scaling_factor ......................... 0.1
    [default0]:  mtp_num_layers .................................. None
    [default0]:  multi_latent_attention .......................... True
    [default0]:  multiple_of ..................................... None
    [default0]:  nccl_all_reduce_for_prefill ..................... False
    [default0]:  nccl_communicator_config_path ................... None
    [default0]:  no_load_optim ................................... None
    [default0]:  no_load_rng ..................................... None
    [default0]:  no_persist_layer_norm ........................... False
    [default0]:  no_rope_freq .................................... None
    [default0]:  no_save_optim ................................... None
    [default0]:  no_save_rng ..................................... None
    [default0]:  no_shared_fs .................................... False
    [default0]:  non_persistent_ckpt_type ........................ None
    [default0]:  non_persistent_global_ckpt_dir .................. None
    [default0]:  non_persistent_local_ckpt_algo .................. fully_parallel
    [default0]:  non_persistent_local_ckpt_dir ................... None
    [default0]:  non_persistent_save_interval .................... None
    [default0]:  norm_epsilon .................................... 1e-06
    [default0]:  norm_init_weight ................................ None
    [default0]:  normalization ................................... RMSNorm
    [default0]:  num_attention_heads ............................. 10
    [default0]:  num_channels .................................... 3
    [default0]:  num_classes ..................................... 1000
    [default0]:  num_dataset_builder_threads ..................... 1
    [default0]:  num_distributed_optimizer_instances ............. 1
    [default0]:  num_experts ..................................... 64
    [default0]:  num_layers ...................................... 6
    [default0]:  num_layers_at_end_in_bf16 ....................... 1
    [default0]:  num_layers_at_start_in_bf16 ..................... 1
    [default0]:  num_layers_per_virtual_pipeline_stage ........... None
    [default0]:  num_mtp_predictor ............................... 0
    [default0]:  num_query_groups ................................ 10
    [default0]:  num_virtual_stages_per_pipeline_rank ............ None
    [default0]:  num_workers ..................................... 4
    [default0]:  object_storage_cache_path ....................... None
    [default0]:  one_logger_async ................................ False
    [default0]:  one_logger_project .............................. megatron-lm
    [default0]:  one_logger_run_name ............................. None
    [default0]:  onnx_safe ....................................... None
    [default0]:  openai_gelu ..................................... False
    [default0]:  optimizer ....................................... adam
    [default0]:  optimizer_cpu_offload ........................... False
    [default0]:  optimizer_offload_fraction ...................... 1.0
    [default0]:  output_bert_embeddings .......................... False
    [default0]:  overlap_cpu_optimizer_d2h_h2d ................... False
    [default0]:  overlap_grad_reduce ............................. True
    [default0]:  overlap_p2p_comm ................................ False
    [default0]:  overlap_p2p_comm_warmup_flush ................... False
    [default0]:  overlap_param_gather ............................ True
    [default0]:  overlap_param_gather_with_optimizer_step ........ False
    [default0]:  override_opt_param_scheduler .................... False
    [default0]:  params_dtype .................................... torch.bfloat16
    [default0]:  patch_dim ....................................... 16
    [default0]:  per_split_data_args_path ........................ None
    [default0]:  perform_initialization .......................... True
    [default0]:  pin_cpu_grads ................................... True
    [default0]:  pin_cpu_params .................................. True
    [default0]:  pipeline_model_parallel_comm_backend ............ None
    [default0]:  pipeline_model_parallel_size .................... 1
    [default0]:  pipeline_model_parallel_split_rank .............. None
    [default0]:  position_embedding_type ......................... rope
    [default0]:  pretrained_checkpoint ........................... None
    [default0]:  profile ......................................... False
    [default0]:  profile_ranks ................................... [0]
    [default0]:  profile_step_end ................................ 12
    [default0]:  profile_step_start .............................. 10
    [default0]:  q_lora_rank ..................................... None
    [default0]:  qk_head_dim ..................................... 128
    [default0]:  qk_layernorm .................................... True
    [default0]:  qk_layernorm_hidden_dim ......................... False
    [default0]:  qk_pos_emb_head_dim ............................. 64
    [default0]:  query_in_block_prob ............................. 0.1
    [default0]:  rampup_batch_size ............................... None
    [default0]:  rampup_save_interval ............................ None
    [default0]:  rank ............................................ 0
    [default0]:  recompute_granularity ........................... full
    [default0]:  recompute_granularity_per_stage_micro_batch ..... None
    [default0]:  recompute_method ................................ uniform
    [default0]:  recompute_method_per_stage_micro_batch .......... None
    [default0]:  recompute_modules ............................... None
    [default0]:  recompute_num_layers ............................ 6
    [default0]:  recompute_num_layers_per_stage_micro_batch ...... None
    [default0]:  record_memory_history ........................... False
    [default0]:  relative_attention_max_distance ................. 128
    [default0]:  relative_attention_num_buckets .................. 32
    [default0]:  replication ..................................... False
    [default0]:  replication_factor .............................. 2
    [default0]:  replication_jump ................................ None
    [default0]:  rerun_mode ...................................... disabled
    [default0]:  reset_attention_mask ............................ True
    [default0]:  reset_position_ids .............................. True
    [default0]:  result_rejected_tracker_filename ................ None
    [default0]:  retriever_report_topk_accuracies ................ []
    [default0]:  retriever_score_scaling ......................... False
    [default0]:  retriever_seq_length ............................ 256
    [default0]:  retro_add_retriever ............................. False
    [default0]:  retro_attention_gate ............................ 1
    [default0]:  retro_cyclic_train_iters ........................ None
    [default0]:  retro_encoder_attention_dropout ................. 0.1
    [default0]:  retro_encoder_hidden_dropout .................... 0.1
    [default0]:  retro_encoder_layers ............................ 2
    [default0]:  retro_num_neighbors ............................. 2
    [default0]:  retro_num_retrieved_chunks ...................... 2
    [default0]:  retro_project_dir ............................... None
    [default0]:  retro_verify_neighbor_count ..................... True
    [default0]:  rope_scaling_factor ............................. 8.0
    [default0]:  rotary_base ..................................... 1000000
    [default0]:  rotary_interleaved .............................. False
    [default0]:  rotary_percent .................................. 1.0
    [default0]:  rotary_scaling_factor ........................... 1.0
    [default0]:  rotary_seq_len_interpolation_factor ............. None
    [default0]:  run_workload_inspector_server ................... False
    [default0]:  sample_rate ..................................... 1.0
    [default0]:  save ............................................ /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/checkpoints
    [default0]:  save_interval ................................... 600
    [default0]:  save_when_num_microbatches_change ............... False
    [default0]:  scatter_gather_tensors_in_pipeline .............. True
    [default0]:  seed ............................................ 42
    [default0]:  seq_length ...................................... 4096
    [default0]:  sequence_parallel ............................... False
    [default0]:  sgd_momentum .................................... 0.9
    [default0]:  short_seq_prob .................................. 0.1
    [default0]:  skip_iters_range ................................ None
    [default0]:  skip_samples_range .............................. None
    [default0]:  skip_train ...................................... False
    [default0]:  skipped_train_samples ........................... 0
    [default0]:  spec ............................................ None
    [default0]:  special_tokens_file ............................. None
    [default0]:  spiky_loss_threshold ............................ 0.2
    [default0]:  split ........................................... 1
    [default0]:  squared_relu .................................... False
    [default0]:  standalone_embedding_stage ...................... False
    [default0]:  start_weight_decay .............................. 0.1
    [default0]:  straggler_ctrlr_port ............................ 65535
    [default0]:  straggler_minmax_count .......................... 1
    [default0]:  suggested_communication_unit_size ............... None
    [default0]:  swiglu .......................................... True
    [default5]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default5]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default4]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/training/arguments.py:819: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
    [default4]:  warnings.warn("Disabling sequence parallelism because tensor model parallelism is disabled")
    [default1]:[rank1]:[W627 15:58:54.941583063 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default2]:[rank2]:[W627 15:58:54.982814058 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default4]:[rank4]:[W627 15:58:54.017065317 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default5]:[rank5]:[W627 15:58:54.981277710 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default3]:[rank3]:[W627 15:58:54.049300540 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default6]:[rank6]:[W627 15:58:54.050272654 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default7]:wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id OpenSeek-Small-v1-Baseline-rank-7.
    [default0]:wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id OpenSeek-Small-v1-Baseline-rank-0.
    [default7]:wandb: Tracking run with wandb version 0.19.11
    [default7]:wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
    [default7]:[rank7]:[W627 15:58:55.174225256 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default0]:wandb: Tracking run with wandb version 0.19.11
    [default0]:wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
    [default0]:[rank0]:[W627 15:58:55.391530592 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
    [default0]:  swin_backbone_type .............................. tiny
    [default0]:  symmetric_ar_type ............................... None
    [default0]:  te_rng_tracker .................................. False
    [default0]:  tensor_model_parallel_size ...................... 1
    [default0]:  tensorboard_dir ................................. /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/tensorboard
    [default0]:  tensorboard_log_interval ........................ 1
    [default0]:  tensorboard_queue_size .......................... 1000
    [default0]:  test_data_path .................................. None
    [default0]:  test_mode ....................................... False
    [default0]:  tiktoken_num_special_tokens ..................... 1000
    [default0]:  tiktoken_pattern ................................ None
    [default0]:  tiktoken_special_tokens ......................... None
    [default0]:  timing_log_level ................................ 0
    [default0]:  timing_log_option ............................... minmax
    [default0]:  titles_data_path ................................ None
    [default0]:  tokenizer_model ................................. None
    [default0]:  tokenizer_path .................................. ../hf_openseek/tokenizer
    [default0]:  tokenizer_type .................................. QwenTokenizerFS
    [default0]:  torch_fsdp2_reshard_after_forward ............... True
    [default0]:  tp_comm_bootstrap_backend ....................... nccl
    [default0]:  tp_comm_bulk_dgrad .............................. True
    [default0]:  tp_comm_bulk_wgrad .............................. True
    [default0]:  tp_comm_overlap ................................. False
    [default0]:  tp_comm_overlap_ag .............................. True
    [default0]:  tp_comm_overlap_cfg ............................. None
    [default0]:  tp_comm_overlap_rs .............................. True
    [default0]:  tp_comm_overlap_rs_dgrad ........................ False
    [default0]:  tp_comm_split_ag ................................ True
    [default0]:  tp_comm_split_rs ................................ True
    [default0]:  train_data_path ................................. None
    [default0]:  train_iters ..................................... None
    [default0]:  train_samples ................................... 24576000
    [default0]:  train_sync_interval ............................. None
    [default0]:  transformer_impl ................................ transformer_engine
    [default0]:  transformer_pipeline_model_parallel_size ........ 1
    [default0]:  untie_embeddings_and_output_weights ............. False
    [default0]:  use_checkpoint_args ............................. False
    [default0]:  use_checkpoint_opt_param_scheduler .............. False
    [default0]:  use_cpu_initialization .......................... None
    [default0]:  use_custom_fsdp ................................. False
    [default0]:  use_dist_ckpt ................................... False
    [default0]:  use_dist_ckpt_deprecated ........................ False
    [default0]:  use_distributed_optimizer ....................... True
    [default0]:  use_flash_attn .................................. False
    [default0]:  use_legacy_models ............................... False
    [default0]:  use_mp_args_from_checkpoint_args ................ False
    [default0]:  use_one_sent_docs ............................... False
    [default0]:  use_partial_reduce_for_shared_embedding ......... False
    [default0]:  use_persistent_ckpt_worker ...................... False
    [default0]:  use_precision_aware_optimizer ................... False
    [default0]:  use_pytorch_profiler ............................ False
    [default0]:  use_ring_exchange_p2p ........................... False
    [default0]:  use_rope_scaling ................................ False
    [default0]:  use_rotary_position_embeddings .................. True
    [default0]:  use_tokenizer_model_from_checkpoint_args ........ True
    [default0]:  use_torch_fsdp2 ................................. False
    [default0]:  use_torch_optimizer_for_cpu_offload ............. False
    [default0]:  use_tp_pp_dp_mapping ............................ False
    [default0]:  v_head_dim ...................................... 128
    [default0]:  valid_data_path ................................. None
    [default0]:  variable_seq_lengths ............................ False
    [default0]:  virtual_pipeline_model_parallel_size ............ None
    [default0]:  vision_backbone_type ............................ vit
    [default0]:  vision_pretraining .............................. False
    [default0]:  vision_pretraining_type ......................... classify
    [default0]:  vocab_extra_ids ................................. 0
    [default0]:  vocab_file ...................................... None
    [default0]:  vocab_size ...................................... 151851
    [default0]:  wandb_api_key ................................... 
    [default0]:  wandb_exp_name .................................. OpenSeek-Small-v1-Baseline
    [default0]:  wandb_log_model ................................. False
    [default0]:  wandb_log_model_interval ........................ 1000
    [default0]:  wandb_mode ...................................... offline
    [default0]:  wandb_project ................................... OpenSeek-Small-v1-Baseline
    [default0]:  wandb_save_dir .................................. /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/wandb
    [default0]:  weight_decay .................................... 0.1
    [default0]:  weight_decay_incr_style ......................... constant
    [default0]:  wgrad_deferral_limit ............................ 0
    [default0]:  world_size ...................................... 8
    [default0]:  yaml_cfg ........................................ None
    [default0]:-------------------- end of arguments ---------------------
    [default0]:INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 128
    [default0]:> building QwenTokenizerFS tokenizer ...
    [default0]: > padded vocab (size: 151851) with 21 dummy tokens (new size: 151872)
    [default0]:WARNING:megatron.core.rerun_state_machine:RerunStateMachine initialized in mode RerunMode.DISABLED
    [default0]:> initializing torch distributed ...
    [default0]:> initialized tensor model parallel with size 1
    [default0]:> initialized pipeline model parallel with size 1
    [default0]:> setting random seeds to 42 ...
    [default7]:> setting tensorboard ...
    [default7]:WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
    [default0]:> compiling dataset index builder ...
    [default0]:make: Entering directory '/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/datasets'
    [default0]:make: Nothing to be done for 'default'.
    [default0]:make: Leaving directory '/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/datasets'
    [default0]:>>> done with dataset index builder. Compilation time: 0.029 seconds
    [default0]:WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
    [default0]:> compiling and loading fused kernels ...
    [default0]:>>> done with compiling and loading fused kernels. Compilation time: 1.923 seconds
    [default3]:> Number of attn parameters in a transformer block in billions:  0.01
    [default3]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default3]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default3]:> Number of parameters in transformer block in billions:  1.20
    [default3]:> Number of parameters in mtp transformer block in billions:  0.00
    [default3]:> Number of parameters in embedding layers in billions: 0.19
    [default3]:> Total number of parameters in billions: 1.39
    [default3]:> Number of parameters in most loaded shard in billions: 1.3946
    [default3]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default3]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default3]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default3]:> Number of activated parameters in transformer block in billions:  0.20
    [default1]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default1]:  warnings.warn(
    [default4]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default4]:  warnings.warn(
    [default7]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default7]:  warnings.warn(
    [default6]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default6]:  warnings.warn(
    [default3]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default3]:> Number of activated parameters in embedding layers in billions: 0.19
    [default3]:> Total number of activated parameters in billions: 0.40
    [default3]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default3]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default3]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default3]:
    [default6]:> Number of attn parameters in a transformer block in billions:  0.01
    [default6]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default6]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default6]:> Number of parameters in transformer block in billions:  1.20
    [default6]:> Number of parameters in mtp transformer block in billions:  0.00
    [default6]:> Number of parameters in embedding layers in billions: 0.19
    [default6]:> Total number of parameters in billions: 1.39
    [default6]:> Number of parameters in most loaded shard in billions: 1.3946
    [default6]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default6]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default6]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default6]:> Number of activated parameters in transformer block in billions:  0.20
    [default6]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default6]:> Number of activated parameters in embedding layers in billions: 0.19
    [default6]:> Total number of activated parameters in billions: 0.40
    [default6]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default6]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default6]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default6]:
    [default1]:> Number of attn parameters in a transformer block in billions:  0.01
    [default1]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default1]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default1]:> Number of parameters in transformer block in billions:  1.20
    [default1]:> Number of parameters in mtp transformer block in billions:  0.00
    [default1]:> Number of parameters in embedding layers in billions: 0.19
    [default1]:> Total number of parameters in billions: 1.39
    [default1]:> Number of parameters in most loaded shard in billions: 1.3946
    [default1]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default1]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default1]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default1]:> Number of activated parameters in transformer block in billions:  0.20
    [default1]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default1]:> Number of activated parameters in embedding layers in billions: 0.19
    [default1]:> Total number of activated parameters in billions: 0.40
    [default1]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default1]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default1]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default1]:
    [default2]:> Number of attn parameters in a transformer block in billions:  0.01
    [default2]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default2]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default2]:> Number of parameters in transformer block in billions:  1.20
    [default2]:> Number of parameters in mtp transformer block in billions:  0.00
    [default2]:> Number of parameters in embedding layers in billions: 0.19
    [default2]:> Total number of parameters in billions: 1.39
    [default2]:> Number of parameters in most loaded shard in billions: 1.3946
    [default2]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default2]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default2]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default2]:> Number of activated parameters in transformer block in billions:  0.20
    [default2]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default2]:> Number of activated parameters in embedding layers in billions: 0.19
    [default2]:> Total number of activated parameters in billions: 0.40
    [default2]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default2]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default2]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default2]:
    [default0]:time to initialize megatron (seconds): 9.606
    [default0]:[after megatron is initialized] datetime: 2025-06-27 15:59:01 
    [default0]:> Number of attn parameters in a transformer block in billions:  0.01
    [default0]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default0]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default0]:> Number of parameters in transformer block in billions:  1.20
    [default0]:> Number of parameters in mtp transformer block in billions:  0.00
    [default0]:> Number of parameters in embedding layers in billions: 0.19
    [default0]:> Total number of parameters in billions: 1.39
    [default0]:> Number of parameters in most loaded shard in billions: 1.3946
    [default0]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default0]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default0]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default0]:> Number of activated parameters in transformer block in billions:  0.20
    [default0]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default0]:> Number of activated parameters in embedding layers in billions: 0.19
    [default0]:> Total number of activated parameters in billions: 0.40
    [default0]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default0]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default0]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default0]:
    [default0]:building GPT model ...
    [default4]:> Number of attn parameters in a transformer block in billions:  0.01
    [default4]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default4]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default4]:> Number of parameters in transformer block in billions:  1.20
    [default4]:> Number of parameters in mtp transformer block in billions:  0.00
    [default4]:> Number of parameters in embedding layers in billions: 0.19
    [default4]:> Total number of parameters in billions: 1.39
    [default4]:> Number of parameters in most loaded shard in billions: 1.3946
    [default4]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default4]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default4]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default4]:> Number of activated parameters in transformer block in billions:  0.20
    [default4]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default4]:> Number of activated parameters in embedding layers in billions: 0.19
    [default4]:> Total number of activated parameters in billions: 0.40
    [default4]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default4]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default4]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default5]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default5]:  warnings.warn(
    [default3]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default3]:  warnings.warn(
    [default0]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default0]:  warnings.warn(
    [default2]:/root/miniconda3/envs/flagscale-train/lib/python3.12/site-packages/transformer_engine/pytorch/cpu_offload.py:593: DeprecationWarning: Offloading weights is deprecated. Using offload_weights=True does not have any effect.
    [default2]:  warnings.warn(
    [default1]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default1]:  warnings.warn(
    [default1]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default1]:  warnings.warn(
    [default4]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default4]:  warnings.warn(
    [default4]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default4]:  warnings.warn(
    [default7]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default7]:  warnings.warn(
    [default7]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default7]:  warnings.warn(
    [default6]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default6]:  warnings.warn(
    [default6]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default6]:  warnings.warn(
    [default5]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default5]:  warnings.warn(
    [default5]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default5]:  warnings.warn(
    [default3]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default3]:  warnings.warn(
    [default3]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default3]:  warnings.warn(
    [default0]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default0]:  warnings.warn(
    [default0]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default0]:  warnings.warn(
    [default2]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name q_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default2]:  warnings.warn(
    [default2]:/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/extensions/transformer_engine.py:153: UserWarning: The user buffer name kv_down_proj is not supported inTransformer Engine. Disabling TP communication overlap for this layer.
    [default2]:  warnings.warn(
    [default4]:
    [default5]:> Number of attn parameters in a transformer block in billions:  0.01
    [default5]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default5]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default5]:> Number of parameters in transformer block in billions:  1.20
    [default5]:> Number of parameters in mtp transformer block in billions:  0.00
    [default5]:> Number of parameters in embedding layers in billions: 0.19
    [default5]:> Total number of parameters in billions: 1.39
    [default5]:> Number of parameters in most loaded shard in billions: 1.3946
    [default5]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default5]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default5]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default5]:> Number of activated parameters in transformer block in billions:  0.20
    [default5]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default5]:> Number of activated parameters in embedding layers in billions: 0.19
    [default5]:> Total number of activated parameters in billions: 0.40
    [default5]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default5]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default5]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default5]:
    [default7]:> Number of attn parameters in a transformer block in billions:  0.01
    [default7]:> Number of dense mlp parameters in a transformer block in billions:  0.03
    [default7]:> Number of sparse mlp parameters in a transformer block in billions:  0.23
    [default7]:> Number of parameters in transformer block in billions:  1.20
    [default7]:> Number of parameters in mtp transformer block in billions:  0.00
    [default7]:> Number of parameters in embedding layers in billions: 0.19
    [default7]:> Total number of parameters in billions: 1.39
    [default7]:> Number of parameters in most loaded shard in billions: 1.3946
    [default7]:> Number of activated attn parameters in a transformer block in billions:  0.01
    [default7]:> Number of activated dense mlp parameters in a transformer block in billions:  0.03
    [default7]:> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
    [default7]:> Number of activated parameters in transformer block in billions:  0.20
    [default7]:> Number of activated parameters in mtp transformer block in billions:  0.00
    [default7]:> Number of activated parameters in embedding layers in billions: 0.19
    [default7]:> Total number of activated parameters in billions: 0.40
    [default7]:> Activation memory footprint per dense transformer layer: 276.0 MB
    [default7]:> Activation memory footprint per moe transformer layer: 346.0 MB
    [default7]:>>> [FS] Theoretical memory footprints: weight and optimizer=9975.26 MB, activation=2416.03 MB, total=12391.29 MB
    [default7]:
    [default0]: > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1394625792
    [default0]:INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, check_for_large_grads=False, bucket_size=40000000, pad_buckets_for_high_nccl_busbw=False, average_in_collective=False, fp8_param_gather=False, use_custom_fsdp=False, data_parallel_sharding_strategy='no_shard', gradient_reduce_div_fusion=True, suggested_communication_unit_size=None, preserve_fp32_weights=True, keep_fp8_transpose_cache_when_using_custom_fsdp=False)
    [default0]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 30
    [default0]:Params for bucket 1 (40142080 elements, 40142080 padded size):
    [default0]:	module.decoder.layers.5.mlp.shared_experts.linear_fc1.weight
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight61
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight53
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight45
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight37
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight58
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight50
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight42
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight63
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight55
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight47
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight39
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight60
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight52
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight44
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight36
    [default0]:	module.decoder.layers.5.mlp.shared_experts.linear_fc2.weight
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight57
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight49
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight41
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight62
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight54
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight46
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight38
    [default0]:	module.decoder.final_layernorm.weight
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight59
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight51
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight43
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight35
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight56
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight48
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight40
    [default0]:Params for bucket 2 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight29
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight21
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight13
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight5
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight34
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight26
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight18
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight10
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight0
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight31
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight23
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight15
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight7
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight28
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight20
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight12
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight4
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight33
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight25
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight17
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight9
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight1
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight30
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight22
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight14
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight6
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight27
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight19
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight11
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight3
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight32
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight24
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight16
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight8
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc2.weight2
    [default0]:Params for bucket 3 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight63
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight59
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight61
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight57
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight55
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight53
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight51
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight49
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight47
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight62
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight60
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight58
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight56
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight54
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight52
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight50
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight48
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight46
    [default0]:Params for bucket 4 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight45
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight43
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight41
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight39
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight37
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight35
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight33
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight31
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight29
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight42
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight44
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight40
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight38
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight36
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight34
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight32
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight30
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight28
    [default0]:Params for bucket 5 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight27
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight23
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight25
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight21
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight19
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight17
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight15
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight13
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight11
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight26
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight24
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight22
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight20
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight18
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight16
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight14
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight12
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight10
    [default0]:Params for bucket 6 (40635392 elements, 40635392 padded size):
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight1
    [default0]:	module.decoder.layers.5.mlp.router.weight
    [default0]:	module.decoder.layers.5.pre_mlp_layernorm.weight
    [default0]:	module.decoder.layers.5.self_attention.linear_kv_up_proj.weight
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight2
    [default0]:	module.decoder.layers.5.self_attention.linear_kv_up_proj.layer_norm_weight
    [default0]:	module.decoder.layers.5.self_attention.linear_kv_down_proj.weight
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight61
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight9
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight5
    [default0]:	module.decoder.layers.5.self_attention.linear_q_proj.weight
    [default0]:	module.decoder.layers.5.self_attention.linear_proj.weight
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight3
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight63
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight6
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight8
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight60
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight4
    [default0]:	module.decoder.layers.5.input_layernorm.weight
    [default0]:	module.decoder.layers.4.mlp.shared_experts.linear_fc2.weight
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight0
    [default0]:	module.decoder.layers.4.mlp.shared_experts.linear_fc1.weight
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight62
    [default0]:	module.decoder.layers.5.mlp.experts.linear_fc1.weight7
    [default0]:Params for bucket 7 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight56
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight48
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight40
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight32
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight53
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight45
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight37
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight29
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight58
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight50
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight42
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight34
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight26
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight55
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight47
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight39
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight31
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight52
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight44
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight36
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight28
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight57
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight49
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight41
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight33
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight25
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight54
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight46
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight38
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight30
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight59
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight51
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight43
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight35
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight27
    [default0]:Params for bucket 8 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight24
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight16
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight8
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight2
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight62
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight21
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight13
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight5
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight59
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight18
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight10
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight23
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight15
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight7
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight61
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight20
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight12
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight4
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight0
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight17
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight9
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight1
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight63
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight22
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight14
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight6
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight60
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight19
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight11
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc2.weight3
    [default0]:Params for bucket 9 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight58
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight54
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight56
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight52
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight50
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight48
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight46
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight44
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight42
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight55
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight57
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight53
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight51
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight49
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight47
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight45
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight43
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight41
    [default0]:Params for bucket 10 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight40
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight38
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight36
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight34
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight32
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight30
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight28
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight26
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight24
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight39
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight37
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight35
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight33
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight31
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight29
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight27
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight25
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight23
    [default0]:Params for bucket 11 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight22
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight20
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight18
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight16
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight14
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight12
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight10
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight8
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight6
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight19
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight21
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight17
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight15
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight13
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight11
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight9
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight7
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight5
    [default0]:Params for bucket 12 (40635392 elements, 40635392 padded size):
    [default0]:	module.decoder.layers.4.self_attention.linear_proj.weight
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight63
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight55
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight60
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight1
    [default0]:	module.decoder.layers.4.mlp.router.weight
    [default0]:	module.decoder.layers.4.pre_mlp_layernorm.weight
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight52
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight2
    [default0]:	module.decoder.layers.3.mlp.shared_experts.linear_fc1.weight
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight57
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight62
    [default0]:	module.decoder.layers.4.self_attention.linear_kv_up_proj.weight
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight54
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight3
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight59
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight51
    [default0]:	module.decoder.layers.4.self_attention.linear_kv_up_proj.layer_norm_weight
    [default0]:	module.decoder.layers.4.self_attention.linear_kv_down_proj.weight
    [default0]:	module.decoder.layers.4.self_attention.linear_q_proj.weight
    [default0]:	module.decoder.layers.4.input_layernorm.weight
    [default0]:	module.decoder.layers.3.mlp.shared_experts.linear_fc2.weight
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight56
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight4
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight61
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight53
    [default0]:	module.decoder.layers.4.mlp.experts.linear_fc1.weight0
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight58
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight50
    [default0]:Params for bucket 13 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight47
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight39
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight31
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight23
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight15
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight44
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight36
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight28
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight20
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight49
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight41
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight33
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight25
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight17
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight46
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight38
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight30
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight22
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight43
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight35
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight27
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight19
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight48
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight40
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight32
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight24
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight16
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight45
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight37
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight29
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight21
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight42
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight34
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight26
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight18
    [default0]:Params for bucket 14 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight7
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight56
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight12
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight4
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight61
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight9
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight2
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight1
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight58
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight14
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight6
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight63
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight55
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight11
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight3
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight60
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight8
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight57
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight13
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight5
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight62
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight54
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight10
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc2.weight0
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight59
    [default0]:Params for bucket 15 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight50
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight52
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight48
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight46
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight44
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight42
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight40
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight38
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight36
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight53
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight49
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight51
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight47
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight45
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight43
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight41
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight39
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight37
    [default0]:Params for bucket 16 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight32
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight34
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight30
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight28
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight26
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight24
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight22
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight20
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight18
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight35
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight33
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight31
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight29
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight27
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight25
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight23
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight21
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight19
    [default0]:Params for bucket 17 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight17
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight14
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight12
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight10
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight8
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight6
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight4
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight3
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight2
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight15
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight16
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight13
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight11
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight9
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight7
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight5
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight1
    [default0]:	module.decoder.layers.3.mlp.experts.linear_fc1.weight0
    [default0]:Params for bucket 18 (40635392 elements, 40635392 padded size):
    [default0]:	module.decoder.layers.3.mlp.router.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight57
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight49
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight41
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight62
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight54
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight46
    [default0]:	module.decoder.layers.3.self_attention.linear_kv_up_proj.weight
    [default0]:	module.decoder.layers.3.self_attention.linear_q_proj.weight
    [default0]:	module.decoder.layers.2.mlp.shared_experts.linear_fc1.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight59
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight51
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight43
    [default0]:	module.decoder.layers.3.pre_mlp_layernorm.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight56
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight48
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight40
    [default0]:	module.decoder.layers.3.self_attention.linear_kv_up_proj.layer_norm_weight
    [default0]:	module.decoder.layers.3.self_attention.linear_proj.weight
    [default0]:	module.decoder.layers.2.mlp.shared_experts.linear_fc2.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight61
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight53
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight45
    [default0]:	module.decoder.layers.3.self_attention.linear_kv_down_proj.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight58
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight50
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight42
    [default0]:	module.decoder.layers.3.input_layernorm.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight63
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight55
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight47
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight60
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight52
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight44
    [default0]:Params for bucket 19 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight33
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight25
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight17
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight9
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight38
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight30
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight22
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight14
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight6
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight35
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight27
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight19
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight11
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight32
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight24
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight16
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight8
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight37
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight29
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight21
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight13
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight5
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight34
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight26
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight18
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight10
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight39
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight31
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight23
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight15
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight7
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight36
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight28
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight20
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight12
    [default0]:Params for bucket 20 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight1
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight0
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight62
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight3
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight63
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight49
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight2
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight55
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight61
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight60
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight59
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight58
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight57
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight56
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight53
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight50
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight54
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight52
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight51
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc2.weight4
    [default0]:Params for bucket 21 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight48
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight44
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight46
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight47
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight45
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight43
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight42
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight41
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight40
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight39
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight37
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight35
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight34
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight33
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight38
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight32
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight36
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight31
    [default0]:Params for bucket 22 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight27
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight26
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight25
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight24
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight22
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight16
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight14
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight13
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight30
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight29
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight28
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight23
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight21
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight20
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight19
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight18
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight17
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight15
    [default0]:Params for bucket 23 (42929152 elements, 42929152 padded size):
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight4
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight3
    [default0]:	module.decoder.layers.1.mlp.shared_experts.linear_fc2.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight12
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight11
    [default0]:	module.decoder.layers.2.mlp.router.weight
    [default0]:	module.decoder.layers.2.self_attention.linear_kv_up_proj.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight5
    [default0]:	module.decoder.layers.2.pre_mlp_layernorm.weight
    [default0]:	module.decoder.layers.2.self_attention.linear_kv_up_proj.layer_norm_weight
    [default0]:	module.decoder.layers.2.self_attention.linear_kv_down_proj.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight9
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight6
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight0
    [default0]:	module.decoder.layers.2.self_attention.linear_q_proj.weight
    [default0]:	module.decoder.layers.2.self_attention.linear_proj.weight
    [default0]:	module.decoder.layers.2.input_layernorm.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight7
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight2
    [default0]:	module.decoder.layers.1.mlp.shared_experts.linear_fc1.weight
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight8
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight1
    [default0]:	module.decoder.layers.2.mlp.experts.linear_fc1.weight10
    [default0]:Params for bucket 24 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight62
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight60
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight55
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight50
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight38
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight32
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight63
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight54
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight37
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight34
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight41
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight33
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight31
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight57
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight51
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight49
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight46
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight58
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight40
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight36
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight30
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight59
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight53
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight43
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight29
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight56
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight52
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight44
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight39
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight61
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight48
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight47
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight45
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight42
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight35
    [default0]:Params for bucket 25 (40140800 elements, 40140800 padded size):
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight26
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight9
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight8
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight61
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight23
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight16
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight14
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight2
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight63
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight25
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight13
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight4
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight28
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight24
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight20
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight11
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight19
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight10
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight1
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight21
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight18
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight7
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight3
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight0
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight27
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight22
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight17
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight5
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight15
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight12
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc2.weight6
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight62
    [default0]:Params for bucket 26 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight60
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight57
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight59
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight58
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight56
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight55
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight52
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight51
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight47
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight46
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight54
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight53
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight50
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight49
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight48
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight45
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight44
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight43
    [default0]:Params for bucket 27 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight42
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight41
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight38
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight39
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight37
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight33
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight31
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight29
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight28
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight26
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight25
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight40
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight36
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight35
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight34
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight32
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight30
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight27
    [default0]:Params for bucket 28 (41287680 elements, 41287680 padded size):
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight21
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight20
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight19
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight14
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight13
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight12
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight9
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight8
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight24
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight22
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight23
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight18
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight17
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight16
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight15
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight11
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight10
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight7
    [default0]:Params for bucket 29 (49810432 elements, 49810432 padded size):
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight6
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight4
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight3
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight5
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight1
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight0
    [default0]:	module.decoder.layers.1.mlp.router.weight
    [default0]:	module.decoder.layers.1.self_attention.linear_kv_up_proj.weight
    [default0]:	module.decoder.layers.1.self_attention.linear_proj.weight
    [default0]:	module.decoder.layers.1.mlp.experts.linear_fc1.weight2
    [default0]:	module.decoder.layers.1.pre_mlp_layernorm.weight
    [default0]:	module.decoder.layers.1.self_attention.linear_kv_up_proj.layer_norm_weight
    [default0]:	module.decoder.layers.1.self_attention.linear_kv_down_proj.weight
    [default0]:	module.decoder.layers.1.self_attention.linear_q_proj.weight
    [default0]:	module.decoder.layers.1.input_layernorm.weight
    [default0]:	module.decoder.layers.0.mlp.linear_fc2.weight
    [default0]:	module.decoder.layers.0.mlp.linear_fc1.weight
    [default0]:Params for bucket 30 (200543232 elements, 200543232 padded size):
    [default0]:	module.decoder.layers.0.mlp.linear_fc1.layer_norm_weight
    [default0]:	module.decoder.layers.0.self_attention.linear_q_proj.weight
    [default0]:	module.decoder.layers.0.self_attention.linear_kv_up_proj.weight
    [default0]:	module.embedding.word_embeddings.weight
    [default0]:	module.decoder.layers.0.self_attention.linear_kv_up_proj.layer_norm_weight
    [default0]:	module.decoder.layers.0.self_attention.linear_kv_down_proj.weight
    [default0]:	module.decoder.layers.0.self_attention.linear_proj.weight
    [default0]:	module.decoder.layers.0.input_layernorm.weight
    [default0]:INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.003, min_lr=0.0003, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, use_precision_aware_optimizer=False, store_param_remainders=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=1.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=True, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x7fbf60962780>, config_logger_dir='')
    [default0]:setting training iterations to 24000
    [default0]:INFO:megatron.core.optimizer_param_scheduler:> learning rate decay style: cosine
    [default0]:WARNING: could not find the metadata file /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/checkpoints/latest_checkpointed_iteration.txt
    [default0]:    will not load any checkpoints and will start from random
    [default0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2025-06-27 15:59:02 
    [default0]:> building train, validation, and test datasets ...
    [default0]: > datasets target sizes (minimum size):
    [default0]:    train:      24576000
    [default0]:    validation: 0
    [default0]:    test:       0
    [default0]:INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 1.0), None, None]
    [default0]:> building train, validation, and test datasets for GPT ...
    [default0]:INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building GPTDataset splits with sizes=(24576000, 0, 0) and config=GPTDatasetConfig(random_seed=42, sequence_length=4096, blend=(['../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high/part_142_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low/part_62_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid/part_189_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high/part_76_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low/part_124_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid/part_444_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high/part_498_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low/part_10_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid/part_144_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high/part_86_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low/part_133_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid/part_139_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high/part_47_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low/part_11_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid/part_97_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high/part_43_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low/part_10_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid/part_164_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high/part_92_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low/part_113_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid/part_563_text_document', '../OpenSeek-Pretrain-100B/arxiv/007_00000_text_document', '../OpenSeek-Pretrain-100B/books/016_00007_text_document', '../OpenSeek-Pretrain-100B/code-high/part_13_text_document', '../OpenSeek-Pretrain-100B/code-low/part_36_text_document', '../OpenSeek-Pretrain-100B/code-mid/part_37_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-high/23_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-low/51_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/118_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/176_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/256_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/320_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/32_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid/3_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-high/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-low/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-mid/23_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-high/12_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-low/3_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-mid/5_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high/5_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low/5_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-high/74_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-low/54_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-mid/275_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_code-high/13_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_code-low/9_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_code-mid/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_math-high/11_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_math-low/11_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_math-mid/29_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-high/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-low/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid/3_text_document', '../OpenSeek-Pretrain-100B/math-high/part_04_text_document', '../OpenSeek-Pretrain-100B/math-low/part_10_text_document', '../OpenSeek-Pretrain-100B/math-mid/part_07_text_document', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0041-of-0136_text_document', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0125-of-0136_text_document', '../OpenSeek-Pretrain-100B/pes2o-full-val/valid-0034-of-0060_text_document', '../OpenSeek-Pretrain-100B/pes2o/pubmedcentral_3_text_document', '../OpenSeek-Pretrain-100B/stack/018_00000_text_document', '../OpenSeek-Pretrain-100B/wiki/012_00000_text_document', '../OpenSeek-Pretrain-100B/zh_cc-high-loss0/part_28_text_document', '../OpenSeek-Pretrain-100B/zh_cc-high-loss1/part_59_text_document', '../OpenSeek-Pretrain-100B/zh_cc-high-loss2/part_16_text_document', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss0/part_192_text_document', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss1/part_550_text_document', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss2/part_71_text_document'], [1.1068, 0.3577, 0.7775, 0.2859, 0.1672, 0.2339, 0.5397, 0.4064, 0.5005, 0.4616, 0.067, 0.3429, 0.261, 0.1824, 0.2313, 0.8237, 0.2866, 0.667, 0.4657, 0.2005, 0.4317, 1.1397, 0.6782, 0.9175, 0.6414, 0.4696, 1.0102, 1.1403, 0.9674, 0.3755, 0.0499, 0.3608, 0.3623, 0.3704, 0.3733, 0.3631, 0.2573, 0.1638, 0.3251, 6.0237, 8.9063, 10.1376, 0.4598, 0.6857, 0.899, 1.3135, 1.653, 0.3536, 0.6314, 0.5978, 0.7909, 0.2225, 0.1797, 0.2042, 0.4081, 0.1659, 1.2828, 5.68, 7.4907, 8.9359, 0.7663, 0.4052, 0.1916, 0.5074, 0.6437, 0.6406, 0.4, 0.3564, 0.5768, 1.8165, 1.694, 1.6311, 0.687, 0.7387, 0.0143, 6.1982, 0.4229, 0.4202, 1.8171, 0.9776, 0.3725, 0.9492, 0.9236, 1.0643]), blend_per_split=None, split='1', split_matrix=[(0, 1.0), None, None], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=False, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._QwenTokenizerFS object at 0x7fbf5fd3f380>, mid_level_dataset_surplus=0.005, reset_position_ids=True, reset_attention_mask=True, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, object_storage_cache_path=None)
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high/part_142_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 3bcf97bf902bc7c1b1bdfc6f43697d5c-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 3bcf97bf902bc7c1b1bdfc6f43697d5c-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 3bcf97bf902bc7c1b1bdfc6f43697d5c-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 278453
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low/part_62_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 42edf795324f414dfcabdcfd7d860a40-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 42edf795324f414dfcabdcfd7d860a40-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 42edf795324f414dfcabdcfd7d860a40-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 90001
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid/part_189_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from fbd05c4b24607c4e84548924fc59cca2-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from fbd05c4b24607c4e84548924fc59cca2-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from fbd05c4b24607c4e84548924fc59cca2-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 195608
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high/part_76_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 93b91bf5fd05a6901f85ce49182a8582-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 93b91bf5fd05a6901f85ce49182a8582-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 93b91bf5fd05a6901f85ce49182a8582-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 71916
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low/part_124_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from c156036a9186aa56bd0f678fb3c80d3f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from c156036a9186aa56bd0f678fb3c80d3f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from c156036a9186aa56bd0f678fb3c80d3f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 42075
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 38419ede70ce7c575159593f3a3fc94d-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 38419ede70ce7c575159593f3a3fc94d-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 38419ede70ce7c575159593f3a3fc94d-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 58837
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 4762bb70c45a00a24c261d684e5ce4ff-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 4762bb70c45a00a24c261d684e5ce4ff-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 4762bb70c45a00a24c261d684e5ce4ff-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 135775
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 421b3d6f3adf7fab7d6fa68067570d8f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 421b3d6f3adf7fab7d6fa68067570d8f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 421b3d6f3adf7fab7d6fa68067570d8f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 102232
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid/part_444_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ba00e53917af4de1938c3e89a305da94-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ba00e53917af4de1938c3e89a305da94-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ba00e53917af4de1938c3e89a305da94-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 125911
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high/part_498_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 91379f2ddcb832c61f886a23d6f26843-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 91379f2ddcb832c61f886a23d6f26843-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 91379f2ddcb832c61f886a23d6f26843-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 116141
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low/part_10_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 325354
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 325354
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 49da887f9ea0fe93926b5194b3e73e64-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 49da887f9ea0fe93926b5194b3e73e64-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 49da887f9ea0fe93926b5194b3e73e64-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 16844
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid/part_144_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f1a511b8258c5b2ef7cb0880677d5320-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f1a511b8258c5b2ef7cb0880677d5320-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f1a511b8258c5b2ef7cb0880677d5320-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 86258
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high/part_86_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a52b401753534e6b0add72400982b06a-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a52b401753534e6b0add72400982b06a-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a52b401753534e6b0add72400982b06a-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 65662
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low/part_133_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 5519d9210b1853b80c89283c5b8219bd-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 5519d9210b1853b80c89283c5b8219bd-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 5519d9210b1853b80c89283c5b8219bd-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 45891
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid/part_139_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ecbf550b30bfc696b0d4912fd82d990f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ecbf550b30bfc696b0d4912fd82d990f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ecbf550b30bfc696b0d4912fd82d990f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 58196
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high/part_47_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from e4aeb3128a4705d3c4b914a618670c60-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from e4aeb3128a4705d3c4b914a618670c60-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from e4aeb3128a4705d3c4b914a618670c60-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 207235
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low/part_11_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a9f99f617847aa8f6179da4f53b838af-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a9f99f617847aa8f6179da4f53b838af-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a9f99f617847aa8f6179da4f53b838af-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 72100
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid/part_97_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default7]:(min, max) time across ranks (ms):
    [default7]:    load-checkpoint ................................: (3.48, 10.62)
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 063fab3df418ed000462f842ac8d49ea-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 063fab3df418ed000462f842ac8d49ea-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 063fab3df418ed000462f842ac8d49ea-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 167804
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high/part_43_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 92c67007e7c71328e23b10f14bbdf0d9-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 92c67007e7c71328e23b10f14bbdf0d9-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 92c67007e7c71328e23b10f14bbdf0d9-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 117162
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low/part_10_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 97123868ddd599ebe18baa058e80c79d-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 97123868ddd599ebe18baa058e80c79d-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 97123868ddd599ebe18baa058e80c79d-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 50433
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid/part_164_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 31933f6459e2d65890af1f316c530035-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 31933f6459e2d65890af1f316c530035-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 31933f6459e2d65890af1f316c530035-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 108609
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high/part_92_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from d1c63d01256da2dbfb9fc6012b0cf457-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from d1c63d01256da2dbfb9fc6012b0cf457-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from d1c63d01256da2dbfb9fc6012b0cf457-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 286720
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low/part_113_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 06a4931530206c4caa9999e2a116c216-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 06a4931530206c4caa9999e2a116c216-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 06a4931530206c4caa9999e2a116c216-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 170626
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid/part_563_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from cea463b205f3d64bc57cd0bf402f517a-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from cea463b205f3d64bc57cd0bf402f517a-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from cea463b205f3d64bc57cd0bf402f517a-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 230810
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/arxiv/007_00000_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 36740
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 36740
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 6443055721fcb34913a77945da3c3a0e-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 6443055721fcb34913a77945da3c3a0e-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 6443055721fcb34913a77945da3c3a0e-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161355
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/books/016_00007_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 4969
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 4969
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 10bf636698c127d1a24bbf6d2eda2aee-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 10bf636698c127d1a24bbf6d2eda2aee-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 10bf636698c127d1a24bbf6d2eda2aee-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 118143
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/code-high/part_13_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 21cd90b13a6517160b33ca8685f946a8-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 21cd90b13a6517160b33ca8685f946a8-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 21cd90b13a6517160b33ca8685f946a8-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 254137
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/code-low/part_36_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 499c6fff9032a79c988ab8063ce949e1-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 499c6fff9032a79c988ab8063ce949e1-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 499c6fff9032a79c988ab8063ce949e1-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 286865
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/code-mid/part_37_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 52e1e8829398a339b53880f97a35f5fd-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 52e1e8829398a339b53880f97a35f5fd-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 52e1e8829398a339b53880f97a35f5fd-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 243365
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-high/23_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 78a0fcb91abd174d65bb6cc0d46e617d-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 78a0fcb91abd174d65bb6cc0d46e617d-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 78a0fcb91abd174d65bb6cc0d46e617d-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 94468
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-low/51_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 73459
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 73459
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 82ad0782ac038979c9eaf5146ff201a9-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 82ad0782ac038979c9eaf5146ff201a9-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 82ad0782ac038979c9eaf5146ff201a9-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 12546
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/118_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 61ededc3c5fec8599ebac2c9268839d2-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 61ededc3c5fec8599ebac2c9268839d2-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 61ededc3c5fec8599ebac2c9268839d2-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 90757
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/176_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f6f063fadb56f7471be6fd7c52d961a9-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f6f063fadb56f7471be6fd7c52d961a9-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f6f063fadb56f7471be6fd7c52d961a9-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 91135
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/256_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0fac24c40f9207ba4e1c876f67ac890f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0fac24c40f9207ba4e1c876f67ac890f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0fac24c40f9207ba4e1c876f67ac890f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 93182
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/320_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 6e257a563aa1f477457f9d5063f31c3f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 6e257a563aa1f477457f9d5063f31c3f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 6e257a563aa1f477457f9d5063f31c3f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 93904
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/32_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from c8fceab04f7a896b8a87caea00adf17e-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from c8fceab04f7a896b8a87caea00adf17e-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from c8fceab04f7a896b8a87caea00adf17e-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 91340
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high/1_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 6b309498174da0deb341abe9aa36a0e8-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 6b309498174da0deb341abe9aa36a0e8-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 6b309498174da0deb341abe9aa36a0e8-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 64738
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low/2_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 398e9530d4c8410e322ac338179666ca-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 398e9530d4c8410e322ac338179666ca-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 398e9530d4c8410e322ac338179666ca-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 41218
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid/3_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from da0fb5525d01182e1daf66b4dbd842b4-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from da0fb5525d01182e1daf66b4dbd842b4-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from da0fb5525d01182e1daf66b4dbd842b4-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 81781
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high/2_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 256516
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 256516
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 5257e77c1f3591b9cf58683b58f3dcc3-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 5257e77c1f3591b9cf58683b58f3dcc3-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 5257e77c1f3591b9cf58683b58f3dcc3-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1515440
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low/1_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 097db078c0fd21058b81a805201e1740-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 097db078c0fd21058b81a805201e1740-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 097db078c0fd21058b81a805201e1740-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2240642
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid/2_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from de2315b46c9e398c3e52816dce050a9a-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from de2315b46c9e398c3e52816dce050a9a-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from de2315b46c9e398c3e52816dce050a9a-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2550407
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_code-high/4_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from c33b0b590a676cb05db1bdb9a99c5b2a-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from c33b0b590a676cb05db1bdb9a99c5b2a-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from c33b0b590a676cb05db1bdb9a99c5b2a-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 115665
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_code-low/6_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0a144fd0007b87526df5be9acef24430-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0a144fd0007b87526df5be9acef24430-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0a144fd0007b87526df5be9acef24430-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 172518
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_code-mid/23_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f79878c937ae3ea237f9aecc49b2185f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f79878c937ae3ea237f9aecc49b2185f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f79878c937ae3ea237f9aecc49b2185f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 226180
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_math-high/12_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 5ece26b572b5dca7ffe7945c6c9226c7-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 5ece26b572b5dca7ffe7945c6c9226c7-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 5ece26b572b5dca7ffe7945c6c9226c7-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 330448
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_math-low/3_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 23783151274dfea0577aea5f996dc236-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 23783151274dfea0577aea5f996dc236-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 23783151274dfea0577aea5f996dc236-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 415859
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_math-mid/5_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 647d18ed6efe43a6a42bddfeabba04f1-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 647d18ed6efe43a6a42bddfeabba04f1-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 647d18ed6efe43a6a42bddfeabba04f1-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 88947
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high/5_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 167169b90da1e60c19514db0924b1e20-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 167169b90da1e60c19514db0924b1e20-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 167169b90da1e60c19514db0924b1e20-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 158858
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low/5_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ce096f4bf7b8552c29fea089860705c4-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ce096f4bf7b8552c29fea089860705c4-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ce096f4bf7b8552c29fea089860705c4-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 150385
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid/4_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from b4c94e45cf9ca3107266571ad6b407e2-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from b4c94e45cf9ca3107266571ad6b407e2-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from b4c94e45cf9ca3107266571ad6b407e2-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 198961
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_CC-high/74_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a8fa8dd8d69d1cd8935b3457875c39c4-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a8fa8dd8d69d1cd8935b3457875c39c4-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a8fa8dd8d69d1cd8935b3457875c39c4-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 55987
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_CC-low/54_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f461ed8995dac3b75450c24fc74ccf40-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f461ed8995dac3b75450c24fc74ccf40-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f461ed8995dac3b75450c24fc74ccf40-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 45202
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_CC-mid/275_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 9a44f36019d65846dd82c700062bc27c-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 9a44f36019d65846dd82c700062bc27c-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 9a44f36019d65846dd82c700062bc27c-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 51384
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high/4_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 05388d8ae2767f997f3a327b6162fa0c-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 05388d8ae2767f997f3a327b6162fa0c-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 05388d8ae2767f997f3a327b6162fa0c-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 102662
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low/2_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 99d51016a10dfaaaf8aa11f19ca47eec-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 99d51016a10dfaaaf8aa11f19ca47eec-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 99d51016a10dfaaaf8aa11f19ca47eec-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 41745
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid/6_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 8376d733799a069256ae09b1bcd6dcab-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 8376d733799a069256ae09b1bcd6dcab-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 8376d733799a069256ae09b1bcd6dcab-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 322718
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high/2_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 256516
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 256516
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 3a33c9308a021d4054fa13c21cf4952f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 3a33c9308a021d4054fa13c21cf4952f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 3a33c9308a021d4054fa13c21cf4952f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1428961
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low/1_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 9efb6b5b415537b5bc066d9cc2c620b3-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 9efb6b5b415537b5bc066d9cc2c620b3-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 9efb6b5b415537b5bc066d9cc2c620b3-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1884499
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid/1_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 13f6793fe99fe92439722cddaf40727e-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 13f6793fe99fe92439722cddaf40727e-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 13f6793fe99fe92439722cddaf40727e-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2248083
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_code-high/13_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 55f9b728813d421a6eda80c4e4386d81-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 55f9b728813d421a6eda80c4e4386d81-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 55f9b728813d421a6eda80c4e4386d81-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 192791
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_code-low/9_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from af2b2f630340e8986a7fa73fde612311-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from af2b2f630340e8986a7fa73fde612311-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from af2b2f630340e8986a7fa73fde612311-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 101935
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_code-mid/6_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from da5b6dc21e950c12d3ac425bfd1f091f-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from da5b6dc21e950c12d3ac425bfd1f091f-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from da5b6dc21e950c12d3ac425bfd1f091f-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 48202
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_math-high/11_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ad21fc502df49664e479a402ec05d322-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ad21fc502df49664e479a402ec05d322-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ad21fc502df49664e479a402ec05d322-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 127661
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_math-low/11_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 26ec788a6472d7a4fc2abb1fd801bd83-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 26ec788a6472d7a4fc2abb1fd801bd83-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 26ec788a6472d7a4fc2abb1fd801bd83-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161943
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_math-mid/29_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a4ff47cc44a1bcd08f765fc5846d53fb-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a4ff47cc44a1bcd08f765fc5846d53fb-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a4ff47cc44a1bcd08f765fc5846d53fb-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161166
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_wiki-high/4_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 9c88f9a84c38df50a45e7390a09215f0-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 9c88f9a84c38df50a45e7390a09215f0-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 9c88f9a84c38df50a45e7390a09215f0-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 100623
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_wiki-low/6_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f7239aa9de74201060aed61b1cdf0c6b-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f7239aa9de74201060aed61b1cdf0c6b-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f7239aa9de74201060aed61b1cdf0c6b-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 89674
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid/3_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 75221ad5995f5e1b284e1451bc21e7ea-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 75221ad5995f5e1b284e1451bc21e7ea-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 75221ad5995f5e1b284e1451bc21e7ea-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 145122
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/math-high/part_04_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ef51ee55d8f6603934f08ce635314fad-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ef51ee55d8f6603934f08ce635314fad-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ef51ee55d8f6603934f08ce635314fad-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 456998
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/math-low/part_10_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 183681f26924ef7e06ab027733f0490e-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 183681f26924ef7e06ab027733f0490e-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 183681f26924ef7e06ab027733f0490e-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 426167
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/math-mid/part_07_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 2c02a9a76de03db0d4cbd2b78a34e2e8-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 2c02a9a76de03db0d4cbd2b78a34e2e8-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 2c02a9a76de03db0d4cbd2b78a34e2e8-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 410354
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o-full-train/train-0041-of-0136_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 111715
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 111715
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 4bedbf63df1ea1c142930c7b58689315-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 4bedbf63df1ea1c142930c7b58689315-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 4bedbf63df1ea1c142930c7b58689315-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 172833
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o-full-train/train-0125-of-0136_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 120151
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 120151
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 8c143921153f51208327e9e6c80f3996-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 8c143921153f51208327e9e6c80f3996-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 8c143921153f51208327e9e6c80f3996-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 185836
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o-full-val/valid-0034-of-0060_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 2066
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 2066
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 17b24baded5fb547c2bdd1eca38d9e8d-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 17b24baded5fb547c2bdd1eca38d9e8d-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 17b24baded5fb547c2bdd1eca38d9e8d-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 3587
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o/pubmedcentral_3_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 817bacd90c33f000963995212597faeb-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 817bacd90c33f000963995212597faeb-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 817bacd90c33f000963995212597faeb-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1559325
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/stack/018_00000_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 725493
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 725493
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 70fbfb40decd8fd580ba438c4fe3a326-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 70fbfb40decd8fd580ba438c4fe3a326-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 70fbfb40decd8fd580ba438c4fe3a326-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 106399
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/wiki/012_00000_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 952137
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 952137
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0f1298d7bcb5257da1d1fed0af30cf27-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0f1298d7bcb5257da1d1fed0af30cf27-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0f1298d7bcb5257da1d1fed0af30cf27-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 105713
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-high-loss0/part_28_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1014022
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1014022
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 8a41ddce0a4e9446a2a38ecaf18b8319-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 8a41ddce0a4e9446a2a38ecaf18b8319-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 8a41ddce0a4e9446a2a38ecaf18b8319-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 457136
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-high-loss1/part_59_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1024772
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1024772
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 90b19f2d251220dc983bbbbee1bbaa16-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 90b19f2d251220dc983bbbbee1bbaa16-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 90b19f2d251220dc983bbbbee1bbaa16-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 245948
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-high-loss2/part_16_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1006257
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1006257
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 89ab9dff4a2d78068a6f282a80ae15d6-GPTDataset-train-document_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 89ab9dff4a2d78068a6f282a80ae15d6-GPTDataset-train-sample_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 89ab9dff4a2d78068a6f282a80ae15d6-GPTDataset-train-shuffle_index.npy
    [default0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 93708
    [default0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-medidum-loss0/part_192_text_document.idx
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
    [default0]:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1003089
    [default0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1003089
    [default0]:INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
    
  3. 如果程序运行正常,大约 1-2 分钟后,您可以从 OpenSeek 根目录执行以下命令:

    grep "iteration.*consumed samples" OpenSeek-Small-v1-Baseline/logs/host_0_localhost.output
    
    
    
  4. 如果输出类似于以下示例,则表示您的程序已成功启动:

    [default0]: [2025-05-27 15:23:07] iteration        1/    24000 | consumed samples:          1024 | elapsed time per iteration (ms): 271607.0 | throughput per GPU (TFLOP/s/GPU): 40.4 | learning rate: 1.500000E-06 | global batch size:  1024 | lm loss: 1.192634E+01 | load_balancing_loss: 1.041994E+00 | loss scale: 1.0 | grad norm: 5.960 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:    0 | number of nan iterations:    0 |
    
  • finished creating GPT datasets ...
    [after dataloaders are built] datetime: 2025-06-27 12:23:34
    done with setup ...
    (min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (320.15, 320.15)
    train/valid/test-data-iterators-setup ..........: (3543.10, 3543.10)
    training ...
    Setting rerun_state_machine.current_iteration to 0...
    [before the start of training step] datetime: 2025-06-27 12:23:34
    [2025-06-27 12:27:53] iteration 1/ 24000 | consumed samples: 1024 | elapsed time per iteration (ms): 258947.6 | throughput per GPU (TFLOP/s/GPU): 42.3 | learning rate: 1.500000E-06 | global batch size: 1024 | lm loss: 1.192634E+01 | load_balancing_loss: 1.041995E+00 | loss scale: 1.0 | grad norm: 5.960 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations: 0 | number of nan iterations: 0 |
    Number of parameters in transformer block in billions: 1.20
    Number of parameters in embedding layers in billions: 0.19
    Total number of parameters in billions: 1.39
    Number of parameters in most loaded shard in billions: 1.3942
    Theoretical memory footprints: weight and optimizer=23933.59 MB
    [Rank 0] (after 1 iterations) memory (MB) | allocated: 24119.04443359375 | max allocated: 24119.55126953125 | reserved: 26416.0 | max reserved: 26416.0

    
    
cat 

训练日志

[2025-06-27 12:23:26,305] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
using world size: 1, data-parallel size: 1, context-parallel size: 1, hierarchical context-parallel sizes: Nonetensor-model-parallel size: 1, encoder-tensor-model-parallel size: 0, pipeline-model-parallel size: 1, encoder-pipeline-model-parallel size: 0
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:QwenTokenizerFS
Number of virtual stages per pipeline stage: None
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  account_for_embedding_in_pipeline_split ......... False
  account_for_loss_in_pipeline_split .............. False
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... False
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  align_grad_reduce ............................... True
  align_param_gather .............................. False
  app_tag_run_name ................................ None
  app_tag_run_version ............................. 0.0.0
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... False
  apply_sft_dataset_separated_loss_mask_if_existed  False
  async_save ...................................... None
  async_tensor_model_parallel_allreduce ........... True
  attention_backend ............................... AttnBackend.auto
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... True
  auto_detect_ckpt_format ......................... False
  auto_skip_spiky_loss ............................ False
  auto_tune ....................................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  calc_ft_timeouts ................................ False
  calculate_per_token_loss ........................ False
  check_for_large_grads ........................... False
  check_for_nan_in_loss_and_grad .................. True
  check_for_spiky_loss ............................ False
  check_weight_hash_across_dp_replicas_interval ... None
  ckpt_assume_constant_structure .................. False
  ckpt_convert_format ............................. None
  ckpt_convert_save ............................... None
  ckpt_convert_update_legacy_dist_opt_format ...... False
  ckpt_format ..................................... torch
  ckpt_fully_parallel_load ........................ False
  ckpt_fully_parallel_save ........................ True
  ckpt_fully_parallel_save_deprecated ............. False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  config_logger_dir ............................... 
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  cp_comm_type .................................... ['p2p']
  create_attention_mask_in_dataloader ............. True
  cross_entropy_fusion_impl ....................... native
  cross_entropy_loss_fusion ....................... False
  cuda_graph_scope ................................ full
  cuda_graph_warmup_steps ......................... 3
  data_args_path .................................. None
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_sharding_strategy ................. no_shard
  data_parallel_size .............................. 1
  data_path ....................................... ['1.1068', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high/part_142_text_document', '0.3577', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low/part_62_text_document', '0.7775', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid/part_189_text_document', '0.2859', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high/part_76_text_document', '0.1672', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low/part_124_text_document', '0.2339', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document', '0.5397', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document', '0.4064', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document', '0.5005', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid/part_444_text_document', '0.4616', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high/part_498_text_document', '0.067', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low/part_10_text_document', '0.3429', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid/part_144_text_document', '0.261', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high/part_86_text_document', '0.1824', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low/part_133_text_document', '0.2313', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid/part_139_text_document', '0.8237', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high/part_47_text_document', '0.2866', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low/part_11_text_document', '0.667', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid/part_97_text_document', '0.4657', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high/part_43_text_document', '0.2005', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low/part_10_text_document', '0.4317', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid/part_164_text_document', '1.1397', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high/part_92_text_document', '0.6782', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low/part_113_text_document', '0.9175', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid/part_563_text_document', '0.6414', '../OpenSeek-Pretrain-100B/arxiv/007_00000_text_document', '0.4696', '../OpenSeek-Pretrain-100B/books/016_00007_text_document', '1.0102', '../OpenSeek-Pretrain-100B/code-high/part_13_text_document', '1.1403', '../OpenSeek-Pretrain-100B/code-low/part_36_text_document', '0.9674', '../OpenSeek-Pretrain-100B/code-mid/part_37_text_document', '0.3755', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-high/23_text_document', '0.0499', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-low/51_text_document', '0.3608', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/118_text_document', '0.3623', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/176_text_document', '0.3704', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/256_text_document', '0.3733', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/320_text_document', '0.3631', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/32_text_document', '0.2573', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high/1_text_document', '0.1638', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low/2_text_document', '0.3251', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid/3_text_document', '6.0237', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high/2_text_document', '8.9063', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low/1_text_document', '10.1376', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid/2_text_document', '0.4598', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-high/4_text_document', '0.6857', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-low/6_text_document', '0.899', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-mid/23_text_document', '1.3135', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-high/12_text_document', '1.653', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-low/3_text_document', '0.3536', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-mid/5_text_document', '0.6314', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high/5_text_document', '0.5978', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low/5_text_document', '0.7909', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid/4_text_document', '0.2225', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-high/74_text_document', '0.1797', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-low/54_text_document', '0.2042', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-mid/275_text_document', '0.4081', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high/4_text_document', '0.1659', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low/2_text_document', '1.2828', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid/6_text_document', '5.68', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high/2_text_document', '7.4907', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low/1_text_document', '8.9359', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid/1_text_document', '0.7663', '../OpenSeek-Pretrain-100B/cot_synthesis_code-high/13_text_document', '0.4052', '../OpenSeek-Pretrain-100B/cot_synthesis_code-low/9_text_document', '0.1916', '../OpenSeek-Pretrain-100B/cot_synthesis_code-mid/6_text_document', '0.5074', '../OpenSeek-Pretrain-100B/cot_synthesis_math-high/11_text_document', '0.6437', '../OpenSeek-Pretrain-100B/cot_synthesis_math-low/11_text_document', '0.6406', '../OpenSeek-Pretrain-100B/cot_synthesis_math-mid/29_text_document', '0.4', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-high/4_text_document', '0.3564', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-low/6_text_document', '0.5768', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid/3_text_document', '1.8165', '../OpenSeek-Pretrain-100B/math-high/part_04_text_document', '1.694', '../OpenSeek-Pretrain-100B/math-low/part_10_text_document', '1.6311', '../OpenSeek-Pretrain-100B/math-mid/part_07_text_document', '0.687', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0041-of-0136_text_document', '0.7387', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0125-of-0136_text_document', '0.0143', '../OpenSeek-Pretrain-100B/pes2o-full-val/valid-0034-of-0060_text_document', '6.1982', '../OpenSeek-Pretrain-100B/pes2o/pubmedcentral_3_text_document', '0.4229', '../OpenSeek-Pretrain-100B/stack/018_00000_text_document', '0.4202', '../OpenSeek-Pretrain-100B/wiki/012_00000_text_document', '1.8171', '../OpenSeek-Pretrain-100B/zh_cc-high-loss0/part_28_text_document', '0.9776', '../OpenSeek-Pretrain-100B/zh_cc-high-loss1/part_59_text_document', '0.3725', '../OpenSeek-Pretrain-100B/zh_cc-high-loss2/part_16_text_document', '0.9492', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss0/part_192_text_document', '0.9236', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss1/part_550_text_document', '1.0643', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss2/part_71_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  ddp_average_in_collective ....................... False
  ddp_bucket_size ................................. None
  ddp_num_buckets ................................. None
  ddp_pad_buckets_for_high_nccl_busbw ............. False
  decoder_first_pipeline_num_layers ............... None
  decoder_last_pipeline_num_layers ................ None
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  decrease_batch_size_if_needed ................... False
  defer_embedding_wgrad_compute ................... False
  deprecated_use_mcore_models ..................... False
  deterministic_mode .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  disable_bf16_reduced_precision_matmul ........... False
  disable_mamba_mem_eff_path ...................... False
  disable_straggler_on_startup .................... False
  dist_ckpt_format_deprecated ..................... None
  dist_ckpt_strictness ............................ assume_ok_unexpected
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_cuda_graph ............................... False
  enable_ft_package ............................... False
  enable_gloo_process_groups ...................... True
  enable_hetero ................................... False
  enable_msc ...................................... True
  enable_one_logger ............................... True
  encoder_num_layers .............................. 6
  encoder_pipeline_model_parallel_size ............ 0
  encoder_seq_length .............................. 4096
  encoder_tensor_model_parallel_size .............. 0
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  error_injection_rate ............................ 0
  error_injection_type ............................ transient_error
  eval_interval ................................... 1000
  eval_iters ...................................... 0
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  exp_avg_dtype ................................... torch.float32
  exp_avg_sq_dtype ................................ torch.float32
  expert_model_parallel_size ...................... 1
  expert_tensor_parallel_size ..................... 1
  expert_tensor_parallel_size_per_process_mesh .... None
  external_cuda_graph ............................. False
  extra_eval_interval ............................. None
  extra_valid_data_path ........................... None
  ffn_hidden_size ................................. 7168
  finetune ........................................ False
  finetune_dataset_type ........................... None
  first_last_layers_bf16 .......................... False
  flash_decode .................................... False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_param_gather ................................ False
  fp8_recipe ...................................... delayed
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1024
  grad_reduce_in_bf16 ............................. False
  gradient_accumulation_fusion .................... True
  gradient_reduce_div_fusion ...................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hetero_current_device_type ...................... None
  hetero_device_types ............................. None
  hetero_pipeline_layer_split ..................... None
  hetero_process_meshes ........................... None
  hetero_use_cpu_communication .................... False
  heterogeneous_layers_config_encoded_json ........ None
  heterogeneous_layers_config_path ................ None
  hidden_dim_multiplier ........................... None
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 1280
  hierarchical_context_parallel_sizes ............. None
  hybrid_attention_ratio .......................... 0.0
  hybrid_mlp_ratio ................................ 0.0
  hybrid_override_pattern ......................... None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... -1
  inference_dynamic_batching ...................... False
  inference_dynamic_batching_buffer_guaranteed_fraction  0.2
  inference_dynamic_batching_buffer_overflow_factor  None
  inference_dynamic_batching_buffer_size_gb ....... 40.0
  inference_dynamic_batching_chunk_size ........... 256
  inference_dynamic_batching_max_requests_override  None
  inference_dynamic_batching_max_tokens_override .. None
  inference_max_batch_size ........................ 8
  inference_max_seq_length ........................ 2560
  inference_rng_tracker ........................... False
  init_method_std ................................. 0.006
  init_method_xavier_uniform ...................... False
  init_model_with_meta_device ..................... False
  initial_loss_scale .............................. 4294967296
  is_hybrid_model ................................. False
  iter_per_epoch .................................. 1250
  iterations_to_skip .............................. []
  keep_fp8_transpose_cache_when_using_custom_fsdp . False
  kv_channels ..................................... 128
  kv_lora_rank .................................... 512
  lazy_mpu_init ................................... None
  load ............................................ /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/checkpoints
  local_rank ...................................... 0
  log_interval .................................... 1
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... True
  log_num_zeros_in_grad ........................... True
  log_params_norm ................................. True
  log_progress .................................... False
  log_straggler ................................... False
  log_throughput .................................. True
  log_timers_to_tensorboard ....................... True
  log_validation_ppl_to_tensorboard ............... True
  log_world_size_to_tensorboard ................... False
  logging_level ................................... None
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.003
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_stablelm2_alpha ........................ 1.0
  lr_decay_stablelm2_beta ......................... 0.0
  lr_decay_stablelm2_cosine_max_lr ................ None
  lr_decay_stablelm2_cosine_period_samples ........ 0
  lr_decay_stablelm2_cosine_samples ............... 0
  lr_decay_stablelm2_decay_samples ................ 0
  lr_decay_stablelm2_rsqrt_samples ................ 0
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 2048000
  lr_wsd_decay_iters .............................. None
  lr_wsd_decay_samples ............................ None
  lr_wsd_decay_style .............................. exponential
  main_grads_dtype ................................ torch.float32
  main_params_dtype ............................... torch.float32
  make_vocab_size_divisible_by .................... 64
  mamba_head_dim .................................. 64
  mamba_num_groups ................................ 8
  mamba_num_heads ................................. None
  mamba_state_dim ................................. 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  memory_snapshot_path ............................ snapshot.pickle
  merge_file ...................................... None
  micro_batch_size ................................ 1
  microbatch_group_size_per_vp_stage .............. None
  mid_level_dataset_surplus ....................... 0.005
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0003
  mlp_chunks_for_prefill .......................... 1
  mmap_bin_files .................................. False
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0001
  moe_enable_deepep ............................... False
  moe_expert_capacity_factor ...................... None
  moe_extended_tp ................................. False
  moe_ffn_hidden_size ............................. 896
  moe_grouped_gemm ................................ True
  moe_input_jitter_eps ............................ None
  moe_layer_freq .................................. [0, 1, 1, 1, 1, 1]
  moe_layer_recompute ............................. False
  moe_pad_expert_input_to_capacity ................ False
  moe_per_layer_logging ........................... False
  moe_permute_fusion .............................. False
  moe_router_bias_update_rate ..................... 0.001
  moe_router_dtype ................................ fp32
  moe_router_enable_expert_bias ................... True
  moe_router_group_topk ........................... 1
  moe_router_load_balancing_type .................. seq_aux_loss
  moe_router_num_groups ........................... 1
  moe_router_pre_softmax .......................... False
  moe_router_score_function ....................... sigmoid
  moe_router_topk ................................. 6
  moe_router_topk_scaling_factor .................. 2.446
  moe_shared_expert_intermediate_size ............. 1792
  moe_shared_expert_overlap ....................... False
  moe_token_dispatcher_type ....................... alltoall
  moe_token_drop_policy ........................... probs
  moe_use_legacy_grouped_gemm ..................... False
  moe_use_upcycling ............................... False
  moe_z_loss_coeff ................................ None
  mrope_section ................................... None
  mscale .......................................... 1.0
  mscale_all_dim .................................. 1.0
  mtp_loss_coeff .................................. 0.3
  mtp_loss_scaling_factor ......................... 0.1
  mtp_num_layers .................................. None
  multi_latent_attention .......................... True
  multiple_of ..................................... None
  nccl_all_reduce_for_prefill ..................... False
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_rope_freq .................................... None
  no_save_optim ................................... None
  no_save_rng ..................................... None
  no_shared_fs .................................... False
  non_persistent_ckpt_type ........................ None
  non_persistent_global_ckpt_dir .................. None
  non_persistent_local_ckpt_algo .................. fully_parallel
  non_persistent_local_ckpt_dir ................... None
  non_persistent_save_interval .................... None
  norm_epsilon .................................... 1e-06
  norm_init_weight ................................ None
  normalization ................................... RMSNorm
  num_attention_heads ............................. 10
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_dataset_builder_threads ..................... 1
  num_distributed_optimizer_instances ............. 1
  num_experts ..................................... 64
  num_layers ...................................... 6
  num_layers_at_end_in_bf16 ....................... 1
  num_layers_at_start_in_bf16 ..................... 1
  num_layers_per_virtual_pipeline_stage ........... None
  num_mtp_predictor ............................... 0
  num_query_groups ................................ 10
  num_virtual_stages_per_pipeline_rank ............ None
  num_workers ..................................... 4
  object_storage_cache_path ....................... None
  one_logger_async ................................ False
  one_logger_project .............................. megatron-lm
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  optimizer_cpu_offload ........................... False
  optimizer_offload_fraction ...................... 1.0
  output_bert_embeddings .......................... False
  overlap_cpu_optimizer_d2h_h2d ................... False
  overlap_grad_reduce ............................. True
  overlap_p2p_comm ................................ False
  overlap_p2p_comm_warmup_flush ................... False
  overlap_param_gather ............................ True
  overlap_param_gather_with_optimizer_step ........ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  per_split_data_args_path ........................ None
  perform_initialization .......................... True
  pin_cpu_grads ................................... True
  pin_cpu_params .................................. True
  pipeline_model_parallel_comm_backend ............ None
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  pretrained_checkpoint ........................... None
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  q_lora_rank ..................................... None
  qk_head_dim ..................................... 128
  qk_layernorm .................................... True
  qk_layernorm_hidden_dim ......................... False
  qk_pos_emb_head_dim ............................. 64
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rampup_save_interval ............................ None
  rank ............................................ 0
  recompute_granularity ........................... full
  recompute_granularity_per_stage_micro_batch ..... None
  recompute_method ................................ uniform
  recompute_method_per_stage_micro_batch .......... None
  recompute_modules ............................... None
  recompute_num_layers ............................ 6
  recompute_num_layers_per_stage_micro_batch ...... None
  record_memory_history ........................... False
  relative_attention_max_distance ................. 128
  relative_attention_num_buckets .................. 32
  replication ..................................... False
  replication_factor .............................. 2
  replication_jump ................................ None
  rerun_mode ...................................... disabled
  reset_attention_mask ............................ True
  reset_position_ids .............................. True
  result_rejected_tracker_filename ................ None
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rope_scaling_factor ............................. 8.0
  rotary_base ..................................... 1000000
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_scaling_factor ........................... 1.0
  rotary_seq_len_interpolation_factor ............. None
  run_workload_inspector_server ................... False
  sample_rate ..................................... 1.0
  save ............................................ /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/checkpoints
  save_interval ................................... 600
  save_when_num_microbatches_change ............... False
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 42
  seq_length ...................................... 4096
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_iters_range ................................ None
  skip_samples_range .............................. None
  skip_train ...................................... False
  skipped_train_samples ........................... 0
  spec ............................................ None
  special_tokens_file ............................. None
  spiky_loss_threshold ............................ 0.2
  split ........................................... 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  straggler_ctrlr_port ............................ 65535
  straggler_minmax_count .......................... 1
  suggested_communication_unit_size ............... None
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  symmetric_ar_type ............................... None
  te_rng_tracker .................................. False
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/tensorboard
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  tiktoken_num_special_tokens ..................... 1000
  tiktoken_pattern ................................ None
  tiktoken_special_tokens ......................... None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_path .................................. ../hf_openseek/tokenizer
  tokenizer_type .................................. QwenTokenizerFS
  torch_fsdp2_reshard_after_forward ............... True
  tp_comm_bootstrap_backend ....................... nccl
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_overlap_rs_dgrad ........................ False
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... 24576000
  train_sync_interval ............................. None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_custom_fsdp ................................. False
  use_dist_ckpt ................................... False
  use_dist_ckpt_deprecated ........................ False
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. False
  use_legacy_models ............................... False
  use_mp_args_from_checkpoint_args ................ False
  use_one_sent_docs ............................... False
  use_partial_reduce_for_shared_embedding ......... False
  use_persistent_ckpt_worker ...................... False
  use_precision_aware_optimizer ................... False
  use_pytorch_profiler ............................ False
  use_ring_exchange_p2p ........................... False
  use_rope_scaling ................................ False
  use_rotary_position_embeddings .................. True
  use_tokenizer_model_from_checkpoint_args ........ True
  use_torch_fsdp2 ................................. False
  use_torch_optimizer_for_cpu_offload ............. False
  use_tp_pp_dp_mapping ............................ False
  v_head_dim ...................................... 128
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... 151851
  wandb_api_key ................................... 
  wandb_exp_name .................................. OpenSeek-Small-v1-Baseline
  wandb_log_model ................................. False
  wandb_log_model_interval ........................ 1000
  wandb_mode ...................................... offline
  wandb_project ................................... OpenSeek-Small-v1-Baseline
  wandb_save_dir .................................. /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/wandb
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  wgrad_deferral_limit ............................ 0
  world_size ...................................... 1
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 1024
> building QwenTokenizerFS tokenizer ...
 > padded vocab (size: 151851) with 21 dummy tokens (new size: 151872)
WARNING:megatron.core.rerun_state_machine:RerunStateMachine initialized in mode RerunMode.DISABLED
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 42 ...
> setting tensorboard ...
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
> compiling dataset index builder ...
make: Entering directory '/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/data2/workspaces/FlagAI_OpenSeek/FlagScale/third_party/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.028 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 0.055 seconds
time to initialize megatron (seconds): 3.587
[after megatron is initialized] datetime: 2025-06-27 12:23:30 
> Number of attn parameters in a transformer block in billions:  0.01
> Number of dense mlp parameters in a transformer block in billions:  0.03
> Number of sparse mlp parameters in a transformer block in billions:  0.23
> Number of parameters in transformer block in billions:  1.20
> Number of parameters in mtp transformer block in billions:  0.00
> Number of parameters in embedding layers in billions: 0.19
> Total number of parameters in billions: 1.39
> Number of parameters in most loaded shard in billions: 1.3946
> Number of activated attn parameters in a transformer block in billions:  0.01
> Number of activated dense mlp parameters in a transformer block in billions:  0.03
> Number of activated sparse mlp parameters in a transformer block in billions:  0.03
> Number of activated parameters in transformer block in billions:  0.20
> Number of activated parameters in mtp transformer block in billions:  0.00
> Number of activated parameters in embedding layers in billions: 0.19
> Total number of activated parameters in billions: 0.40
> Activation memory footprint per dense transformer layer: 276.0 MB
> Activation memory footprint per moe transformer layer: 346.0 MB
>>> [FS] Theoretical memory footprints: weight and optimizer=23940.62 MB, activation=2416.03 MB, total=26356.66 MB

building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1394625792
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, overlap_param_gather=True, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, check_for_large_grads=False, bucket_size=40000000, pad_buckets_for_high_nccl_busbw=False, average_in_collective=False, fp8_param_gather=False, use_custom_fsdp=False, data_parallel_sharding_strategy='no_shard', gradient_reduce_div_fusion=True, suggested_communication_unit_size=None, preserve_fp32_weights=True, keep_fp8_transpose_cache_when_using_custom_fsdp=False)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 30
Params for bucket 1 (40142080 elements, 40142080 padded size):
	module.decoder.layers.5.mlp.shared_experts.linear_fc1.weight
	module.decoder.layers.5.mlp.experts.linear_fc2.weight60
	module.decoder.layers.5.mlp.experts.linear_fc2.weight52
	module.decoder.layers.5.mlp.experts.linear_fc2.weight44
	module.decoder.layers.5.mlp.experts.linear_fc2.weight36
	module.decoder.layers.5.mlp.experts.linear_fc2.weight57
	module.decoder.layers.5.mlp.experts.linear_fc2.weight49
	module.decoder.layers.5.mlp.experts.linear_fc2.weight41
	module.decoder.layers.5.mlp.experts.linear_fc2.weight62
	module.decoder.layers.5.mlp.experts.linear_fc2.weight54
	module.decoder.layers.5.mlp.experts.linear_fc2.weight46
	module.decoder.layers.5.mlp.experts.linear_fc2.weight38
	module.decoder.layers.5.mlp.experts.linear_fc2.weight59
	module.decoder.layers.5.mlp.experts.linear_fc2.weight51
	module.decoder.layers.5.mlp.experts.linear_fc2.weight43
	module.decoder.layers.5.mlp.experts.linear_fc2.weight35
	module.decoder.layers.5.mlp.experts.linear_fc2.weight56
	module.decoder.layers.5.mlp.experts.linear_fc2.weight48
	module.decoder.layers.5.mlp.experts.linear_fc2.weight40
	module.decoder.layers.5.mlp.shared_experts.linear_fc2.weight
	module.decoder.layers.5.mlp.experts.linear_fc2.weight61
	module.decoder.layers.5.mlp.experts.linear_fc2.weight53
	module.decoder.layers.5.mlp.experts.linear_fc2.weight45
	module.decoder.layers.5.mlp.experts.linear_fc2.weight37
	module.decoder.layers.5.mlp.experts.linear_fc2.weight58
	module.decoder.layers.5.mlp.experts.linear_fc2.weight50
	module.decoder.layers.5.mlp.experts.linear_fc2.weight42
	module.decoder.final_layernorm.weight
	module.decoder.layers.5.mlp.experts.linear_fc2.weight63
	module.decoder.layers.5.mlp.experts.linear_fc2.weight55
	module.decoder.layers.5.mlp.experts.linear_fc2.weight47
	module.decoder.layers.5.mlp.experts.linear_fc2.weight39
Params for bucket 2 (40140800 elements, 40140800 padded size):
	module.decoder.layers.5.mlp.experts.linear_fc2.weight28
	module.decoder.layers.5.mlp.experts.linear_fc2.weight20
	module.decoder.layers.5.mlp.experts.linear_fc2.weight12
	module.decoder.layers.5.mlp.experts.linear_fc2.weight4
	module.decoder.layers.5.mlp.experts.linear_fc2.weight33
	module.decoder.layers.5.mlp.experts.linear_fc2.weight25
	module.decoder.layers.5.mlp.experts.linear_fc2.weight17
	module.decoder.layers.5.mlp.experts.linear_fc2.weight9
	module.decoder.layers.5.mlp.experts.linear_fc2.weight1
	module.decoder.layers.5.mlp.experts.linear_fc2.weight30
	module.decoder.layers.5.mlp.experts.linear_fc2.weight22
	module.decoder.layers.5.mlp.experts.linear_fc2.weight14
	module.decoder.layers.5.mlp.experts.linear_fc2.weight6
	module.decoder.layers.5.mlp.experts.linear_fc2.weight27
	module.decoder.layers.5.mlp.experts.linear_fc2.weight19
	module.decoder.layers.5.mlp.experts.linear_fc2.weight11
	module.decoder.layers.5.mlp.experts.linear_fc2.weight3
	module.decoder.layers.5.mlp.experts.linear_fc2.weight32
	module.decoder.layers.5.mlp.experts.linear_fc2.weight24
	module.decoder.layers.5.mlp.experts.linear_fc2.weight16
	module.decoder.layers.5.mlp.experts.linear_fc2.weight8
	module.decoder.layers.5.mlp.experts.linear_fc2.weight2
	module.decoder.layers.5.mlp.experts.linear_fc2.weight29
	module.decoder.layers.5.mlp.experts.linear_fc2.weight21
	module.decoder.layers.5.mlp.experts.linear_fc2.weight13
	module.decoder.layers.5.mlp.experts.linear_fc2.weight5
	module.decoder.layers.5.mlp.experts.linear_fc2.weight34
	module.decoder.layers.5.mlp.experts.linear_fc2.weight26
	module.decoder.layers.5.mlp.experts.linear_fc2.weight18
	module.decoder.layers.5.mlp.experts.linear_fc2.weight10
	module.decoder.layers.5.mlp.experts.linear_fc2.weight0
	module.decoder.layers.5.mlp.experts.linear_fc2.weight31
	module.decoder.layers.5.mlp.experts.linear_fc2.weight23
	module.decoder.layers.5.mlp.experts.linear_fc2.weight15
	module.decoder.layers.5.mlp.experts.linear_fc2.weight7
Params for bucket 3 (41287680 elements, 41287680 padded size):
	module.decoder.layers.5.mlp.experts.linear_fc1.weight62
	module.decoder.layers.5.mlp.experts.linear_fc1.weight60
	module.decoder.layers.5.mlp.experts.linear_fc1.weight58
	module.decoder.layers.5.mlp.experts.linear_fc1.weight56
	module.decoder.layers.5.mlp.experts.linear_fc1.weight54
	module.decoder.layers.5.mlp.experts.linear_fc1.weight52
	module.decoder.layers.5.mlp.experts.linear_fc1.weight50
	module.decoder.layers.5.mlp.experts.linear_fc1.weight48
	module.decoder.layers.5.mlp.experts.linear_fc1.weight46
	module.decoder.layers.5.mlp.experts.linear_fc1.weight63
	module.decoder.layers.5.mlp.experts.linear_fc1.weight61
	module.decoder.layers.5.mlp.experts.linear_fc1.weight59
	module.decoder.layers.5.mlp.experts.linear_fc1.weight57
	module.decoder.layers.5.mlp.experts.linear_fc1.weight55
	module.decoder.layers.5.mlp.experts.linear_fc1.weight53
	module.decoder.layers.5.mlp.experts.linear_fc1.weight51
	module.decoder.layers.5.mlp.experts.linear_fc1.weight49
	module.decoder.layers.5.mlp.experts.linear_fc1.weight47
Params for bucket 4 (41287680 elements, 41287680 padded size):
	module.decoder.layers.5.mlp.experts.linear_fc1.weight44
	module.decoder.layers.5.mlp.experts.linear_fc1.weight42
	module.decoder.layers.5.mlp.experts.linear_fc1.weight40
	module.decoder.layers.5.mlp.experts.linear_fc1.weight38
	module.decoder.layers.5.mlp.experts.linear_fc1.weight36
	module.decoder.layers.5.mlp.experts.linear_fc1.weight34
	module.decoder.layers.5.mlp.experts.linear_fc1.weight32
	module.decoder.layers.5.mlp.experts.linear_fc1.weight31
	module.decoder.layers.5.mlp.experts.linear_fc1.weight28
	module.decoder.layers.5.mlp.experts.linear_fc1.weight45
	module.decoder.layers.5.mlp.experts.linear_fc1.weight43
	module.decoder.layers.5.mlp.experts.linear_fc1.weight41
	module.decoder.layers.5.mlp.experts.linear_fc1.weight39
	module.decoder.layers.5.mlp.experts.linear_fc1.weight37
	module.decoder.layers.5.mlp.experts.linear_fc1.weight35
	module.decoder.layers.5.mlp.experts.linear_fc1.weight33
	module.decoder.layers.5.mlp.experts.linear_fc1.weight30
	module.decoder.layers.5.mlp.experts.linear_fc1.weight29
Params for bucket 5 (41287680 elements, 41287680 padded size):
	module.decoder.layers.5.mlp.experts.linear_fc1.weight24
	module.decoder.layers.5.mlp.experts.linear_fc1.weight26
	module.decoder.layers.5.mlp.experts.linear_fc1.weight22
	module.decoder.layers.5.mlp.experts.linear_fc1.weight20
	module.decoder.layers.5.mlp.experts.linear_fc1.weight18
	module.decoder.layers.5.mlp.experts.linear_fc1.weight16
	module.decoder.layers.5.mlp.experts.linear_fc1.weight14
	module.decoder.layers.5.mlp.experts.linear_fc1.weight12
	module.decoder.layers.5.mlp.experts.linear_fc1.weight10
	module.decoder.layers.5.mlp.experts.linear_fc1.weight27
	module.decoder.layers.5.mlp.experts.linear_fc1.weight23
	module.decoder.layers.5.mlp.experts.linear_fc1.weight25
	module.decoder.layers.5.mlp.experts.linear_fc1.weight21
	module.decoder.layers.5.mlp.experts.linear_fc1.weight19
	module.decoder.layers.5.mlp.experts.linear_fc1.weight17
	module.decoder.layers.5.mlp.experts.linear_fc1.weight15
	module.decoder.layers.5.mlp.experts.linear_fc1.weight13
	module.decoder.layers.5.mlp.experts.linear_fc1.weight11
Params for bucket 6 (40635392 elements, 40635392 padded size):
	module.decoder.layers.5.mlp.experts.linear_fc1.weight3
	module.decoder.layers.5.mlp.router.weight
	module.decoder.layers.5.pre_mlp_layernorm.weight
	module.decoder.layers.5.self_attention.linear_kv_up_proj.weight
	module.decoder.layers.4.mlp.experts.linear_fc2.weight62
	module.decoder.layers.5.mlp.experts.linear_fc1.weight7
	module.decoder.layers.5.self_attention.linear_kv_up_proj.layer_norm_weight
	module.decoder.layers.5.self_attention.linear_kv_down_proj.weight
	module.decoder.layers.5.mlp.experts.linear_fc1.weight4
	module.decoder.layers.5.self_attention.linear_q_proj.weight
	module.decoder.layers.5.self_attention.linear_proj.weight
	module.decoder.layers.5.mlp.experts.linear_fc1.weight9
	module.decoder.layers.5.mlp.experts.linear_fc1.weight0
	module.decoder.layers.4.mlp.experts.linear_fc2.weight61
	module.decoder.layers.5.mlp.experts.linear_fc1.weight6
	module.decoder.layers.5.mlp.experts.linear_fc1.weight1
	module.decoder.layers.5.input_layernorm.weight
	module.decoder.layers.4.mlp.shared_experts.linear_fc2.weight
	module.decoder.layers.4.mlp.experts.linear_fc2.weight63
	module.decoder.layers.5.mlp.experts.linear_fc1.weight8
	module.decoder.layers.5.mlp.experts.linear_fc1.weight2
	module.decoder.layers.4.mlp.shared_experts.linear_fc1.weight
	module.decoder.layers.4.mlp.experts.linear_fc2.weight60
	module.decoder.layers.5.mlp.experts.linear_fc1.weight5
Params for bucket 7 (40140800 elements, 40140800 padded size):
	module.decoder.layers.4.mlp.experts.linear_fc2.weight54
	module.decoder.layers.4.mlp.experts.linear_fc2.weight46
	module.decoder.layers.4.mlp.experts.linear_fc2.weight38
	module.decoder.layers.4.mlp.experts.linear_fc2.weight30
	module.decoder.layers.4.mlp.experts.linear_fc2.weight59
	module.decoder.layers.4.mlp.experts.linear_fc2.weight51
	module.decoder.layers.4.mlp.experts.linear_fc2.weight43
	module.decoder.layers.4.mlp.experts.linear_fc2.weight35
	module.decoder.layers.4.mlp.experts.linear_fc2.weight27
	module.decoder.layers.4.mlp.experts.linear_fc2.weight56
	module.decoder.layers.4.mlp.experts.linear_fc2.weight48
	module.decoder.layers.4.mlp.experts.linear_fc2.weight40
	module.decoder.layers.4.mlp.experts.linear_fc2.weight32
	module.decoder.layers.4.mlp.experts.linear_fc2.weight53
	module.decoder.layers.4.mlp.experts.linear_fc2.weight45
	module.decoder.layers.4.mlp.experts.linear_fc2.weight37
	module.decoder.layers.4.mlp.experts.linear_fc2.weight29
	module.decoder.layers.4.mlp.experts.linear_fc2.weight58
	module.decoder.layers.4.mlp.experts.linear_fc2.weight50
	module.decoder.layers.4.mlp.experts.linear_fc2.weight42
	module.decoder.layers.4.mlp.experts.linear_fc2.weight34
	module.decoder.layers.4.mlp.experts.linear_fc2.weight26
	module.decoder.layers.4.mlp.experts.linear_fc2.weight55
	module.decoder.layers.4.mlp.experts.linear_fc2.weight47
	module.decoder.layers.4.mlp.experts.linear_fc2.weight39
	module.decoder.layers.4.mlp.experts.linear_fc2.weight31
	module.decoder.layers.4.mlp.experts.linear_fc2.weight52
	module.decoder.layers.4.mlp.experts.linear_fc2.weight44
	module.decoder.layers.4.mlp.experts.linear_fc2.weight36
	module.decoder.layers.4.mlp.experts.linear_fc2.weight28
	module.decoder.layers.4.mlp.experts.linear_fc2.weight57
	module.decoder.layers.4.mlp.experts.linear_fc2.weight49
	module.decoder.layers.4.mlp.experts.linear_fc2.weight41
	module.decoder.layers.4.mlp.experts.linear_fc2.weight33
	module.decoder.layers.4.mlp.experts.linear_fc2.weight25
Params for bucket 8 (40140800 elements, 40140800 padded size):
	module.decoder.layers.4.mlp.experts.linear_fc2.weight22
	module.decoder.layers.4.mlp.experts.linear_fc2.weight14
	module.decoder.layers.4.mlp.experts.linear_fc2.weight0
	module.decoder.layers.4.mlp.experts.linear_fc1.weight60
	module.decoder.layers.4.mlp.experts.linear_fc2.weight19
	module.decoder.layers.4.mlp.experts.linear_fc2.weight11
	module.decoder.layers.4.mlp.experts.linear_fc2.weight24
	module.decoder.layers.4.mlp.experts.linear_fc2.weight16
	module.decoder.layers.4.mlp.experts.linear_fc2.weight8
	module.decoder.layers.4.mlp.experts.linear_fc2.weight4
	module.decoder.layers.4.mlp.experts.linear_fc1.weight62
	module.decoder.layers.4.mlp.experts.linear_fc2.weight21
	module.decoder.layers.4.mlp.experts.linear_fc2.weight13
	module.decoder.layers.4.mlp.experts.linear_fc2.weight1
	module.decoder.layers.4.mlp.experts.linear_fc1.weight59
	module.decoder.layers.4.mlp.experts.linear_fc2.weight18
	module.decoder.layers.4.mlp.experts.linear_fc2.weight10
	module.decoder.layers.4.mlp.experts.linear_fc2.weight7
	module.decoder.layers.4.mlp.experts.linear_fc2.weight23
	module.decoder.layers.4.mlp.experts.linear_fc2.weight15
	module.decoder.layers.4.mlp.experts.linear_fc2.weight3
	module.decoder.layers.4.mlp.experts.linear_fc1.weight61
	module.decoder.layers.4.mlp.experts.linear_fc2.weight20
	module.decoder.layers.4.mlp.experts.linear_fc2.weight12
	module.decoder.layers.4.mlp.experts.linear_fc2.weight5
	module.decoder.layers.4.mlp.experts.linear_fc2.weight2
	module.decoder.layers.4.mlp.experts.linear_fc2.weight17
	module.decoder.layers.4.mlp.experts.linear_fc2.weight9
	module.decoder.layers.4.mlp.experts.linear_fc2.weight6
	module.decoder.layers.4.mlp.experts.linear_fc1.weight63
Params for bucket 9 (41287680 elements, 41287680 padded size):
	module.decoder.layers.4.mlp.experts.linear_fc1.weight58
	module.decoder.layers.4.mlp.experts.linear_fc1.weight54
	module.decoder.layers.4.mlp.experts.linear_fc1.weight56
	module.decoder.layers.4.mlp.experts.linear_fc1.weight52
	module.decoder.layers.4.mlp.experts.linear_fc1.weight50
	module.decoder.layers.4.mlp.experts.linear_fc1.weight48
	module.decoder.layers.4.mlp.experts.linear_fc1.weight46
	module.decoder.layers.4.mlp.experts.linear_fc1.weight44
	module.decoder.layers.4.mlp.experts.linear_fc1.weight42
	module.decoder.layers.4.mlp.experts.linear_fc1.weight57
	module.decoder.layers.4.mlp.experts.linear_fc1.weight55
	module.decoder.layers.4.mlp.experts.linear_fc1.weight53
	module.decoder.layers.4.mlp.experts.linear_fc1.weight51
	module.decoder.layers.4.mlp.experts.linear_fc1.weight49
	module.decoder.layers.4.mlp.experts.linear_fc1.weight47
	module.decoder.layers.4.mlp.experts.linear_fc1.weight45
	module.decoder.layers.4.mlp.experts.linear_fc1.weight43
	module.decoder.layers.4.mlp.experts.linear_fc1.weight41
Params for bucket 10 (41287680 elements, 41287680 padded size):
	module.decoder.layers.4.mlp.experts.linear_fc1.weight40
	module.decoder.layers.4.mlp.experts.linear_fc1.weight38
	module.decoder.layers.4.mlp.experts.linear_fc1.weight36
	module.decoder.layers.4.mlp.experts.linear_fc1.weight34
	module.decoder.layers.4.mlp.experts.linear_fc1.weight32
	module.decoder.layers.4.mlp.experts.linear_fc1.weight30
	module.decoder.layers.4.mlp.experts.linear_fc1.weight28
	module.decoder.layers.4.mlp.experts.linear_fc1.weight26
	module.decoder.layers.4.mlp.experts.linear_fc1.weight24
	module.decoder.layers.4.mlp.experts.linear_fc1.weight39
	module.decoder.layers.4.mlp.experts.linear_fc1.weight37
	module.decoder.layers.4.mlp.experts.linear_fc1.weight35
	module.decoder.layers.4.mlp.experts.linear_fc1.weight33
	module.decoder.layers.4.mlp.experts.linear_fc1.weight31
	module.decoder.layers.4.mlp.experts.linear_fc1.weight29
	module.decoder.layers.4.mlp.experts.linear_fc1.weight27
	module.decoder.layers.4.mlp.experts.linear_fc1.weight25
	module.decoder.layers.4.mlp.experts.linear_fc1.weight23
Params for bucket 11 (41287680 elements, 41287680 padded size):
	module.decoder.layers.4.mlp.experts.linear_fc1.weight22
	module.decoder.layers.4.mlp.experts.linear_fc1.weight18
	module.decoder.layers.4.mlp.experts.linear_fc1.weight20
	module.decoder.layers.4.mlp.experts.linear_fc1.weight16
	module.decoder.layers.4.mlp.experts.linear_fc1.weight14
	module.decoder.layers.4.mlp.experts.linear_fc1.weight12
	module.decoder.layers.4.mlp.experts.linear_fc1.weight10
	module.decoder.layers.4.mlp.experts.linear_fc1.weight8
	module.decoder.layers.4.mlp.experts.linear_fc1.weight6
	module.decoder.layers.4.mlp.experts.linear_fc1.weight19
	module.decoder.layers.4.mlp.experts.linear_fc1.weight21
	module.decoder.layers.4.mlp.experts.linear_fc1.weight17
	module.decoder.layers.4.mlp.experts.linear_fc1.weight15
	module.decoder.layers.4.mlp.experts.linear_fc1.weight13
	module.decoder.layers.4.mlp.experts.linear_fc1.weight11
	module.decoder.layers.4.mlp.experts.linear_fc1.weight9
	module.decoder.layers.4.mlp.experts.linear_fc1.weight7
	module.decoder.layers.4.mlp.experts.linear_fc1.weight5
Params for bucket 12 (40635392 elements, 40635392 padded size):
	module.decoder.layers.4.mlp.experts.linear_fc1.weight4
	module.decoder.layers.4.self_attention.linear_q_proj.weight
	module.decoder.layers.4.self_attention.linear_proj.weight
	module.decoder.layers.3.mlp.experts.linear_fc2.weight60
	module.decoder.layers.3.mlp.experts.linear_fc2.weight52
	module.decoder.layers.4.mlp.experts.linear_fc1.weight0
	module.decoder.layers.3.mlp.experts.linear_fc2.weight57
	module.decoder.layers.3.mlp.experts.linear_fc2.weight62
	module.decoder.layers.3.mlp.experts.linear_fc2.weight54
	module.decoder.layers.4.mlp.experts.linear_fc1.weight1
	module.decoder.layers.3.mlp.experts.linear_fc2.weight59
	module.decoder.layers.3.mlp.experts.linear_fc2.weight51
	module.decoder.layers.4.mlp.experts.linear_fc1.weight2
	module.decoder.layers.4.self_attention.linear_kv_up_proj.layer_norm_weight
	module.decoder.layers.4.self_attention.linear_kv_down_proj.weight
	module.decoder.layers.3.mlp.shared_experts.linear_fc2.weight
	module.decoder.layers.3.mlp.experts.linear_fc2.weight56
	module.decoder.layers.3.mlp.shared_experts.linear_fc1.weight
	module.decoder.layers.3.mlp.experts.linear_fc2.weight61
	module.decoder.layers.3.mlp.experts.linear_fc2.weight53
	module.decoder.layers.4.mlp.experts.linear_fc1.weight3
	module.decoder.layers.4.mlp.router.weight
	module.decoder.layers.4.pre_mlp_layernorm.weight
	module.decoder.layers.4.self_attention.linear_kv_up_proj.weight
	module.decoder.layers.4.input_layernorm.weight
	module.decoder.layers.3.mlp.experts.linear_fc2.weight58
	module.decoder.layers.3.mlp.experts.linear_fc2.weight50
	module.decoder.layers.3.mlp.experts.linear_fc2.weight63
	module.decoder.layers.3.mlp.experts.linear_fc2.weight55
Params for bucket 13 (40140800 elements, 40140800 padded size):
	module.decoder.layers.3.mlp.experts.linear_fc2.weight44
	module.decoder.layers.3.mlp.experts.linear_fc2.weight36
	module.decoder.layers.3.mlp.experts.linear_fc2.weight28
	module.decoder.layers.3.mlp.experts.linear_fc2.weight20
	module.decoder.layers.3.mlp.experts.linear_fc2.weight49
	module.decoder.layers.3.mlp.experts.linear_fc2.weight41
	module.decoder.layers.3.mlp.experts.linear_fc2.weight33
	module.decoder.layers.3.mlp.experts.linear_fc2.weight25
	module.decoder.layers.3.mlp.experts.linear_fc2.weight17
	module.decoder.layers.3.mlp.experts.linear_fc2.weight46
	module.decoder.layers.3.mlp.experts.linear_fc2.weight38
	module.decoder.layers.3.mlp.experts.linear_fc2.weight30
	module.decoder.layers.3.mlp.experts.linear_fc2.weight22
	module.decoder.layers.3.mlp.experts.linear_fc2.weight43
	module.decoder.layers.3.mlp.experts.linear_fc2.weight35
	module.decoder.layers.3.mlp.experts.linear_fc2.weight27
	module.decoder.layers.3.mlp.experts.linear_fc2.weight19
	module.decoder.layers.3.mlp.experts.linear_fc2.weight48
	module.decoder.layers.3.mlp.experts.linear_fc2.weight40
	module.decoder.layers.3.mlp.experts.linear_fc2.weight32
	module.decoder.layers.3.mlp.experts.linear_fc2.weight24
	module.decoder.layers.3.mlp.experts.linear_fc2.weight16
	module.decoder.layers.3.mlp.experts.linear_fc2.weight45
	module.decoder.layers.3.mlp.experts.linear_fc2.weight37
	module.decoder.layers.3.mlp.experts.linear_fc2.weight29
	module.decoder.layers.3.mlp.experts.linear_fc2.weight21
	module.decoder.layers.3.mlp.experts.linear_fc2.weight42
	module.decoder.layers.3.mlp.experts.linear_fc2.weight34
	module.decoder.layers.3.mlp.experts.linear_fc2.weight26
	module.decoder.layers.3.mlp.experts.linear_fc2.weight18
	module.decoder.layers.3.mlp.experts.linear_fc2.weight47
	module.decoder.layers.3.mlp.experts.linear_fc2.weight39
	module.decoder.layers.3.mlp.experts.linear_fc2.weight31
	module.decoder.layers.3.mlp.experts.linear_fc2.weight23
	module.decoder.layers.3.mlp.experts.linear_fc2.weight15
Params for bucket 14 (40140800 elements, 40140800 padded size):
	module.decoder.layers.3.mlp.experts.linear_fc2.weight12
	module.decoder.layers.3.mlp.experts.linear_fc2.weight4
	module.decoder.layers.3.mlp.experts.linear_fc1.weight61
	module.decoder.layers.3.mlp.experts.linear_fc2.weight9
	module.decoder.layers.3.mlp.experts.linear_fc2.weight2
	module.decoder.layers.3.mlp.experts.linear_fc2.weight1
	module.decoder.layers.3.mlp.experts.linear_fc1.weight58
	module.decoder.layers.3.mlp.experts.linear_fc2.weight14
	module.decoder.layers.3.mlp.experts.linear_fc2.weight6
	module.decoder.layers.3.mlp.experts.linear_fc1.weight63
	module.decoder.layers.3.mlp.experts.linear_fc1.weight55
	module.decoder.layers.3.mlp.experts.linear_fc2.weight11
	module.decoder.layers.3.mlp.experts.linear_fc2.weight3
	module.decoder.layers.3.mlp.experts.linear_fc1.weight60
	module.decoder.layers.3.mlp.experts.linear_fc2.weight8
	module.decoder.layers.3.mlp.experts.linear_fc1.weight57
	module.decoder.layers.3.mlp.experts.linear_fc2.weight13
	module.decoder.layers.3.mlp.experts.linear_fc2.weight5
	module.decoder.layers.3.mlp.experts.linear_fc1.weight62
	module.decoder.layers.3.mlp.experts.linear_fc1.weight54
	module.decoder.layers.3.mlp.experts.linear_fc2.weight10
	module.decoder.layers.3.mlp.experts.linear_fc2.weight0
	module.decoder.layers.3.mlp.experts.linear_fc1.weight59
	module.decoder.layers.3.mlp.experts.linear_fc2.weight7
	module.decoder.layers.3.mlp.experts.linear_fc1.weight56
Params for bucket 15 (41287680 elements, 41287680 padded size):
	module.decoder.layers.3.mlp.experts.linear_fc1.weight53
	module.decoder.layers.3.mlp.experts.linear_fc1.weight51
	module.decoder.layers.3.mlp.experts.linear_fc1.weight49
	module.decoder.layers.3.mlp.experts.linear_fc1.weight47
	module.decoder.layers.3.mlp.experts.linear_fc1.weight45
	module.decoder.layers.3.mlp.experts.linear_fc1.weight43
	module.decoder.layers.3.mlp.experts.linear_fc1.weight41
	module.decoder.layers.3.mlp.experts.linear_fc1.weight39
	module.decoder.layers.3.mlp.experts.linear_fc1.weight38
	module.decoder.layers.3.mlp.experts.linear_fc1.weight36
	module.decoder.layers.3.mlp.experts.linear_fc1.weight50
	module.decoder.layers.3.mlp.experts.linear_fc1.weight52
	module.decoder.layers.3.mlp.experts.linear_fc1.weight48
	module.decoder.layers.3.mlp.experts.linear_fc1.weight46
	module.decoder.layers.3.mlp.experts.linear_fc1.weight44
	module.decoder.layers.3.mlp.experts.linear_fc1.weight42
	module.decoder.layers.3.mlp.experts.linear_fc1.weight40
	module.decoder.layers.3.mlp.experts.linear_fc1.weight37
Params for bucket 16 (41287680 elements, 41287680 padded size):
	module.decoder.layers.3.mlp.experts.linear_fc1.weight35
	module.decoder.layers.3.mlp.experts.linear_fc1.weight31
	module.decoder.layers.3.mlp.experts.linear_fc1.weight33
	module.decoder.layers.3.mlp.experts.linear_fc1.weight29
	module.decoder.layers.3.mlp.experts.linear_fc1.weight27
	module.decoder.layers.3.mlp.experts.linear_fc1.weight25
	module.decoder.layers.3.mlp.experts.linear_fc1.weight23
	module.decoder.layers.3.mlp.experts.linear_fc1.weight21
	module.decoder.layers.3.mlp.experts.linear_fc1.weight19
	module.decoder.layers.3.mlp.experts.linear_fc1.weight32
	module.decoder.layers.3.mlp.experts.linear_fc1.weight34
	module.decoder.layers.3.mlp.experts.linear_fc1.weight30
	module.decoder.layers.3.mlp.experts.linear_fc1.weight28
	module.decoder.layers.3.mlp.experts.linear_fc1.weight26
	module.decoder.layers.3.mlp.experts.linear_fc1.weight24
	module.decoder.layers.3.mlp.experts.linear_fc1.weight22
	module.decoder.layers.3.mlp.experts.linear_fc1.weight20
	module.decoder.layers.3.mlp.experts.linear_fc1.weight18
Params for bucket 17 (41287680 elements, 41287680 padded size):
	module.decoder.layers.3.mlp.experts.linear_fc1.weight17
	module.decoder.layers.3.mlp.experts.linear_fc1.weight13
	module.decoder.layers.3.mlp.experts.linear_fc1.weight15
	module.decoder.layers.3.mlp.experts.linear_fc1.weight11
	module.decoder.layers.3.mlp.experts.linear_fc1.weight9
	module.decoder.layers.3.mlp.experts.linear_fc1.weight7
	module.decoder.layers.3.mlp.experts.linear_fc1.weight5
	module.decoder.layers.3.mlp.experts.linear_fc1.weight1
	module.decoder.layers.3.mlp.experts.linear_fc1.weight0
	module.decoder.layers.3.mlp.experts.linear_fc1.weight16
	module.decoder.layers.3.mlp.experts.linear_fc1.weight14
	module.decoder.layers.3.mlp.experts.linear_fc1.weight12
	module.decoder.layers.3.mlp.experts.linear_fc1.weight10
	module.decoder.layers.3.mlp.experts.linear_fc1.weight8
	module.decoder.layers.3.mlp.experts.linear_fc1.weight6
	module.decoder.layers.3.mlp.experts.linear_fc1.weight4
	module.decoder.layers.3.mlp.experts.linear_fc1.weight3
	module.decoder.layers.3.mlp.experts.linear_fc1.weight2
Params for bucket 18 (40635392 elements, 40635392 padded size):
	module.decoder.layers.3.self_attention.linear_kv_up_proj.weight
	module.decoder.layers.3.self_attention.linear_kv_up_proj.layer_norm_weight
	module.decoder.layers.2.mlp.experts.linear_fc2.weight62
	module.decoder.layers.2.mlp.experts.linear_fc2.weight54
	module.decoder.layers.2.mlp.experts.linear_fc2.weight46
	module.decoder.layers.3.self_attention.linear_q_proj.weight
	module.decoder.layers.3.self_attention.linear_proj.weight
	module.decoder.layers.2.mlp.shared_experts.linear_fc1.weight
	module.decoder.layers.2.mlp.experts.linear_fc2.weight59
	module.decoder.layers.2.mlp.experts.linear_fc2.weight51
	module.decoder.layers.2.mlp.experts.linear_fc2.weight43
	module.decoder.layers.2.mlp.experts.linear_fc2.weight56
	module.decoder.layers.2.mlp.experts.linear_fc2.weight48
	module.decoder.layers.2.mlp.experts.linear_fc2.weight40
	module.decoder.layers.3.mlp.router.weight
	module.decoder.layers.3.self_attention.linear_kv_down_proj.weight
	module.decoder.layers.2.mlp.experts.linear_fc2.weight61
	module.decoder.layers.2.mlp.experts.linear_fc2.weight53
	module.decoder.layers.2.mlp.experts.linear_fc2.weight45
	module.decoder.layers.2.mlp.experts.linear_fc2.weight58
	module.decoder.layers.2.mlp.experts.linear_fc2.weight50
	module.decoder.layers.2.mlp.experts.linear_fc2.weight42
	module.decoder.layers.2.mlp.experts.linear_fc2.weight55
	module.decoder.layers.2.mlp.shared_experts.linear_fc2.weight
	module.decoder.layers.2.mlp.experts.linear_fc2.weight63
	module.decoder.layers.2.mlp.experts.linear_fc2.weight47
	module.decoder.layers.3.pre_mlp_layernorm.weight
	module.decoder.layers.3.input_layernorm.weight
	module.decoder.layers.2.mlp.experts.linear_fc2.weight60
	module.decoder.layers.2.mlp.experts.linear_fc2.weight52
	module.decoder.layers.2.mlp.experts.linear_fc2.weight44
	module.decoder.layers.2.mlp.experts.linear_fc2.weight57
	module.decoder.layers.2.mlp.experts.linear_fc2.weight49
	module.decoder.layers.2.mlp.experts.linear_fc2.weight41
Params for bucket 19 (40140800 elements, 40140800 padded size):
	module.decoder.layers.2.mlp.experts.linear_fc2.weight38
	module.decoder.layers.2.mlp.experts.linear_fc2.weight30
	module.decoder.layers.2.mlp.experts.linear_fc2.weight22
	module.decoder.layers.2.mlp.experts.linear_fc2.weight10
	module.decoder.layers.2.mlp.experts.linear_fc2.weight35
	module.decoder.layers.2.mlp.experts.linear_fc2.weight27
	module.decoder.layers.2.mlp.experts.linear_fc2.weight19
	module.decoder.layers.2.mlp.experts.linear_fc2.weight7
	module.decoder.layers.2.mlp.experts.linear_fc2.weight32
	module.decoder.layers.2.mlp.experts.linear_fc2.weight24
	module.decoder.layers.2.mlp.experts.linear_fc2.weight16
	module.decoder.layers.2.mlp.experts.linear_fc2.weight12
	module.decoder.layers.2.mlp.experts.linear_fc2.weight37
	module.decoder.layers.2.mlp.experts.linear_fc2.weight29
	module.decoder.layers.2.mlp.experts.linear_fc2.weight21
	module.decoder.layers.2.mlp.experts.linear_fc2.weight9
	module.decoder.layers.2.mlp.experts.linear_fc2.weight34
	module.decoder.layers.2.mlp.experts.linear_fc2.weight26
	module.decoder.layers.2.mlp.experts.linear_fc2.weight18
	module.decoder.layers.2.mlp.experts.linear_fc2.weight15
	module.decoder.layers.2.mlp.experts.linear_fc2.weight6
	module.decoder.layers.2.mlp.experts.linear_fc2.weight39
	module.decoder.layers.2.mlp.experts.linear_fc2.weight31
	module.decoder.layers.2.mlp.experts.linear_fc2.weight23
	module.decoder.layers.2.mlp.experts.linear_fc2.weight11
	module.decoder.layers.2.mlp.experts.linear_fc2.weight36
	module.decoder.layers.2.mlp.experts.linear_fc2.weight28
	module.decoder.layers.2.mlp.experts.linear_fc2.weight20
	module.decoder.layers.2.mlp.experts.linear_fc2.weight13
	module.decoder.layers.2.mlp.experts.linear_fc2.weight8
	module.decoder.layers.2.mlp.experts.linear_fc2.weight33
	module.decoder.layers.2.mlp.experts.linear_fc2.weight25
	module.decoder.layers.2.mlp.experts.linear_fc2.weight17
	module.decoder.layers.2.mlp.experts.linear_fc2.weight14
	module.decoder.layers.2.mlp.experts.linear_fc2.weight5
Params for bucket 20 (40140800 elements, 40140800 padded size):
	module.decoder.layers.2.mlp.experts.linear_fc2.weight0
	module.decoder.layers.2.mlp.experts.linear_fc1.weight63
	module.decoder.layers.2.mlp.experts.linear_fc1.weight60
	module.decoder.layers.2.mlp.experts.linear_fc1.weight49
	module.decoder.layers.2.mlp.experts.linear_fc1.weight54
	module.decoder.layers.2.mlp.experts.linear_fc2.weight4
	module.decoder.layers.2.mlp.experts.linear_fc2.weight2
	module.decoder.layers.2.mlp.experts.linear_fc2.weight1
	module.decoder.layers.2.mlp.experts.linear_fc1.weight61
	module.decoder.layers.2.mlp.experts.linear_fc1.weight50
	module.decoder.layers.2.mlp.experts.linear_fc1.weight56
	module.decoder.layers.2.mlp.experts.linear_fc1.weight53
	module.decoder.layers.2.mlp.experts.linear_fc1.weight51
	module.decoder.layers.2.mlp.experts.linear_fc2.weight3
	module.decoder.layers.2.mlp.experts.linear_fc1.weight62
	module.decoder.layers.2.mlp.experts.linear_fc1.weight58
	module.decoder.layers.2.mlp.experts.linear_fc1.weight57
	module.decoder.layers.2.mlp.experts.linear_fc1.weight55
	module.decoder.layers.2.mlp.experts.linear_fc1.weight52
	module.decoder.layers.2.mlp.experts.linear_fc1.weight59
Params for bucket 21 (41287680 elements, 41287680 padded size):
	module.decoder.layers.2.mlp.experts.linear_fc1.weight36
	module.decoder.layers.2.mlp.experts.linear_fc1.weight31
	module.decoder.layers.2.mlp.experts.linear_fc1.weight34
	module.decoder.layers.2.mlp.experts.linear_fc1.weight37
	module.decoder.layers.2.mlp.experts.linear_fc1.weight38
	module.decoder.layers.2.mlp.experts.linear_fc1.weight48
	module.decoder.layers.2.mlp.experts.linear_fc1.weight46
	module.decoder.layers.2.mlp.experts.linear_fc1.weight45
	module.decoder.layers.2.mlp.experts.linear_fc1.weight47
	module.decoder.layers.2.mlp.experts.linear_fc1.weight44
	module.decoder.layers.2.mlp.experts.linear_fc1.weight43
	module.decoder.layers.2.mlp.experts.linear_fc1.weight42
	module.decoder.layers.2.mlp.experts.linear_fc1.weight41
	module.decoder.layers.2.mlp.experts.linear_fc1.weight40
	module.decoder.layers.2.mlp.experts.linear_fc1.weight39
	module.decoder.layers.2.mlp.experts.linear_fc1.weight35
	module.decoder.layers.2.mlp.experts.linear_fc1.weight32
	module.decoder.layers.2.mlp.experts.linear_fc1.weight33
Params for bucket 22 (41287680 elements, 41287680 padded size):
	module.decoder.layers.2.mlp.experts.linear_fc1.weight30
	module.decoder.layers.2.mlp.experts.linear_fc1.weight27
	module.decoder.layers.2.mlp.experts.linear_fc1.weight28
	module.decoder.layers.2.mlp.experts.linear_fc1.weight29
	module.decoder.layers.2.mlp.experts.linear_fc1.weight25
	module.decoder.layers.2.mlp.experts.linear_fc1.weight24
	module.decoder.layers.2.mlp.experts.linear_fc1.weight20
	module.decoder.layers.2.mlp.experts.linear_fc1.weight19
	module.decoder.layers.2.mlp.experts.linear_fc1.weight16
	module.decoder.layers.2.mlp.experts.linear_fc1.weight14
	module.decoder.layers.2.mlp.experts.linear_fc1.weight26
	module.decoder.layers.2.mlp.experts.linear_fc1.weight23
	module.decoder.layers.2.mlp.experts.linear_fc1.weight22
	module.decoder.layers.2.mlp.experts.linear_fc1.weight21
	module.decoder.layers.2.mlp.experts.linear_fc1.weight18
	module.decoder.layers.2.mlp.experts.linear_fc1.weight17
	module.decoder.layers.2.mlp.experts.linear_fc1.weight15
	module.decoder.layers.2.mlp.experts.linear_fc1.weight13
Params for bucket 23 (42929152 elements, 42929152 padded size):
	module.decoder.layers.2.self_attention.linear_kv_up_proj.weight
	module.decoder.layers.2.mlp.router.weight
	module.decoder.layers.2.mlp.experts.linear_fc1.weight8
	module.decoder.layers.2.mlp.experts.linear_fc1.weight6
	module.decoder.layers.2.pre_mlp_layernorm.weight
	module.decoder.layers.2.self_attention.linear_kv_up_proj.layer_norm_weight
	module.decoder.layers.2.self_attention.linear_kv_down_proj.weight
	module.decoder.layers.2.self_attention.linear_proj.weight
	module.decoder.layers.2.self_attention.linear_q_proj.weight
	module.decoder.layers.2.mlp.experts.linear_fc1.weight9
	module.decoder.layers.2.mlp.experts.linear_fc1.weight5
	module.decoder.layers.2.mlp.experts.linear_fc1.weight3
	module.decoder.layers.2.mlp.experts.linear_fc1.weight11
	module.decoder.layers.2.mlp.experts.linear_fc1.weight1
	module.decoder.layers.2.mlp.experts.linear_fc1.weight10
	module.decoder.layers.2.mlp.experts.linear_fc1.weight0
	module.decoder.layers.1.mlp.shared_experts.linear_fc2.weight
	module.decoder.layers.2.mlp.experts.linear_fc1.weight2
	module.decoder.layers.2.input_layernorm.weight
	module.decoder.layers.1.mlp.shared_experts.linear_fc1.weight
	module.decoder.layers.2.mlp.experts.linear_fc1.weight12
	module.decoder.layers.2.mlp.experts.linear_fc1.weight7
	module.decoder.layers.2.mlp.experts.linear_fc1.weight4
Params for bucket 24 (40140800 elements, 40140800 padded size):
	module.decoder.layers.1.mlp.experts.linear_fc2.weight49
	module.decoder.layers.1.mlp.experts.linear_fc2.weight48
	module.decoder.layers.1.mlp.experts.linear_fc2.weight59
	module.decoder.layers.1.mlp.experts.linear_fc2.weight41
	module.decoder.layers.1.mlp.experts.linear_fc2.weight34
	module.decoder.layers.1.mlp.experts.linear_fc2.weight32
	module.decoder.layers.1.mlp.experts.linear_fc2.weight29
	module.decoder.layers.1.mlp.experts.linear_fc2.weight58
	module.decoder.layers.1.mlp.experts.linear_fc2.weight55
	module.decoder.layers.1.mlp.experts.linear_fc2.weight52
	module.decoder.layers.1.mlp.experts.linear_fc2.weight51
	module.decoder.layers.1.mlp.experts.linear_fc2.weight42
	module.decoder.layers.1.mlp.experts.linear_fc2.weight45
	module.decoder.layers.1.mlp.experts.linear_fc2.weight61
	module.decoder.layers.1.mlp.experts.linear_fc2.weight44
	module.decoder.layers.1.mlp.experts.linear_fc2.weight38
	module.decoder.layers.1.mlp.experts.linear_fc2.weight36
	module.decoder.layers.1.mlp.experts.linear_fc2.weight43
	module.decoder.layers.1.mlp.experts.linear_fc2.weight35
	module.decoder.layers.1.mlp.experts.linear_fc2.weight54
	module.decoder.layers.1.mlp.experts.linear_fc2.weight53
	module.decoder.layers.1.mlp.experts.linear_fc2.weight47
	module.decoder.layers.1.mlp.experts.linear_fc2.weight46
	module.decoder.layers.1.mlp.experts.linear_fc2.weight33
	module.decoder.layers.1.mlp.experts.linear_fc2.weight31
	module.decoder.layers.1.mlp.experts.linear_fc2.weight30
	module.decoder.layers.1.mlp.experts.linear_fc2.weight63
	module.decoder.layers.1.mlp.experts.linear_fc2.weight57
	module.decoder.layers.1.mlp.experts.linear_fc2.weight40
	module.decoder.layers.1.mlp.experts.linear_fc2.weight62
	module.decoder.layers.1.mlp.experts.linear_fc2.weight60
	module.decoder.layers.1.mlp.experts.linear_fc2.weight56
	module.decoder.layers.1.mlp.experts.linear_fc2.weight50
	module.decoder.layers.1.mlp.experts.linear_fc2.weight39
	module.decoder.layers.1.mlp.experts.linear_fc2.weight37
Params for bucket 25 (40140800 elements, 40140800 padded size):
	module.decoder.layers.1.mlp.experts.linear_fc2.weight20
	module.decoder.layers.1.mlp.experts.linear_fc2.weight18
	module.decoder.layers.1.mlp.experts.linear_fc1.weight63
	module.decoder.layers.1.mlp.experts.linear_fc2.weight28
	module.decoder.layers.1.mlp.experts.linear_fc2.weight17
	module.decoder.layers.1.mlp.experts.linear_fc2.weight13
	module.decoder.layers.1.mlp.experts.linear_fc1.weight61
	module.decoder.layers.1.mlp.experts.linear_fc2.weight25
	module.decoder.layers.1.mlp.experts.linear_fc2.weight14
	module.decoder.layers.1.mlp.experts.linear_fc2.weight9
	module.decoder.layers.1.mlp.experts.linear_fc2.weight8
	module.decoder.layers.1.mlp.experts.linear_fc2.weight6
	module.decoder.layers.1.mlp.experts.linear_fc2.weight24
	module.decoder.layers.1.mlp.experts.linear_fc2.weight16
	module.decoder.layers.1.mlp.experts.linear_fc2.weight26
	module.decoder.layers.1.mlp.experts.linear_fc2.weight27
	module.decoder.layers.1.mlp.experts.linear_fc2.weight22
	module.decoder.layers.1.mlp.experts.linear_fc2.weight21
	module.decoder.layers.1.mlp.experts.linear_fc2.weight15
	module.decoder.layers.1.mlp.experts.linear_fc2.weight2
	module.decoder.layers.1.mlp.experts.linear_fc2.weight19
	module.decoder.layers.1.mlp.experts.linear_fc2.weight10
	module.decoder.layers.1.mlp.experts.linear_fc2.weight7
	module.decoder.layers.1.mlp.experts.linear_fc2.weight4
	module.decoder.layers.1.mlp.experts.linear_fc2.weight1
	module.decoder.layers.1.mlp.experts.linear_fc2.weight12
	module.decoder.layers.1.mlp.experts.linear_fc2.weight3
	module.decoder.layers.1.mlp.experts.linear_fc1.weight62
	module.decoder.layers.1.mlp.experts.linear_fc2.weight23
	module.decoder.layers.1.mlp.experts.linear_fc2.weight11
	module.decoder.layers.1.mlp.experts.linear_fc2.weight5
	module.decoder.layers.1.mlp.experts.linear_fc2.weight0
Params for bucket 26 (41287680 elements, 41287680 padded size):
	module.decoder.layers.1.mlp.experts.linear_fc1.weight56
	module.decoder.layers.1.mlp.experts.linear_fc1.weight57
	module.decoder.layers.1.mlp.experts.linear_fc1.weight58
	module.decoder.layers.1.mlp.experts.linear_fc1.weight53
	module.decoder.layers.1.mlp.experts.linear_fc1.weight52
	module.decoder.layers.1.mlp.experts.linear_fc1.weight50
	module.decoder.layers.1.mlp.experts.linear_fc1.weight49
	module.decoder.layers.1.mlp.experts.linear_fc1.weight45
	module.decoder.layers.1.mlp.experts.linear_fc1.weight60
	module.decoder.layers.1.mlp.experts.linear_fc1.weight59
	module.decoder.layers.1.mlp.experts.linear_fc1.weight55
	module.decoder.layers.1.mlp.experts.linear_fc1.weight54
	module.decoder.layers.1.mlp.experts.linear_fc1.weight51
	module.decoder.layers.1.mlp.experts.linear_fc1.weight48
	module.decoder.layers.1.mlp.experts.linear_fc1.weight47
	module.decoder.layers.1.mlp.experts.linear_fc1.weight46
	module.decoder.layers.1.mlp.experts.linear_fc1.weight44
	module.decoder.layers.1.mlp.experts.linear_fc1.weight43
Params for bucket 27 (41287680 elements, 41287680 padded size):
	module.decoder.layers.1.mlp.experts.linear_fc1.weight41
	module.decoder.layers.1.mlp.experts.linear_fc1.weight39
	module.decoder.layers.1.mlp.experts.linear_fc1.weight40
	module.decoder.layers.1.mlp.experts.linear_fc1.weight37
	module.decoder.layers.1.mlp.experts.linear_fc1.weight31
	module.decoder.layers.1.mlp.experts.linear_fc1.weight27
	module.decoder.layers.1.mlp.experts.linear_fc1.weight26
	module.decoder.layers.1.mlp.experts.linear_fc1.weight42
	module.decoder.layers.1.mlp.experts.linear_fc1.weight38
	module.decoder.layers.1.mlp.experts.linear_fc1.weight36
	module.decoder.layers.1.mlp.experts.linear_fc1.weight35
	module.decoder.layers.1.mlp.experts.linear_fc1.weight34
	module.decoder.layers.1.mlp.experts.linear_fc1.weight33
	module.decoder.layers.1.mlp.experts.linear_fc1.weight32
	module.decoder.layers.1.mlp.experts.linear_fc1.weight30
	module.decoder.layers.1.mlp.experts.linear_fc1.weight29
	module.decoder.layers.1.mlp.experts.linear_fc1.weight28
	module.decoder.layers.1.mlp.experts.linear_fc1.weight25
Params for bucket 28 (41287680 elements, 41287680 padded size):
	module.decoder.layers.1.mlp.experts.linear_fc1.weight24
	module.decoder.layers.1.mlp.experts.linear_fc1.weight20
	module.decoder.layers.1.mlp.experts.linear_fc1.weight23
	module.decoder.layers.1.mlp.experts.linear_fc1.weight21
	module.decoder.layers.1.mlp.experts.linear_fc1.weight19
	module.decoder.layers.1.mlp.experts.linear_fc1.weight17
	module.decoder.layers.1.mlp.experts.linear_fc1.weight16
	module.decoder.layers.1.mlp.experts.linear_fc1.weight15
	module.decoder.layers.1.mlp.experts.linear_fc1.weight11
	module.decoder.layers.1.mlp.experts.linear_fc1.weight10
	module.decoder.layers.1.mlp.experts.linear_fc1.weight8
	module.decoder.layers.1.mlp.experts.linear_fc1.weight7
	module.decoder.layers.1.mlp.experts.linear_fc1.weight22
	module.decoder.layers.1.mlp.experts.linear_fc1.weight18
	module.decoder.layers.1.mlp.experts.linear_fc1.weight14
	module.decoder.layers.1.mlp.experts.linear_fc1.weight13
	module.decoder.layers.1.mlp.experts.linear_fc1.weight12
	module.decoder.layers.1.mlp.experts.linear_fc1.weight9
Params for bucket 29 (49810432 elements, 49810432 padded size):
	module.decoder.layers.1.mlp.experts.linear_fc1.weight6
	module.decoder.layers.1.mlp.experts.linear_fc1.weight5
	module.decoder.layers.1.mlp.experts.linear_fc1.weight3
	module.decoder.layers.1.mlp.experts.linear_fc1.weight2
	module.decoder.layers.1.mlp.experts.linear_fc1.weight1
	module.decoder.layers.1.mlp.experts.linear_fc1.weight0
	module.decoder.layers.1.mlp.router.weight
	module.decoder.layers.1.pre_mlp_layernorm.weight
	module.decoder.layers.1.self_attention.linear_kv_up_proj.layer_norm_weight
	module.decoder.layers.1.self_attention.linear_kv_down_proj.weight
	module.decoder.layers.0.mlp.linear_fc1.weight
	module.decoder.layers.1.mlp.experts.linear_fc1.weight4
	module.decoder.layers.1.self_attention.linear_kv_up_proj.weight
	module.decoder.layers.1.input_layernorm.weight
	module.decoder.layers.0.mlp.linear_fc2.weight
	module.decoder.layers.1.self_attention.linear_proj.weight
	module.decoder.layers.1.self_attention.linear_q_proj.weight
Params for bucket 30 (200543232 elements, 200543232 padded size):
	module.decoder.layers.0.self_attention.linear_kv_up_proj.weight
	module.decoder.layers.0.self_attention.linear_q_proj.weight
	module.decoder.layers.0.self_attention.linear_proj.weight
	module.decoder.layers.0.input_layernorm.weight
	module.embedding.word_embeddings.weight
	module.decoder.layers.0.mlp.linear_fc1.layer_norm_weight
	module.decoder.layers.0.self_attention.linear_kv_up_proj.layer_norm_weight
	module.decoder.layers.0.self_attention.linear_kv_down_proj.weight
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.003, min_lr=0.0003, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, use_precision_aware_optimizer=False, store_param_remainders=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=1.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=True, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x7f6c8e993320>, config_logger_dir='')
setting training iterations to 24000
INFO:megatron.core.optimizer_param_scheduler:> learning rate decay style: cosine
WARNING: could not find the metadata file /data2/workspaces/FlagAI_OpenSeek/OpenSeek-Small-v1-Baseline/checkpoints/latest_checkpointed_iteration.txt
    will not load any checkpoints and will start from random
(min, max) time across ranks (ms):
    load-checkpoint ................................: (0.41, 0.41)
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-06-27 12:23:30 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      24576000
    validation: 0
    test:       0
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 1.0), None, None]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building GPTDataset splits with sizes=(24576000, 0, 0) and config=GPTDatasetConfig(random_seed=42, sequence_length=4096, blend=(['../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high/part_142_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low/part_62_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid/part_189_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high/part_76_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low/part_124_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid/part_444_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high/part_498_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low/part_10_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid/part_144_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high/part_86_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low/part_133_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid/part_139_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high/part_47_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low/part_11_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid/part_97_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high/part_43_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low/part_10_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid/part_164_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high/part_92_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low/part_113_text_document', '../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid/part_563_text_document', '../OpenSeek-Pretrain-100B/arxiv/007_00000_text_document', '../OpenSeek-Pretrain-100B/books/016_00007_text_document', '../OpenSeek-Pretrain-100B/code-high/part_13_text_document', '../OpenSeek-Pretrain-100B/code-low/part_36_text_document', '../OpenSeek-Pretrain-100B/code-mid/part_37_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-high/23_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-low/51_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/118_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/176_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/256_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/320_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/32_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid/3_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-high/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-low/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_code-mid/23_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-high/12_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-low/3_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_math-mid/5_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high/5_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low/5_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-high/74_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-low/54_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_CC-mid/275_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high/2_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid/1_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_code-high/13_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_code-low/9_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_code-mid/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_math-high/11_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_math-low/11_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_math-mid/29_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-high/4_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-low/6_text_document', '../OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid/3_text_document', '../OpenSeek-Pretrain-100B/math-high/part_04_text_document', '../OpenSeek-Pretrain-100B/math-low/part_10_text_document', '../OpenSeek-Pretrain-100B/math-mid/part_07_text_document', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0041-of-0136_text_document', '../OpenSeek-Pretrain-100B/pes2o-full-train/train-0125-of-0136_text_document', '../OpenSeek-Pretrain-100B/pes2o-full-val/valid-0034-of-0060_text_document', '../OpenSeek-Pretrain-100B/pes2o/pubmedcentral_3_text_document', '../OpenSeek-Pretrain-100B/stack/018_00000_text_document', '../OpenSeek-Pretrain-100B/wiki/012_00000_text_document', '../OpenSeek-Pretrain-100B/zh_cc-high-loss0/part_28_text_document', '../OpenSeek-Pretrain-100B/zh_cc-high-loss1/part_59_text_document', '../OpenSeek-Pretrain-100B/zh_cc-high-loss2/part_16_text_document', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss0/part_192_text_document', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss1/part_550_text_document', '../OpenSeek-Pretrain-100B/zh_cc-medidum-loss2/part_71_text_document'], [1.1068, 0.3577, 0.7775, 0.2859, 0.1672, 0.2339, 0.5397, 0.4064, 0.5005, 0.4616, 0.067, 0.3429, 0.261, 0.1824, 0.2313, 0.8237, 0.2866, 0.667, 0.4657, 0.2005, 0.4317, 1.1397, 0.6782, 0.9175, 0.6414, 0.4696, 1.0102, 1.1403, 0.9674, 0.3755, 0.0499, 0.3608, 0.3623, 0.3704, 0.3733, 0.3631, 0.2573, 0.1638, 0.3251, 6.0237, 8.9063, 10.1376, 0.4598, 0.6857, 0.899, 1.3135, 1.653, 0.3536, 0.6314, 0.5978, 0.7909, 0.2225, 0.1797, 0.2042, 0.4081, 0.1659, 1.2828, 5.68, 7.4907, 8.9359, 0.7663, 0.4052, 0.1916, 0.5074, 0.6437, 0.6406, 0.4, 0.3564, 0.5768, 1.8165, 1.694, 1.6311, 0.687, 0.7387, 0.0143, 6.1982, 0.4229, 0.4202, 1.8171, 0.9776, 0.3725, 0.9492, 0.9236, 1.0643]), blend_per_split=None, split='1', split_matrix=[(0, 1.0), None, None], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=False, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._QwenTokenizerFS object at 0x7f6c8f08bcb0>, mid_level_dataset_surplus=0.005, reset_position_ids=True, reset_attention_mask=True, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, object_storage_cache_path=None)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high/part_142_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 3bcf97bf902bc7c1b1bdfc6f43697d5c-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 3bcf97bf902bc7c1b1bdfc6f43697d5c-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 3bcf97bf902bc7c1b1bdfc6f43697d5c-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 278453
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low/part_62_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 42edf795324f414dfcabdcfd7d860a40-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 42edf795324f414dfcabdcfd7d860a40-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 42edf795324f414dfcabdcfd7d860a40-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 90001
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid/part_189_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from fbd05c4b24607c4e84548924fc59cca2-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from fbd05c4b24607c4e84548924fc59cca2-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from fbd05c4b24607c4e84548924fc59cca2-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 195608
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high/part_76_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 93b91bf5fd05a6901f85ce49182a8582-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 93b91bf5fd05a6901f85ce49182a8582-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 93b91bf5fd05a6901f85ce49182a8582-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 71916
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low/part_124_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from c156036a9186aa56bd0f678fb3c80d3f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from c156036a9186aa56bd0f678fb3c80d3f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from c156036a9186aa56bd0f678fb3c80d3f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 42075
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 38419ede70ce7c575159593f3a3fc94d-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 38419ede70ce7c575159593f3a3fc94d-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 38419ede70ce7c575159593f3a3fc94d-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 58837
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 4762bb70c45a00a24c261d684e5ce4ff-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 4762bb70c45a00a24c261d684e5ce4ff-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 4762bb70c45a00a24c261d684e5ce4ff-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 135775
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 421b3d6f3adf7fab7d6fa68067570d8f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 421b3d6f3adf7fab7d6fa68067570d8f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 421b3d6f3adf7fab7d6fa68067570d8f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 102232
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid/part_444_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ba00e53917af4de1938c3e89a305da94-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ba00e53917af4de1938c3e89a305da94-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ba00e53917af4de1938c3e89a305da94-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 125911
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high/part_498_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 91379f2ddcb832c61f886a23d6f26843-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 91379f2ddcb832c61f886a23d6f26843-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 91379f2ddcb832c61f886a23d6f26843-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 116141
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low/part_10_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 325354
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 325354
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 49da887f9ea0fe93926b5194b3e73e64-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 49da887f9ea0fe93926b5194b3e73e64-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 49da887f9ea0fe93926b5194b3e73e64-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 16844
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid/part_144_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f1a511b8258c5b2ef7cb0880677d5320-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f1a511b8258c5b2ef7cb0880677d5320-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f1a511b8258c5b2ef7cb0880677d5320-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 86258
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high/part_86_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a52b401753534e6b0add72400982b06a-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a52b401753534e6b0add72400982b06a-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a52b401753534e6b0add72400982b06a-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 65662
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low/part_133_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 5519d9210b1853b80c89283c5b8219bd-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 5519d9210b1853b80c89283c5b8219bd-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 5519d9210b1853b80c89283c5b8219bd-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 45891
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid/part_139_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ecbf550b30bfc696b0d4912fd82d990f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ecbf550b30bfc696b0d4912fd82d990f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ecbf550b30bfc696b0d4912fd82d990f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 58196
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high/part_47_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from e4aeb3128a4705d3c4b914a618670c60-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from e4aeb3128a4705d3c4b914a618670c60-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from e4aeb3128a4705d3c4b914a618670c60-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 207235
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low/part_11_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a9f99f617847aa8f6179da4f53b838af-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a9f99f617847aa8f6179da4f53b838af-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a9f99f617847aa8f6179da4f53b838af-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 72100
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid/part_97_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 063fab3df418ed000462f842ac8d49ea-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 063fab3df418ed000462f842ac8d49ea-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 063fab3df418ed000462f842ac8d49ea-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 167804
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high/part_43_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 92c67007e7c71328e23b10f14bbdf0d9-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 92c67007e7c71328e23b10f14bbdf0d9-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 92c67007e7c71328e23b10f14bbdf0d9-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 117162
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low/part_10_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 97123868ddd599ebe18baa058e80c79d-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 97123868ddd599ebe18baa058e80c79d-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 97123868ddd599ebe18baa058e80c79d-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 50433
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid/part_164_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 31933f6459e2d65890af1f316c530035-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 31933f6459e2d65890af1f316c530035-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 31933f6459e2d65890af1f316c530035-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 108609
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high/part_92_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from d1c63d01256da2dbfb9fc6012b0cf457-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from d1c63d01256da2dbfb9fc6012b0cf457-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from d1c63d01256da2dbfb9fc6012b0cf457-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 286720
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low/part_113_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 06a4931530206c4caa9999e2a116c216-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 06a4931530206c4caa9999e2a116c216-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 06a4931530206c4caa9999e2a116c216-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 170626
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid/part_563_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from cea463b205f3d64bc57cd0bf402f517a-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from cea463b205f3d64bc57cd0bf402f517a-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from cea463b205f3d64bc57cd0bf402f517a-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 230810
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/arxiv/007_00000_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 36740
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 36740
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 6443055721fcb34913a77945da3c3a0e-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 6443055721fcb34913a77945da3c3a0e-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 6443055721fcb34913a77945da3c3a0e-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161355
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/books/016_00007_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 4969
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 4969
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 10bf636698c127d1a24bbf6d2eda2aee-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 10bf636698c127d1a24bbf6d2eda2aee-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 10bf636698c127d1a24bbf6d2eda2aee-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 118143
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/code-high/part_13_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 21cd90b13a6517160b33ca8685f946a8-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 21cd90b13a6517160b33ca8685f946a8-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 21cd90b13a6517160b33ca8685f946a8-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 254137
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/code-low/part_36_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 499c6fff9032a79c988ab8063ce949e1-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 499c6fff9032a79c988ab8063ce949e1-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 499c6fff9032a79c988ab8063ce949e1-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 286865
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/code-mid/part_37_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 52e1e8829398a339b53880f97a35f5fd-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 52e1e8829398a339b53880f97a35f5fd-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 52e1e8829398a339b53880f97a35f5fd-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 243365
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-high/23_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 78a0fcb91abd174d65bb6cc0d46e617d-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 78a0fcb91abd174d65bb6cc0d46e617d-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 78a0fcb91abd174d65bb6cc0d46e617d-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 94468
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-low/51_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 73459
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 73459
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 82ad0782ac038979c9eaf5146ff201a9-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 82ad0782ac038979c9eaf5146ff201a9-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 82ad0782ac038979c9eaf5146ff201a9-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 12546
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/118_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 61ededc3c5fec8599ebac2c9268839d2-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 61ededc3c5fec8599ebac2c9268839d2-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 61ededc3c5fec8599ebac2c9268839d2-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 90757
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/176_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f6f063fadb56f7471be6fd7c52d961a9-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f6f063fadb56f7471be6fd7c52d961a9-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f6f063fadb56f7471be6fd7c52d961a9-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 91135
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/256_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0fac24c40f9207ba4e1c876f67ac890f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0fac24c40f9207ba4e1c876f67ac890f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0fac24c40f9207ba4e1c876f67ac890f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 93182
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/320_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 6e257a563aa1f477457f9d5063f31c3f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 6e257a563aa1f477457f9d5063f31c3f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 6e257a563aa1f477457f9d5063f31c3f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 93904
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid/32_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from c8fceab04f7a896b8a87caea00adf17e-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from c8fceab04f7a896b8a87caea00adf17e-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from c8fceab04f7a896b8a87caea00adf17e-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 91340
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high/1_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 6b309498174da0deb341abe9aa36a0e8-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 6b309498174da0deb341abe9aa36a0e8-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 6b309498174da0deb341abe9aa36a0e8-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 64738
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low/2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 398e9530d4c8410e322ac338179666ca-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 398e9530d4c8410e322ac338179666ca-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 398e9530d4c8410e322ac338179666ca-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 41218
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid/3_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from da0fb5525d01182e1daf66b4dbd842b4-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from da0fb5525d01182e1daf66b4dbd842b4-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from da0fb5525d01182e1daf66b4dbd842b4-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 81781
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high/2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 256516
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 256516
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 5257e77c1f3591b9cf58683b58f3dcc3-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 5257e77c1f3591b9cf58683b58f3dcc3-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 5257e77c1f3591b9cf58683b58f3dcc3-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1515440
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low/1_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 097db078c0fd21058b81a805201e1740-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 097db078c0fd21058b81a805201e1740-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 097db078c0fd21058b81a805201e1740-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2240642
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid/2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from de2315b46c9e398c3e52816dce050a9a-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from de2315b46c9e398c3e52816dce050a9a-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from de2315b46c9e398c3e52816dce050a9a-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2550407
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_code-high/4_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from c33b0b590a676cb05db1bdb9a99c5b2a-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from c33b0b590a676cb05db1bdb9a99c5b2a-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from c33b0b590a676cb05db1bdb9a99c5b2a-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 115665
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_code-low/6_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0a144fd0007b87526df5be9acef24430-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0a144fd0007b87526df5be9acef24430-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0a144fd0007b87526df5be9acef24430-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 172518
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_code-mid/23_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f79878c937ae3ea237f9aecc49b2185f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f79878c937ae3ea237f9aecc49b2185f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f79878c937ae3ea237f9aecc49b2185f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 226180
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_math-high/12_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 5ece26b572b5dca7ffe7945c6c9226c7-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 5ece26b572b5dca7ffe7945c6c9226c7-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 5ece26b572b5dca7ffe7945c6c9226c7-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 330448
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_math-low/3_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 23783151274dfea0577aea5f996dc236-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 23783151274dfea0577aea5f996dc236-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 23783151274dfea0577aea5f996dc236-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 415859
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_math-mid/5_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 647d18ed6efe43a6a42bddfeabba04f1-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 647d18ed6efe43a6a42bddfeabba04f1-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 647d18ed6efe43a6a42bddfeabba04f1-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 88947
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high/5_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 167169b90da1e60c19514db0924b1e20-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 167169b90da1e60c19514db0924b1e20-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 167169b90da1e60c19514db0924b1e20-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 158858
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low/5_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ce096f4bf7b8552c29fea089860705c4-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ce096f4bf7b8552c29fea089860705c4-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ce096f4bf7b8552c29fea089860705c4-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 150385
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid/4_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from b4c94e45cf9ca3107266571ad6b407e2-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from b4c94e45cf9ca3107266571ad6b407e2-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from b4c94e45cf9ca3107266571ad6b407e2-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 198961
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_CC-high/74_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a8fa8dd8d69d1cd8935b3457875c39c4-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a8fa8dd8d69d1cd8935b3457875c39c4-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a8fa8dd8d69d1cd8935b3457875c39c4-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 55987
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_CC-low/54_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f461ed8995dac3b75450c24fc74ccf40-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f461ed8995dac3b75450c24fc74ccf40-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f461ed8995dac3b75450c24fc74ccf40-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 45202
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_CC-mid/275_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 9a44f36019d65846dd82c700062bc27c-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 9a44f36019d65846dd82c700062bc27c-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 9a44f36019d65846dd82c700062bc27c-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 51384
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high/4_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 05388d8ae2767f997f3a327b6162fa0c-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 05388d8ae2767f997f3a327b6162fa0c-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 05388d8ae2767f997f3a327b6162fa0c-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 102662
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low/2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 99d51016a10dfaaaf8aa11f19ca47eec-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 99d51016a10dfaaaf8aa11f19ca47eec-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 99d51016a10dfaaaf8aa11f19ca47eec-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 41745
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid/6_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 8376d733799a069256ae09b1bcd6dcab-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 8376d733799a069256ae09b1bcd6dcab-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 8376d733799a069256ae09b1bcd6dcab-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 322718
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high/2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 256516
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 256516
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 3a33c9308a021d4054fa13c21cf4952f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 3a33c9308a021d4054fa13c21cf4952f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 3a33c9308a021d4054fa13c21cf4952f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1428961
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low/1_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 9efb6b5b415537b5bc066d9cc2c620b3-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 9efb6b5b415537b5bc066d9cc2c620b3-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 9efb6b5b415537b5bc066d9cc2c620b3-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1884499
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid/1_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 13f6793fe99fe92439722cddaf40727e-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 13f6793fe99fe92439722cddaf40727e-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 13f6793fe99fe92439722cddaf40727e-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2248083
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_code-high/13_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 55f9b728813d421a6eda80c4e4386d81-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 55f9b728813d421a6eda80c4e4386d81-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 55f9b728813d421a6eda80c4e4386d81-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 192791
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_code-low/9_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from af2b2f630340e8986a7fa73fde612311-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from af2b2f630340e8986a7fa73fde612311-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from af2b2f630340e8986a7fa73fde612311-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 101935
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_code-mid/6_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from da5b6dc21e950c12d3ac425bfd1f091f-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from da5b6dc21e950c12d3ac425bfd1f091f-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from da5b6dc21e950c12d3ac425bfd1f091f-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 48202
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_math-high/11_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ad21fc502df49664e479a402ec05d322-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ad21fc502df49664e479a402ec05d322-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ad21fc502df49664e479a402ec05d322-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 127661
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_math-low/11_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 26ec788a6472d7a4fc2abb1fd801bd83-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 26ec788a6472d7a4fc2abb1fd801bd83-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 26ec788a6472d7a4fc2abb1fd801bd83-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161943
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_math-mid/29_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a4ff47cc44a1bcd08f765fc5846d53fb-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a4ff47cc44a1bcd08f765fc5846d53fb-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a4ff47cc44a1bcd08f765fc5846d53fb-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 161166
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_wiki-high/4_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 9c88f9a84c38df50a45e7390a09215f0-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 9c88f9a84c38df50a45e7390a09215f0-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 9c88f9a84c38df50a45e7390a09215f0-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 100623
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_wiki-low/6_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from f7239aa9de74201060aed61b1cdf0c6b-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from f7239aa9de74201060aed61b1cdf0c6b-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from f7239aa9de74201060aed61b1cdf0c6b-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 89674
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid/3_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 500000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 500000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 75221ad5995f5e1b284e1451bc21e7ea-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 75221ad5995f5e1b284e1451bc21e7ea-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 75221ad5995f5e1b284e1451bc21e7ea-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 145122
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/math-high/part_04_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from ef51ee55d8f6603934f08ce635314fad-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from ef51ee55d8f6603934f08ce635314fad-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from ef51ee55d8f6603934f08ce635314fad-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 456998
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/math-low/part_10_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 183681f26924ef7e06ab027733f0490e-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 183681f26924ef7e06ab027733f0490e-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 183681f26924ef7e06ab027733f0490e-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 426167
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/math-mid/part_07_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 2c02a9a76de03db0d4cbd2b78a34e2e8-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 2c02a9a76de03db0d4cbd2b78a34e2e8-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 2c02a9a76de03db0d4cbd2b78a34e2e8-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 410354
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o-full-train/train-0041-of-0136_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 111715
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 111715
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 4bedbf63df1ea1c142930c7b58689315-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 4bedbf63df1ea1c142930c7b58689315-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 4bedbf63df1ea1c142930c7b58689315-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 172833
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o-full-train/train-0125-of-0136_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 120151
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 120151
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 8c143921153f51208327e9e6c80f3996-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 8c143921153f51208327e9e6c80f3996-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 8c143921153f51208327e9e6c80f3996-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 185836
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o-full-val/valid-0034-of-0060_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 2066
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 2066
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 17b24baded5fb547c2bdd1eca38d9e8d-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 17b24baded5fb547c2bdd1eca38d9e8d-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 17b24baded5fb547c2bdd1eca38d9e8d-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 3587
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/pes2o/pubmedcentral_3_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1000000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1000000
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 817bacd90c33f000963995212597faeb-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 817bacd90c33f000963995212597faeb-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 817bacd90c33f000963995212597faeb-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 1559325
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/stack/018_00000_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 725493
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 725493
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 70fbfb40decd8fd580ba438c4fe3a326-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 70fbfb40decd8fd580ba438c4fe3a326-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 70fbfb40decd8fd580ba438c4fe3a326-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 106399
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/wiki/012_00000_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 952137
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 952137
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0f1298d7bcb5257da1d1fed0af30cf27-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0f1298d7bcb5257da1d1fed0af30cf27-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0f1298d7bcb5257da1d1fed0af30cf27-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 105713
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-high-loss0/part_28_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1014022
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1014022
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 8a41ddce0a4e9446a2a38ecaf18b8319-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 8a41ddce0a4e9446a2a38ecaf18b8319-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 8a41ddce0a4e9446a2a38ecaf18b8319-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 457136
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-high-loss1/part_59_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1024772
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1024772
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 90b19f2d251220dc983bbbbee1bbaa16-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 90b19f2d251220dc983bbbbee1bbaa16-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 90b19f2d251220dc983bbbbee1bbaa16-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 245948
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-high-loss2/part_16_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1006257
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1006257
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 89ab9dff4a2d78068a6f282a80ae15d6-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 89ab9dff4a2d78068a6f282a80ae15d6-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 89ab9dff4a2d78068a6f282a80ae15d6-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 93708
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-medidum-loss0/part_192_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1003089
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1003089
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 3850bffbf498895bd2a9179e4d7dcea3-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 3850bffbf498895bd2a9179e4d7dcea3-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 3850bffbf498895bd2a9179e4d7dcea3-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 238798
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-medidum-loss1/part_550_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1012308
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1012308
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from acdca4be0c37ea582f652270b04903dd-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from acdca4be0c37ea582f652270b04903dd-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from acdca4be0c37ea582f652270b04903dd-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 232358
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ../OpenSeek-Pretrain-100B/zh_cc-medidum-loss2/part_71_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 1013689
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 1013689
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 7d71d34ab9676e73d600759f5bf29f96-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 7d71d34ab9676e73d600759f5bf29f96-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 7d71d34ab9676e73d600759f5bf29f96-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 267765
INFO:megatron.core.datasets.blended_dataset:Build and save the BlendedDataset indices
INFO:megatron.core.datasets.blended_dataset:	Build and save the dataset and dataset sample indexes
WARNING:megatron.core.datasets.blended_dataset:Cannot save the BlendedDataset indexes because path_to_cache is None
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2025-06-27 12:23:34 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (320.15, 320.15)
    train/valid/test-data-iterators-setup ..........: (3543.10, 3543.10)
training ...
Setting rerun_state_machine.current_iteration to 0...
[before the start of training step] datetime: 2025-06-27 12:23:34 
 [2025-06-27 12:27:53] iteration        1/   24000 | consumed samples:         1024 | elapsed time per iteration (ms): 258947.6 | throughput per GPU (TFLOP/s/GPU): 42.3 | learning rate: 1.500000E-06 | global batch size:  1024 | lm loss: 1.192634E+01 | load_balancing_loss: 1.041995E+00 | loss scale: 1.0 | grad norm: 5.960 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
Number of parameters in transformer block in billions:  1.20
Number of parameters in embedding layers in billions: 0.19
Total number of parameters in billions: 1.39
Number of parameters in most loaded shard in billions: 1.3942
Theoretical memory footprints: weight and optimizer=23933.59 MB
[Rank 0] (after 1 iterations) memory (MB) | allocated: 24119.04443359375 | max allocated: 24119.55126953125 | reserved: 26416.0 | max reserved: 26416.0
 [2025-06-27 12:32:13] iteration        2/   24000 | consumed samples:         2048 | elapsed time per iteration (ms): 260283.7 | throughput per GPU (TFLOP/s/GPU): 42.1 | learning rate: 3.000000E-06 | global batch size:  1024 | lm loss: 1.192975E+01 | load_balancing_loss: 1.041399E+00 | loss scale: 1.0 | grad norm: 5.938 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 12:36:33] iteration        3/   24000 | consumed samples:         3072 | elapsed time per iteration (ms): 260541.1 | throughput per GPU (TFLOP/s/GPU): 42.1 | learning rate: 4.500000E-06 | global batch size:  1024 | lm loss: 1.191912E+01 | load_balancing_loss: 1.041041E+00 | loss scale: 1.0 | grad norm: 5.986 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 12:40:53] iteration        4/   24000 | consumed samples:         4096 | elapsed time per iteration (ms): 259756.9 | throughput per GPU (TFLOP/s/GPU): 42.2 | learning rate: 6.000000E-06 | global batch size:  1024 | lm loss: 1.184126E+01 | load_balancing_loss: 1.043573E+00 | loss scale: 1.0 | grad norm: 5.782 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 12:45:14] iteration        5/   24000 | consumed samples:         5120 | elapsed time per iteration (ms): 260413.5 | throughput per GPU (TFLOP/s/GPU): 42.1 | learning rate: 7.500000E-06 | global batch size:  1024 | lm loss: 1.174128E+01 | load_balancing_loss: 1.052462E+00 | loss scale: 1.0 | grad norm: 5.367 | num zeros: 1.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 12:49:34] iteration        6/   24000 | consumed samples:         6144 | elapsed time per iteration (ms): 260808.5 | throughput per GPU (TFLOP/s/GPU): 42.0 | learning rate: 9.000000E-06 | global batch size:  1024 | lm loss: 1.167125E+01 | load_balancing_loss: 1.060444E+00 | loss scale: 1.0 | grad norm: 5.116 | num zeros: 0.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 12:53:54] iteration        7/   24000 | consumed samples:         7168 | elapsed time per iteration (ms): 259943.4 | throughput per GPU (TFLOP/s/GPU): 42.2 | learning rate: 1.050000E-05 | global batch size:  1024 | lm loss: 1.156939E+01 | load_balancing_loss: 1.075724E+00 | loss scale: 1.0 | grad norm: 4.446 | num zeros: 1.0 | params norm: 238.330 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 12:58:14] iteration        8/   24000 | consumed samples:         8192 | elapsed time per iteration (ms): 259467.8 | throughput per GPU (TFLOP/s/GPU): 42.3 | learning rate: 1.200000E-05 | global batch size:  1024 | lm loss: 1.149021E+01 | load_balancing_loss: 1.089425E+00 | loss scale: 1.0 | grad norm: 3.690 | num zeros: 1.0 | params norm: 238.331 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:02:32] iteration        9/   24000 | consumed samples:         9216 | elapsed time per iteration (ms): 258421.4 | throughput per GPU (TFLOP/s/GPU): 42.4 | learning rate: 1.350000E-05 | global batch size:  1024 | lm loss: 1.142278E+01 | load_balancing_loss: 1.098188E+00 | loss scale: 1.0 | grad norm: 3.258 | num zeros: 0.0 | params norm: 238.333 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:06:50] iteration       10/   24000 | consumed samples:        10240 | elapsed time per iteration (ms): 257502.7 | throughput per GPU (TFLOP/s/GPU): 42.6 | learning rate: 1.500000E-05 | global batch size:  1024 | lm loss: 1.135647E+01 | load_balancing_loss: 1.102417E+00 | loss scale: 1.0 | grad norm: 2.955 | num zeros: 0.0 | params norm: 238.335 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:11:08] iteration       11/   24000 | consumed samples:        11264 | elapsed time per iteration (ms): 257793.8 | throughput per GPU (TFLOP/s/GPU): 42.5 | learning rate: 1.650000E-05 | global batch size:  1024 | lm loss: 1.131526E+01 | load_balancing_loss: 1.101932E+00 | loss scale: 1.0 | grad norm: 2.657 | num zeros: 1.0 | params norm: 238.338 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:15:26] iteration       12/   24000 | consumed samples:        12288 | elapsed time per iteration (ms): 258544.5 | throughput per GPU (TFLOP/s/GPU): 42.4 | learning rate: 1.800000E-05 | global batch size:  1024 | lm loss: 1.128093E+01 | load_balancing_loss: 1.100040E+00 | loss scale: 1.0 | grad norm: 2.531 | num zeros: 0.0 | params norm: 238.341 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:19:47] iteration       13/   24000 | consumed samples:        13312 | elapsed time per iteration (ms): 261221.8 | throughput per GPU (TFLOP/s/GPU): 42.0 | learning rate: 1.950000E-05 | global batch size:  1024 | lm loss: 1.122959E+01 | load_balancing_loss: 1.097064E+00 | loss scale: 1.0 | grad norm: 2.656 | num zeros: 1.0 | params norm: 238.346 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:24:08] iteration       14/   24000 | consumed samples:        14336 | elapsed time per iteration (ms): 260648.7 | throughput per GPU (TFLOP/s/GPU): 42.1 | learning rate: 2.100000E-05 | global batch size:  1024 | lm loss: 1.117949E+01 | load_balancing_loss: 1.093009E+00 | loss scale: 1.0 | grad norm: 2.415 | num zeros: 0.0 | params norm: 238.351 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:28:29] iteration       15/   24000 | consumed samples:        15360 | elapsed time per iteration (ms): 260931.1 | throughput per GPU (TFLOP/s/GPU): 42.0 | learning rate: 2.250000E-05 | global batch size:  1024 | lm loss: 1.112640E+01 | load_balancing_loss: 1.087326E+00 | loss scale: 1.0 | grad norm: 2.490 | num zeros: 0.0 | params norm: 238.358 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:32:50] iteration       16/   24000 | consumed samples:        16384 | elapsed time per iteration (ms): 261004.3 | throughput per GPU (TFLOP/s/GPU): 42.0 | learning rate: 2.400000E-05 | global batch size:  1024 | lm loss: 1.107808E+01 | load_balancing_loss: 1.079241E+00 | loss scale: 1.0 | grad norm: 2.431 | num zeros: 0.0 | params norm: 238.366 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:37:12] iteration       17/   24000 | consumed samples:        17408 | elapsed time per iteration (ms): 261991.3 | throughput per GPU (TFLOP/s/GPU): 41.9 | learning rate: 2.550000E-05 | global batch size:  1024 | lm loss: 1.101676E+01 | load_balancing_loss: 1.070701E+00 | loss scale: 1.0 | grad norm: 2.961 | num zeros: 0.0 | params norm: 238.376 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:41:35] iteration       18/   24000 | consumed samples:        18432 | elapsed time per iteration (ms): 262600.2 | throughput per GPU (TFLOP/s/GPU): 41.8 | learning rate: 2.700000E-05 | global batch size:  1024 | lm loss: 1.093517E+01 | load_balancing_loss: 1.067473E+00 | loss scale: 1.0 | grad norm: 2.937 | num zeros: 2.0 | params norm: 238.388 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:45:57] iteration       19/   24000 | consumed samples:        19456 | elapsed time per iteration (ms): 262636.4 | throughput per GPU (TFLOP/s/GPU): 41.8 | learning rate: 2.850000E-05 | global batch size:  1024 | lm loss: 1.086546E+01 | load_balancing_loss: 1.060247E+00 | loss scale: 1.0 | grad norm: 3.895 | num zeros: 0.0 | params norm: 238.401 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:50:20] iteration       20/   24000 | consumed samples:        20480 | elapsed time per iteration (ms): 262766.2 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 3.000000E-05 | global batch size:  1024 | lm loss: 1.081787E+01 | load_balancing_loss: 1.056373E+00 | loss scale: 1.0 | grad norm: 3.499 | num zeros: 1.0 | params norm: 238.416 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:54:44] iteration       21/   24000 | consumed samples:        21504 | elapsed time per iteration (ms): 263680.6 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 3.150000E-05 | global batch size:  1024 | lm loss: 1.073946E+01 | load_balancing_loss: 1.050537E+00 | loss scale: 1.0 | grad norm: 3.805 | num zeros: 1.0 | params norm: 238.432 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 13:59:07] iteration       22/   24000 | consumed samples:        22528 | elapsed time per iteration (ms): 263207.2 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 3.300000E-05 | global batch size:  1024 | lm loss: 1.066498E+01 | load_balancing_loss: 1.049672E+00 | loss scale: 1.0 | grad norm: 2.622 | num zeros: 0.0 | params norm: 238.451 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:03:30] iteration       23/   24000 | consumed samples:        23552 | elapsed time per iteration (ms): 263641.7 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 3.450000E-05 | global batch size:  1024 | lm loss: 1.060945E+01 | load_balancing_loss: 1.047704E+00 | loss scale: 1.0 | grad norm: 2.150 | num zeros: 0.0 | params norm: 238.471 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:07:53] iteration       24/   24000 | consumed samples:        24576 | elapsed time per iteration (ms): 262820.3 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 3.600000E-05 | global batch size:  1024 | lm loss: 1.054304E+01 | load_balancing_loss: 1.045879E+00 | loss scale: 1.0 | grad norm: 2.249 | num zeros: 0.0 | params norm: 238.495 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:12:17] iteration       25/   24000 | consumed samples:        25600 | elapsed time per iteration (ms): 263325.3 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 3.750000E-05 | global batch size:  1024 | lm loss: 1.048294E+01 | load_balancing_loss: 1.044638E+00 | loss scale: 1.0 | grad norm: 2.400 | num zeros: 0.0 | params norm: 238.520 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:16:40] iteration       26/   24000 | consumed samples:        26624 | elapsed time per iteration (ms): 262943.7 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 3.900000E-05 | global batch size:  1024 | lm loss: 1.042026E+01 | load_balancing_loss: 1.041865E+00 | loss scale: 1.0 | grad norm: 2.440 | num zeros: 0.0 | params norm: 238.548 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:21:03] iteration       27/   24000 | consumed samples:        27648 | elapsed time per iteration (ms): 263056.6 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 4.050000E-05 | global batch size:  1024 | lm loss: 1.033104E+01 | load_balancing_loss: 1.043118E+00 | loss scale: 1.0 | grad norm: 2.389 | num zeros: 0.0 | params norm: 238.579 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:25:26] iteration       28/   24000 | consumed samples:        28672 | elapsed time per iteration (ms): 263034.5 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 4.200000E-05 | global batch size:  1024 | lm loss: 1.026868E+01 | load_balancing_loss: 1.040587E+00 | loss scale: 1.0 | grad norm: 2.169 | num zeros: 0.0 | params norm: 238.612 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:29:48] iteration       29/   24000 | consumed samples:        29696 | elapsed time per iteration (ms): 262752.9 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 4.350000E-05 | global batch size:  1024 | lm loss: 1.018083E+01 | load_balancing_loss: 1.041168E+00 | loss scale: 1.0 | grad norm: 2.247 | num zeros: 0.0 | params norm: 238.648 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:34:12] iteration       30/   24000 | consumed samples:        30720 | elapsed time per iteration (ms): 263202.5 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 4.500000E-05 | global batch size:  1024 | lm loss: 1.011347E+01 | load_balancing_loss: 1.042428E+00 | loss scale: 1.0 | grad norm: 2.255 | num zeros: 1.0 | params norm: 238.686 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:38:34] iteration       31/   24000 | consumed samples:        31744 | elapsed time per iteration (ms): 262553.6 | throughput per GPU (TFLOP/s/GPU): 41.8 | learning rate: 4.650000E-05 | global batch size:  1024 | lm loss: 1.002936E+01 | load_balancing_loss: 1.040154E+00 | loss scale: 1.0 | grad norm: 2.140 | num zeros: 1.0 | params norm: 238.728 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:42:57] iteration       32/   24000 | consumed samples:        32768 | elapsed time per iteration (ms): 263258.3 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 4.800000E-05 | global batch size:  1024 | lm loss: 9.967064E+00 | load_balancing_loss: 1.039624E+00 | loss scale: 1.0 | grad norm: 2.145 | num zeros: 1.0 | params norm: 238.773 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:47:20] iteration       33/   24000 | consumed samples:        33792 | elapsed time per iteration (ms): 263033.7 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 4.950000E-05 | global batch size:  1024 | lm loss: 9.870680E+00 | load_balancing_loss: 1.039607E+00 | loss scale: 1.0 | grad norm: 2.177 | num zeros: 0.0 | params norm: 238.821 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:51:44] iteration       34/   24000 | consumed samples:        34816 | elapsed time per iteration (ms): 263226.2 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 5.100000E-05 | global batch size:  1024 | lm loss: 9.804967E+00 | load_balancing_loss: 1.039294E+00 | loss scale: 1.0 | grad norm: 2.142 | num zeros: 0.0 | params norm: 238.873 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 14:56:07] iteration       35/   24000 | consumed samples:        35840 | elapsed time per iteration (ms): 263583.5 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 5.250000E-05 | global batch size:  1024 | lm loss: 9.716209E+00 | load_balancing_loss: 1.039907E+00 | loss scale: 1.0 | grad norm: 2.205 | num zeros: 1.0 | params norm: 238.928 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:00:31] iteration       36/   24000 | consumed samples:        36864 | elapsed time per iteration (ms): 263716.8 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 5.400000E-05 | global batch size:  1024 | lm loss: 9.629738E+00 | load_balancing_loss: 1.041594E+00 | loss scale: 1.0 | grad norm: 2.289 | num zeros: 0.0 | params norm: 238.986 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:04:54] iteration       37/   24000 | consumed samples:        37888 | elapsed time per iteration (ms): 263006.7 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 5.550000E-05 | global batch size:  1024 | lm loss: 9.566217E+00 | load_balancing_loss: 1.042442E+00 | loss scale: 1.0 | grad norm: 2.398 | num zeros: 0.0 | params norm: 239.047 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:09:17] iteration       38/   24000 | consumed samples:        38912 | elapsed time per iteration (ms): 262926.5 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 5.700000E-05 | global batch size:  1024 | lm loss: 9.475837E+00 | load_balancing_loss: 1.043912E+00 | loss scale: 1.0 | grad norm: 2.430 | num zeros: 0.0 | params norm: 239.111 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:13:40] iteration       39/   24000 | consumed samples:        39936 | elapsed time per iteration (ms): 263264.4 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 5.850000E-05 | global batch size:  1024 | lm loss: 9.383533E+00 | load_balancing_loss: 1.048713E+00 | loss scale: 1.0 | grad norm: 2.290 | num zeros: 0.0 | params norm: 239.179 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:18:03] iteration       40/   24000 | consumed samples:        40960 | elapsed time per iteration (ms): 263157.9 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 6.000000E-05 | global batch size:  1024 | lm loss: 9.300698E+00 | load_balancing_loss: 1.049337E+00 | loss scale: 1.0 | grad norm: 2.338 | num zeros: 1.0 | params norm: 239.249 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:22:26] iteration       41/   24000 | consumed samples:        41984 | elapsed time per iteration (ms): 263029.2 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 6.150000E-05 | global batch size:  1024 | lm loss: 9.214035E+00 | load_balancing_loss: 1.049348E+00 | loss scale: 1.0 | grad norm: 2.173 | num zeros: 1.0 | params norm: 239.323 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:26:49] iteration       42/   24000 | consumed samples:        43008 | elapsed time per iteration (ms): 262970.9 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 6.300000E-05 | global batch size:  1024 | lm loss: 9.111977E+00 | load_balancing_loss: 1.052316E+00 | loss scale: 1.0 | grad norm: 2.166 | num zeros: 0.0 | params norm: 239.401 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:31:13] iteration       43/   24000 | consumed samples:        44032 | elapsed time per iteration (ms): 263657.2 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 6.450000E-05 | global batch size:  1024 | lm loss: 9.048386E+00 | load_balancing_loss: 1.053344E+00 | loss scale: 1.0 | grad norm: 2.191 | num zeros: 0.0 | params norm: 239.483 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:35:35] iteration       44/   24000 | consumed samples:        45056 | elapsed time per iteration (ms): 262466.2 | throughput per GPU (TFLOP/s/GPU): 41.8 | learning rate: 6.600000E-05 | global batch size:  1024 | lm loss: 8.968973E+00 | load_balancing_loss: 1.054048E+00 | loss scale: 1.0 | grad norm: 2.190 | num zeros: 1.0 | params norm: 239.568 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:39:59] iteration       45/   24000 | consumed samples:        46080 | elapsed time per iteration (ms): 263089.7 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 6.750000E-05 | global batch size:  1024 | lm loss: 8.847938E+00 | load_balancing_loss: 1.059039E+00 | loss scale: 1.0 | grad norm: 2.177 | num zeros: 39.0 | params norm: 239.657 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:44:22] iteration       46/   24000 | consumed samples:        47104 | elapsed time per iteration (ms): 263084.5 | throughput per GPU (TFLOP/s/GPU): 41.7 | learning rate: 6.900000E-05 | global batch size:  1024 | lm loss: 8.766063E+00 | load_balancing_loss: 1.055782E+00 | loss scale: 1.0 | grad norm: 2.152 | num zeros: 1.0 | params norm: 239.750 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:48:46] iteration       47/   24000 | consumed samples:        48128 | elapsed time per iteration (ms): 263952.4 | throughput per GPU (TFLOP/s/GPU): 41.5 | learning rate: 7.050000E-05 | global batch size:  1024 | lm loss: 8.684135E+00 | load_balancing_loss: 1.057991E+00 | loss scale: 1.0 | grad norm: 2.165 | num zeros: 0.0 | params norm: 239.846 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-06-27 15:53:09] iteration       48/   24000 | consumed samples:        49152 | elapsed time per iteration (ms): 263581.7 | throughput per GPU (TFLOP/s/GPU): 41.6 | learning rate: 7.200000E-05 | global batch size:  1024 | lm loss: 8.595925E+00 | load_balancing_loss: 1.059975E+00 | loss scale: 1.0 | grad norm: 2.102 | num zeros: 0.0 | params norm: 239.948 | number of skipped iterations:   0 | number of nan iterations:   0 |
posted @ 2025-06-27 16:09  Xu_Lin  阅读(76)  评论(0)    收藏  举报