[ERROR: tf distribute strategy parameter server: tfx component trainer: model.save(): failed to connect to all addresses]

log of pod tfx-component-trainer:

2024-02-14 13:56:45.656154: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'examples' with dtype string and shape [?]
	 [[{{node examples}}]]
WARNING:absl:Found untraced functions such as _update_step_xla while saving (showing 1 of 1). These functions will not be directly callable after loading.
2024-02-14 13:56:58.654607: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'serving_default_examples' with dtype string and shape [?]
	 [[{{node serving_default_examples}}]]
ERROR:absl:Execution 81 failed.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
    main(sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
    execution_info = component_launcher.launch()
  File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
    executor_output = self._run_executor(execution_info)
  File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
    executor_output = self._executor_operator.run_executor(execution_info)
  File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
    return run_with_executor(execution_info, executor)
  File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
    result = executor.Do(execution_info.input_dict, output_dict,
  File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
    run_fn(fn_args)
  File "/tmp/tmp6whwtr0z/detect_anomalies_in_wafer_trainer.py", line 339, in run_fn
    model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnavailableError: Graph execution error:

failed to connect to all addresses
Additional GRPC error information from remote target /job:chief/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:
:{"created":"@1707919046.196861281","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1707919046.183670123","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
	 [[{{node num_shards/_4}}]]
Additional GRPC error information from remote target /job:ps/replica:0/task:0/device:CPU:0:
:{"created":"@1707919046.428929072","description":"Error received from peer ipv4:10.105.27.97:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" failed to connect to all addresses\nAdditional GRPC error information from remote target /job:chief/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:\n:{"created":"@1707919046.196861281","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1707919046.183670123","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}\n\t [[{{node num_shards/_4}}]]","grpc_status":14} [Op:__inference_tf_function_save_108956]
2024-02-14 13:57:51.914254: I tensorflow/core/common_runtime/eager/kernel_and_device.cc:94] Ignoring error status when releasing multi-device function handle UNIMPLEMENTED: Releasing a multi-device component handle on a remote device is not yet implemented.
INFO:tensorflow:ClusterCoordinator destructor: stopping cluster
INFO:tensorflow:ClusterCoordinator destructor: stopping cluster
INFO:tensorflow:Stopping cluster, starting with failure handler
INFO:tensorflow:Stopping cluster, starting with failure handler
INFO:tensorflow:Stopping workers
INFO:tensorflow:Stopping workers
INFO:tensorflow:Stopping queue
INFO:tensorflow:Stopping queue
INFO:tensorflow:Start cancelling remote resource-building functions
INFO:tensorflow:Start cancelling remote resource-building functions
time="2024-02-14T13:58:57.578Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-02-14T13:58:58.482Z" level=error msg="cannot save artifact /mlpipeline-ui-metadata.json" argo=true error="stat /mlpipeline-ui-metadata.json: no such file or directory"
Error: exit status 1

[ANALYSIS]

# in run_fn() of module file of tfx component trainer 
model.save(fn_args.serving_model_dir, ...)
    trainer = tfx.components.Trainer(
        module_file=module_file,
        examples=example_gen.outputs['examples'],
        
        transform_graph=transform.outputs['transform_graph'],
        
        train_args=tfx.proto.TrainArgs(num_steps=_STEPS_PER_EPOCH,),
        
        custom_config={"epochs": 5},
    )
tfx/compoments/trainer/fn_args_utils.py

@attr.s
class FnArgs:
  """Args to pass to user defined training/tuning function(s).

  Attributes:
    working_dir: Working dir.
    train_files: A list of patterns for train files.
    eval_files: A list of patterns for eval files.
    train_steps: Number of train steps.
    eval_steps: Number of eval steps.
    schema_path: A single uri for schema file. Will be None if not specified.
    schema_file: Deprecated, use `schema_path` instead.
    transform_graph_path: An optional single uri for transform graph produced by
      TFT. Will be None if not specified.
    transform_output: Deprecated, use `transform_graph_path` instead.
    data_accessor: Contains factories that can create tf.data.Datasets or other
      means to access the train/eval data. They provide a uniform way of
      accessing data, regardless of how the data is stored on disk.
    serving_model_dir: A single uri for the output directory of the serving
      model.
    eval_model_dir: A single uri for the output directory of the eval model.
      Note that this is estimator only, Keras doesn't require it for TFMA.
    model_run_dir: A single uri for the output directory of model training
      related files.
    base_model: An optional base model path that will be used for this training.
    hyperparameters: An optional keras_tuner.HyperParameters config.
    custom_config: An optional dictionary passed to the component.
  """
  working_dir = attr.ib(type=str, default=None)
  train_files = attr.ib(type=List[str], default=None)
  eval_files = attr.ib(type=List[str], default=None)
  train_steps = attr.ib(type=int, default=None)
  eval_steps = attr.ib(type=int, default=None)
  schema_path = attr.ib(type=str, default=None)
  schema_file = attr.ib(type=str, default=None)
  transform_graph_path = attr.ib(type=str, default=None)
  transform_output = attr.ib(type=str, default=None)
  data_accessor = attr.ib(type=DataAccessor, default=None)
  serving_model_dir = attr.ib(type=str, default=None)
  eval_model_dir = attr.ib(type=str, default=None)
  model_run_dir = attr.ib(type=str, default=None)
  base_model = attr.ib(type=str, default=None)
  hyperparameters = attr.ib(type=Dict[str, Any], default=None)
  custom_config = attr.ib(type=Dict[str, Any], default=None)


def get_common_fn_args(input_dict: Dict[str, List[types.Artifact]],
                       exec_properties: Dict[str, Any],
                       working_dir: Optional[str] = None) -> FnArgs:
  """Get common args of training and tuning."""
  if input_dict.get(standard_component_specs.TRANSFORM_GRAPH_KEY):
    transform_graph_path = artifact_utils.get_single_uri(
        input_dict[standard_component_specs.TRANSFORM_GRAPH_KEY])
  else:
    transform_graph_path = None

  if input_dict.get(standard_component_specs.SCHEMA_KEY):
    schema_path = io_utils.get_only_uri_in_dir(
        artifact_utils.get_single_uri(
            input_dict[standard_component_specs.SCHEMA_KEY]))
  else:
    schema_path = None

  train_args = trainer_pb2.TrainArgs()
  eval_args = trainer_pb2.EvalArgs()
  proto_utils.json_to_proto(
      exec_properties[standard_component_specs.TRAIN_ARGS_KEY], train_args)
  proto_utils.json_to_proto(
      exec_properties[standard_component_specs.EVAL_ARGS_KEY], eval_args)

  # Default behavior is train on `train` split (when splits is empty in train
  # args) and evaluate on `eval` split (when splits is empty in eval args).
  if not train_args.splits:
    train_args.splits.append('train')
    absl.logging.info("Train on the 'train' split when train_args.splits is "
                      'not set.')
  if not eval_args.splits:
    eval_args.splits.append('eval')
    absl.logging.info("Evaluate on the 'eval' split when eval_args.splits is "
                      'not set.')

  train_files = []
  for train_split in train_args.splits:
    train_files.extend([
        io_utils.all_files_pattern(uri)
        for uri in artifact_utils.get_split_uris(
            input_dict[standard_component_specs.EXAMPLES_KEY], train_split)
    ])

  eval_files = []
  for eval_split in eval_args.splits:
    eval_files.extend([
        io_utils.all_files_pattern(uri)
        for uri in artifact_utils.get_split_uris(
            input_dict[standard_component_specs.EXAMPLES_KEY], eval_split)
    ])

  data_accessor = DataAccessor(
      tf_dataset_factory=tfxio_utils.get_tf_dataset_factory_from_artifact(
          input_dict[standard_component_specs.EXAMPLES_KEY],
          _TELEMETRY_DESCRIPTORS),
      record_batch_factory=tfxio_utils.get_record_batch_factory_from_artifact(
          input_dict[standard_component_specs.EXAMPLES_KEY],
          _TELEMETRY_DESCRIPTORS),
      data_view_decode_fn=tfxio_utils.get_data_view_decode_fn_from_artifact(
          input_dict[standard_component_specs.EXAMPLES_KEY],
          _TELEMETRY_DESCRIPTORS)
      )

  # https://github.com/tensorflow/tfx/issues/45: Replace num_steps=0 with
  # num_steps=None.  Conversion of the proto to python will set the default
  # value of an int as 0 so modify the value here.  Tensorflow will raise an
  # error if num_steps <= 0.
  train_steps = train_args.num_steps or None
  eval_steps = eval_args.num_steps or None

  # Load and deserialize custom config from execution properties.
  # Note that in the component interface the default serialization of custom
  # config is 'null' instead of '{}'. Therefore we need to default the
  # json_utils.loads to 'null' then populate it with an empty dict when
  # needed.
  custom_config = json_utils.loads(
      exec_properties.get(standard_component_specs.CUSTOM_CONFIG_KEY, 'null'))

  # TODO(ruoyu): Make this a dict of tag -> uri instead of list.
  if input_dict.get(standard_component_specs.BASE_MODEL_KEY):
    base_model_artifact = artifact_utils.get_single_instance(
        input_dict[standard_component_specs.BASE_MODEL_KEY])
    base_model = path_utils.serving_model_path(
        base_model_artifact.uri,
        path_utils.is_old_model_artifact(base_model_artifact))
  else:
    base_model = None

  return FnArgs(
      working_dir=working_dir,
      train_files=train_files,
      eval_files=eval_files,
      train_steps=train_steps,
      eval_steps=eval_steps,
      schema_path=schema_path,
      transform_graph_path=transform_graph_path,
      data_accessor=data_accessor,
      base_model=base_model,
      custom_config=custom_config,
  )


# tfx/components/trainer/executor.py
  def _GetFnArgs(self, input_dict: Dict[str, List[types.Artifact]],
                 output_dict: Dict[str, List[types.Artifact]],
                 exec_properties: Dict[str, Any]) -> fn_args_utils.FnArgs:
    if input_dict.get(standard_component_specs.HYPERPARAMETERS_KEY):
      hyperparameters_file = io_utils.get_only_uri_in_dir(
          artifact_utils.get_single_uri(
              input_dict[standard_component_specs.HYPERPARAMETERS_KEY]))
      hyperparameters_config = json.loads(
          file_io.read_file_to_string(hyperparameters_file))
    else:
      hyperparameters_config = None

    output_path = artifact_utils.get_single_uri(
        output_dict[standard_component_specs.MODEL_KEY])
    serving_model_dir = path_utils.serving_model_dir(output_path)
...

# tfx/utils/path_utils.py
def serving_model_dir(output_uri: str, is_old_artifact: bool = False) -> str:
  """Returns directory for exported model for serving purpose."""
  if is_old_artifact:
    return os.path.join(output_uri, _OLD_SERVING_MODEL_DIR)
  return os.path.join(output_uri, path_constants.SERVING_MODEL_DIR)


# tfx/utils/path_constants.py
SERVING_MODEL_DIR = 'Format-Serving'

fn_args.serving_model_dir is uri of trainer output artifact “model“ + 'Format-Serving'.

  File "/tmp/tmp6whwtr0z/detect_anomalies_in_wafer_trainer.py", line 339, in run_fn
    model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnavailableError: Graph execution error:

failed to connect to all addresses

[SOLUTION]

This error is saying that "failed to connect to all addresses of metadata-grpc-deployment", after checking it is found that pod metadata-grpc-deployment is down and recreated again. So just retry.