tf.feature_column.input_layer 特征顺序问题

先说结论

  • tf.feature_column.input_layer()的api,会对传入的feature_columns进行排序,并不是按照输入顺序进行组织,排序依据基于feature_column的name(tf生成的,类似于'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'这种。
  • 关键代码:
for column in sorted(feature_columns, key=lambda x: x.name):
      ordered_columns.append(column)
  • 代码验证:
In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]
Out[31]:
['r_rsp113_indicator',
 'u_wu211_X_u_wu215_indicator',
 'u_wu211_indicator',
 'u_wu215_indicator']

表现

In [24]: u_wu211 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu211', vocabulary_list=['0','1','2'])
    ...: u_wu215 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu215', vocabulary_list=['00s','10s','90s'])
    ...: r_rsp113 = tf.feature_column.categorical_column_with_vocabulary_list(key='r_rsp113', vocabulary_list=['0','-1','1'])
    ...: u_wu211_u_wu215_cross = tf.feature_column.crossed_column(keys = [u_wu211, u_wu215], hash_bucket_size=3)
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu215)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(r_rsp113)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211_u_wu215_cross)]))
    ...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211),
    ...:   tf.feature_column.indicator_column(u_wu215),
    ...: tf.feature_column.indicator_column(r_rsp113),
    ...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)
    ...: ]))
    ...:
tf.Tensor(
[[1. 0. 0.]
 [0. 0. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 0. 0.]
 [1. 0. 0.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 1. 0.]
 [0. 1. 0.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 0. 1.]
 [0. 0. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0.]], shape=(2, 12), dtype=float32)
  • 由第一条sample举例:期望得到的是u_wu211 + u_wu215 + r_rsp113 + u_wu211_u_wu215_cross
    • 即:[1. 0. 0.] + [0. 0. 0.] + [0. 1. 0.] + [0. 0. 1.]
    • 但得到的却是:[0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.],也就是['r_rsp113', 'u_wu211_u_wu215_cross', 'u_wu211', 'u_wu215']

文档描述

    feature_columns: An iterable containing the FeatureColumns to use as inputs
      to your model. All items should be instances of classes derived from
      `_DenseColumn` such as `numeric_column`, `embedding_column`,
      `bucketized_column`, `indicator_column`. If you have categorical features,
      you can wrap them with an `embedding_column` or `indicator_column`.
  • feature_columns参数接收一个:包含模型中使用到的FeatureColumns的一个迭代器,列表中的项目都应该是_DenseColumn类的实例化对象,例如numeric_column, embedding_column, bucketized_column, indicator_column.如果是标签类别的特征,需要用embedding_column or indicator_column转换一下。
  • 其中并未解释特征顺序相关问题。

源码探究

  • tf.feature_column.input_layer
@tf_export(v1=['feature_column.input_layer'])
def input_layer(features,
                feature_columns,
                weight_collections=None,
                trainable=True,
                cols_to_vars=None,
                cols_to_output_tensors=None):
  """Returns a dense `Tensor` as input layer based on given `feature_columns`.

  Generally a single example in training data is described with FeatureColumns.
  At the first layer of the model, this column oriented data should be converted
  to a single `Tensor`.

  Example:

  ``python
  price = numeric_column('price')
  keywords_embedded = embedding_column(
      categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)
  columns = [price, keywords_embedded, ...]
  features = tf.io.parse_example(..., features=make_parse_example_spec(columns))
  dense_tensor = input_layer(features, columns)
  for units in [128, 64, 32]:
    dense_tensor = tf.compat.v1.layers.dense(dense_tensor, units, tf.nn.relu)
  prediction = tf.compat.v1.layers.dense(dense_tensor, 1)
  ``

  Args:
    features: A mapping from key to tensors. `_FeatureColumn`s look up via these
      keys. For example `numeric_column('price')` will look at 'price' key in
      this dict. Values can be a `SparseTensor` or a `Tensor` depends on
      corresponding `_FeatureColumn`.
    feature_columns: An iterable containing the FeatureColumns to use as inputs
      to your model. All items should be instances of classes derived from
      `_DenseColumn` such as `numeric_column`, `embedding_column`,
      `bucketized_column`, `indicator_column`. If you have categorical features,
      you can wrap them with an `embedding_column` or `indicator_column`.
    weight_collections: A list of collection names to which the Variable will be
      added. Note that variables will also be added to collections
      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
    trainable: If `True` also add the variable to the graph collection
      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
    cols_to_vars: If not `None`, must be a dictionary that will be filled with a
      mapping from `_FeatureColumn` to list of `Variable`s.  For example, after
      the call, we might have cols_to_vars =
      {_EmbeddingColumn(
        categorical_column=_HashedCategoricalColumn(
          key='sparse_feature', hash_bucket_size=5, dtype=tf.string),
        dimension=10): [<tf.Variable 'some_variable:0' shape=(5, 10),
                        <tf.Variable 'some_variable:1' shape=(5, 10)]}
      If a column creates no variables, its value will be an empty list.
    cols_to_output_tensors: If not `None`, must be a dictionary that will be
      filled with a mapping from '_FeatureColumn' to the associated
      output `Tensor`s.

  Returns:
    A `Tensor` which represents input layer of a model. Its shape
    is (batch_size, first_layer_dimension) and its dtype is `float32`.
    first_layer_dimension is determined based on given `feature_columns`.

  Raises:
    ValueError: if an item in `feature_columns` is not a `_DenseColumn`.
  """
  return _internal_input_layer(
      features,
      feature_columns,
      weight_collections=weight_collections,
      trainable=trainable,
      cols_to_vars=cols_to_vars,
      cols_to_output_tensors=cols_to_output_tensors)

  • _internal_input_layer

def _internal_input_layer(features,
                          feature_columns,
                          weight_collections=None,
                          trainable=True,
                          cols_to_vars=None,
                          scope=None,
                          cols_to_output_tensors=None,
                          from_template=False):
  """See input_layer. `scope` is a name or variable scope to use."""

  feature_columns = _normalize_feature_columns(feature_columns)
  for column in feature_columns:
    if not isinstance(column, _DenseColumn):
      raise ValueError(
          'Items of feature_columns must be a _DenseColumn. '
          'You can wrap a categorical column with an '
          'embedding_column or indicator_column. Given: {}'.format(column))
  weight_collections = list(weight_collections or [])
  if ops.GraphKeys.GLOBAL_VARIABLES not in weight_collections:
    weight_collections.append(ops.GraphKeys.GLOBAL_VARIABLES)
  if ops.GraphKeys.MODEL_VARIABLES not in weight_collections:
    weight_collections.append(ops.GraphKeys.MODEL_VARIABLES)

  def _get_logits():  # pylint: disable=missing-docstring
    builder = _LazyBuilder(features)
    output_tensors = []
    ordered_columns = []
    for column in sorted(feature_columns, key=lambda x: x.name):
      ordered_columns.append(column)
      with variable_scope.variable_scope(
          None, default_name=column._var_scope_name):  # pylint: disable=protected-access
        tensor = column._get_dense_tensor(  # pylint: disable=protected-access
            builder,
            weight_collections=weight_collections,
            trainable=trainable)
        num_elements = column._variable_shape.num_elements()  # pylint: disable=protected-access
        batch_size = array_ops.shape(tensor)[0]
        output_tensor = array_ops.reshape(
            tensor, shape=(batch_size, num_elements))
        output_tensors.append(output_tensor)
        if cols_to_vars is not None:
          # Retrieve any variables created (some _DenseColumn's don't create
          # variables, in which case an empty list is returned).
          cols_to_vars[column] = ops.get_collection(
              ops.GraphKeys.GLOBAL_VARIABLES,
              scope=variable_scope.get_variable_scope().name)
        if cols_to_output_tensors is not None:
          cols_to_output_tensors[column] = output_tensor
    _verify_static_batch_size_equality(output_tensors, ordered_columns)
    return array_ops.concat(output_tensors, 1)

  # If we're constructing from the `make_template`, that by default adds a
  # variable scope with the name of the layer. In that case, we dont want to
  # add another `variable_scope` as that would break checkpoints.
  if from_template:
    return _get_logits()
  else:
    with variable_scope.variable_scope(
        scope, default_name='input_layer', values=features.values()):
      return _get_logits()

  • 两处需要注意:
    • 在_get_logits中,_LazyBuilder对重复引用的特征做了去重,并且延迟初始化
    • 另外在添加特征中,引入了一个排序,基于feature_column的name(tf生成的,类似于'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'这种。
    • 代码如下:
  def _get_logits():  # pylint: disable=missing-docstring
    builder = _LazyBuilder(features)
    output_tensors = []
    ordered_columns = []
    for column in sorted(feature_columns, key=lambda x: x.name):
      ordered_columns.append(column)
      with variable_scope.variable_scope(
          None, default_name=column._var_scope_name):  # pylint: disable=protected-access

结论验证

In [29]: fcs = [tf.feature_column.indicator_column(u_wu211),
    ...:   tf.feature_column.indicator_column(u_wu215),
    ...: tf.feature_column.indicator_column(r_rsp113),
    ...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)
    ...: ]

In [30]: sorted( fcs, key=lambda x: x.name)
Out[30]:
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='r_rsp113', vocabulary_list=('0', '-1', '1'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=CrossedColumn(keys=(VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=3, hash_key=None)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]
Out[31]:
['r_rsp113_indicator',
 'u_wu211_X_u_wu215_indicator',
 'u_wu211_indicator',
 'u_wu215_indicator']
  • 期望结果:
  • ['r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator', 'u_wu211_indicator', 'u_wu215_indicator']
    • 即: [0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.]
    • [0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
    • 与预期一致。
posted @ 2022-08-15 11:57  澄轶  阅读(732)  评论(0编辑  收藏  举报