Classify structured data using Keras Preprocessing Layers

对于既有数值特征，又有类别特征的输入情况，使用 keras的预处理层进行转换。

https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb?hl=ar-bh#scrollTo=sMYQvJuBi7MS

This tutorial demonstrates how to classify structured data (e.g. tabular data in a CSV). You will use Keras to define the model, and preprocessing layers as a bridge to map from columns in a CSV to features used to train the model. This tutorial contains complete code to:

Load a CSV file using Pandas.

Build an input pipeline to batch and shuffle the rows using tf.data.

Map from columns in the CSV to features used to train the model using Keras Preprocessing layers.

Build, train, and evaluate a model using Keras.

Note: This tutorial is similar to Classify structured data with feature columns. This version uses new experimental Keras Preprocessing Layers instead of tf.feature_column. Keras Preprocessing Layers are more intuitive, and can be easily included inside your model to simplify deployment.

处理效果

数值型，定义输入，进行正规化变换。

类别性，转换为整数，对整数进行类别编码， one-hot-code

`CategoryEncoding`

https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/CategoryEncoding

This layer provides options for condensing data into a categorical encoding. It accepts integer values as inputs and outputs a dense representation (one sample = 1-index tensor of float values representing data about the sample's tokens) of those inputs.

类别编码对应的CODE

index -- 字符查询表，根据字符，查到index

encoder-- 根据index查询到的index，生成one-hot码

def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
  # Create a StringLookup layer which will turn strings into integer indices
  if dtype == 'string':
    index = preprocessing.StringLookup(max_tokens=max_tokens)
  else:
    index = preprocessing.IntegerLookup(max_values=max_tokens)

  # Prepare a Dataset that only yields our feature
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the set of possible values and assign them a fixed integer index.
  index.adapt(feature_ds)

  # Create a Discretization for our integer indices.
  encoder = preprocessing.CategoryEncoding(max_tokens=index.vocab_size())

  # Apply one-hot encoding to our indices. The lambda function captures the
  # layer so we can use them, or include them in the functional model later.
  return lambda feature: encoder(index(feature))

测试输出

type_col = train_features['Type']
layer = get_category_encoding_layer('Type', train_ds, 'string')
layer(type_col)

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

数值型处理

生成一个正则转换器。

def get_normalization_layer(name, dataset):
  # Create a Normalization layer for our feature.
  normalizer = preprocessing.Normalization()

  # Prepare a Dataset that only yields our feature.
  feature_ds = dataset.map(lambda x, y: x[name])

  # Learn the statistics of the data.
  normalizer.adapt(feature_ds)

  return normalizer

例子，将对应特征，做正规化处理。

photo_count_col = train_features['PhotoAmt']
layer = get_normalization_layer('PhotoAmt', train_ds)
layer(photo_count_col)

<tf.Tensor: shape=(5, 1), dtype=float32, numpy=
array([[ 1.3705449 ],
       [ 0.74395925],
       [-0.19591942],
       [-0.8225052 ],
       [-0.8225052 ]], dtype=float32)>

预处理管线定义

all_input 存储所有的输入层单元

encoded_features 存储所有的输入层单元，经过预处理管线后的逻辑单元，例如正规化对象，和 ONE-HOT 编码对象。

all_inputs = []
encoded_features = []

# Numeric features.
for header in ['PhotoAmt', 'Fee']:
  numeric_col = tf.keras.Input(shape=(1,), name=header)
  normalization_layer = get_normalization_layer(header, train_ds)
  encoded_numeric_col = normalization_layer(numeric_col)
  all_inputs.append(numeric_col)
  encoded_features.append(encoded_numeric_col)




# Categorical features encoded as string.
categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                    'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']
for header in categorical_cols:
  categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
  encoding_layer = get_category_encoding_layer(header, train_ds, dtype='string',
                                               max_tokens=5)
  encoded_categorical_col = encoding_layer(categorical_col)
  all_inputs.append(categorical_col)
  encoded_features.append(encoded_categorical_col)

使用 concatenate 接口，将所有的编码特征，连接起来。

构成统一的特征向量输出。

然后，将统一的特征向量输出，连接到第一层稠密层单元，然后连接到第二层稠密层单元。

第二层稠密层单元，就是作为模型输出。

tf.keras.Model 模型的一个参数，是对所有输入输出的规约；

第二个参数，是经过一些列预处理和模型本身的管线输出，模型输出的规约。

all_features = tf.keras.layers.concatenate(encoded_features)


x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(all_inputs, output)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

tf.keras.layers.concatenate

https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate

tensor

https://www.tensorflow.org/guide/tensor

print("Type of every element:", rank_4_tensor.dtype)
print("Number of axes:", rank_4_tensor.ndim)
print("Shape of tensor:", rank_4_tensor.shape)
print("Elements along axis 0 of tensor:", rank_4_tensor.shape[0])
print("Elements along the last axis of tensor:", rank_4_tensor.shape[-1])
print("Total number of elements (3*2*4*5): ", tf.size(rank_4_tensor).numpy())

Type of every element: <dtype: 'float32'>
Number of axes: 4
Shape of tensor: (3, 2, 4, 5)
Elements along axis 0 of tensor: 3
Elements along the last axis of tensor: 5
Total number of elements (3*2*4*5):  120

A rank-4 tensor, shape: `[3, 2, 4, 5]`

posted @ 2021-05-10 11:11 lightsong 阅读(164) 评论(0) 收藏举报

刷新页面返回顶部

Stay Hungry,Stay Foolish!

lightsong

{Web: [React, Vue, NodeJS, HTTP]，DevOps:[Jenkins,Docker,K8S], Languages:[Python, JS, C, Lua, Shell, Groovy]}, AI:[LLM, langchain，langraph]

Classify structured data using Keras Preprocessing Layers

Classify structured data using Keras Preprocessing Layers

处理效果

`CategoryEncoding`

类别编码对应的CODE

数值型处理

预处理管线定义

tf.keras.layers.concatenate

tensor

公告