特征平台——feast
feast是google开源的一个特征平台,其提供特征注册管理,以及和特征存储(feature store),离线存储(offline store)和在线存储(online store)交互的SDK,官网文档:
https://docs.feast.dev/
目前最新的v0.24版本支持的离线存储:File,Snowflake,BigQuery,Redshift,Spark,PostgreSQL,Trino,AzureSynapse等,参考:
https://docs.feast.dev/reference/offline-stores
在线存储:SQLite,Snowflake,Redis,Datastore,DynamoDB,PostgreSQL,Cassandra等,参考:
https://docs.feast.dev/reference/online-stores
provider 用于定义feast运行的环境,其提供了feature store在不同平台组件上的实现,目前有4种:local, gcp,aws和azure
| provider | 支持的offline store | 支持的online store |
| local | BigQuery,file | Redis,Datastore,Sqlite |
| gcp | BigQuery,file | Datastore,Sqlite |
| aws | Redshift,file | DynamoDB,Sqlite |
| azure | Mysql,file | Redis,Splite |
参考:
https://docs.feast.dev/getting-started/architecture-and-components/provider
data source 用于定义特征的数据来源,每个batch data source都和一个offline store关联,比如SnowflakeSource只能和Snowflake offline store关联
data source的类型包括:file,Snowflake,bigquery,redshift,push,kafka,kinesis,spark,postgreSQL,Trino,AzureSynapse+AzureSQL
| data source | offline store |
| FileSource | file |
| SnowflakeSource | Snowflake |
| BigQuerySource | BigQuery |
| RedshiftSource | Redshift |
| PushSource(可以同时将feature写入online和offline store) | |
| KafkaSource(仍然处于实验性) | |
| KinesisSource(仍然处于实验性) | |
| SparkSource(支持hive和parquet文件) | Spark |
| PostgreSQLSource | PostgreSQL |
| TrinoSource | Trino |
| MsSqlServerSource | AzureSynapse+AzureSQL |
Batch Materialization Engines 用于将offline store的数据刷到online store,其配置位于feature_store.xml的batch_engine
其默认实现是LocalMaterializationEngine,也基于aws lambda的LambdaMaterializaionEngine
https://docs.feast.dev/getting-started/architecture-and-components/batch-materialization-engine
也可以Bytewax(配合k8s使用)和Snowflake(当使用SnowflakeSource的时候)作为batch materialization engine
此外,还可以自行实现engine,参考:
https://docs.feast.dev/how-to-guides/customizing-feast/creating-a-custom-materialization-engine
1.feast的安装
https://docs.feast.dev/getting-started/quickstart
下面的安装以v0.23版本为例,安装v0.23版本的时候建议使用python3.8,v0.22版本的时候建议使用python3.7
pip install feast===0.23.0
由于选择的离线存储是hive,在线存储是cassandra,所以还需要安装离线存储和在线存储的插件
pip install feast-cassandra==0.1.3 pip install feast-hive==0.17.0
如果安装feast-hive的时候遇到无法安装thriftpy,则需要先安装cython
pip install cython pip install thriftpy
2.创建一个feast项目
feast init my_project Creating a new Feast repository in /Users/lintong/coding/python/my_project. (⎈ |docker-desktop:default)➜ /Users/lintong/coding/python $ tree -L 3 my_project my_project ├── __init__.py ├── data │ └── driver_stats.parquet ├── example.py └── feature_store.yaml 1 directory, 4 files
其中feature_store.yaml,可以在其中配置offline store和online store,该文件必须位于project的根目录,参考:
https://docs.feast.dev/reference/feature-repository
如下
project: my_project
registry: data/registry.db
provider: local
online_store:
path: data/online_store.db
entity_key_serialization_version: 2
example.py定义了feast pipeline的流程,即feature的数据source,特征的entity,特征的view注册,特征的服务化,如下
# This is an example feature definition file
from datetime import timedelta
from feast import Entity, FeatureService, FeatureView, Field, FileSource
from feast.types import Float32, Int64
# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_hourly_stats = FileSource(
name="driver_hourly_stats_source",
path="/Users/lintong/coding/python/my_project/data/driver_stats.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created",
)
# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
driver = Entity(name="driver", join_keys=["driver_id"])
# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_hourly_stats_view = FeatureView(
name="driver_hourly_stats",
entities=[driver],
ttl=timedelta(days=1),
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
online=True,
source=driver_hourly_stats,
tags={},
)
driver_stats_fs = FeatureService(
name="driver_activity", features=[driver_hourly_stats_view]
)
3.配置注册store和feature
feature store的配置文件默认是feature_store.xml,也可以自行添加
feature定义的配置文件默认是exampl.xml,也可以自行添加
写好配置文件后通过运行feast apply命令来注册store和feature,也可以使用.feastignore文件来排除store和feature
如果feast apply遇到如下报错
importerror: cannot import name 'soft_unicode' from 'markupsafe'
则解决方法如下
pip install markupsafe==2.0.1
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/5428513.html

浙公网安备 33010602011771号