[Airflow] 02 - Building a Machine Learning Pipeline with Apache Airflow
博文: 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow,
Jun 21, 2020
One computer vision-related article.

一、有向图DAG定义
-
任务启动时间
This means that in our case, we start at (31 days ago) + (30 day interval) = yesterday.
# These args will get passed on to each operator # You can override them on a per-task basis during operator initialization default_args = { 'owner': 'Binh Phan', 'depends_on_past': False, 'start_date': days_ago(31), 'email': ['example@example.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=2), # 'queue': 'bash_queue', # 'pool': 'backfill', # 'priority_weight': 10, # 'end_date': datetime(2016, 1, 1), # 'wait_for_downstream': False, # 'dag': dag, # 'sla': timedelta(hours=2), # 'execution_timeout': timedelta(seconds=300), # 'on_failure_callback': some_function, # 'on_success_callback': some_other_function, # 'on_retry_callback': another_function, # 'sla_miss_callback': yet_another_function, # 'trigger_rule': 'all_success' } #instantiates a directed acyclic graph dag = DAG( 'ml_pipeline', default_args=default_args, description='A simple Machine Learning pipeline', schedule_interval=timedelta(days=30), )
-
Task & Operators
Tasks are defined using Operators, which execute code.
Here, we use BashOperator to run the Python files for our tasks, located in scripts/.
For example, in our download_images task, where we download images from Google Images, the BashOperator calls python3 /usr/local/airflow/scripts/download_images.py.
# instantiate tasks using Operators. #BashOperator defines tasks that execute bash scripts. In this case, we run Python scripts for each task. download_images = BashOperator( task_id = 'download_images', bash_command = 'python3 /usr/local/airflow/scripts/download_images.py', dag = dag, )
train = BashOperator( task_id = 'train', depends_on_past = False, bash_command = 'python3 /usr/local/airflow/scripts/train.py', retries = 3, dag = dag, )
serve_commands = """ lsof -i tcp:8008 | awk 'NR!=1 {print $2}' | xargs kill; python3 /usr/local/airflow/scripts/serve.py serve """ serve = BashOperator( task_id = 'serve', depends_on_past = False, bash_command = serve_commands, retries = 3, dag = dag, ) #sets the ordering of the DAG. The >> directs the 2nd task to run after the 1st task. This means that #download images runs first, then train, then serve. download_images >> train >> serve
二、Docker部署服务
- 一些dockerfile命令
LABEL指令添加元数据到一个镜像
https://www.centos.bz/2016/12/dockerfile-label-instrunction/
# VERSION 1.10.10 # AUTHOR: Binh Phan # DESCRIPTION: greenr-airflow container # BUILD: docker build --rm -t btphan95/greenr-airflow . # SOURCE: https://github.com/btphan95/greenr-airflow FROM python:3.6-slim-stretch LABEL maintainer="Binh_" ENV DEBIAN_FRONTEND noninteractive ENV TERM linux # Airflow variables ARG AIRFLOW_VERSION=1.10.10 ARG AIRFLOW_USER_HOME=/usr/local/airflow ARG AIRFLOW_DEPS="" ARG PYTHON_DEPS="" ENV AIRFLOW_HOME=${AIRFLOW_USER_HOME} # Define en_US ENV LANGUAGE en_US.UTF-8 ENV LANG en_US.UTF-8 ENV LC_ALL en_US.UTF-8 ENV LC_CTYPE en_US.UTF-8 ENV LC_MESSAGES en_US.UTF-8 RUN set -ex \ && buildDeps=' \ freetds-dev \ libkrb5-dev \ libsasl2-dev \ libssl-dev \ libffi-dev \ libpq-dev \ git \ ' \ && apt-get update -yqq \ && apt-get upgrade -yqq \ && apt-get install -yqq --no-install-recommends \ $buildDeps \ freetds-bin \ build-essential \ default-libmysqlclient-dev \ apt-utils \ curl \ rsync \ netcat \ locales \ && sed -i 's/^# en_US.UTF-8 UTF-8$/en_US.UTF-8 UTF-8/g' /etc/locale.gen \ && locale-gen \ && update-locale LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 \ && useradd -ms /bin/bash -d ${AIRFLOW_USER_HOME} airflow \ && pip install -U pip setuptools wheel \ && pip install pytz \ && pip install pyOpenSSL \ && pip install ndg-httpsclient \ && pip install pyasn1 \ && pip install apache-airflow[crypto,celery,postgres,hive,jdbc,mysql,ssh${AIRFLOW_DEPS:+,}${AIRFLOW_DEPS}]==${AIRFLOW_VERSION} \ && pip install 'redis==3.2' \ && if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi \ && apt-get purge --auto-remove -yqq $buildDeps \ && apt-get autoremove -yqq --purge \ && apt-get clean \ && rm -rf \ /var/lib/apt/lists/* \ /tmp/* \ /var/tmp/* \ /usr/share/man \ /usr/share/doc \ /usr/share/doc-base \ && apt-get update \ && apt-get upgrade -y \ && apt-get install -y git COPY config/airflow.cfg ${AIRFLOW_USER_HOME}/airflow.cfg COPY requirements.txt requirements.txt RUN pip install git+https://github.com/fastai/fastai.git RUN pip install -r requirements.txt COPY dags ${AIRFLOW_USER_HOME}/dags COPY data ${AIRFLOW_USER_HOME}/data COPY scripts ${AIRFLOW_USER_HOME}/scripts # Install required libraries EXPOSE 8008 8080 #USER airflow #RUN chown -R airflow:airflow ${AIRFLOW_USER_HOME} WORKDIR ${AIRFLOW_USER_HOME} COPY scripts/entrypoint.sh /entrypoint.sh ENTRYPOINT ["/entrypoint.sh"] CMD ["webserver"]
貌似也没什么可多说的~
/* implement */

浙公网安备 33010602011771号