[Airflow] 02 - Building a Machine Learning Pipeline with Apache Airflow

博文: 10 Minutes to Building a Machine Learning Pipeline with Apache Airflow,

Jun 21, 2020

One computer vision-related article.

 

 

一、有向图DAG定义

  • 任务启动时间

This means that in our case, we start at (31 days ago) + (30 day interval) = yesterday.

# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
    'owner': 'Binh Phan',
    'depends_on_past': False,
    'start_date': days_ago(31),
    'email': ['example@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=2),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
    # 'wait_for_downstream': False,
    # 'dag': dag,
    # 'sla': timedelta(hours=2),
    # 'execution_timeout': timedelta(seconds=300),
    # 'on_failure_callback': some_function,
    # 'on_success_callback': some_other_function,
    # 'on_retry_callback': another_function,
    # 'sla_miss_callback': yet_another_function,
    # 'trigger_rule': 'all_success'
}

#instantiates a directed acyclic graph
dag = DAG(
    'ml_pipeline',
    default_args=default_args,
    description='A simple Machine Learning pipeline',
    schedule_interval=timedelta(days=30),
)

 

  • Task & Operators

Tasks are defined using Operators, which execute code. 

Here, we use BashOperator to run the Python files for our tasks, located in scripts/. 

For example, in our download_images task, where we download images from Google Images, the BashOperator calls python3 /usr/local/airflow/scripts/download_images.py.

# instantiate tasks using Operators.
#BashOperator defines tasks that execute bash scripts. In this case, we run Python scripts for each task.
download_images = BashOperator(
    task_id         = 'download_images',
    bash_command    = 'python3 /usr/local/airflow/scripts/download_images.py',
    dag             = dag,
)
train
= BashOperator( task_id = 'train', depends_on_past = False, bash_command = 'python3 /usr/local/airflow/scripts/train.py', retries = 3, dag = dag, )
serve_commands
= """ lsof -i tcp:8008 | awk 'NR!=1 {print $2}' | xargs kill; python3 /usr/local/airflow/scripts/serve.py serve """ serve = BashOperator( task_id = 'serve', depends_on_past = False, bash_command = serve_commands, retries = 3, dag = dag, ) #sets the ordering of the DAG. The >> directs the 2nd task to run after the 1st task. This means that #download images runs first, then train, then serve. download_images >> train >> serve 

 

 

二、Docker部署服务

  • 一些dockerfile命令

LABEL指令添加元数据到一个镜像

https://www.centos.bz/2016/12/dockerfile-label-instrunction/

 

# VERSION 1.10.10
# AUTHOR: Binh Phan
# DESCRIPTION: greenr-airflow container
# BUILD: docker build --rm -t btphan95/greenr-airflow .
# SOURCE: https://github.com/btphan95/greenr-airflow

FROM python:3.6-slim-stretch

LABEL maintainer="Binh_"

ENV DEBIAN_FRONTEND noninteractive
ENV TERM linux

# Airflow variables
ARG AIRFLOW_VERSION=1.10.10
ARG AIRFLOW_USER_HOME=/usr/local/airflow
ARG AIRFLOW_DEPS=""
ARG PYTHON_DEPS=""
ENV AIRFLOW_HOME=${AIRFLOW_USER_HOME}

# Define en_US
ENV LANGUAGE en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LC_CTYPE en_US.UTF-8
ENV LC_MESSAGES en_US.UTF-8

RUN set -ex \
    && buildDeps=' \
        freetds-dev \
        libkrb5-dev \
        libsasl2-dev \
        libssl-dev \
        libffi-dev \
        libpq-dev \
        git \
    ' \
    && apt-get update -yqq \
    && apt-get upgrade -yqq \
    && apt-get install -yqq --no-install-recommends \
        $buildDeps \
        freetds-bin \
        build-essential \
        default-libmysqlclient-dev \
        apt-utils \
        curl \
        rsync \
        netcat \
        locales \
    && sed -i 's/^# en_US.UTF-8 UTF-8$/en_US.UTF-8 UTF-8/g' /etc/locale.gen \
    && locale-gen \
    && update-locale LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 \
    && useradd -ms /bin/bash -d ${AIRFLOW_USER_HOME} airflow \
    && pip install -U pip setuptools wheel \
    && pip install pytz \
    && pip install pyOpenSSL \
    && pip install ndg-httpsclient \
    && pip install pyasn1 \
    && pip install apache-airflow[crypto,celery,postgres,hive,jdbc,mysql,ssh${AIRFLOW_DEPS:+,}${AIRFLOW_DEPS}]==${AIRFLOW_VERSION} \
    && pip install 'redis==3.2' \
    && if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi \
    && apt-get purge --auto-remove -yqq $buildDeps \
    && apt-get autoremove -yqq --purge \
    && apt-get clean \
    && rm -rf \
        /var/lib/apt/lists/* \
        /tmp/* \
        /var/tmp/* \
        /usr/share/man \
        /usr/share/doc \
        /usr/share/doc-base \
    && apt-get update \
    && apt-get upgrade -y \
    && apt-get install -y git

COPY config/airflow.cfg ${AIRFLOW_USER_HOME}/airflow.cfg
COPY requirements.txt requirements.txt
RUN pip install git+https://github.com/fastai/fastai.git
RUN pip install -r requirements.txt
COPY dags ${AIRFLOW_USER_HOME}/dags
COPY data ${AIRFLOW_USER_HOME}/data
COPY scripts ${AIRFLOW_USER_HOME}/scripts
# Install required libraries
EXPOSE 8008 8080
#USER airflow
#RUN chown -R airflow:airflow ${AIRFLOW_USER_HOME}
WORKDIR ${AIRFLOW_USER_HOME}
COPY scripts/entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
CMD ["webserver"]
dockerfile

 

 

 

 貌似也没什么可多说的~

 

/* implement */

 

 

 

posted @ 2020-11-07 10:34  郝壹贰叁  阅读(153)  评论(0)    收藏  举报