Airflow at WePay

NOTE: We recently gave an Airflow at WePay talk to the Bay Area Airflow meetup group. The video and slides are both available.

Our last post provided an overview of WePay’s data warehouse. In this post, we’ll be diving into how we run Airflow as part of the ETL pipeline.

Introduction

Airflow is a workflow scheduler. It lets you define a series of tasks (chunks of code, queries, etc) that can be strung together into a DAG (directed acyclic graph) by having the tasks depend on one another. It also allows you to define how frequently the DAG should be run: once a minute, once an hour, every 20 minutes, etc. It’s kind of like Cron. The difference is that Airflow has a lot of extra features that are really useful when doing data processing (or running workflows in general). Some of the out-of-the-box goodies include:

A really great UI to see what’s running and what’s failed
Email alerts when something breaks
The ability to automatically retry when a task fails
Integrations with a lot of infrastructure (Hive, Presto, Druid, AWS, Google cloud, etc)
Distributed execution, so tasks can run on different machines as you scale
SLAs to define how long tasks and DAGs should take to run
Data sensors to trigger a DAG when data arrives
Backfilling to do one-time loads of old data
XCom to communicate between tasks

The project was originally open sourced last year, and has gathered quite a lot of momentum. WePay has been contributing for the past six months, and we currently have a committer/PPMC member on the project (me!). Our main contribution has been to integrate Airflow with the Google Cloud Platform.

Google cloud integration

We’ve written the following Google cloud hooks for Airflow:

BigQueryHook: Allows Airflow to connect to BigQuery.
GoogleCloudStorageHook: Allows Airflow to connect to Google Cloud Storage.
DatastoreHook: Allows Airflow to connect to Google datastore.

These hooks, combined with a myriad of operators, allow you to interact with Google cloud products in different ways.

BigQuery integration supports querying BigQuery, copying data between tables, loading data in and out of BigQuery from Google Cloud Storage, and doing data quality checks. Google Cloud Storage integration also includes uploading and downloading files, as well as loading data from MySQL into Google Cloud Storage.

Running Airflow

WePay runs more than 7,000 DAGs (workflows) and 17,000 tasks per day through Airflow. We have split Airflow into three environments: ETL, prod, and dev. Each of these environments runs with their own Airflow web server, scheduler, and database. The web server and scheduler run on a single 32 core (n1-highcpu-32) machine in each environment, and the database is hosted in Google CloudSQL.

We keep a clone of the Apache incubator GitHub repository in WePay’s own GitHub enterprise repository. This allows us to run off of the latest official Airflow release but cherry-pick patches off of master that haven’t yet been released in an official build. (Note: we never cherry-pick anything except commits that have already been merged to Airflow’s master).

Installation

Airflow is installed straight off of our GitHub repository using pip:

pip install git+https://git@our-wepay-repo.com/DataInfra/airflow.git@1.7.0-wepay#egg=airflow[gcp_api,mysql,crypto]==1.7.1.2+wepay5

There are several things worth noting here. The @1.7.1.2-wepay part of the URI defines the branch in our repository to install. We cut a branch for every Airflow release. This is where we cherry-pick patches off of master. Every time we do a cherry pick, we increment our custom version in setup.py (1.7.1.2+wepay4, in this example). We try to stick to PEP 440, and use local version identifiers for our internal releases. The strange brackets at the end are a rarely used feature of setup.py: extras_require. Airflow uses this feature to define dependencies for various features: gcp_api for Google cloud, mysql for MySQL, and crypto for cryptography. Pip installs these using bracket annotation.

Supervisor

We use supervisor to run both the web server and the scheduler. Our supervisor config is located here.

AIRFLOW_HOME

Airflow requires you to set an AIRFLOW_HOME environment variable. This variable defines where the airflow.cfg is. A snippet of our airflow.cfg is located here. We keep the airflow.cfg file in /etc/airflow. We’ve also dumped our DAGs folder in there as well (/etc/airflow/dags). Logs go into /var/log/airflow.

Testing

We have a single monolithic GitHub repository called airflow-dags. The structure for this repository is as follows:

./dags/
./dags/dev/
./dags/etl/
./dags/prod/
./tests/
./udfs/
./weflow/
./run-tests.sh

DAGs in the ./dags/dev/ folder will be deployed to the dev environment, prod to the prod environment, and so on. The UDFs folder contains all of our BigQuery javascript UDFs (more on this in a future post). Weflow is where we keep our WePay-specific hooks, operators, and utilities. These are generally things that aren’t useful to the community (or we’d contribute it back to the Apache repository).

When a pull request is sent to the repository, our TeamCity CI system kicks off a test run using the run-tests.sh script. Our DAG tests mostly fall into two categories: tests for weflow utility methods, and smoke tests for DAGs. The DAG smoke tests (available here) check the following:

Test that the scheduler can import the DAG without a failure
Check that the owner of every task is a known engineering team at WePay
Check that the email of every task is set to a known engineering team’s email address

We also run flake8 on the entire repository. These steps give a much higher confidence that the DAGs will work when deployed.

DAGs are treated more as configuration than code, and we limit the types of operators that we use. We do very little testing on the actual logic of the DAGs. This is usually done in the dev environment by the DAG owners.

Deployment

This is currently our weakest link and the area that we (and Airflow) need to improve on the most. Airflow relies on all DAGs appearing in the same DAG folder (/etc/airflow/dags in our installation). We simply have a Cron job (ironically) that refreshes the DAGs folder every two minutes. We originally ran this as a DAG, but having a DAG that runs every two minutes seemed a bit wasteful (added a lot of rows to the database).

We have multiple folders in the repo (dev, prod, etl), so the code is checked out in one place, and symlinks are set in the appropriate folders inside /etc/airflow/dags:

/etc/airflow/dags/udfs -> /home/airflow/airflow-dags/udfs
/etc/airflow/dags/weflow -> /home/airflow/airflow-dags/weflow
/etc/airflow/dags/dags/etl -> /home/airflow/airflow-dags/dags/etl

The double ‘dags’ in the last line is so that it mirrors the layout of our airflow-dags repository, and weflow imports work correctly.

Our goal is to move this to a .zip or Docker-based deployment model, where DAGs can be deployed and versioned independently of one another. We plan to cover this in a future post when it’s been implemented.

Authentication

We run Airflow with the LDAP contrib module enabled. This allows us to restrict access to the Airflow UI to only those that need it. We also use the LDAP module to do some basic authorization checks, which prevent our users from getting access to the “Admin” and “Data Profiler” tabs.

Connections

We use the Google Cloud Storage and Google BigQuery hooks listed above. The connections are configured as follows:

{
    "extra__google_cloud_platform__scope": "https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/datastore,https://www.googleapis.com/auth/userinfo.email,https://www.googleapis.com/auth/devstorage.read_write",
    "extra__google_cloud_platform__project": "our-project-1234",
    "extra__google_cloud_platform__key_path": "/etc/airflow/certs/some-service-account-cert.json",
}

There is one connection defined for each team (gcp_team1, gcp_team2, etc). Each connection uses a different key_path (cert), and thus a different service account. This allows us to have per-team permissioning. This can also be expanded in the future to have per-DAG service accounts, and per-DAG permissions, though we haven’t had the need for this yet.

Dynamic DAG generation

Now, we get to the meat of the ETL pipeline: our ETL DAGs! We have only a single config-driven ETL DAG file. It dynamically generates over 200 DAGs. The configuration to load a table looks like so:

'some_table': {
    'select': {
        'columns': [
            'id',
            'some_col',
            'secret_data',
            'create_time',
            'modify_time',
            'version',
        ],
        'unique_id': ['id'],
        'partition': {
            'range': {
                '1d': {
                    'start': datetime(2012, 7, 25),
                },
                '15m': {
                    'lookback': 900,
                },
            },
            'order': {
                'col': 'modify_time',
                'type': 'utc_seconds',
            },
        },
    },
    'views': {
        'full': {
            'dedup': [{'col': 'modify_time', 'sorted': 'DESC'},
                      {'col': 'version', 'sorted': 'DESC'}]
        },
        'clean': {
            'blacklist': ['secret_data'],
        },
    },
    'verify': {
        'count': {
            'col': 'create_time',
            'lookback': 3600,
        },
        'full': {
            'col': 'id',
            'chunk': 2500000,
        },
    },
},

There’s a lot to unpack here. The most important part is the select block. This defines which columns we pull from MySQL and load into BigQuery:

        'columns': [
            'id',
            'some_col',
            'secret_data',
            'create_time',
            'modify_time',
            'version',
        ],

It also defines how we load the data: incrementally, or fully. Incremental loads only transfer a chunk of data each execution. Full loads transfer (and overwrite) the entire table on each execution.

        'partition': {
            'range': {
                '1d': {
                    'start': datetime(2012, 7, 25),
                },
                '15m': {
                    'lookback': 900,
                },
            },
            'order': {
                'col': 'modify_time',
                'type': 'utc_seconds',
            },
        },

The example, above, is an incremental load that runs every 15 minutes, and also daily. The daily load overwrites the entire day’s worth of data. The start_date defines how far back Airflow should go when it starts loading data (i.e. when the table was first created). The 15 minute load runs every 15 minutes and loads the last 30 minutes of data.

But wait, if we’re loading the last 30 minutes of data every 15 minutes, there will be duplicates! This is correct. Due to database commit ordering (serializability), timestamp issues, etc, it’s not a good idea to simply use a timestamp as a monotonically increasing ID. If we only pulled 15 minutes of data every 15 minutes, we might drop some rows. But this means that we need to de-duplicate data. This is why we have the ‘views’ block.

    'views': {
        'full': {
            'dedup': [{'col': 'modify_time', 'sorted': 'DESC'},
                      {'col': 'version', 'sorted': 'DESC'}]
        },
        'clean': {
            'blacklist': ['secret_data'],
        },
    },

This block creates a view that de-duplicates the data when it’s queried. It also creates a ‘clean’ view that has any sensitive columns removed (we don’t do row-level masking currently). Another approach to this would be to de-duplicate on write (rather than query time), but we’ll go into detail on these design choices in our subsequent post on Google Cloud Storage and BigQuery.

Metrics, monitoring, and logs

We use Icinga to monitor our Airflow machines (although we’re looking at moving to Sensu). This alerts us to any system-level issues. We don’t currently run any metrics for Airflow directly (but we’ll be using StatsD and Grafana for that); we’ve been able to get away with just DAG-level alerting when a DAG fails. We run all of our DAGs with retry=2, and all DAGs are idempotent, so retrying won’t cause any harm.

DAG logs are stored in Google Cloud Storage using the remote_base_log_folder and remote_log_conn_id configurations in airflow.cfg. We also have been experimenting with ELK but haven’t rolled it out widely yet.

Local development

One of the cool things about our Google cloud integration with Airflow is that you can use Airflow with Google cloud’s OAuth 2 authentication to develop and run DAGs locally without having to deploy your DAGs to a remote Airflow machine. This is really useful when iterating on a new DAG, or debugging something that’s broken.

Setting this is up is easy:

Install gcloud
Run gcloud auth login
Start Airflow
Add a Google Cloud Platform connection with just a project_id set to your Google cloud project.

And that’s it. You can now write DAGs locally that, when executed, will directly run against real Google cloud projects with real data. We typically use our ‘developer analytics’ cluster as the backing Google cloud project. This dramatically accelerates how quickly we can write Airflow DAGs, and increases the likelihood that they’ll work properly when deployed on a real Airflow machine.

Lessons

During the last few months of operations, we’ve made some discoveries that are worth noting:

Prefer many smaller DAGs, over fewer larger ones. Our task to DAG ratio is pretty small (~2.5 tasks per-DAG). Splitting DAGs as granularly as possible makes it easier to turn different task groups on and off, and also makes it so that we don’t need to use backfill (we just set a start_date when the table was created in the source DB, and the scheduler fills in the rest). Having smaller, more granular tasks also makes it easier for us to clear or reset state without affecting other tasks when things fail.
Exotic features are for the adventurous. We have shied away from using some of Airflow’s more exotic features, such as backfill, subdags, and external task sensors (we do use pools and XCom). When we started using Airflow, it was a pretty young project, and it had a lot of bugs in these areas. There has been a lot of great work to fix things recently, so this guidance might be outdated.
Restart everything when deploying DAG changes. This is also something that’s improving, but the Airflow web server is stateful right now. The scheduler also appears to need restarting when running with LocalExecutor (or run with –num-runs, so that it restarts itself periodically). This is not an ideal state, but it’s a known issue with Airflow at the moment.
Don’t use CeleryExecutor unless you have to. Airflow is actually pretty good at scaling up, rather than out. You can get away with running Airflow on a single machine for quite a while. We still run everything with the LocalExecutor, which simplifies operations. It’s also possible to run multiple LocalExecutors, rather than the CeleryExecutor, if you want to run on multiple machines, though we don’t run this way. One advantage that Celery does provide, however, is the ability to isolate tasks/DAGs on separate machines based on queues. This means that you can run a specific machine for a specific team with very custom/exotic packages or hardware, and have all of their DAGs get routed to it. This option is something you don’t have with LocalExecutor.
Don’t run commands locally in the DAG. Airflow ships with PythonOperator and BashOperator. They add a lot of flexibility to the types of DAGs you can write, since they allow you to run arbitrary code. Unfortunately, this arbitrary code runs locally on the Airflow worker machine. It’s run as a shared user account, without any process isolation. The engineers running the worker machine must also install all packages that all BashOperators and PythonOperators use (scipy, numpy, nltk, OS-level packages, etc). This can be an operational headache. DAGs might demand a ton of bizarre packages, a lot of resources on the machine, or overwrite each other’s files. At WePay, we limit which operators and packages can be used. This is something that Airflow plans to address (there have been discussions of DAG execution within Docker containers), but it hasn’t been fixed yet.