The Secrets of Orchestration in the Cloud

Confessions of an Airflow User

Mistakes made and lessons learnt in a year of using Airflow

Matthew Grey
Cognizant Servian
Published in
9 min readJan 9, 2020

--

A sign reading, “Next in line for Confession”
The church of Airflow forgives all transgressions (picture from Unsplash)

Airflow, Airflow, Airflow… how I love and hate thee. The siren calls of scale and flexibility tempt me, even as I have been scalded by my trust in you. As Airflow projects of the future loom, I am reminded constantly of the past. I hear the bellows of my projects booming in the wind. They call to me. Bound in irons. Doomed for eternity. They clasp and beg at me, asking:

“Why did you think a dynamic mega-DAG was a good idea?”

I dismiss them as the nags of the past. I cannot give them the attention that they demand. I must focus on the future.

Endeavours with Airflow require patience, research, wisdom. I read blog posts warning of anti-patterns and thought I knew better. So I dismissed them.

I was a man of technology! I understood the risks of wielding my tools! The lessons of the blogs mattered naught. They fell on deaf ears.

And so I began my work.

From the beginning it was clear that Airflow was a strange beast. I created DAGS, operators, plugins, and began to consider myself an intermediate user. However I was just digging myself deeper and deeper into my Airflow hole.

My only way forward was to learn from my mistakes. Rethink Airflow. Share my findings.

In writing this blog post, my mind is returned to the Airflow projects that I have contributed to, sometimes even lead. I stare into a maelstrom of mistakes that I have made — all so obvious in retrospect. I am one of the unfortunate many that has battled and lost with Airflow.

I have committed many sins against data engineering. It is through an introspection of these misgivings, that I may share my knowledge with you. I do this with the hopes of preventing future crimes against Airflow. So! Cast off your false assumptions of Airflow, for it is here and now that we shall discover how to avoid falling victim to its many pitfalls.

Sins of the past

Complexity is where most of my Airflow transgressions began. By increasing the complexity of the Airflow instance, more could go wrong. When things did go wrong, they were harder to fix as I wasn’t fixing just my code, I was fixing my code in Airflow. There comes no greater relief than that which is achieved through moving from a complex Airflow environment, to a lean Airflow environment.

A circuit board
Complexity is bad. Debugging complex systems is worse.

ETL logic embedded in custom operators

Embedding ETL logic in Airflow is something which I’ve seen too many times. Airflow supports custom python operators, allowing you to embed any kind of logic you choose into your Airflow DAGs. So why not empower Airflow with ETL logic, taking it from simple orchestrator of tasks to fully fledged data ingestion pipeline? Well, most of the time, this is asking for trouble.

What happens when the logic exceeds the resources that are intended and allocated for it?

How do you test your logic? Must you engage Airflow to simply test some Python code?

The additional complexity of ELT logic inside of Airflow will lead you down the path of frustration when it comes to ensuring that the logic behaves as expected, is performant is terms of execution time, is testable, and so on.

Dynamic DAGs are cool, dynamic DAGs are a huge pain

It is almost a natural behaviour of those standing up Airflow and managing DAGs to try to make complex DAGs. Chief among these over-complexities is the dynamic DAG. This is often one DAG to rule all DAGs. One DAG that is influenced and directed by a metadata store outside of Airflow.

The dynamic mega-DAG.

This may seem like a great idea on its surface, you only have to write one DAG, which will do different things based on what the data dictates in the external metadata store.

Only one DAG! That’ll be a dream to maintain!

Unfortunately, the battle has already been lost. When something goes wrong with one of the jobs that is run through this pipeline, it is a living nightmare to debug it. A great amount of digging through logs is required in addition to endless navigation of the Airflow UI. The Airflow UI is not designed for this kind of behaviour.

Airflow’s UI provides pagination of DAGs. It provides the ability to search for DAGs by name. The UI is guiding you towards more DAGs, not less — definitely not one dynamic DAG to run all jobs.

A container ship
Airflow should be a fleet of small ships delivering packages, not one massive cargo ship.

Kubernetes is a great idea!

Kubernetes has had an influential effect on the global technology scene, and Airflow has been very much swept up in it. Indeed, Google Cloud’s Airflow-as-a-Service offering is in fact Airflow running on a GKE cluster (together with a smattering of other services). The main problem with running Airflow on Kubernetes is a result of using Airflow in a non-Kubernetes way.

Airflow has excellent support for Kubernetes by using their Kubernetes Executor (something that I recommend you look at if you are running Airflow on Kubernetes). However, using Airflow on Kubernetes with a non-Kubernetes Executor is a bit of a waste unfortunately. You gain all of the overhead of a Kubernetes cluster (together with the pain of managing, administering, governing, and securing it) with none of the benefits.

Why won’t my plugins refresh?!

The process of refreshing plugin code is also a pain. Through a lot of trial and error, I have found that it takes restarting the scheduler and the webserver(s) in a certain order to properly replace the plugin code for a given Airflow instance. The errors given from Airflow often paint a very unclear picture of what is happening as a result of missing or erroneous code.

These reasons provide yet more ammunition for the idea of removing as much custom logic from Airflow as possible. If you can’t reliably update the plugin code, the error reporting isn’t great, and it’s annoying to test…why bother?

Keep Airflow lean, its complexity low, and its DAGs and operators deterministic.

A woman holding her head in her hands
Bugs popping up in weird places because you changed plugin code? Welcome to Airflow!

The pathway to healing

Having reformed my usage of Airflow in recent days, let me offer some sage advice. The suggestions here will, again, lean heavily into the idea of lowering the complexity of the Airflow instance wherever possible. The more understandable and deterministic your systems are, the better you will sleep.

Anti-patterns are real — constantly ask yourself if you are falling into their hands

There are plenty of warnings of Airflow’s anti-patterns floating around in blog posts and documentation. Heed these warnings, do not ignore them, do not think of yourself as above such worries. I have been victim to slowly slipping into overly dynamic DAGs in the past, as I did not stop to consider ‘is this an anti-pattern?’ when making changes to my Airflow code.

Talk to someone who has used Airflow heavily, and if possible share your instance’s details. They will most of the time have some pearls of wisdom for you — and maybe a few war stories too.

A stop sign
Stop, take a breath and ask yourself: “Is this an anti-pattern?”. You’d be surprised how easily you can fall into anti-pattern traps.

Study Airflow well, its strengths, and its weaknesses. Only then can you defeat it

Airflow can be a particular beast. As I see it there may in fact be more anti-patterns in Airflow than patterns. When you can understand how it was intended to be used, you can use Airflow’s strengths to make a great orchestration system.

Abuse Airflow and you will be in a world of hurt.

ETL logic goes in scripts, call the scripts from Airflow

I’ve recounted above why ETL logic can really bog down Airflow, make it hard to test, unstable — a plain old bad place to be. The solution to this is to place your ETL logic somewhere outside of Airflow. Airflow itself offers solutions to this believe it or not. For Airflow running on a VM, there’s BashOperator and DockerOperator. Both of these allow you to write and execute code in such a way that you can properly test the ETL code in isolation, away from Airflow. This is a huge benefit to you as an Airflow developer. Most of the time I would prefer Docker over Bash as you have more control over the execution environment, rather than running Bash locally.

If you are running Airflow on Kubernetes, then you also have an Airflow-native solution to removing custom code from Airflow — the KubernetesPodOperator.

The KubernetesPodOperator is an absolute shining gem. When it comes to removing custom logic from Airflow, Dockerising that logic and triggering it from Airflow is a great idea. To apply this to a Kubernetes world, you would spin out this docker-contained step into a pod, so that it is completely separate from the Airflow environment, maybe even running on a different machine!

Make friends with your airflow configuration

Good lord you can achieve a lot if you know what the Airflow configuration can do. It’s an absolute mammoth of a file, and can be quite daunting to those just learning about Airflow. I urge you to experiment with the configuration — at least in a safe place where you don’t mind blowing away the Airflow instance. I have a startling example of how important the Airflow configuration is to the behaviour of Airflow’s jobs. An Airflow-on-Kubernetes cluster that I was managing was executing a given DAG to completion in 30 minutes. I noticed that each task was taking a tiny amount of time to complete, but there was a large amount of time that the DAG waited for the next task to be scheduled. Given enough fiddling with the configuration, the DAG was performing much better — it went from a 30 minute execution to just under 5.

Such a performance gain was achieved purely through tweaks to the Airflow configuration file. Consider that you may be able to speed up your jobs, improve stability, and even increase your resource utilisation just through experimenting with your Airflow configuration for a day or two. It’s always worth a try.

Green symbols running down a black background
Decoding what the Airflow configuration variables do can be frustrating, but it is 100% worth the effort.

Keep learning

I’d like to leave you, having confessed my sins and made reparations, with some further reading material that has helped me in my time with Airflow:

An excellent article broaching the topic of using container-based logic instead of embedded logic:

https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753

A post from Kubernetes itself about using Airflow on Kubernetes:

https://kubernetes.io/blog/2018/06/28/airflow-on-kubernetes-part-1-a-different-kind-of-operator/

A guide to scaling out Airflow (though it is an Astronomer-specific guide, it still has some great tips for all Airflow developers):

https://www.astronomer.io/guides/airflow-scaling-workers/

An open (at the time of writing) issue with Airflow regarding their poorly named configuration variables — this is why I urge experimentation to properly understand the configuration:

https://issues.apache.org/jira/browse/AIRFLOW-57

Now it’s up to you. Give Airflow hell!

About the author

Matthew Grey is a principal technology engineering consultant at Servian specialising in Google Cloud. Servian is a technology consulting company specialising in big data, analytics, AI, cybersecurity, cloud infrastructure, and application development.

You can reach me on LinkedIn or check out my other posts here on Medium.

--

--

Technology engineer keen on big data, automation, streaming, and natural language processing. Currently focused on solutions in Google Cloud.