Build More Scalable Pipelines Using Python Fire | by Avi Chad-Friedman | Apr, 2022

Photograph by ian dooley on Unsplash

Airflow is a ubiquitous, open-source platform for declaratively and programmatically defining advanced workflows. Importantly, items of labor might be dispatched and executed virtually wherever, utilizing a wealthy ecosystem of operators; and that ecosystem contains Airflow’s employee nodes themselves, for those who leverage the humblest of all operators, the PythonOperator.

That is an anti-pattern that must be prevented for a couple of causes: duties compete with each other for assets and may starve the cluster; dependencies should be put in on the employee nodes and may battle; and all code should be deployed to Airflow, making it troublesome to arrange a single-purpose deployment pipeline per mannequin.

Packaging fashions into Docker containers sounds nice, however interfacing with these containers out of your DAG, particularly the place you may have dynamic inputs, is a headache and requires some extent of CLI programming. And this is without doubt one of the causes information scientists and engineers in a rush to ship go for the PythonOperator: calling a Python operate from Python is simple; calling a Python operate from the command line is tough.

What if I’m to inform you you possibly can have your cake and eat it too…

Picture by Creator. Generated through https://www.kapwing.com/

Earlier than diving right into a extra fleshed-out instance, I need to structure the design sample at a excessive degree.

  1. Construct your mannequin in a single-purpose repository (or at the very least submodule).
  2. Instantiate Fireplace in your mannequin’s important file (instance beneath).
  3. From Airflow, use the Pythonic-CLI and override the Docker entry level with no matter courses, strategies, and parameters you need to invoke.

Let’s take an instance of a easy machine studying DAG with two steps. Each the coaching and the inference duties take a number of parameters that may be configured by atmosphere variables, run-time triggers, or different dynamic inputs. Because the mannequin is in Python, we simply invoke the 2 steps utilizing the PythonOperator.

To not belabor the purpose however to make the overall issues identified above particular: the coaching or scoring duties could eat up massive assets and starve out the cluster; and the logic to connect with function and mannequin storage layers all should be tightly coupled with the DAG, encouraging spaghetti code.

Unscalable DAG leveraging the Python operator for coaching and scoring

If we have been to attempt to do issues the suitable means, by containerizing the mannequin and utilizing an operator that launches the workflow on some compute service (Kubernetes, AWS Batch, and so forth.), we’d should move these run-time parameters into the container — through an atmosphere variable or CLI — and program that interface. Constructing a sturdy CLI with first rate error dealing with isn’t one thing any engineer needs to take a position time into.

Enter Google’s Python Fire. This library turns any Python object (learn operate, class, and so forth.) into an intuitively parameterized CLI with a single line of code and offers you much less of an excuse to beat up in your Airflow employees. The instance code beneath accommodates no express reference as to easy methods to work together with it from the command line, and but (as you possibly can see within the subsequent DAG) you possibly can invoke any technique inside the CustomerModel class with any parameters you need.

Skeleton of an object-oriented machine studying mannequin code

To grok the facility of wrapping the modeling code in Fireplace, let’s take a look at a DAG that makes use of the auto-generated CLI interface to launch the mannequin in a Docker container.

The KubernetesPodOperator right here is only a stand-in for any compute service operator that may run the mannequin code packaged in my_model_image.

The CLI permits the 2 duties of the DAG to invoke completely different strategies and move parameters Pythonically, so you don’t have to consider the interface any extra deeply than you’ll if it have been simply calling a Python operate!

An Airflow DAG launching duties in a Kubernetes pod

More Posts