A Data Engineer’s Perspective on AWS Managed Airflow vs. AWS Step Function (State Machine) | by Suvendu Mohanty | Jun, 2022

Pricing, infrastructure, scale, and extra

Picture by Rubaitul Azad on Unsplash

Earlier than I shoot my perspective, right here’s some context. I work for a serious streaming app’s Knowledge Staff. Our staff’s work is Airflow heavy and we’ve 1000+ complicated airflow jobs to course of our ETL and ML Pipeline.

Properly, very not too long ago I began with a brand new Staff and the staff was transferring from their legacy spark 2.1 self-managed monolithic processes to AWS EMR and serverless structure (lambda, and so on.) to enhance the latency and do away with the monolithic Spark ETL course of (in case you are questioning how the monolithic construction appears, see my repo here).

So, the staff was contemplating the AWS Managed Airflow because the ETL orchestration instrument. However, because the staff was evaluating the appropriate choices, we introduced AWS State Machine into the repair to see the way it’s going to assist over AWS Managed Airflow.

Right here, is my perspective on each the nice instruments.

I can’t discuss in regards to the technical variations between these two nice merchandise. As a substitute, will focus extra on my expertise utilizing these two merchandise and the place they’ll match into.

Additionally, its price to say this text give attention to the Step Capabilities ETL and ML pipeline capabilities — as Step Operate is far more highly effective for a lot of internet software and enterprise workflow use-cases, we’re not overlaying these capabilities right here as Airflow doesn’t compete in these areas.

Each the companies are one of the best used for ETL and ML pipelines.

The Airflow wants a bit of studying curve (Python, Airflow Operator Syntax) in phrases constructing your pipeline. However on the similar time, it has extra management to write down complicated pipelines.

AWS State Machine however is a neater begin and sooner to combine along with your AWS service — to construct your pipeline with zero code by leveraging the State Machine Workflow Studio.

You possibly can construct your entire ETL movement by dragging and dropping of the workflow management and the combination of your companies can be flawless. Along with that, it’ll generate the State Machine definition JSON which you should utilize in your CF/ SAM template.

AWS State Machine Workflow Studio

After constructing my first State Machine, I noticed you’ll be able to iterate it a lot sooner in your ETL and ML pipeline growth through the use of Step Operate in comparison with Airflow.

One different criterion that must be considered whereas selecting one over the opposite is: what are the programs that you’re going to combine as part of your ETL course of.

State Machine, is one of the best match when your assets are largely AWS companies. Although, connecting to AWS non-managed service or on-premise companies is not going to be that easy with State Machine. Nevertheless, this isn’t a roadblock totally as you’ll be able to see one implementation here connecting on-premise resources from State machine.

Whereas AWS Managed Airflow additionally offers plugin integration for a lot of AWS companies, their controls are richer when interacting with exterior companies. Nevertheless, as talked about the combination with all of the AWS companies wants some code prepare dinner up.

Identical to Airflow offers us with wealthy units of Operators to manage our workflow, AWS State Machine additionally offers many superior stage and wealthy workflow controls to design your workflow (Map iterator, Wait , and Parallel). Though you’ll be able to’t add a customized operator in contrast to in Airflow. however, personally, I don’t discover a use case that the State Machine movement management is not going to fulfill.

Additionally, I discover State Machine is wealthy after I require my ETL course of to have some human intervention — like approval wanted in your pipeline to proceed to subsequent step. In such circumstances, State Machine offers you management when in your pipeline you’ll be able to ship management to the human, and after the approval/rejection subsequent step will comply with.

Human Intervention pushed Workflow utilizing State Machine

One factor that State Machine isn’t shipped is scheduling of the State Machine just like the Airflow. Though, there may be a few strategy to obtain just like the schedule triggered by AWS EventBridge.

ML Pipeline utilizing State Machine

Airflow is an open-source product. Its set up and managing infrastructure overhead are taken care of by AWS. But its serverless possibility isn’t accessible but so you could be conscious about your ETL load, regardless of AWS having auto-scaling available.

Whereas, Step Operate(State Machine) is totally serverless and also you don’t have the burden of managing or configuring your assets. And also you pay as you employ.

Airflow pricing is predicated on the infrastructure you select, whereas the State Machine is pay as you employ.

You’ll find some pricing comparisons right here:

I discover State Machine is a richer product that lets you iterate sooner along with your ETL growth.

On the similar time, AWS Managed Airflow provides you higher management of your ETL pipeline — however it comes at some value of extra effort and a studying curve.

Additionally, some fashionable and auto workflow could possibly be little difficult to attain via Airflow.

More Posts