3 Reasons You Should Stay Away From AWS Step Functions | by Allen Helton | May, 2022

Step Capabilities is an unbelievable workflow service from AWS. However for those who’re not cautious, you may get in over your head and end up struggling to do routine upkeep.

Picture by Breana Panaguiton on Unsplash

Final week I obtained a name asking for assist with a nasty bug in manufacturing.

The bug handled processing information at a scale the dev workforce hadn’t anticipated. They have been utilizing Step Functions to orchestrate a workflow that took an array of objects, processed them, and shoved the remodeled objects into DynamoDB.

On the floor, it feels like a reasonably customary workflow — actually what Step Capabilities was designed to do. However upon nearer inspection, we realized we have been operating into the max request size limit of 256KB as a result of the array was so giant.

Debugging the problem took considerably longer than anticipated as a result of we constantly needed to set off a workflow and look ahead to it to complete. It was taking 15+ minutes each run due to the quantity of things being processed.

I labored with the dev workforce a bit to determine alternate options and we finally landed on a workaround to interrupt up the array into smaller batches and run a number of executions of the state machine began through a Lambda operate.

However I don’t actually like that resolution. Limits are there for a motive. It felt soiled to me how we labored round the issue. So naturally, I took to Twitter to see what you all are doing.

I obtained a number of stable solutions and it obtained me pondering “what are Step Capabilities not good at?”

Truthfully, it’s a brief record. However the shortcomings do apply to a variety of use circumstances. So let’s dive in and discuss when it’s higher to make use of one thing like Lambda capabilities over Step Capabilities.

Earlier than we begin, I wish to add a disclaimer — for those who look arduous sufficient, you’ll virtually all the time discover a workaround. With Step Capabilities, workarounds to the issues under are literally viable options.

Nonetheless, the viability will range based mostly in your consolation degree with the service. Some superior patterns can work round shortcomings, however they may be too tough to take care of in some cases.

Every state of affairs under is marked with newbie, intermediate, and superior ability ranges. For the aim of this text, these expertise ranges are outlined as:

  • Newbie — You might be simply getting began with Step Capabilities. You might have serverless expertise however are questioning what all of the hubbub is about.
  • Intermediate — You perceive how state machines are structured and are acquainted with production-ready finest practices. You know the way to make use of issues like Map and Parallel states successfully and are in a position to maintain management of the state machine execution state measurement and form.
  • Superior — You might be well-versed in event-driven architectures and understand how and when to make use of specific vs customary workflows. You realize of and use superior options like ready for process tokens and executing sub-workflows. You all the time use direct SDK integrations at any time when potential.

With this in thoughts, let’s discuss some non-trivial eventualities with Step Capabilities and what it’s best to do based mostly in your consolation degree.


Step Capabilities has a max request measurement restrict of 256KB. Which means all information you load in your state machine and move throughout transitions have to be smaller than 256KB always. When you load an excessive amount of information alongside the way in which, you’re going to get an exception and the execution will abort.


This can be a drawback that usually sneaks up on you and is a beast to trace down. The whole lot works nice till it doesn’t. The best factor you are able to do to handle the execution measurement restrict is trim the state to embody solely what’s completely vital.

Use the data flow simulator to assist reshape your information to incorporate as little as potential. This includes making heavy use of the ResultSelector, ResultPath, and OutputPath properties on states.

The problem with this method is that it doesn’t clear up the issue for those who can’t shrink your information set down. When you’re consolation degree is low with Step Capabilities, then Lambda capabilities may be a extra applicable resolution.


The official advice from AWS is to save the data in S3 and move the article arn between states. This implies when you’ve a payload that may probably go over the 256KB restrict, you could first put it aside to S3. When executing your state machine, you move within the object key and bucket so all Lambda capabilities can load the info.

A significant downside to this method is that it makes it tougher to make use of the direct SDK integrations. These integrations use information immediately out of the execution state, so that you won’t be able to move the mandatory info to the API calls as a result of it’s saved in S3.

It’s a easy resolution to an attention-grabbing drawback, however you successfully remove a serious good thing about Step Capabilities. To not point out you’ve a efficiency hit since you’ll be loading the article from S3 at any time when it’s essential to entry the payload.


With payloads that exceed the execution state restrict, you must set off your workflows through a Lambda operate. With this in thoughts, you may have the ability to break up up your information and workflow into a number of items. If in case you have a set of actions that must be carried out on a subset of your information, you might create a state machine that does solely these duties.

You might then create one other state machine that does duties on a distinct subset of your information, and so forth. It will create small, “domain-driven” state machines which have slim focus.

Your execution Lambda operate can be accountable for parsing the info into the suitable items and executing every state machine with the right information. After operating all of the state machines, it could piece the info again collectively if vital and return the end result.

This method brings again the power to make use of direct SDK integrations, nevertheless it does add complexity to your resolution. By managing extra state machines, you may need problem sustaining the answer down the highway.

Watch out with this method, you don’t need the Lambda operate to attend for the execution to complete for all of the state machines. That will rack up a hefty invoice. As an alternative, you might attempt utilizing the scatter/gather pattern to set off a response on completion.


Step Capabilities has a most variety of historical past occasions of 25,000. This implies in case you have an information set with hundreds of entries in it, you may exceed the restrict of state transitions. You may additionally run into the info measurement restrict as properly for units that giant.

Giant information units that must be processed concurrently feels like a terrific use for Step Capabilities. Nonetheless, in case you are doing parallel processing through a Map state, the max concurrency restrict is 40. Which means you’ll be processing the info in “batches” of 40. So your parallel processing may not be as quick as you assume.


In case your workflow is operating asynchronously, it may be finest to simply accept the 40 concurrent Map executions and look ahead to it to complete. There’s nothing mistaken with this method till you get near the 25,000 occasion historical past restrict.

When that begins to occur together with your state machines, you may want to start out doing a little math and determining what your max merchandise depend is. As soon as you determine your max merchandise depend, then you may run your workflow in parallel batches. Much like what I did to unravel that manufacturing bug talked about earlier.

To deal with the big information measurement that comes together with large arrays, you would wish to undertake the identical method as listed above, the place the payload is saved into an S3 object and loaded, parsed, and break up through a Lambda operate firstly of the state machine.


The answer for an intermediate method is much like the newbie, nevertheless it includes extra automation. If the array you might be processing lives in a database like DynamoDB, you may load a subset of the info to course of from inside the state machine.

Diagram of a state machine that masses from the database and retains monitor of state depend

The state machine masses a subset of the info using the limit property. It then iterates over the returned objects in a Map state.

As soon as the objects are completed processing, it masses the execution historical past and appears on the Id property of the final merchandise to get the variety of occasions have occurred. If there are nonetheless sufficient occasions left with out getting too near the 25,000 restrict, it begins from the start. Whether it is getting near the restrict, the state machine will begin one other occasion of itself to restart the depend and proceed processing the place it left off.

This course of will get you fairly far. However by way of velocity of execution, it may very well be quicker. This method works in sequential batches of 40. So your giant datasets might take a big period of time to course of.


Justin Callison, senior supervisor of Step Capabilities, walks us by a sophisticated method towards blistering quick parallel processing by structuring state machines as orchestrators and runners.

The orchestrator parses your dataset into batches and passes a single batch to a runner. The runner takes the batch and works the objects. If a batch has greater than 40 objects in, it splits the info into 40 extra batches and recursively calls itself to fan out and course of extra objects in parallel. The state machine will proceed to separate and fan out till there are fewer than 40 objects in every batch.

The article goes into nice element and even supplies a working example in GitHub.

This methodology fully addresses the parallel drawback, however is probably the most superior method by far. Be sure you are comfy with Step Capabilities earlier than taking place this route. As with something recursive, a small bug might ship you in an infinite loop and trigger a big invoice.


When constructing workflows, typically it’s essential to manipulate information in a number of microservices. Microservices are a logical separation of AWS sources that will or might not stay in the identical account. Every microservice needs to be self contained and solely use its personal sources, not sources from different providers.

Straight utilizing sources from different microservices would create tight coupling, which is an anti-pattern in serverless and microservice design. Step Capabilities makes it straightforward it cross these service boundaries in case you have a number of microservices deployed into the identical AWS account. It’s as much as you to be vigilant whenever you’re constructing your state machines.


Warning — what I’m about to counsel is anti-pattern and I don’t suggest it for manufacturing use!

When beginning out with Step Capabilities, it’s completely potential to make use of Lambda capabilities, SQS queues, SNS subjects, and so forth… with out regard to which microservice they belong to. The workflow studio permits you to merely choose a Lambda operate from a drop down. There are not any restrictions for which capabilities you should utilize as a result of microservices are a logical assemble.

When you’re utilizing Infrastructure as Code (IaC) it’s a matter of exporting the arn of a useful resource and importing it into the template of one other service. A bit harder, however nonetheless comparatively straightforward.

Nothing stops you from going throughout microservice, and it could get the job completed. So whereas it’s not advisable, it’s usually the best method to fixing cross-service boundaries.


Whereas invoking sources immediately may be an anti-pattern, calling a cross-service API shouldn’t be. If in case you have your sources behind an inside API, it’s completely acceptable to name it. Calling an API supplies free coupling, which is way more acceptable in serverless and microservice environments.

Since Step Capabilities don’t at present help calling exterior APIs natively, you’ve two choices for incorporating this method into your workflows.

The Lambda operate may be as easy or advanced as you want it to be. If you wish to remodel the response earlier than you come it to the state machine, do it. If you wish to do a straight move by, that’s an possibility as properly. The target with this method is to name an API utilizing one thing like axios or requests.

The HTTP integration is basically making a proxy from an API Gateway in your microservice to name an exterior endpoint. When going this route, you may name the API Gateway invoke SDK integration to make the decision immediately. This supplies a higher-performing resolution than the Lambda operate.


If the cross-service name it’s essential to make is a long-running or multi-step course of, you don’t need a synchronous resolution like what was listed above. As an alternative, it’s essential to pause execution and look ahead to a response to be able to resume. Sheen Brisals exhibits us the best way to use EventBridge to do just that.

The EventBridge integration will hearth an occasion, pause the state machine execution, look ahead to an occasion to be processed in one other service, then resume the workflow when the opposite service fires an occasion again. This is named the callback pattern.

The callback sample is one other manner to offer free coupling between your microservices. It does add a layer of complexity to your resolution, however supplies probably the most flexibility and highest reliability. Simply make certain you configure the state machine heartbeat to abort execution if one thing goes mistaken within the different microservice.

There are a couple of conditions the place Step Capabilities may not be the very best AWS service to make use of in terms of creating workflows. The way you deal with giant payloads, excessive quantity arrays, or cross service boundaries varies based mostly in your degree of consolation.

When you pursue an possibility exterior of consolation degree, do not forget that the best resolution isn’t the one which works, it’s the one which works and also you’re in a position to successfully preserve. This implies if there’s a defect, you must know the best way to troubleshoot an issue and dive by traces.

Typically it’s higher to only go together with Lambda.

It’s not a nasty factor to go together with the less complicated possibility based mostly on the abilities of your engineering workforce. One thing we’re all continually engaged on is upskilling. Enhancing our consolation degree with new cloud options or new architectural patterns or completely new providers is a part of working within the cloud. We like it.

Step Capabilities is a tremendous different to Lambda capabilities in a large number of use circumstances. They provide high traceability in asynchronous workflows and in some cases are cheaper to run than Lambda functions. There are even methods to remove the notorious serverless might begin by integrating API Gateway directly to an express state machine.

Step Capabilities are turning out to be the swiss military knife of the serverless world. It permits customers to do many issues shortly and simply. It’s simply not all the time probably the most beginner-friendly.

I extremely encourage you to check out Step Capabilities for those who haven’t already. The professionals enormously outweigh the cons and so they provide a excessive diploma of visibility into your server-side operations. You may visit my GitHub page for quite a lot of examples.

Completely happy coding!

More Posts