Event Driven Architectures (EDAs) are notoriously tough relating to error dealing with. Since we’re utilizing serverless companies, we’ve doubled down on the necessity for enhanced observability.
To assist hold observe of the state of our WebSocket, we’ve applied Lifeless Letter Queues so we are able to drop occasions in a single location when one thing goes incorrect. There are two kinds of points we are able to expertise in our WebSocket:
Occasion supply failures happen when EventBridge fails to place the occasion in our SQS queue or when the occasion fails to switch from SQS to the processing lambda. This could possibly be attributable to a system outage or improper configuration. No matter what the foundation trigger is, we ship it over to a DLQ for us to watch and triage. As soon as it’s within the DLQ, we are able to examine what’s going on within the system and try to repair it.
To ship an occasion supply failure to a DLQ, we update our EventBridge rule to focus on the queue on a supply failure.
Occasion processing failures happen in our code. We both have a bug within the code, the occasion doesn’t have the info we anticipate, or possibly a service we’re calling within the code has been throttled or is experiencing a failure. As soon as once more, we push these errors to a DLQ so we don’t should be prepared the moment one thing goes incorrect.
To ship an occasion processing failure to a DLQ, we make use of Lambda destinations to route the occasion on a failure.
Sending errors to a DLQ is one factor, however how are you aware when one thing wants your consideration?
To know when issues are incorrect, we’ve applied a CloudWatch alarm that watches the DLQ. At any time when the DLQ has 1+ merchandise in it, a SNS matter will fireplace and notify the involved events of a system failure.
As soon as the useless letter queue has been cleared of all gadgets, the alarm will flip off and proceed looking forward to the subsequent incident.
When an error lands in a DLQ, you could have two choices: take guide motion and routinely retry. Contemplate beginning with guide motion solely whilst you get used to troubleshooting event-driven errors. Determine what the patterns are in resolving them earlier than you attempt to automate.
When you establish errors that may be retried and the way you’d go about fixing them, then you need to begin constructing infrastructure round that course of. The LEGO crew has an incredible video on how they routinely retry occasion supply failures of their system.