Unzip and Gzip Incoming S3 Files With AWS Lambda | by nhammad | Mar, 2022

Simpler, sooner, and higher

AWS Workflow

Some time again I encountered a state of affairs the place the incoming S3 recordsdata had been zipped. Every zipped file contained 5 textual content or CSV recordsdata. Nonetheless, for additional processing, I wanted to extract the zipped content material and convert it into gzipped format. Since there was a big inflow of recordsdata, unzipping and gzipping recordsdata manually didn’t appear to be attainable.

To greatest approach to automate the method appeared to make use of AWS Lambda Features. If you happen to head to the Properties tab of your S3 bucket, you possibly can arrange an Occasion Notification for all object “create” occasions (or simply PutObject occasions). Because the vacation spot, you possibly can choose the Lambda operate the place you’ll write your code to unzip and gzip recordsdata.

Now, each time there’s a new .zip file added to your S3 bucket, the lambda operate can be triggered. You too can add a prefix to your occasion notification settings, for instance, for those who solely need to run the lambda operate when recordsdata are uploaded to a particular folder throughout the S3 bucket.

The lambda operate can then appear like this:

When an occasion triggers this lambda operate, the operate will extract the file key that induced the set off. Utilizing the file key, we’ll then load the incoming zip file right into a buffer, unzip it, and browse every file individually.

Throughout the loop, every particular person file throughout the zipped folder can be individually compressed right into a gzip format file after which can be uploaded to the vacation spot S3 bucket.

You may replace the final_file_path parameter if you wish to add the recordsdata in a particular folder. Equally, you possibly can replace parameters like sourcebucketname and destination_bucket in accordance with your necessities. You too can add the gzipped recordsdata to the identical supply bucket.

One other approach to do the identical might be to first learn the S3 file into the /tmp folder after which unzip it for additional processing. Nonetheless, this methodology may crash for those who begin getting a number of recordsdata in your S3 bucket on the identical time.

Word that by default, Lambda has a timeout of three seconds and reminiscence of 128 MBs solely. In case you have a number of recordsdata coming into your S3 bucket, it is best to change these parameters to their most values:

Timeout = 900

Memory_size = 10240

The AWS function that you’re utilizing to run your Lambda operate would require sure permissions. Firstly, it could require entry to S3 for studying and writing recordsdata. The next insurance policies are the primary ones:

“s3:ListBucket”

“s3:HeadObject”

“s3:GetObject”

“s3:GetObjectVersion”

“s3:PutObject”

You also needs to have CloudWatch entry so you possibly can log and debug your code if required.

“logs:CreateLogGroup”

“logs:CreateLogStream”

“logs:PutLogEvents”

Replicating an identical workflow by Terraform can also be fairly simple. I’ll write about it in my subsequent put up! 🙂

More Posts