splitting and merging the json files from batch jobs in aws

Solution for splitting and merging the json files from batch jobs in aws
is Given Below:

I am working on a project where I am splitting a single file with a bunch of sentences into chunks for further sending to a third-party API for sentiment analysis.

The third-party API has a limitation of up to 5000 characters of limitation and which is why I am splitting the file into chunks of 40 sentences each. Each chunk will be sent to a batch job via AWS SQS and processed for sentiment analysis from a third-party API. I wanted to merge all of the processed files into one file. I couldn’t find the logic to merge the files.

enter image description here

For example,

the input file,

chunk1: sentence1....sentence1... sentence1....

chunk2: sentence2....sentence2... sentence2....

The input file is separated into chunks. Each of these chunks is sent separately to a batch job via SQS. The batch job will call the external API for sentiment analysis. Each file will be uploaded to the S3 bucket as separate files.
Output file:

{"Chunk1": "sentence1....sentence1...sentence1....",
"Sentiment": "positive."}

All I wanted is to have the output in a single file but couldn’t find the logic to merge the output files.

Logic I tried:

For each input file, I send a UUID to every chunk as ametadata and merge them with another lambda function. But the problem here is I am not sure when all of the chunks are processed and when to invoke the lambda function to merge the files.

If you have any better logic to merge the files, please share it here.

This sounds like a perfect use case for the AWS Step Function service. Step Functions allow you to have ordered tasks (which can be implemented by Lambdas). One of the state types, called Map, allows you to kick off many tasks in parallel and wait for all of those to finish before proceeding to the next step.

So a quick high level state flow would be something like:

  1. First state takes a file as input and breaks up the file into multiple chunks
  2. The second state would be a map state with a task that takes a file as input and sends to the sentiment analysis and saves output. The map state will kick off a task for each small file and retrieve the sentiment analysis.
  3. The third and final task state will take all of the files and combine them in whatever way you deem appropriate.

It may take a bit of googling and reading the user guides but your workflow is exactly the use case this service was designed for and it sounds like you already have some of these steps implemented as their own Lambda functions, you’ll just need to tweak those to be compatible with how Step Functions receive and push data out instead of using SQS.

That being said, I’m not sure how you want to merge the files as each section was analyzed separately and may have it’s own sentiment and I’m not sure how to summarize the sentiment as a whole.