My team has just finished building a scalable, resilient, serverless distributed data pipeline which scales seamlessly with the amount of data it takes in as input. We have used several tools like Ansible, Lambda, Terraform, etc. And, also learned a lot of lessons along the way, in the form of pitfalls, failures, and wins. This talk is about that system and the lessons learned.
Session type: Session
We wanted to build a serverless data pipeline for coding medical charts using NLP. However, we didn’t want it to be real-time (which is most serverless systems). So, we used a queue and a monitoring system’s (AWS CloudWatch) alarms to pull off the serverless, batch processing pipeline.
Next, we wanted to make it a serverless, batch, distributed pipeline. So, we made use of Ansible and made the Master-Workers architecture. However, AWS Lambda has a time-limit of 5 minutes. But our entire NLP pipeline flow takes 30 minutes to complete.
So, we stumbled upon an idea wherein we create a Master server via Lambda and run Ansible in nohup mode. And then, we learned some very important lessons while doing nohup monitoring.
Now, we realized that Ansible can terminate the workers once the tasks are completed, but we want to delete the master also. So, we again built the Ansible playbook such that the Master kills itself once the workers are terminated.
Also, we built a serverless API for querying the results of the data-pipeline, using AWS Lambda and API Gateway. And, all this have to be built keeping in mind the HIPAA compliance, which means that the data needs to be encrypted both at rest, and in motion.
So, along the way of building this complete architecture, we experienced a lot of gotcha moments, failures, huge wins and pitfalls which taught us very important lessons.
This talk pinpoints those lessons, and what we learned out of them, and what the audience can learn out of them.