Coordinating a multi-lambda software product

29th Aug 2021

Introduction:

The more I have been working on AWS, the more I understanding the importance of well-architected solutions. Today, I would like to focus on the value of AWS Step Functions. What are Step Functions? The offical description is:

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

To explain the value, I am going to use some hypothetical use cases, which are:

Use Case One: As a GitHub administrator, whenever a new developer joins the GitHub organisation, I would like to add them to a GitHub Team so that they can access repositories straight away. I also would like to fetch internal company information about that user (email, id, etc.) and add them to an internal DB for querying.
Use Case Two: As a GitHub administrator, whenever a GitHub Workflow completes, I would like to calculate the workflow cost and work out the total workflow count for the repository, so it's easy to do chargebacks per repository. I also would like to store the data in a DB so it can be queried historically.

Before we build out some architectures, let's set some principles:

Purposeful: Single function lambda's for single-use cases. (not combining multiple businesslogic into a single lambda). This is to promote reuse.
Event Driven: No polling or waiting on a cron which triggers to see if things have changed. I would like an end-to-end event driven architecture.
Stateful: Non-Invoke Based (Callback hell). E.G Lambda A Invoked Lambda B from within Lambda A and waits for Lambda B to be done to return success/fail

Now, as with any architecture, there are multiple ways to build out this example system. I will show two examples below and compare & contrast the pros and cons of each, mainly focusing on how to use multiple lambdas together and why AWS Step Functions are beneficial.

Model One - SQS:

Let's walk through the above diagram. We have a GitHub App configured on two events (A, B). We use a GitHub App to remove the "human" element of the connection, along with some other goodies like an increase in API Requests. The GitHub App will send a payload to our API, but before it reached the API, we use AWS Route 53 for our custom DNS record, which then will proxy down to our AWS Cloudfront Distribution.

Once the payload reaches the API, we will first use the direct integration between AWS HTTP API Gateways and AWS Lambda to first process the data. Then, to communicate between the rest of the lambdas, we use AWS SQS to traffic data between lambdas for processing. Finally, data ends up in the database where you could use a service like AWS AppSync or another API Gateway to fetch the data.

Let's talk about the pros:

Extensible: It's extensible, as an individual SQS Queue fronts each function. Meaning, you're able to quickly send data to that function from any service that can send structured data. You may know about Lambda X needing to send data to Lambda Y. Still when a new Lambda comes in, Lambda Z, it's easy to add that lambda into the current architecture and send data to Lambda Y without breaking the current pattern.

Let's talk about the cons:

Clean Arch: It's a little messy. I am a big believer that most clean architectures are simple. Don't overcomplicate something and add AWS services because it could fit a need. Look at alternatives to reduce your footprint. There are 6 SQS Queues; they seem to be the most predominant service in this design. Are they needed?
Problem Finding: How easy is it to really find problems? We have a dead letter queue configured so any messages that don't complete can be re-processed accordingly, but you only see a problem at a time; you don't see the history of where that data has come from or where it has been or how it has been processed. You would have to write some custom code to do this.
App Tracking: Amazon SQS requires you to implement application-level tracking, especially if your application uses multiple queues, which in this case, it does.

Overall, this isn't a bad architecture, it fits a use case, but could it be fine-tuned?

Model Two - Step Functions

Both ingress patterns into AWS are the same. The main difference starts when you get past the AWS HTTP API Gateway and into the data processing.

As this solution has multiple lambdas, we use AWS Step Functions to coordinate how they interact. So, when a payload reaches the HTTP API, we trigger the AWS Step function. Data is processed by each lambda and sent back to the state machine, where finally it inserts data into the DB and uses a custom AWS SNS Topic to send an email on success/error.

Both have similar architectures but differ slightly in data communication; let's discuss the detail ...

Let's talk about the pros:

Less Code: We don't have to write a custom lambda to enter data into the DB. Step functions have a native integration with DynamoDB, meaning we don't have to write code to do something preexisting. More information on integrations can be found here: Using AWS Step Functions with other services.
Less AWS Resource(s): No need for any queues. We use the state machine to send data to the following lambda in the chain. More information on how to send data within step functions can be found here: State Machine Data
Process Overview: Easy to see the whole process in action. Step Functions easy allow you to see the data that is processing. To see more information on seeing the overall process, check out this link: Input and Output Processing in Step Functions
Easy to find problems: Don't you dislike having to crawl through cloudwatch events to find errors logged out from a lambdas console? Using AWS Step Functions allows you to quickly find errors via the Step Functions GUI as you can crawl through the state machines events to find problems. I find this link really useful for more information on debugging: Monitoring & Logging
Built in retries: Sometimes lambdas error and writes into DynamoDB's fail. Although they are rare, if not handled correctly, they could cause downstream dilemmas. Step Functions have inbuilt retry capabilities that allow you to retry on specific errors. Meaning you can only retry on specific event errors that you would like to retry on. More information on this can be found here: Monitoring & Logging

Let's talk about the cons:

Con One: Step Functions has some pretty strict and small limits (I actually think this article is a nice summary of the limits: Think Twice Before Using Step Functions — Check the AWS Serverless Service Quotas). If you are processing lots of data, you would need to split your step functions into multiple state machines. One idea on how to architect your solution around this limit is to create a parent/child sate machine. E.G, a child state machine could process a single data entry at a time, which invokes from a parent state machine that loops through the data, but doesn't directly do any of the processing, so it stays within limits.
Con Two: If another system needs to reuse a specific function, there is no queue in front of it, making it harder to call. Making the architecture not amazingly extensible. Yes, you can still use AWS SQS with step functions, but unless you think it's needed externally to this use case, it likely isn't required.

Overall, I genuinely believe this architecture is cleaner and runs a more robust process than the previous design.

Going into detail about Step Functions

I would like to focus on two core parts of step functions that stand out to me:

Feature One: Built in looping through arrays

Let's say you have a data set of 1,000 users. You could send all 1,000 users to the lambda via an SQS Queue (but you can only send ten records at a time, remember), loop through all users, process them accordingly and send them back to the state machine. Or, you could use the inbuilt map feature within Step Functions that will map through the user's array at the sate machine level and send one user record at a time to the lambda for processing. Why would you do this? It allows you to write less code within your lambda, fewer loops, hopefully, quicker processing. In my opinion, it also makes your code cleaner.

It looks a little like this within the state machine definition file:

Invoke Worker State Machine:
Type: Map
InputPath: "$.users"
MaxConcurrency: 50
Parameters:
    UserDetails.$: "$$.Map.Item.Value"

Feature Two: AWS Step Functions can call AWS Step Functions

As mentioned above, AWS Step Functions have some pretty strict (and small) limits. Meaning you have to architect your solutions correctly. An elegant aspect of Step Functions is they can call other step functions. Meaning if you have been processing lots of data and are reaching limits, you can split up your Step Function into one parent step function, and then a child step function where you send one data record at a time to be processed individually from the parent.

It looks a little like this within the state machine definition file:

Iterator:
StartAt: Invoke Worker State Machine Task
States:
    Invoke Worker State Machine Task:
    Type: Task
    Resource: arn:aws:states:::states:startExecution.sync:2
    Parameters:
        StateMachineArn: "${ChildStateMachineArn}"
        Input:
        UserDetails.$: "$.UserDetails"
        AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$: "$$.Ex.Id"

Taking the above two code snippets, you are looping through the user's array of objects and sending one user at a time. Let's say you have 1,000 users; you are spinning up 1,000 child step functions and processing one user at a time.

These are just two features that I think make Step Functions a great resource to use when co-ordinating a multi-lambda solution. However, there are so many more, check out the following docs for more information Introduction to AWS Step Functions

Conclusion

I have found Step Functions to be a great resource when working across lambdas. They give you more confidence in your design and allow you to write less code, and in most cases, less code is better code, right?

An InnerSourcing Approach within an Enterprise

14th Feb 2021

Context:

InnerSourcing. An approach that could revolutionise software development within enterprises, yet an approach unknown to most. So, what is InnerSourcing? Most people/companies have different interpretations, but overall, InnerSourcing takes the values and principles of "open source" and looks to apply that within an enterprise. Traditionally, software teams are siloed and have their objectives, meaning they are focused on their deliverables and nothing else. Typically, if these teams were to take a step back and look across the enterprise, they would see other teams trying to do the same thing, or even better, other teams have solved/come up with a solution already. What InnerSourcing tries to achieve is collaboration across teams to meet a common goal. Why have multiple teams work on the same thing separately, versus combining everyone's knowledge in these teams to work together to meet the outcome as one.

Advantages:

InnerSourcing has some more values than just cross-team collaboration; some more advantages are:

Faster development: Having multiple developers from various teams will lead to faster development, versus a single team where only one or two developers may be focused on that deliverable.
Documentation: If multiple teams are consuming a software product, more focus is put on writing good documentation. This is so developers looking to use the software product can get up and running quickly, rather than figure it out themselves.
Code Reuse: It is better to build once and reuse multiple times, then build several times and use once. This is because code is more maintainable as if an update needs to be made, it gets made once, versus every team needing to make the same change.
Encourage innovation: Innersourcing gets more minds on the same outcome. Naturally, this means a more diverse mindset, leading to a broader range of solutions being proposed to a problem; therefore, you encourage innovation.
Healthy Competition: The more advanced companies create healthy competition by rewarding developers who have "InnerSourced" the most. Doing this gets the most out of your developers and keeps them engaged.
Security & Quality: This is one of the more unrecognised values. Building something once and reusing it multiple times means if an urgent security vulnerability needs to be patched, the update just needs to be made to a single codebase. The teams using the fixed software need to do nothing. On the next release of their software, it will get automatically pulled down and patched.

These are just a few examples to show why InnerSourcing can be such a success within enterprises.

Success Story

At this point (traditionally) in my blog, I would usually explain how something works and share my experience/further thoughts on the topic. However, for this article, I would like to link out to a conference I spoke at back in 2018.

Every year, GitHub host a conference in San Francisco called "GitHub Universe". This conference focuses on all aspects of software development, where GitHub and companies present approaches and best practices. I was lucky enough to present on behalf of Lilly, to talk about Lillys's InnerSourcing journey and the value seen internally. I discuss the concepts of InnerSourcing, the approach taken, some statistics of adoption, then ending with some best practices and tips & tricks for other companies looking to adopt the same.

See the video below:

Conclusion

To conclude, similar to DevOps, InnerSourcing can be seen as an approach and methodology to transform IT enterprises' traditional culture and behaviours. It helps to have a central team spearheading the journey, as that team can focus on automating the process which sparks and enables InnerSourcing. Additionally, it also helps to share and use the same suite of DevOps Tools (such as source code and package manager tools), as this removes friction in the developer experience. Asking a developer to go between multiple tools will lead to developer frustration and therefore, a lower appetite to adopt InnerSourcing. Developer experience has to be at the heart and focal point of a companies InnerSourcing journey.

DevOps and InnerSourcing, although different, complement each other more then people realise. If companies lay solid foundations for one, it will naturally help lay the other's foundations, even if unintentional. Finally, I believe InnerSourcing will soon be the "norm" and a core approach for how software development occurs within enterprises. I am excited to see InnerSourcing grow and mature through various companies in the coming years.

Older Newer

Nick Liffen

Table of Contents

Coordinating a multi-lambda software product

Introduction:

Model One - SQS:

Model Two - Step Functions

Going into detail about Step Functions

Feature One: Built in looping through arrays

Feature Two: AWS Step Functions can call AWS Step Functions

Conclusion

An InnerSourcing Approach within an Enterprise

Context:

Advantages:

Success Story

Conclusion