Table of Contents


The Importance of Developer Experience - Developer roles are evolving!

Introduction

Over the past five years, I have had the pleasure of working within different developer communities within various organizations. These experiences are where I captured my passion for developer experience and Dev(Sec)Ops. Making life as easy possible for developers and ensuring the right tools are in place is essential to improve developer retention and business outcomes. Nowadays, this is even more imperative because the role of a developer is no longer just software development; it's starting to become key within every area of the SDLC.

The evolution of a developer ...

What do I mean by this? Let's quickly discuss the evolution.

Traditionally, a business analyst would hand a requirement over to a developer. That developer would then carry out the software development of that requirement. The developer would then hand it over to a tester to write unit tests, regression tests, etc. You would repeat that process until all requirements are completed. Then, a security engineer would run a security test, and they would attempt to understand which are false positives and need to be fixed. Once all vulnerabilities are solved, quality would come in and work with the business/technical analyst to ensure all documentation are complete. Then, you can release!

What's a theme you are seeing out of this way of working? It's very waterfall, messy and there are so many opportunities that can slow the release.

Now, a more modern approach could be:

A scrum master works with a developer to draft user stories that meet the requirement. The developer would carry out the actual software development. Once complete, they would then write unit tests (maybe even UI tests if it's a GUI and some regression tests). Alongside testing, the developer gets SCA and SAST reports of any vulnerabilities, fixing them during development if any arise. Additionally, the developer is updating design documentation, README's, CHANGELOG's, etc. During this process, the security engineer, quality consultant, business/systems analyst is ensuring its meeting requirements (DevSecOps), meaning just before release, there are no problems, and you can release straight away!

So, what's the main difference you see between the two? Straight away, you see the developer (or developers) are carrying out a lot more of the tasks within the SLDC lifecycle. This doesn't mean the other roles aren't involved or important; it just means there is a shift in perspective from who is primarily doing the work. You may still have testers for large applications, but the developer developing the feature/bug request is writing the initial tests. Similarly, you may (and should) have a dedicated security engineer on hand to assist with vulnerabilities that the developer needs advice about. Still, the developer is going in and remediating the vulnerabilities. These two examples highlight the shift left in nature (nothing new) in how development is done.

Why? Why is this evolving in this way? Let's discuss:

  • The most important one is, the developer knows the codebase the best. I'll give two examples of why this is critical:
    • Let's say you have a centralized testing team that provides unit/regression/UI testing capabilities. The developer does the work; they hand off to a tester to write the test(s). There is a one-two day SLA for that testing to be picked up. Then the tester has to get up to speed with the feature/bug that has been written/fixed. The SLA largely depends on the size of the change but can take anywhere from an hour to a day. The actual development of the tests likely take the same amount of time, so no added time there. The work then gets PR'd, which needs another review by the developer who wrote the code to ensure it meets the requirement of the bug/feature. Again, adding to the total SLA. In comparison, if the developer who wrote the code write the tests as well, there is 0 SLA in the handover between the testers, 0 SLA in getting up to speed with the codebase and 0 SLA in the review, as the tests can go into the main PR into the dev/qa/main branch. Think about this at scale; there is SO much time saving, which equates to quicker value for the business.
    • Another example, security. Traditionally, a security person would look through a report and say, "Hey, you need to fix all critical and high vulnerabilities". The developer would pivot the conversation to fixing the vulnerabilities that had the highest risk on the most critical part of the codebase. However, the conversation would typically end in all critical and high vulnerabilities needing remediating, if not all of them. Although controversial, this is highly inefficient and generally wastes a considerable amount of time. Why? Just because a vulnerability is flagged as critical, it doesn't mean it's critical to your application. On the other hand, you may have a medium severity that directly affects the most important aspect of your application, with the highest surface area. This is why the developer who wrote the code takes accountability for reviewing and remediating the vulnerabilities of most criticality to the application, not based on the potential effect. This leads to quicker release cycles and more secure software as developers fix vulnerabilities that count early!
  • The second reason developers are becoming more involved is due to more and more aspects of the SLDC ... well ... they are becoming code. Think about traditional development; you would build software and hand it off to an infrastructure person to deploy it. Nowadays, the developer is that infrastructure person. The developer writes the code for the feature but then writes the code for the IaC, which supports that feature. You are also seeing CI/CD becoming more config as code. Like the feature example above, the developer writes the feature, then the IaC, and any additions to the CI/CD landscape. More and more aspects of development are becoming code, which requires a developer.

So, the above talks about the evolution of a developer, which paints a picture of why developer experience is so important. If the developer is doing more, you naturally would like to maximize velocity to get the best value. However, most importantly, you want to maximize how happy they are. A happy developer = better retention + productivity. I truly believe the more emphasis you put on developer experience, the more you will get back. So, what can do you do to improve the developer experience? Below, I will discuss my three core principles when it comes to developer experience.

Developer First Toolset

Stand up tooling that has developer-first principles. When you put more on developers and follow a DevSecOps approach, it's critical to stand up tooling within the toolchain that focuses on developers.

More and more tools are starting to put developers at the heart of what they do, which paves the way for increased productivity and experience (we have discussed why this is important above). Some great examples of tooling which should be developer-first are:

  • Security
  • CI
  • CD
  • Quality
  • Testing

You wouldn't hand a scientist equipment that didn't make it easier for the scientist to use? You wouldn't have a medic devices that were hard to use and make their life hard? So why would you hand developers tooling that doesn't make the life of a developer easy?

If you're a company looking to change the tooling to be more developer-focused, ensure you have developer(s) involved in the decision making.

Frictionless Processes

Focus on the process, not just the tools. Having the right tools are essential, but if you make them hard to use or implement them in a way that introduces friction, you won't be getting the most out of the tools. What can you do to ensure you have good foundations?

  • Automate: Try not to put any manual requests or SLA's on setup. Use the API's, Webhooks, etc., provided by tools to get set up in an automated fashion.
  • Provide suitable levels of access: Try not to limit access to tools, especially for the reason of just in case. Provide people with the autonomy to read, write and administer their solutions. There are times where you have to limit access, especially in large enterprises, but do so for the right reasons, not just for the sake of not knowing.
  • Open API's: The most frustrating thing for a developer is when they are restricted on the art of the possible because API's are disabled due to the team managing the tooling worrying about what teams may do, which they don't know about. As above, there may be good reasons (security, etc.), but unless something is blocking you, open API's and empower teams to be creative.

These are just a few examples. The main point to get across is automation. Automation is a great way to unblock friction. It is especially important regarding the DevOps Toolchain. Ensure that it's easy to get data between tool A and tool B. Automation and interconnectivity are essential to success, whether that's a testing tool to a project management tool or a security tool to a CI tool. Ensure you think about your process, not just the tools.

Empowerment & Trust

The last one focused on empowerment and trust. This behaviour generally gets overlooked, as it's easier to focus on the tools because you can measure and quantity success easier. However, there are simple steps you can make to help drive a more open culture among developers:

  • Don't push back on developers for the sake of wanting to share an opinion. I have seen many scenarios where a developer shares a great thought but gets questioned or disputed by someone who doesn't know the area but wants to share a thought to make sure they are in the conversation. Everyone should share ideas, collaborate, and be open, but ensure developers voices are heard and listened to. This is a simple behaviour you can adopt but will make a huge difference.
  • I mentioned this above in the process section, but nowadays (especially in larger enterprises), access is restricted, API's are disabled (and I know what you're thinking, but what about security? It would be best to never compromise on security, but think about what automated processes you can use to keep to a high standard of security but still enabling access and API's). E.g. key rotation, automatic access reviews, etc. The more you give to a developer, the more they will feel empowered and trusted, which will boost morale.

Focus on your people, and have trust in your developers.


Coordinating a multi-lambda software product

Introduction

The more I have been working on AWS, the more I understanding the importance of well-architected solutions. Today, I would like to focus on the value of AWS Step Functions. What are Step Functions? The offical description is:

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

To explain the value, I am going to use some hypothetical use cases, which are:

  • Use Case One: As a GitHub administrator, whenever a new developer joins the GitHub organisation, I would like to add them to a GitHub Team so that they can access repositories straight away. I also would like to fetch internal company information about that user (email, id, etc.) and add them to an internal DB for querying.
  • Use Case Two: As a GitHub administrator, whenever a GitHub Workflow completes, I would like to calculate the workflow cost and work out the total workflow count for the repository, so it's easy to do chargebacks per repository. I also would like to store the data in a DB so it can be queried historically.

Before we build out some architectures, let's set some principles:

  • Purposeful: Single function lambda's for single-use cases. (not combining multiple businesslogic into a single lambda). This is to promote reuse.
  • Event Driven: No polling or waiting on a cron which triggers to see if things have changed. I would like an end-to-end event driven architecture.
  • Stateful: Non-Invoke Based (Callback hell). E.G Lambda A Invoked Lambda B from within Lambda A and waits for Lambda B to be done to return success/fail

Now, as with any architecture, there are multiple ways to build out this example system. I will show two examples below and compare & contrast the pros and cons of each, mainly focusing on how to use multiple lambdas together and why AWS Step Functions are beneficial.

Model One - SQS

AWS SQS Arch Design

Let's walk through the above diagram. We have a GitHub App configured on two events (A, B). We use a GitHub App to remove the "human" element of the connection, along with some other goodies like an increase in API Requests. The GitHub App will send a payload to our API, but before it reached the API, we use AWS Route 53 for our custom DNS record, which then will proxy down to our AWS Cloudfront Distribution.

Once the payload reaches the API, we will first use the direct integration between AWS HTTP API Gateways and AWS Lambda to first process the data. Then, to communicate between the rest of the lambdas, we use AWS SQS to traffic data between lambdas for processing. Finally, data ends up in the database where you could use a service like AWS AppSync or another API Gateway to fetch the data.

Let's talk about the pros:

  • Extensible: It's extensible, as an individual SQS Queue fronts each function. Meaning, you're able to quickly send data to that function from any service that can send structured data. You may know about Lambda X needing to send data to Lambda Y. Still when a new Lambda comes in, Lambda Z, it's easy to add that lambda into the current architecture and send data to Lambda Y without breaking the current pattern.

Let's talk about the cons:

  • Clean Arch: It's a little messy. I am a big believer that most clean architectures are simple. Don't overcomplicate something and add AWS services because it could fit a need. Look at alternatives to reduce your footprint. There are 6 SQS Queues; they seem to be the most predominant service in this design. Are they needed?
  • Problem Finding: How easy is it to really find problems? We have a dead letter queue configured so any messages that don't complete can be re-processed accordingly, but you only see a problem at a time; you don't see the history of where that data has come from or where it has been or how it has been processed. You would have to write some custom code to do this.
  • App Tracking: Amazon SQS requires you to implement application-level tracking, especially if your application uses multiple queues, which in this case, it does.

Overall, this isn't a bad architecture, it fits a use case, but could it be fine-tuned?

Model Two - Step Functions

AWS Step Functions Arch Design

Both ingress patterns into AWS are the same. The main difference starts when you get past the AWS HTTP API Gateway and into the data processing.

As this solution has multiple lambdas, we use AWS Step Functions to coordinate how they interact. So, when a payload reaches the HTTP API, we trigger the AWS Step function. Data is processed by each lambda and sent back to the state machine, where finally it inserts data into the DB and uses a custom AWS SNS Topic to send an email on success/error.

Both have similar architectures but differ slightly in data communication; let's discuss the detail ...

Let's talk about the pros:

  • Less Code: We don't have to write a custom lambda to enter data into the DB. Step functions have a native integration with DynamoDB, meaning we don't have to write code to do something preexisting. More information on integrations can be found here: Using AWS Step Functions with other services.
  • Less AWS Resource(s): No need for any queues. We use the state machine to send data to the following lambda in the chain. More information on how to send data within step functions can be found here: State Machine Data
  • Process Overview: Easy to see the whole process in action. Step Functions easy allow you to see the data that is processing. To see more information on seeing the overall process, check out this link: Input and Output Processing in Step Functions
  • Easy to find problems: Don't you dislike having to crawl through cloudwatch events to find errors logged out from a lambdas console? Using AWS Step Functions allows you to quickly find errors via the Step Functions GUI as you can crawl through the state machines events to find problems. I find this link really useful for more information on debugging: Monitoring & Logging
  • Built in retries: Sometimes lambdas error and writes into DynamoDB's fail. Although they are rare, if not handled correctly, they could cause downstream dilemmas. Step Functions have inbuilt retry capabilities that allow you to retry on specific errors. Meaning you can only retry on specific event errors that you would like to retry on. More information on this can be found here: Monitoring & Logging

Let's talk about the cons:

  • Con One: Step Functions has some pretty strict and small limits (I actually think this article is a nice summary of the limits: Think Twice Before Using Step Functions — Check the AWS Serverless Service Quotas). If you are processing lots of data, you would need to split your step functions into multiple state machines. One idea on how to architect your solution around this limit is to create a parent/child sate machine. E.G, a child state machine could process a single data entry at a time, which invokes from a parent state machine that loops through the data, but doesn't directly do any of the processing, so it stays within limits.
  • Con Two: If another system needs to reuse a specific function, there is no queue in front of it, making it harder to call. Making the architecture not amazingly extensible. Yes, you can still use AWS SQS with step functions, but unless you think it's needed externally to this use case, it likely isn't required.

Overall, I genuinely believe this architecture is cleaner and runs a more robust process than the previous design.

Going into detail about Step Functions

I would like to focus on two core parts of step functions that stand out to me:

Feature One: Built in looping through arrays

Let's say you have a data set of 1,000 users. You could send all 1,000 users to the lambda via an SQS Queue (but you can only send ten records at a time, remember), loop through all users, process them accordingly and send them back to the state machine. Or, you could use the inbuilt map feature within Step Functions that will map through the user's array at the sate machine level and send one user record at a time to the lambda for processing. Why would you do this? It allows you to write less code within your lambda, fewer loops, hopefully, quicker processing. In my opinion, it also makes your code cleaner.

It looks a little like this within the state machine definition file:

Invoke Worker State Machine:
Type: Map
InputPath: "$.users"
MaxConcurrency: 50
Parameters:
    UserDetails.$: "$$.Map.Item.Value"

Feature Two: AWS Step Functions can call AWS Step Functions

As mentioned above, AWS Step Functions have some pretty strict (and small) limits. Meaning you have to architect your solutions correctly. An elegant aspect of Step Functions is they can call other step functions. Meaning if you have been processing lots of data and are reaching limits, you can split up your Step Function into one parent step function, and then a child step function where you send one data record at a time to be processed individually from the parent.

It looks a little like this within the state machine definition file:

Iterator:
StartAt: Invoke Worker State Machine Task
States:
    Invoke Worker State Machine Task:
    Type: Task
    Resource: arn:aws:states:::states:startExecution.sync:2
    Parameters:
        StateMachineArn: "${ChildStateMachineArn}"
        Input:
        UserDetails.$: "$.UserDetails"
        AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$: "$$.Ex.Id"

Taking the above two code snippets, you are looping through the user's array of objects and sending one user at a time. Let's say you have 1,000 users; you are spinning up 1,000 child step functions and processing one user at a time.

These are just two features that I think make Step Functions a great resource to use when co-ordinating a multi-lambda solution. However, there are so many more, check out the docs here for more information.

Conclusion

I have found Step Functions to be a great resource when working across lambdas. They give you more confidence in your design and allow you to write less code, and in most cases, less code is better code, right?