Introduction
AWS Lambda is one of the main precursors of Functions as a Service. This paradigm introduced a significant shift in how we develop applications and reason about them. The promise is simple: You push your code, specify your handler and the Functions runtime will take care of running it and scaling it on demand!
However, this apparent simplicity hides a learning curve that is not as flat as it seems. While using Lambda, I stumbled into many issues and learned many things that improved my usage over time and that I wish I had known from the start. Here is a short list of some of the most important things that changed how I use AWS Lambda.
Lambda timeouts cause cold starts!
Lambda functions can be configured to timeout after some time. This is useful to guarantee an upper limit to the execution time of function invocations. For a long time, they really seemed harmless to me. But in reality, their occurrence is much more problematic than you think, especially in high throughput or user facing apps. Why? Because they come with a high cost, both performance, and observability wise:
Performance wise: When a Lambda Function reaches a timeout (similarly to an out of memory error), Lambda does not only stop the current invocation. It kills the whole execution environment brutally with it! Think of it as a process sigkill against the entire execution environment. Consequently, the environment becomes non-reusable, and the following invocation needs to spin up a new environment again, incurring a cold start and causing a big latency spike in your endpoint. This behavior is documented in the AWS Lambda guide but is not visible enough considering its high impact. You can read about it in the invoke phase in the Lambda runtime environment page.
Observability wise: Lambda timeouts are literally like unplugging the execution environment from the power source. The logs will be cut off, and your code won’t have the chance to report any metrics, logs, traces, or exceptions that would have been useful to debug the issue. In fact, you won’t have much more information in the logs than
Task timed out
, and you will likely have no clue where exactly your code hung.
For these reasons, Lambda timeouts should never be part of an expected workflow: If they happen, it means you have a bug in your application that you should investigate and fix. Treat them as an emergency exit door and set the threshold high enough to have them occur only on extremely rare occasions (ex. a bug that caused your code to hang). Instead, implement a custom timeout logic in your code if you want to set a hard limit to client calls. You can use this opportunity to report an informative error message to the user, a custom status code, and some timing metrics showing what took that much time.
Wondering how to report such custom metrics? Nice transition to the next bullet point!
Custom metrics from Lambdas via logs using EMF
AWS introduced a compelling way to report custom metrics from your Lambda functions: The Embedded Metrics Format (or EMF).
EMF lets you report custom metrics in the form of log statements shaped in a specific way described by the EMF specification. These logs are automatically intercepted, parsed, and transformed by CloudWatch into fully fledged metrics that you can graph and create alarms on. Your Lambda function can emit those logs using ordinary log reporters (ex. console.log
for javascript, log4j
for java, etc.) and does not need to perform network calls or depend on additional client libraries.
Here is an example of an EMF formatted log statement:
1{2"_aws": {3"Timestamp": 1574109732004,4"CloudWatchMetrics": [5{6"Namespace": "My Namespace",7"Dimensions": [["functionVersion"]],8"Metrics": [{ "Name": "time", "Unit": "Milliseconds" }]9}10]11},12"functionVersion": "$LATEST",13"time": 100,14"requestId": "989ffbf8-9ace-4817-a57c-e4dd734019ee",15"sessionId": "1234"16}
The _aws
field contains the metric definition with its name, unit, and a list of dimensions. This needs to be present in every EMF log statement. The other fields contain the metric values, dimension values, and extra properties that attach additional context to the metric statement. The additional properties will not be captured by CloudWatch metrics but can be extremely useful when deep diving into the logs.
Optimize your Lambdas latency by running your init code once
Lambda execution environments are reused for multiple invocations. This is referred to as warm starts. Variables initialized outside your handler function 1 will keep their value on subsequent invocations running on the same environment. This is a perfect place to store database connections, http pools, and other objects that are costly to create / initialize. That way, you won’t need to recreate them again on every invocation.
So instead of writing something like this (code example in kotlin
):
1class Handler: RequestHandler<Request, String> {23/**4* This is your handler function 👇5*/6override fun handleRequest(event: Request, context: Context): String {78// dbPool initialized inside the handler function9val dbPool = initializeDatabasePool();1011// Do something with your database pool12...1314return response;15}16}
You could write this instead:
1class Handler: RequestHandler<Request, String> {23// dbPool initialized outside the handler function4val dbPool = initializeDatabasePool();56/**7* This is your handler function 👇8*/9override fun handleRequest(event: Request, context: Context): String {1011// Do something with your database pool12...1314return response;15}16}
The initializeDatabasePool()
call in the second snippet will only execute during the init phase (cold starts), and the result will be kept in the dbPool
variable for later invocations that reuse the execution environment.
Background code will continue to run on the next invocations
Background processes, threads, and promises that were initiated by your Lambda function and did not complete when the function ended will resume on subsequent invocations if Lambda reuses the execution environment. This can result in surprising behaviors that are very hard to debug.
The explanation is simple: When a function invocation is done, i.e. your handler function returned, the execution environment is frozen, and any code that was still running is paused. When a new invocation starts, the execution environment unfreezes, and all the incomplete processes resume.
This can happen much more easily than you think! Take a look at this innocent-looking javascript code:
1const values = await Promise.all([2asyncFunction1(), // returns a promise3asyncFunction2(), // returns a promise4])
Promise.all
waits for all the given promises until they all resolve. If any of them fail, Promise.all
returns early and stops waiting for the other promises. In the example above, If the asyncFunction1
promise fails, the Promise.all
call returns and throws an error without waiting for the completion of the asyncFunction2
. In the meantime, asyncFunction2
is potentially still running in the background! If it didn’t complete before the Lambda function returns, it will continue running during the following invocation, potentially causing a resource leak.
Thus, to keep your lambda logic predictable, ensure you are not leaking any background threads or promises across function calls, as they won’t magically disappear!
One way to fix the example above is to use Promise.allSettled
instead of Promise.all
. Promise.allSettled
waits for all the promises even if some of them fail.
Code size matters
A common mistake is thinking that the language runtime (ex. JVM) and the logic you put in your init code are the only factors that impact your functions’ cold starts. However, another critical factor that has a significant impact is the package size of your function.
During the init phase, Lambda starts by downloading the function’s package which is stored in an internal Amazon S3 bucket (or Amazon Elastic Container Registry if the function uses container packaging).
Thus, the heavier your package is, the longer the download phase will be. It can even reach a point where it outweighs all the other cold start phases.
Thus, the package size should have an important consideration in your attempts to optimize lambdas latency. Common optimization techniques to reduce package sizes include: splitting the code, reducing dependencies, and other language specific techniques like tree shaking in the javascript world. In addition, keeping your package small will keep you further away from the dreaded 260Mb limit above which Lambda will not let you upload your code anymore. Believe me, the frustration is huge when this happens!
- The handler function is the function that contains the logic that runs on every invocation.↩