Home Blog CV Projects Patterns Notes Book Colophon Search

Tasks

10 Oct, 2023

One problem with running long running tasks on AWS in a server less way is Lambda's 15 minute timeout.

An alternative is Fargate, but that requires setting up Docker Containers, Elastic Container Registry and a VPC so feels a bit less serverless.

If your long running task is just doing lots of smaller things you can have a lambda function that is called repeatedly by StepFunctions until it completes. That's what I'll describe here.

(Another alternative is to use Lambda Power Tools batch processing which uses SQS behind the scenes.)

To make this work we also want retires and custom error handling. Specifically we want an AbortError to not retry and that is a little tricky. Read this to understand the background:

Now, we'll 'mock' the Lambda's behaviour using a value in DynamoDB and return either successfully, with an error, or with an AbortError. Success ends the state machine execution, error forces retries with backoff and AbortError moves to a fallback state which prints the error and ends.

Deploy the stack by uploading it in the CloudFormation web UI (or any other way). You'll need to choose a name for the stack such as Tasks and a name for the DynamoDB table such has Tasks. This will name the actual table TasksTasks.

AWSTemplateFormatVersion: "2010-09-09"

Description: An example template with an IAM role for a Lambda state machine.

Parameters:
  TableName:
    Description: The name of the table (the stack name will be pre-pended)
    Type: String
    Default: Tasks
    MinLength: "1"

Resources:
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole

  LambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          import boto3
          import os


          dynamodb = boto3.client('dynamodb', region_name=os.environ["AWS_REGION"])


          class Abort(Exception):
              pass

          class Error(Exception):
              pass

          def handler(event, context):
              response = dynamodb.get_item(
                  TableName=f'{os.environ["STACK_NAME"]}{os.environ["TABLE_NAME"]}',
                  Key={
                      'pk': {
                          'S': 'pk1',
                      },
                      'sk': {
                          'S': 'sk1',
                      }
                  }
              )
              data = response['Item']['data']['S'].lower().strip()
              print('Got data:', data)
              if data.lower() == 'abort':
                  raise Abort('This is an Abort error which should get caught and result in transition to the Fallback state!')
              elif data.lower() == 'error':
                  raise Error('This is an Error which should cause a retry!')
              else:
                  event['data'] = data
                  return event
      Runtime: python3.11
      Timeout: "25"
      Architectures:
        - arm64
      Environment:
        Variables:
          STACK_NAME: !Sub '${AWS::StackName}'
          TABLE_NAME: !Sub '${TableName}'

  DynamoDbTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Sub '${AWS::StackName}${TableName}'
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: pk
          AttributeType: S
        - AttributeName: sk
          AttributeType: S
      KeySchema:
        - AttributeName: pk
          KeyType: HASH
        - AttributeName: sk
          KeyType: RANGE
      TimeToLiveSpecification:
        AttributeName: TimeToLive
        Enabled: true

  LambdaDynamoPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub '${AWS::StackName}LambdaDynamoPolicy'
      Description: Managed policy for a Lambda function launched by CloudFormation
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action:
              - dynamodb:GetItem
              - dynamodb:Query
              - dynamodb:PutItem
              - dynamodb:UpdateItem
            Resource:
              - !GetAtt DynamoDbTable.Arn

      # Define the role here, rather than the managed policy on the role, to avoid a circular dependency
      Roles:
        - !Ref LambdaExecutionRole

  LambdaCloudWatchPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub '${AWS::StackName}LambdaCloudWatchPolicy'
      Description: Managed policy for a Lambda function launched by CloudFormation
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogStream
            Resource:
              - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${LambdaLogGroup}:*"
          - Effect: Allow
            Action:
              - logs:PutLogEvents
            Resource:
              - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${LambdaLogGroup}:*"

      # Define the role here, rather than the managed policy on the role, to avoid a circular dependency
      Roles:
        - !Ref LambdaExecutionRole

  # See https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/147
  # https://typicalrunt.me/2019/09/20/enforcing-least-privilege-when-logging-lambda-functions-to-cloudwatch/
  # WARNING: If the lambda function gets updated, its name will change, so the log group will change, so the old logs will get deleted despite the retention period here.
  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/lambda/${LambdaFunction}"
      RetentionInDays: 30

  StatesExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - !Sub states.${AWS::Region}.amazonaws.com
            Action: sts:AssumeRole
      Path: /
      Policies:
        - PolicyName: StatesExecutionPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - lambda:InvokeFunction
                Resource: '*'

  MyStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      RoleArn: !GetAtt StatesExecutionRole.Arn
      Definition:
        Comment: A Hello World example using an AWS Lambda function
        StartAt: HelloWorld
        States:
          HelloWorld:
            Type: Task
            Resource: !GetAtt LambdaFunction.Arn
            Retry:
              - ErrorEquals: [Abort]
                MaxAttempts: 0
              - ErrorEquals: [States.ALL]
                IntervalSeconds: 20
                MaxAttempts: 4
                BackoffRate: 1.2
            Catch:
              - ErrorEquals: [Abort]
                Next: Fallback
            End: true
          Fallback:
            Type: Pass
            Parameters:
              Cause.$: States.StringToJson($.Cause)
            End: true

Now have a play:

Some useful plugins:

Format CloudFormation:

cfn-format -w tasks.yml

Future?

Comments

Be the first to comment.

Add Comment





Copyright James Gardner 1996-2020 All Rights Reserved. Admin.