Migrating a Perl CGI to AWS Lambda

Using a custom runtime and event-driven static site generation

August 30, 2021 · 10 min · Jason Lavoie

Table of Contents

Motivation
Why Perl
Design
Public Website
Repository Bucket
IAM
Lambda Function
Custom Layer
S3 Notification Trigger
The Perl Function
Conclusion

Motivation

In migrating our NOC website to from a traditional Apache server to a serverless architecture, I’ve needed to update or replace any dynamic components. For example, replacing a Wordpress installation with Hugo to publish static content to a S3 bucket served by CloudFront. In this particular case, it was a CGI script that reads our firewall configurations and presents a web page for visualizing and searching the many object-groups and access-lists. I chose to migrate this to run as a Lambda.

What makes this notable, is this CGI script was written in Perl. Perl is not one of the AWS Lambda’s natively supported languages, Java Go, PowerShell, Node.js, C#, Python, and Ruby (at the time of this writing). The perl community has done some work already to leverage the Lambda Runtime API to run perl, which I’ve used here.

Why Perl

Why not just rewrite the script in a different language? I considered this first, but due to the complexity of the script, I thought that adding a simple handler function to wrap the script would be a much more expedient and less-risky change. Once the migration was complete, I could write a workalike replacement in, say, python, and drop it in without impacting the rest of the system.

I originally wrote this tool in 2005 (over 16 years ago!) and it has had numerous updates and changes since then to meet evolving needs as well as to add new features (such as IPv6 support). It relies on parsing the firewall text configuration file into a data structure, and to do this – according to a few grep’s I did – has 100+ if/elsif statements and almost as many regular expressions. Some of those are complex multi-line regexes, as this sample code snippet shows:

[...]
} elsif (/^(?:ipv6 )?access-list (\S+)(?: extended)? (deny|permit) (object-group \S+|\S+) (.*)$/) {
  my $int = 'acl';
  my ($name, $action, $proto, $ace) = ($1, $2, $3, $4);
  if ($ace =~ /(any[46]?|host\ \S+|object-group\ \S+|$ipmaskre)\ # src
               ()                                                # sport
               (any[46]?|host\ \S+|object-group\ \S+|$ipmaskre)  # dst
               (?:\ ((?:eq|lt|gt|neq)\ \S+|                      # dport
                     range\ \S+\ \S+|
                     object-group\ \S+|
                     \S+))?
              /x
      or
      $ace =~ /(any[46]?|host\ \S+|object-group\ \S+|$ipmaskre)\ # src
               (?:((?:eq|lt|gt|neq)\ \S+|                        # sport
                   range\ \S+\ \S+|
                   object-group\ \S+)\ )
               (any[46]?|host\ \S+|object-group\ \S+|$ipmaskre)  # dst
               (?:\ ((?:eq|lt|gt|neq)\ \S+|                      # dport
                     range\ \S+\ \S+|
                     object-group\ \S+|
                     \S+))?
              /x) {

    my ($src, $sport, $dst, $dport) = ($1, $2, $3, $4);
    $sport =~ s/^(.+)/port-object \1/ if ($sport !~ /^object-group /);
    $dport =~ s/^(.+)/port-object \1/ if ($dport !~ /^object-group /);
    cidrize($src, $dst);
    push @{$c{$int}{$name}}, [$action,$proto,$src,$sport,$dst,$dport];
[...]

It will be a nice project, some time in the future, to rewrite this with a couple decades more programming experience and evolution of the available tools. My current plan is to rewrite it in python using TextFSM or Genie. I will certainly share here when I do that. For now, the script “just works”™ and I don’t want to break it.

Design

In the current deployment, when the CGI script is called, it reads a local file with the current configuration of the device in question (supplied as a query parameter), parses the configuration, and returns the HTML to the browser. It is a “single-page app” in the sense that all the CSS and Javascript are included in the response, to allow the user to interactively navigate and search the firewall rules. There is a small web form which sends a POST back to itself to allow the user to choose from a list of devices. The device configurations are on disk, and kept current by a different process.

In order to replicate this user experience, I leveraged some of my previous work where I automatically update a private S3 bucket with the device configurations whenever the corresponding GitHub repository changes. The post expands on that, taking the data from the private bucket and automatically generating the static HTML for the user to access.

In general, a task like this would be a great use for Object Lambda, where the response to a S3 GET request is processed by a custom Lambda function. In this case, though, the generated HTML is always the same for a given input file. It doesn’t make sense to re-process the data for every request. Instead, I decided to build it so the function is triggered by S3 events – when a new file from a push is written to the bucket – and the resulting HTML written to a public bucket. In the real-world deployment this is an Okta-authenticated CloudFront distribution, and this architecture lends itself well to caching. For simplicity’s sake this example writes its output to a publicly-available S3 bucket.

Public Website

Create a simple S3 public website to serve the generated file(s). To prevent unintentional data exposure, AWS has made it the default for buckets to be private. The acl and policy must be explicitly set to allow public access.

data "aws_iam_policy_document" "public-website" {
  statement {
    actions = ["s3:GetObject"]
    principals {
      identifiers = ["*"]
      type        = "*"
    }
    resources = ["${aws_s3_bucket.html.arn}/*"]
  }
}

resource "aws_s3_bucket" "html" {
  bucket = "${var.name}-html"
  acl    = "public-read"
}

resource "aws_s3_bucket_policy" "public" {
  bucket = aws_s3_bucket.html.id
  policy = data.aws_iam_policy_document.public-website.json
}

Repository Bucket

This is the bucket where the source data is stored. The mechanism for keeping this data current isn’t included in this post; In a previous post I documented one approach for automatically populating this bucket whenever a particular set of GitHub repositories are updated.

resource "aws_s3_bucket" "repos" {
  bucket = "${var.name}-repos"
  acl    = "private"
}

IAM

Create an IAM role that the Lambda function can assume that gives it access to read from the private repos bucket and write to the public HTML bucket.

data "aws_iam_policy_document" "fwacl" {
  statement {
    actions   = ["s3:GetObject"]
    resources = ["${aws_s3_bucket.repos.arn}/*"]
    effect    = "Allow"
  }
  statement {
    actions   = ["s3:PutObject"]
    resources = ["${aws_s3_bucket.html.arn}/*"]
    effect    = "Allow"
  }
}

data "aws_iam_policy_document" "assume-lambda" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
    effect = "Allow"
  }
}

resource "aws_iam_role" "fwacl" {
  name               = "assume-lambda-fwacl"
  assume_role_policy = data.aws_iam_policy_document.assume-lambda.json
  inline_policy {
    name   = "fwacl"
    policy = data.aws_iam_policy_document.fwacl.json
  }
}

Lambda Function

The Lambda function uses the AWS::Lambda module. This works by using the generic AWS Linux 2 runtime and specifying the AWS::Lambda layer. The script uses some additional Perl modules, Paws and CGI. Paws is provided by another layer, as referenced in the AWS::Lambda Documentation. Since CGI.pm is no longer in the Perl core, we also reference a custom layer (which we will build later) including this module.

Configure the function to use the IAM role create earlier, giving it permissions to access the necessary buckets.

An environment variable indicating the name of the output bucket is supplied to the function. The name of the input bucket and the input file(s) will be provided in the function payload by the S3 notification, so they do not need to be defined here.

The function_name and handler attributes tell AWS::Lambda where to find the handler routing. With this configuration, it will call the Perl subroutine named handler in the file fwacl.pl in the Lambda zip archive.

data "archive_file" "lambda" {
  type        = "zip"
  output_path = "lambda.zip"
  source_file = "fwacl.pl"
}

resource "aws_lambda_function" "fwacl" {
  filename      = "lambda.zip"
  function_name = "fwacl"
  role          = aws_iam_role.fwacl.arn
  handler       = "fwacl.handler"
  runtime       = "provided.al2"
  layers = [
    "arn:aws:lambda:${var.region}:445285296882:layer:perl-5-34-runtime-al2:2",
    "arn:aws:lambda:${var.region}:445285296882:layer:perl-5-34-paws-al2:2",
    aws_lambda_layer_version.cgipm.arn
  ]
  source_code_hash = data.archive_file.lambda.output_base64sha256
  environment {
    variables = {
      "OUTPUT_BUCKET" = aws_s3_bucket.html.id
    }
  }
}

Custom Layer

Here the custom layer is defined to to provide the Perl CGI module.

resource "aws_lambda_layer_version" "cgipm" {
  filename   = "layer.zip"
  layer_name = "cgipm"

  compatible_runtimes = ["provided.al2"]
  source_code_hash    = filebase64sha256("layer.zip")
}

A bash script is used to build the layer locally so terraform can upload it to AWS. Any Perl modules listed in the PERL_MODULES variable will be included in this layer.

#!/bin/bash
set -euo pipefail

PERL_MODULES="CGI"
DIR=layer
FILE=layer.zip

[ -d "$DIR" ] || mkdir "$DIR"
docker run --rm \
  -v "$(pwd):/var/task" \
  -v "$(pwd)/${DIR}/lib/perl5/site_perl:/opt/lib/perl5/site_perl" \
  shogo82148/p5-aws-lambda:build-5.34.al2 \
  cpanm --notest --no-man-pages "$PERL_MODULES"
cd ${DIR} && zip -9 -r "../${FILE}" .

S3 Notification Trigger

The function is triggered whenever specific objects are written to the bucket. In our case, all the device configurations we care about for this script are in the Rancid core group, and are named in format <location>-<role>fw-<active|standby>. The filter_prefix and filter_suffix are combined to match only the needed data by limiting to files only below the specified directory and ending with fw-active. Also, give S3 permission to execute the function.

resource "aws_lambda_permission" "allow_bucket" {
  statement_id  = "AllowExecutionFromS3Bucket"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.fwacl.arn
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.repos.arn
}

resource "aws_s3_bucket_notification" "bucket_notification" {
  bucket = aws_s3_bucket.repos.id

  lambda_function {
    lambda_function_arn = aws_lambda_function.fwacl.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = "rancid/core/configs/"
    filter_suffix       = "fw-active"
  }

  depends_on = [aws_lambda_permission.allow_bucket]
}

The Perl Function

Here I show the handler subroutine. AWS::Lambda calls it with two arguments, a payload and a context object. These are native Perl data structures decoded from the AWS json payload and constructed from environment variables, respectively.

The needed modules are included via the use Paws and use CGI statements. The files from these packages are present on the lambda’s disk by nature of the added layers, and can be used normally.

I don’t show all the supporting subroutines, as they already existed in the original Perl script. This handler routine is essentially a wrapper around what was already there, looping over the provided records (objects in the bucket that have changed), reading the data from S3 (GetObject), processing the data, and writing the output back to the HTML bucket (PutObject). Note that the ContentType attribute must be set so that S3 can provide the correct mime-type to the browser.

For input, it retrieves the needed information from the $payload object to determine the object key and bucket name to process. For this task only the source bucket and filename (object key) are needed, but see the sample payload for the other information that is available about the event and the object.

Sample

payload for s3 event

{ "Records": [ { "eventVersion": "2.0", "eventSource": "aws:s3", "awsRegion": "us-east-1", "eventTime": "1970-01-01T00:00:00.000Z", "eventName": "ObjectCreated:Put", "userIdentity": { "principalId": "EXAMPLE" }, "requestParameters": { "sourceIPAddress": "127.0.0.1" }, "responseElements": { "x-amz-request-id": "EXAMPLE123456789", "x-amz-id-2": "EXAMPLE123/5678abcdefghijkl/mnopqrstuvwxyzABCDEFGH" }, "s3": { "s3SchemaVersion": "1.0", "configurationId": "testConfigRule", "bucket": { "name": "example-bucket", "ownerIdentity": { "principalId": "EXAMPLE" }, "arn": "arn:aws:s3:::example-bucket" }, "object": { "key": "test/key", "size": 1024, "eTag": "0123456789abcdef0123456789abcdef", "sequencer": "0A1B2C3D4E5F678901" } } } ] }
If the script exits with an error, AWS::Lambda will provide the text
of the error in the “errorMessage” field of the output. Otherwise, it
will serialize the handler routine’s output and return a json string
to Lambda. For troubleshooting and logging purposes, it is helpful to
populate some data about what was processed.
use strict;
use POSIX;
use Paws;
use CGI qw/:standard *table/;

my $DEV;
my @DEVS=get_devices();

sub handler {
  my ($payload, $context) = @_;
  my $s3 = Paws->service('S3', region=>$ENV{'AWS_REGION'});
  my $output->{'output_bucket'} = my $output_bucket = $ENV{'OUTPUT_BUCKET'};

  for my $record (@{$payload->{'Records'}}) {
    # get request info
    my $bucket = $record->{'s3'}->{'bucket'}->{'name'};
    my $key = $record->{'s3'}->{'object'}->{'key'};

    ($DEV = $key) =~ s|.*/(.*)-active$|\1|;
    die "Invalid device $DEV" unless inarray($DEV, @DEVS);
    my $outkey = "${DEV}.html";

    # add some info to the returned data for troubleshooting
    $output->{$DEV}->{'name'} = $DEV;
    $output->{$DEV}->{'path'} = $key;
    $output->{$DEV}->{'file'} = $outkey;
    $output->{'count'}++;

    # retrieve configuration
    my $config = $s3->GetObject(Bucket=>$bucket, Key=>$key)->Body;

    # generate and write html
    $s3->PutObject(
      Bucket=>$output_bucket,
      ContentType=>"text/html",
      Key=>$outkey,
      Body=>generate_html(parse($config))
    );
  }
  return $output;
}

[...]
Conclusion#
This was surprisingly a lot of work to figure out. I was puzzled by
quite a few things, until I finally just read the code
for AWS::Lambda’s bootstrap.pm. At one point, I thought that using
Paws might be overkill, so I explored a few other simpler S3 perl
modules, but did not get any working as well.
Using a perl lambda is certainly a good tool to have in my toolchest,
and I learned a lot building this. If I were doing it again, though,
I think I’d more seriously consider spending the time and effort on
rewriting it in python. At this point it feels like that may have
been a better investment in the future of the tool. When I eventually
do rewrite it, I’ll compare that experience with the time I put in
for this, about a week of my spare time, and be able to make a better
comparison and judgement.

Motivation#

Why Perl#

Design#

Public Website#

Repository Bucket#

IAM#

Lambda Function#

Custom Layer#

S3 Notification Trigger#

The Perl Function#

Conclusion#