Skip to main content

Content Delivery Monitoring in AWS with CloudWatch

This post describes a way of monitoring a Tridion 9 combined Deployer by sending the health checks into a custom metric in CloudWatch in AWS. The same approach can also be used for other Content Delivery services. Once the metric is available in CloudWatch, we can create alarms in case the service errors out or becomes unresponsive.

The overall architecture is as follows:
  • Content Delivery service sends heartbeat (or exposes HTTP endpoint) for monitoring
  • Monitoring Agent checks heartbeat (or HTTP health check) regularly and stores health state
  • AWS lambda function:
    • runs regularly
    • reads the health state from Monitoring Agent
    • pushes custom metrics into CloudWatch
I am running the Deployer (installation docs) and Monitoring Agent (installation docs) on a t2.medium EC2 instance running CentOS on which I also installed the Systems Manager Agent (SSM Agent) (installation docs).

In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a Worker. The Endpoint uses passive monitoring -- the Monitoring Agent accesses the Endpoint URL using HTTP(S) to read the health status. The Worker uses active monitoring -- it sends heartbeats to the Monitoring Agent reporting health status.

Configure Content Delivery Heartbeats

For my Deployer Worker, the monitoring heartbeats are configured in the file deployer-config.xml by adding the following configuration node:

<Monitoring ServiceType="DeployerWorker" Interval="60s" GenerateHeartbeat="true"/>

At the moment of writing this, the documentation is a big buggy -- I noticed the settings above work, although they yield validation exceptions in the logs.

Configure the Monitoring Agent

I'm using the Monitoring Agent to check the health status of the Deployer Endpoint. I'm using the following cd_monitor_conf.xml:

<?xml version="1.0" encoding="UTF-8"?>
<MonitoringAgentConfiguration Version="11.0"

    <StartupPeriod StartupValue="60s"/>

    <HeartbeatMonitoring ListenerPort="20131" EnableRemoteHeartbeats="true">
        <AutomaticServiceRegistration RegistrationFile="RegisteredServices.xml"/>

        <ServiceHealthMonitorBinding Name="HttpServiceHealthMonitor"

        <HttpServiceHealthMonitor ServiceType="DeployerEndpoint" PollInterval="60s" TimeoutInterval="30s">
            <Request URL="http://localhost:8084/mappings" RequestData=""/>
            <Response SuccessPattern="httpupload"/>

    <WebService ListenerPort="20132"/>

Notice that I am not using a Monitoring Agent Web Service, because it is not needed. Instead, I am using the netcat (nc) Unix command to retrieve the statuses from the Monitoring Agent:

echo "<StatusRequest/>" | nc localhost 20132

The Monitoring Agent has an in-built simple service that listens on server socket 20132 for incoming connections. If somebody sends the command <StatusRequest/> to this socket, the Monitoring Agent responds with an XML containing statuses for all components it monitors:


The information in this XML is precisely what we want as custom metrics in AWS CloudWatch.

The Monitoring Agent server socket only listens for connections to, so it can't be accessed remotely. This dictates our architecture on how to retrieve this XML response and how to push it into CloudWatch. Enter the lambda...

AWS Lambda Function

The function is triggered by a CloudWatch event that fires every so often. In my case, I chose every minute.

The lambda uses the boto3 API in order to:
  1. Run the nc command remotely on the Deployer instance using SSM API and capture its output
  2. Create custom metrics from the XML output using CloudWatch API
The code is written in Python 2.7 and looks like this:

import boto3
import time
from xml.dom.minidom import parseString

statuses = {"OK": 0, "Error": 1, "NotResponding": 2}
ssmClient = boto3.client('ssm')
cwClient = boto3.client('cloudwatch')

def lambda_handler(event, context):
    response = ssmClient.send_command(
        Targets = [{'Key':'tag:Type','Values':['Deployer']}],
        DocumentName = 'AWS-RunShellScript',
        TimeoutSeconds = 30,
        Parameters = { 'commands': ['echo "<StatusRequest/>" | nc localhost 20132'] }

    commandId = response['Command']['CommandId']
    status = response['Command']['Status']
    while status == 'Pending' or status == 'InProgress':
        response = ssmClient.list_commands(CommandId = commandId)
        status = response['Commands'][0]['Status']

    response = ssmClient.list_command_invocations(CommandId = commandId)

    for invocation in response['CommandInvocations']:
        instanceId = invocation['InstanceId']
        instanceName = invocation['InstanceName']
        response = ssmClient.get_command_invocation(CommandId = commandId, InstanceId = instanceId)

        output = response['StandardOutputContent']
        if not output:

        dom = parseString(output)
        statusArray = dom.getElementsByTagName('ServiceStatus')

        for statusEl in statusArray:
            ServiceType = statusEl.getElementsByTagName('ServiceType')[0]
            MetricName = "SDL" + ServiceType.replace(" ", "") + "Status"
            Status = statusEl.getElementsByTagName('Status')[0]
            StatusNumber = statuses[Status]

                Namespace = 'SDL Web',
                MetricData = [{
                    'Dimensions': [{
                        'Name': 'InstanceName',
                        'Value': instanceName
                    'MetricName': MetricName,
                    'Value': StatusNumber,
                    'Unit': 'None'

    return None

Brief code explanation:
  • Send nc command to all instances tagged with tag name Type equals Deployer, since I don't feel like keeping track of instance IDs. Currently I have only one instance, but in a production environment the Deployer will be separated into Endpoint and several Worker instances;
  • Wait until command finished execution on all target instances and command status is no longer Pending or InProgress;
  • Read each CommandInvocation within our generic command, so that we are able to retrieve the command output on individual instances;
  • Read the StandardOutputContent from each invocation and parse it into a DOM;
  • For each ServiceStatus node in the XML, translate Status text into a code (0 means OK, 1 = Error and 2 = Not Responding)
  • Push custom metric into CloudWatch using the instance name as dimension (e.g., ServiceType as metric name, and translated Status as metric value;

Eventually, when all is working, the following metrics are available in CloudWatch:


Popular posts from this blog

A Implementation - Custom Binary Publisher

The default way to publish binaries in DD4T is implemented in class DD4T.Templates.Base.Utils.BinaryPublisher and uses method RenderedItem.AddBinary(Component) . This produces binaries that have their TCM URI as suffix in their filename. In my recent project, we had a requirement that binary file names should be clean (without the TCM URI suffix). Therefore, it was time to modify the way DD4T was publishing binaries. The method in charge with publishing binaries is called PublishItem and is defined in class BinaryPublisher . I therefore extended the BinaryPublisher and overrode method PublishItem. public class CustomBinaryPublisher : BinaryPublisher { private Template currentTemplate; private TcmUri structureGroupUri; In its simplest form, method PublishItem just takes the item and passes it to the AddBinary. In order to accomplish the requirement, we must specify a filename while publishing. This is the file name part of the binary path of Component.BinaryConten

Toolkit - Dynamic Content Queries

This post if part of a series about the  File System Toolkit  - a custom content delivery API for SDL Tridion. This post presents the Dynamic Content Query capability. The requirements for the Toolkit API are that it should be able to provide CustomMeta queries, pagination, and sorting -- all on the file system, without the use third party tools (database, search engines, indexers, etc). Therefore I had to implement a simple database engine and indexer -- which is described in more detail in post Writing My Own Database Engine . The querying logic does not make use of cache. This means the query logic is executed every time. When models are requested, the models are however retrieved using the ModelFactory and those are cached. Query Class This is the main class for dynamic content queries. It is the entry point into the execution logic of a query. The class takes as parameter a Criterion (presented below) which triggers the execution of query in all sub-criteria of a Criterio

Event System to Create Mapped Structure Groups for Binary Publish

As a continuation of last week's Publish Binaries to Mapped Structure Group , this week's TBB is in fact the Event System part of that solution. Make sure you do check out the previous post first, which explains why and what this Event System does. To reiterate, the Event System intercepts a Multimedia Component save, take its Folder path and create a 1-to-1 mapping of Structure Groups. The original code was written, again, by my colleague Eric Huiza : [ TcmExtension ( "MyEvents" )] public class EventsManager  : TcmExtension {     private Configuration configuration;     private readonly Regex SAFE_DIRNAME_REGEX = new Regex ( @"[\W_]+" );     public EventsManager() {         ExeConfigurationFileMap fileMap = new ExeConfigurationFileMap ();         fileMap.ExeConfigFilename = Path .GetDirectoryName( Assembly .GetExecutingAssembly().Location) + "\\EventSystem.config" ;         configuration = ConfigurationManager