Skip to main content

Content Delivery Monitoring in AWS with CloudWatch

This post describes a way of monitoring a Tridion 9 combined Deployer by sending the health checks into a custom metric in CloudWatch in AWS. The same approach can also be used for other Content Delivery services. Once the metric is available in CloudWatch, we can create alarms in case the service errors out or becomes unresponsive.

The overall architecture is as follows:
  • Content Delivery service sends heartbeat (or exposes HTTP endpoint) for monitoring
  • Monitoring Agent checks heartbeat (or HTTP health check) regularly and stores health state
  • AWS lambda function:
    • runs regularly
    • reads the health state from Monitoring Agent
    • pushes custom metrics into CloudWatch
I am running the Deployer (installation docs) and Monitoring Agent (installation docs) on a t2.medium EC2 instance running CentOS on which I also installed the Systems Manager Agent (SSM Agent) (installation docs).

In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a Worker. The Endpoint uses passive monitoring -- the Monitoring Agent accesses the Endpoint URL using HTTP(S) to read the health status. The Worker uses active monitoring -- it sends heartbeats to the Monitoring Agent reporting health status.

Configure Content Delivery Heartbeats

For my Deployer Worker, the monitoring heartbeats are configured in the file deployer-config.xml by adding the following configuration node:

<Monitoring ServiceType="DeployerWorker" Interval="60s" GenerateHeartbeat="true"/>

At the moment of writing this, the documentation is a big buggy -- I noticed the settings above work, although they yield validation exceptions in the logs.

Configure the Monitoring Agent

I'm using the Monitoring Agent to check the health status of the Deployer Endpoint. I'm using the following cd_monitor_conf.xml:

<?xml version="1.0" encoding="UTF-8"?>
<MonitoringAgentConfiguration Version="11.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <StartupPeriod StartupValue="60s"/>

    <HeartbeatMonitoring ListenerPort="20131" EnableRemoteHeartbeats="true">
        <AutomaticServiceRegistration RegistrationFile="RegisteredServices.xml"/>
        <Services/>
    </HeartbeatMonitoring>

    <ServiceHealthMonitorBindings>
        <ServiceHealthMonitorBinding Name="HttpServiceHealthMonitor"
            Class="com.tridion.monitor.polling.HTTPHealthMonitor"/>
    </ServiceHealthMonitorBindings>

    <ServiceHealthMonitors>
        <HttpServiceHealthMonitor ServiceType="DeployerEndpoint" PollInterval="60s" TimeoutInterval="30s">
            <Request URL="http://localhost:8084/mappings" RequestData=""/>
            <Response SuccessPattern="httpupload"/>
        </HttpServiceHealthMonitor>
    </ServiceHealthMonitors>

    <WebService ListenerPort="20132"/>
</MonitoringAgentConfiguration>

Notice that I am not using a Monitoring Agent Web Service, because it is not needed. Instead, I am using the netcat (nc) Unix command to retrieve the statuses from the Monitoring Agent:

echo "<StatusRequest/>" | nc localhost 20132

The Monitoring Agent has an in-built simple service that listens on server socket 20132 for incoming connections. If somebody sends the command <StatusRequest/> to this socket, the Monitoring Agent responds with an XML containing statuses for all components it monitors:

<StatusResponse>
  <ServiceStatus>
    <ServiceType>DeployerEndpoint</ServiceType>
    <ServiceInstance></ServiceInstance>
    <ProcessId>-1</ProcessId>
    <Status>OK</Status>
    <StatusChangeTime>2018-12-22T15:51:07Z</StatusChangeTime>
    <LastReportTime>2018-12-22T15:50:07Z</LastReportTime>
    <MonitoredThreadCount>-1</MonitoredThreadCount>
  </ServiceStatus>
  <ServiceStatus>
    <ServiceType>DeployerWorker</ServiceType>
    <ServiceInstance>dummy</ServiceInstance>
    <ProcessId>7152</ProcessId>
    <Status>OK</Status>
    <StatusChangeTime>2018-12-22T15:50:10Z</StatusChangeTime>
    <LastReportTime>2018-12-22T17:21:14Z</LastReportTime>
    <MonitoredThreadCount>3</MonitoredThreadCount>
    <NonRespondingThreads></NonRespondingThreads>
  </ServiceStatus>
</StatusResponse>

The information in this XML is precisely what we want as custom metrics in AWS CloudWatch.

The Monitoring Agent server socket only listens for connections to 127.0.0.1, so it can't be accessed remotely. This dictates our architecture on how to retrieve this XML response and how to push it into CloudWatch. Enter the lambda...

AWS Lambda Function

The function is triggered by a CloudWatch event that fires every so often. In my case, I chose every minute.

The lambda uses the boto3 API in order to:
  1. Run the nc command remotely on the Deployer instance using SSM API and capture its output
  2. Create custom metrics from the XML output using CloudWatch API
The code is written in Python 2.7 and looks like this:

import boto3
import time
from xml.dom.minidom import parseString

statuses = {"OK": 0, "Error": 1, "NotResponding": 2}
ssmClient = boto3.client('ssm')
cwClient = boto3.client('cloudwatch')

def lambda_handler(event, context):
    response = ssmClient.send_command(
        Targets = [{'Key':'tag:Type','Values':['Deployer']}],
        DocumentName = 'AWS-RunShellScript',
        TimeoutSeconds = 30,
        Parameters = { 'commands': ['echo "<StatusRequest/>" | nc localhost 20132'] }
    )

    commandId = response['Command']['CommandId']
    status = response['Command']['Status']
    while status == 'Pending' or status == 'InProgress':
        time.sleep(2)
        response = ssmClient.list_commands(CommandId = commandId)
        status = response['Commands'][0]['Status']

    response = ssmClient.list_command_invocations(CommandId = commandId)

    for invocation in response['CommandInvocations']:
        instanceId = invocation['InstanceId']
        instanceName = invocation['InstanceName']
        response = ssmClient.get_command_invocation(CommandId = commandId, InstanceId = instanceId)

        output = response['StandardOutputContent']
        if not output:
            continue

        dom = parseString(output)
        statusArray = dom.getElementsByTagName('ServiceStatus')

        for statusEl in statusArray:
            ServiceType = statusEl.getElementsByTagName('ServiceType')[0].firstChild.data
            MetricName = "SDL" + ServiceType.replace(" ", "") + "Status"
            Status = statusEl.getElementsByTagName('Status')[0].firstChild.data
            StatusNumber = statuses[Status]

            cwClient.put_metric_data(
                Namespace = 'SDL Web',
                MetricData = [{
                    'Dimensions': [{
                        'Name': 'InstanceName',
                        'Value': instanceName
                    }],
                    'MetricName': MetricName,
                    'Value': StatusNumber,
                    'Unit': 'None'
                }]
            )

    return None


Brief code explanation:
  • Send nc command to all instances tagged with tag name Type equals Deployer, since I don't feel like keeping track of instance IDs. Currently I have only one instance, but in a production environment the Deployer will be separated into Endpoint and several Worker instances;
  • Wait until command finished execution on all target instances and command status is no longer Pending or InProgress;
  • Read each CommandInvocation within our generic command, so that we are able to retrieve the command output on individual instances;
  • Read the StandardOutputContent from each invocation and parse it into a DOM;
  • For each ServiceStatus node in the XML, translate Status text into a code (0 means OK, 1 = Error and 2 = Not Responding)
  • Push custom metric into CloudWatch using the instance name as dimension (e.g. dev-deployer.mitza.net), ServiceType as metric name, and translated Status as metric value;

Eventually, when all is working, the following metrics are available in CloudWatch:







Comments

Popular posts from this blog

Running sp_updatestats on AWS RDS database

Part of the maintenance tasks that I perform on a MSSQL Content Manager database is to run stored procedure sp_updatestats . exec sp_updatestats However, that is not supported on an AWS RDS instance. The error message below indicates that only the sa  account can perform this: Msg 15247 , Level 16 , State 1 , Procedure sp_updatestats, Line 15 [Batch Start Line 0 ] User does not have permission to perform this action. Instead there are several posts that suggest using UPDATE STATISTICS instead: https://dba.stackexchange.com/questions/145982/sp-updatestats-vs-update-statistics I stumbled upon the following post from 2008 (!!!), https://social.msdn.microsoft.com/Forums/sqlserver/en-US/186e3db0-fe37-4c31-b017-8e7c24d19697/spupdatestats-fails-to-run-with-permission-error-under-dbopriveleged-user , which describes a way to wrap the call to sp_updatestats and execute it under a different user: create procedure dbo.sp_updstats with execute as 'dbo' as

I Have Gone Dark

Maybe it's the Holidays, but my mood has gone pretty dark. That is, regarding the look and feel of my computer and Tridion CME, of course. What I did was to dim the lights on the operating system, so I installed Placebo themes for Windows 7 . I went for the Ashtray look -- great name :) My VM looks now like this: But, once you change the theme on Windows, you should 'match' the theme of your applications. Some skin easily, some not. The Office suite has an in-built scheme, which can be set to Black , but it doesn't actually dim the ribbon tool bars -- it looks quite weird. Yahoo Messenger is skinnable, but you can't change the big white panels where you actually 'chat'. Skype is not skinnable at all. For Chrome, there are plenty of grey themes. Now i'm using Pro Grey . But then I got into changing the theme of websites. While very few offer skinnable interfaces (as GMail does), I had to find a way to darken the websites... Enter Stylish -- a pl

REL Standard Tag Library

The RSTL is a library of REL tags providing standard functionality such as iterating collections, conditionals, imports, assignments, XML XSLT transformations, formatting dates, etc. RSTL distributable is available on my Google Code page under  REL Standard Tag Library . Always use the latest JAR . This post describes each RSTL tag in the library explaining its functionality, attributes and providing examples. For understanding the way expressions are evaluated, please read my post about the  Expression Language used by REL Standard Tag Library . <c:choose> / <c:when> / <c:otherwise> Syntax:     <c:choose>         <c:when test="expr1">             Do something         </c:when>         <c:when test="expr2">             Do something else         </c:when>         <c:otherwise>             Do something otherwise         </c:otherwise>     </c:choose> Att