Skip to main content

Content Delivery Monitoring in AWS with CloudWatch

This post describes a way of monitoring a Tridion 9 combined Deployer by sending the health checks into a custom metric in CloudWatch in AWS. The same approach can also be used for other Content Delivery services. Once the metric is available in CloudWatch, we can create alarms in case the service errors out or becomes unresponsive.

The overall architecture is as follows:
  • Content Delivery service sends heartbeat (or exposes HTTP endpoint) for monitoring
  • Monitoring Agent checks heartbeat (or HTTP health check) regularly and stores health state
  • AWS lambda function:
    • runs regularly
    • reads the health state from Monitoring Agent
    • pushes custom metrics into CloudWatch
I am running the Deployer (installation docs) and Monitoring Agent (installation docs) on a t2.medium EC2 instance running CentOS on which I also installed the Systems Manager Agent (SSM Agent) (installation docs).

In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a Worker. The Endpoint uses passive monitoring -- the Monitoring Agent accesses the Endpoint URL using HTTP(S) to read the health status. The Worker uses active monitoring -- it sends heartbeats to the Monitoring Agent reporting health status.

Configure Content Delivery Heartbeats

For my Deployer Worker, the monitoring heartbeats are configured in the file deployer-config.xml by adding the following configuration node:

<Monitoring ServiceType="DeployerWorker" Interval="60s" GenerateHeartbeat="true"/>

At the moment of writing this, the documentation is a big buggy -- I noticed the settings above work, although they yield validation exceptions in the logs.

Configure the Monitoring Agent

I'm using the Monitoring Agent to check the health status of the Deployer Endpoint. I'm using the following cd_monitor_conf.xml:

<?xml version="1.0" encoding="UTF-8"?>
<MonitoringAgentConfiguration Version="11.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <StartupPeriod StartupValue="60s"/>

    <HeartbeatMonitoring ListenerPort="20131" EnableRemoteHeartbeats="true">
        <AutomaticServiceRegistration RegistrationFile="RegisteredServices.xml"/>
        <Services/>
    </HeartbeatMonitoring>

    <ServiceHealthMonitorBindings>
        <ServiceHealthMonitorBinding Name="HttpServiceHealthMonitor"
            Class="com.tridion.monitor.polling.HTTPHealthMonitor"/>
    </ServiceHealthMonitorBindings>

    <ServiceHealthMonitors>
        <HttpServiceHealthMonitor ServiceType="DeployerEndpoint" PollInterval="60s" TimeoutInterval="30s">
            <Request URL="http://localhost:8084/mappings" RequestData=""/>
            <Response SuccessPattern="httpupload"/>
        </HttpServiceHealthMonitor>
    </ServiceHealthMonitors>

    <WebService ListenerPort="20132"/>
</MonitoringAgentConfiguration>

Notice that I am not using a Monitoring Agent Web Service, because it is not needed. Instead, I am using the netcat (nc) Unix command to retrieve the statuses from the Monitoring Agent:

echo "<StatusRequest/>" | nc localhost 20132

The Monitoring Agent has an in-built simple service that listens on server socket 20132 for incoming connections. If somebody sends the command <StatusRequest/> to this socket, the Monitoring Agent responds with an XML containing statuses for all components it monitors:

<StatusResponse>
  <ServiceStatus>
    <ServiceType>DeployerEndpoint</ServiceType>
    <ServiceInstance></ServiceInstance>
    <ProcessId>-1</ProcessId>
    <Status>OK</Status>
    <StatusChangeTime>2018-12-22T15:51:07Z</StatusChangeTime>
    <LastReportTime>2018-12-22T15:50:07Z</LastReportTime>
    <MonitoredThreadCount>-1</MonitoredThreadCount>
  </ServiceStatus>
  <ServiceStatus>
    <ServiceType>DeployerWorker</ServiceType>
    <ServiceInstance>dummy</ServiceInstance>
    <ProcessId>7152</ProcessId>
    <Status>OK</Status>
    <StatusChangeTime>2018-12-22T15:50:10Z</StatusChangeTime>
    <LastReportTime>2018-12-22T17:21:14Z</LastReportTime>
    <MonitoredThreadCount>3</MonitoredThreadCount>
    <NonRespondingThreads></NonRespondingThreads>
  </ServiceStatus>
</StatusResponse>

The information in this XML is precisely what we want as custom metrics in AWS CloudWatch.

The Monitoring Agent server socket only listens for connections to 127.0.0.1, so it can't be accessed remotely. This dictates our architecture on how to retrieve this XML response and how to push it into CloudWatch. Enter the lambda...

AWS Lambda Function

The function is triggered by a CloudWatch event that fires every so often. In my case, I chose every minute.

The lambda uses the boto3 API in order to:
  1. Run the nc command remotely on the Deployer instance using SSM API and capture its output
  2. Create custom metrics from the XML output using CloudWatch API
The code is written in Python 2.7 and looks like this:

import boto3
import time
from xml.dom.minidom import parseString

statuses = {"OK": 0, "Error": 1, "NotResponding": 2}
ssmClient = boto3.client('ssm')
cwClient = boto3.client('cloudwatch')

def lambda_handler(event, context):
    response = ssmClient.send_command(
        Targets = [{'Key':'tag:Type','Values':['Deployer']}],
        DocumentName = 'AWS-RunShellScript',
        TimeoutSeconds = 30,
        Parameters = { 'commands': ['echo "<StatusRequest/>" | nc localhost 20132'] }
    )

    commandId = response['Command']['CommandId']
    status = response['Command']['Status']
    while status == 'Pending' or status == 'InProgress':
        time.sleep(2)
        response = ssmClient.list_commands(CommandId = commandId)
        status = response['Commands'][0]['Status']

    response = ssmClient.list_command_invocations(CommandId = commandId)

    for invocation in response['CommandInvocations']:
        instanceId = invocation['InstanceId']
        instanceName = invocation['InstanceName']
        response = ssmClient.get_command_invocation(CommandId = commandId, InstanceId = instanceId)

        output = response['StandardOutputContent']
        if not output:
            continue

        dom = parseString(output)
        statusArray = dom.getElementsByTagName('ServiceStatus')

        for statusEl in statusArray:
            ServiceType = statusEl.getElementsByTagName('ServiceType')[0].firstChild.data
            MetricName = "SDL" + ServiceType.replace(" ", "") + "Status"
            Status = statusEl.getElementsByTagName('Status')[0].firstChild.data
            StatusNumber = statuses[Status]

            cwClient.put_metric_data(
                Namespace = 'SDL Web',
                MetricData = [{
                    'Dimensions': [{
                        'Name': 'InstanceName',
                        'Value': instanceName
                    }],
                    'MetricName': MetricName,
                    'Value': StatusNumber,
                    'Unit': 'None'
                }]
            )

    return None


Brief code explanation:
  • Send nc command to all instances tagged with tag name Type equals Deployer, since I don't feel like keeping track of instance IDs. Currently I have only one instance, but in a production environment the Deployer will be separated into Endpoint and several Worker instances;
  • Wait until command finished execution on all target instances and command status is no longer Pending or InProgress;
  • Read each CommandInvocation within our generic command, so that we are able to retrieve the command output on individual instances;
  • Read the StandardOutputContent from each invocation and parse it into a DOM;
  • For each ServiceStatus node in the XML, translate Status text into a code (0 means OK, 1 = Error and 2 = Not Responding)
  • Push custom metric into CloudWatch using the instance name as dimension (e.g. dev-deployer.mitza.net), ServiceType as metric name, and translated Status as metric value;

Eventually, when all is working, the following metrics are available in CloudWatch:







Comments

Popular posts from this blog

Scaling Policies

This post is part of a bigger topic Autoscaling Publishers in AWS . In a previous post we talked about the Auto Scaling Groups , but we didn't go into details on the Scaling Policies. This is the purpose of this blog post. As defined earlier, the Scaling Policies define the rules according to which the group size is increased or decreased. These rules are based on instance metrics (e.g. CPU), CloudWatch custom metrics, or even CloudWatch alarms and their states and values. We defined a Scaling Policy with Steps, called 'increase_group_size', which is triggered first by the CloudWatch Alarm 'Publish_Alarm' defined earlier. Also depending on the size of the monitored CloudWatch custom metric 'Waiting for Publish', the Scaling Policy with Steps can add a difference number of instances to the group. The scaling policy sets the number of instances in group to 1 if there are between 1000 and 2000 items Waiting for Publish in the queue. It also sets the

Running sp_updatestats on AWS RDS database

Part of the maintenance tasks that I perform on a MSSQL Content Manager database is to run stored procedure sp_updatestats . exec sp_updatestats However, that is not supported on an AWS RDS instance. The error message below indicates that only the sa  account can perform this: Msg 15247 , Level 16 , State 1 , Procedure sp_updatestats, Line 15 [Batch Start Line 0 ] User does not have permission to perform this action. Instead there are several posts that suggest using UPDATE STATISTICS instead: https://dba.stackexchange.com/questions/145982/sp-updatestats-vs-update-statistics I stumbled upon the following post from 2008 (!!!), https://social.msdn.microsoft.com/Forums/sqlserver/en-US/186e3db0-fe37-4c31-b017-8e7c24d19697/spupdatestats-fails-to-run-with-permission-error-under-dbopriveleged-user , which describes a way to wrap the call to sp_updatestats and execute it under a different user: create procedure dbo.sp_updstats with execute as 'dbo' as

Toolkit - Dynamic Content Queries

This post if part of a series about the  File System Toolkit  - a custom content delivery API for SDL Tridion. This post presents the Dynamic Content Query capability. The requirements for the Toolkit API are that it should be able to provide CustomMeta queries, pagination, and sorting -- all on the file system, without the use third party tools (database, search engines, indexers, etc). Therefore I had to implement a simple database engine and indexer -- which is described in more detail in post Writing My Own Database Engine . The querying logic does not make use of cache. This means the query logic is executed every time. When models are requested, the models are however retrieved using the ModelFactory and those are cached. Query Class This is the main class for dynamic content queries. It is the entry point into the execution logic of a query. The class takes as parameter a Criterion (presented below) which triggers the execution of query in all sub-criteria of a Criterio