This post describes a way of monitoring a Tridion 9 combined Deployer by sending the health checks into a custom metric in CloudWatch in AWS. The same approach can also be used for other Content Delivery services. Once the metric is available in CloudWatch, we can create alarms in case the service errors out or becomes unresponsive.
The overall architecture is as follows:
In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a Worker. The Endpoint uses passive monitoring -- the Monitoring Agent accesses the Endpoint URL using HTTP(S) to read the health status. The Worker uses active monitoring -- it sends heartbeats to the Monitoring Agent reporting health status.
At the moment of writing this, the documentation is a big buggy -- I noticed the settings above work, although they yield validation exceptions in the logs.
Notice that I am not using a Monitoring Agent Web Service, because it is not needed. Instead, I am using the netcat (nc) Unix command to retrieve the statuses from the Monitoring Agent:
The Monitoring Agent has an in-built simple service that listens on server socket 20132 for incoming connections. If somebody sends the command <StatusRequest/> to this socket, the Monitoring Agent responds with an XML containing statuses for all components it monitors:
The information in this XML is precisely what we want as custom metrics in AWS CloudWatch.
The Monitoring Agent server socket only listens for connections to 127.0.0.1, so it can't be accessed remotely. This dictates our architecture on how to retrieve this XML response and how to push it into CloudWatch. Enter the lambda...
The lambda uses the boto3 API in order to:
Brief code explanation:
Eventually, when all is working, the following metrics are available in CloudWatch:
The overall architecture is as follows:
- Content Delivery service sends heartbeat (or exposes HTTP endpoint) for monitoring
- Monitoring Agent checks heartbeat (or HTTP health check) regularly and stores health state
- AWS lambda function:
- runs regularly
- reads the health state from Monitoring Agent
- pushes custom metrics into CloudWatch
In my case I have a combined Deployer that I want to monitor. This consists of an Endpoint and a Worker. The Endpoint uses passive monitoring -- the Monitoring Agent accesses the Endpoint URL using HTTP(S) to read the health status. The Worker uses active monitoring -- it sends heartbeats to the Monitoring Agent reporting health status.
Configure Content Delivery Heartbeats
For my Deployer Worker, the monitoring heartbeats are configured in the file deployer-config.xml by adding the following configuration node:
<Monitoring ServiceType="DeployerWorker" Interval="60s" GenerateHeartbeat="true"/>
At the moment of writing this, the documentation is a big buggy -- I noticed the settings above work, although they yield validation exceptions in the logs.
Configure the Monitoring Agent
I'm using the Monitoring Agent to check the health status of the Deployer Endpoint. I'm using the following cd_monitor_conf.xml:<?xml version="1.0" encoding="UTF-8"?> <MonitoringAgentConfiguration Version="11.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <StartupPeriod StartupValue="60s"/> <HeartbeatMonitoring ListenerPort="20131" EnableRemoteHeartbeats="true"> <AutomaticServiceRegistration RegistrationFile="RegisteredServices.xml"/> <Services/> </HeartbeatMonitoring> <ServiceHealthMonitorBindings> <ServiceHealthMonitorBinding Name="HttpServiceHealthMonitor" Class="com.tridion.monitor.polling.HTTPHealthMonitor"/> </ServiceHealthMonitorBindings> <ServiceHealthMonitors> <HttpServiceHealthMonitor ServiceType="DeployerEndpoint" PollInterval="60s" TimeoutInterval="30s"> <Request URL="http://localhost:8084/mappings" RequestData=""/> <Response SuccessPattern="httpupload"/> </HttpServiceHealthMonitor> </ServiceHealthMonitors> <WebService ListenerPort="20132"/> </MonitoringAgentConfiguration>
Notice that I am not using a Monitoring Agent Web Service, because it is not needed. Instead, I am using the netcat (nc) Unix command to retrieve the statuses from the Monitoring Agent:
echo "<StatusRequest/>" | nc localhost 20132
The Monitoring Agent has an in-built simple service that listens on server socket 20132 for incoming connections. If somebody sends the command <StatusRequest/> to this socket, the Monitoring Agent responds with an XML containing statuses for all components it monitors:
<StatusResponse> <ServiceStatus> <ServiceType>DeployerEndpoint</ServiceType> <ServiceInstance></ServiceInstance> <ProcessId>-1</ProcessId> <Status>OK</Status> <StatusChangeTime>2018-12-22T15:51:07Z</StatusChangeTime> <LastReportTime>2018-12-22T15:50:07Z</LastReportTime> <MonitoredThreadCount>-1</MonitoredThreadCount> </ServiceStatus> <ServiceStatus> <ServiceType>DeployerWorker</ServiceType> <ServiceInstance>dummy</ServiceInstance> <ProcessId>7152</ProcessId> <Status>OK</Status> <StatusChangeTime>2018-12-22T15:50:10Z</StatusChangeTime> <LastReportTime>2018-12-22T17:21:14Z</LastReportTime> <MonitoredThreadCount>3</MonitoredThreadCount> <NonRespondingThreads></NonRespondingThreads> </ServiceStatus> </StatusResponse>
The information in this XML is precisely what we want as custom metrics in AWS CloudWatch.
The Monitoring Agent server socket only listens for connections to 127.0.0.1, so it can't be accessed remotely. This dictates our architecture on how to retrieve this XML response and how to push it into CloudWatch. Enter the lambda...
AWS Lambda Function
The function is triggered by a CloudWatch event that fires every so often. In my case, I chose every minute.The lambda uses the boto3 API in order to:
- Run the nc command remotely on the Deployer instance using SSM API and capture its output
- Create custom metrics from the XML output using CloudWatch API
The code is written in Python 2.7 and looks like this:
import boto3 import time from xml.dom.minidom import parseString statuses = {"OK": 0, "Error": 1, "NotResponding": 2} ssmClient = boto3.client('ssm') cwClient = boto3.client('cloudwatch') def lambda_handler(event, context): response = ssmClient.send_command( Targets = [{'Key':'tag:Type','Values':['Deployer']}], DocumentName = 'AWS-RunShellScript', TimeoutSeconds = 30, Parameters = { 'commands': ['echo "<StatusRequest/>" | nc localhost 20132'] } ) commandId = response['Command']['CommandId'] status = response['Command']['Status'] while status == 'Pending' or status == 'InProgress': time.sleep(2) response = ssmClient.list_commands(CommandId = commandId) status = response['Commands'][0]['Status'] response = ssmClient.list_command_invocations(CommandId = commandId) for invocation in response['CommandInvocations']: instanceId = invocation['InstanceId'] instanceName = invocation['InstanceName'] response = ssmClient.get_command_invocation(CommandId = commandId, InstanceId = instanceId) output = response['StandardOutputContent'] if not output: continue dom = parseString(output) statusArray = dom.getElementsByTagName('ServiceStatus') for statusEl in statusArray: ServiceType = statusEl.getElementsByTagName('ServiceType')[0].firstChild.data MetricName = "SDL" + ServiceType.replace(" ", "") + "Status" Status = statusEl.getElementsByTagName('Status')[0].firstChild.data StatusNumber = statuses[Status] cwClient.put_metric_data( Namespace = 'SDL Web', MetricData = [{ 'Dimensions': [{ 'Name': 'InstanceName', 'Value': instanceName }], 'MetricName': MetricName, 'Value': StatusNumber, 'Unit': 'None' }] ) return None
Brief code explanation:
- Send nc command to all instances tagged with tag name Type equals Deployer, since I don't feel like keeping track of instance IDs. Currently I have only one instance, but in a production environment the Deployer will be separated into Endpoint and several Worker instances;
- Wait until command finished execution on all target instances and command status is no longer Pending or InProgress;
- Read each CommandInvocation within our generic command, so that we are able to retrieve the command output on individual instances;
- Read the StandardOutputContent from each invocation and parse it into a DOM;
- For each ServiceStatus node in the XML, translate Status text into a code (0 means OK, 1 = Error and 2 = Not Responding)
- Push custom metric into CloudWatch using the instance name as dimension (e.g. dev-deployer.mitza.net), ServiceType as metric name, and translated Status as metric value;
Eventually, when all is working, the following metrics are available in CloudWatch:
Comments