Watchdog Monitor

Version 1.0.0

Last Updated Jun 10, 2019

Table Of Contents


In the simplest terms, this plugin provides a way for you to monitor devices on your network. Furthermore, you are likely interested in more than just "is the device up" monitoring. With that in mind, each monitored device can have multiple services monitored. But there is more to monitoring than just a screen showing the current state of everything. The various things you can do with this plugin are below:

  • Check the current state of a service.
  • See historical charts of a service.
  • See past events when a device or service was in a warning or error state.
  • Receive notifications when a service is in a warning or error state.
  • Schedule downtime windows where notifications are not sent (for example, maintenance windows).
  • Build custom dashboards to quickly see the state of your devices.

The above list is not a fully inclusive list, but it does cover the larger pieces.


Below we will quickly cover the definitions used throughout this manual so that you have a better understanding of what we are talking about in the later configuration sections.


A device is a physical thing to be monitored. By itself, having a device exist in the system does not automatically collect data about it - that is where service checks come in. However, a device contains information about the device, such as IP address, so that the various service checks know how to gather the data they need. While we refer to a device as a physical thing, it does not need to be actual physical hardware. A device could also be a virtualized server for example.

Service Check Components

A component is basically the logic that drives a service check. For example, running an ICMP Ping against an IP address is a component. It contains the logic required to perform that check.

Service Check Types

A service check type defines how that component performs it's check and what options are set. For example, an ICMP Ping is useless unless we know what IP address to ping, or what return time is considered an error. The configured Types define this parameters. Out of the box you will find three different ICMP Ping service check types: Ping LAN, Ping VPN, and Ping WAN. Each one has different options configured because, for example, a device on the LAN is expected to respond faster than one over a VPN.

Service Checks

A service check is a single instance in the database that ties a service check type to a device. So the Ping LAN attached to the Rock Server device would be the instance of the service check. It records the results from performing ICMP Pings against the Rock Server.

Device Profiles

Trying to configure every single device individually would be a pain. In addition it would be prone to errors. If you intended to configure eight devices the same way you are more likely to accidentally forget something if you have to configure each one by itself. Profiles allow you to configure a "template" that all assigned devices will follow. As an example, you might create a HP Network Switch profile that is setup to monitor all these service check types:

  • Ping LAN
  • CPU Usage
  • CPU Temperature
  • Chasis Temperature
  • Power Supply Status

You would then assign all of your five HP switches to this profile. If you update the profile, all the devices pick up the new (or removed) service checks as well.

Device Groups

While on the surface these may sound like the same thing as the Device Profiles they serve very different purposes. A Device Group is simply what its name implies: A collection of devices. A group can contain devices of different profiles and a device can be a part of multiple groups. For example, you may have device Check-in Printer 01 in the following device groups: All Devices, Check-in Devices, and Printers.


An event is when a specific service check (e.g. "Ping LAN on Rock Server") encounters a warning or error condition. An event stores the starting date time of the event and the ending date time of the event. This allows you to go back and review historical events and see how long they lasted.


In regards to the Watchdog Monitor plugin, schedules are slighly different than Rock schedules. They use much of the same back-end data, but provide some additional funtionality required for this plugin to work. Think of them as "advanced schedules". You might setup a schedule that covers the following three weekly times:

  • Saturday 4pm - 7pm
  • Sunday 7am - 1pm
  • Wednesday 5pm - 8pm

This schedule might reflect your checkin schedule which covers multiple days of the week and different times on each day.

Notification Groups

Notification Groups allow you to define what you get notified about, who gets those notifications, when they will be notified, and how they get notified. Let's break that down a bit more. In the following paragraphs, let's assume we are talking about a notification group called Checkin Notifications for things related to weekend checkin.

You specify what you get notified about by adding individual devices or device groups to a notification group. For example, you might add the device groups Checkin iPads and Checkin Printers in addition to the single device Rock Server as they are all related to checkin working properly.

Each notification group specifies who gets notified by adding People to the group. So you might have your IT person in charge of configuring checkin, your weekend IT person and maybe the Children's Pastor all in the group. They all would like to know when checkin had issues.

Next we need to specify when they get notified. Each notification group is assigned an (advanced) schedule to associate it with. If an event on a device happens and there is no notification group it is a member of with an active schedule, then no notification is sent.

Finally, we need to know how to notify those individuals. Each person in a notification group has a Notification Method setting that specifies if they want an E-mail and/or an SMS sent to them. In this example, the on-duty IT person might want an SMS so they can go check it right away. The Children's Pastor and IT person in charge of configuration might just want an e-mail so they know about it.


Another piece of the puzzle are downtimes. These let you schedule times when notifications should not be sent - overriding the notification group schedules. For example, you might have two iPads sent out for repair and they won't be back for three weeks. You don't want to remove them from the notification groups as that would mean you need to remember to put them back in. Instead, you can create a downtime for those two iPads with the appropriate date range so that they will be considered "known offline" for that date period and no notifications will be sent.

Data Collectors

A collector is a process that runs the service checks. Out of the box your Rock server will act as a data collector. But that may not be good enough. Your Rock server may be behind a strict firewall that prevents it from talking to the internal LAN. Or you may be cloud hosted in which case your Rock server is for sure not able to talk to your LAN. You can install one or more remote data collectors. These are Windows Services that run in the background and talk to your Rock server to determine what service checks need to be run, running them and then uploading the results to your Rock server.

Quick Start

Before you jump in and try to start configuring things, be sure to read the above section on the various definitions used through these docs. In order to show you how to quickly get started with monitoring a device, let's get you going with monitoring the response time of your web server to make sure the homepage is loading.

First, we need to create a schedule. You probably want to monitor most things 24/7, so we'll create a schedule to do that. On the Watchdog Monitor page, select the Schedules page and add your first schedule. Name it `24/7`. Next add a new schedule component by clicking the plus button again. For the Start Date / Time, use todays date and `12:00 am`. Duration should be 24 hours, recurring every 1 days. This will make a schedule that is always active.

Next we need a device profile. So go back to the Watchdog Monitor page and then into the Device Profiles page. Create a new profile and call it `Rock Server Profile`. Set the Icon Class to `fa fa-rockrms` and select the `24/7` schedule we just created for the Check Schedule. Now add a Service Check and select the `HTTP Service`. Save the profile. Later, you can go back and add in the host check for Ping LAN, but since Windows by default doesn't respond to ICMP (ping), let's just leave that off for now.

Now we need to create the actual device to be monitored. Once again, go back to the Watchdog Monitor page and down into the Devices page and add a new device. Call this `Rock Web Server`. For address you can either enter the DNS name of your Rock server (e.g. ``) or just `localhost` for quick testing. Select the Profile we just created and select `Rock Server` as the collector. This will make Rock itself gather the information. Click Save.

Now, you can either wait about 60 seconds for the data collector job to run, or you can click the Play button for the HTTP Service in the Service Checks grid. You should, hopefully, see a green `OK` state appear. You now have Rock monitoring itself to make sure the website is loading quickly.

One last thing we need to configure for this to be useful, e-mail alerts. Back to the Watchdog Monitor page, go down into the Notification Groups page and add a new notification group. Call it `Rock Server Alerts`, select the `24/7` schedule and the `Rock Web Server` device. Then tick all three State checkboxes and save. Add yourself as a member to the notification group and enable either Email or SMS notification methods (or both if you want) and Save. Rock will now send you e-mail notifications if anything happens on the Rock server. If you want the SMS notifications to work as well, go to the Watchdog Monitor Settings page and select an SMS Number to use for sending those notifications.

I know we all hate reading documentation, but I highly recommend reading over this entire document one time to familiarize yourself with the different features and service check types. Your next steps will probably be installing a local Data Collector service if you are cloud hosted and setting up a few Device Profiles to use when monitoring all your devices.


Service Check Types

Out of the box, we provide a number of standard service check types already configured. This should get you started pretty quickly, but you will probably want to customize these and/or create your own service check types to handle specific scenarios.

Edit Service Check Type

The first item to be configured with a service check type is the Provider (Component). This specifies the fundamental type of check to be performed. In this case, we are going to be configuring an ICMP Ping.

The timing of the checks is defined by the three settings: Check Interval, Recheck Interval and Recheck Count. When a service check is in a given state, the Check Interval is used to determine the number of minutes between checks. In this case, every five minutes the device will be pinged.

Before we talk about Recheck Count and Recheck Interval we need to discuss the concept of a "soft state". When a device is functioning normally, it's in the OK state. This is considered a "hard state", it's known to be true. If a single ping check comes back at 400ms that would be considered an Error state. However, this will not trigger an immediate notification. The reason is that this is considered a "soft state". We think the state might now be Error, but it could also just be a transient result. A blip in the network if you will. So a state change will go through a period of "soft state" before becoming a "hard state".

This is where the Recheck Count comes in. When a state changes (whether from OK to Error, vice-versa, or any other combination of state changes), it goes through this "soft state" for the number of checks specified in the Recheck Count. In this case, we run three additional checks before making the new state a "hard state". But, you three extra checks at five minutes each means you have to wait fifteen minutes before getting a notification. When performing these "rechecks", the Recheck Interval is used. In the above example, with a recheck of one minute, that means it will only wait an additional three minutes total (three checks at one minute intervals) before you get notified.

The final section is all the configuration options of the specific Provider. Each provider has it's own options and you will need to see the in-line help for specific information. One thing to note, is that nearly every text-type field supports Lava as you can see in the example above for the Address field.

Device Profiles

Device profiles, as mentioned previously, allow you to setup a configuration that multiple devices use. Below we will show a sample of how you might configure monitoring your PFSense firewall devices.

Edit Device Profile

So the name is rather self explanetory. This is what the device profile is called and what shows up when you need to select a profile. The Icon CSS Class is used to provide the icon for any device that uses this profile.

We're going to make another segue and talk about the difference between a "device state" and the "device overall state". The former is tied to the Host Service Check in the image above. The Host Service Check is used to determine if the device itself is up or down. Or, said another way, it's used to determine the "device state". When the "device state" reports that the device is in the Error state, none of the other service checks for that device are run. In this example, if we cannot successfully ping the device, then we almost certainly won't be able to run the three web related checks. One primary reason this is done is that in the case of a ping failure you would (hopefully) only receive a single error notification rather than foure total.

So back to the sample screenshot, the Host Service Check is again what is used to determine if the device itself is up or down. The Check Schedule will be used to determine in what time period the service checks for the device will be executed. In this case, we are going to run checks seven days a week, twenty-four hours a day.

One device profile can inherit specific settings from another profile. These are which Service Checks are performed on the device as well as the SNMP settings, which specify how to authenticate for SNMP related checks. The bottom of the screenshot shows that any service checks that are inherited from the parent profile(s). This inheritence is also why the Enabled check box is there. If a parent profile includes a service check that you don't want these devices to use, you can add the service check again and turn off the Enabled checkbox.

In this example, we are inheriting from SNMP Device which gives us the SNMP Uptime check. On top of that, we are going to add service checks to make sure that HTTP and HTTPS are working properly, as well as a check to make sure the SSL certificate is valid and not expiring in the near future. One final thing to note is the Collector Override setting. In a moment when we look at creating a Device, we will talk about the Collector the device uses. The Host Service Check will always use the configured Device Collector. But the other individual checks can override that and use a specific collector.

Edit Device Profile SNMP

As you can see, the SNMP settings are very flexible and allow for just about any combination of settings that your devices may require. Some devices let you choose authentication and encryption options, others just mandate what you must support. So we decided to give you the whole kitchen sink. One thing to note, If you choose SNMP v1 it is actually running v2, but in our testing there has been little difference except that some devices claim to use v1 when they are actually using v2 - which causes some data to not return correctly unless we are also using v2.

Edit Device Profile NRPE

We'll talk more about NRPE checks later, but these are Nagios Passive checks. Meaning the data collector (Rock or other collector you install) will query the device via NRPE for its state. If you already have some servers using Nagios-style checks then you can configure some custom service check types to take advantage of those. One thing to note is that our NRPE checks do not support so-called "insecure encryption" that older Nagios systems use. This is a limitation of the libraries available to us and requires you to use either no encryption or full SSL certificate based encryption.


Okay, now we start getting into the fun stuff. Actually adding a device so we can run checks against it.

Edit Device

Most of these fields should be self explanetory. We'll only cover them briefly. The Name of the device is a user friendly name so you don't need to enter any DNS names here or anything like that. If you turn the Active checkbox off then all service checks will be disabled. Normally you will want to use a Downtime instead, but this could be useful if something is acting up at the end of the day as you are about to leave and you just want to turn off monitoring for the night before you go home.

The Address of the device is optional. If provided it can be either a DNS name or IP address. The Profile and Collector specify the obvious. The Parent is a way to build in automatic silencing of devices and service checks. For example, if the network switch for the Children's Building goes down, you don't want to get notifications about all the devices in the building. You know they are down because the switch is down. So you can build a virtual device tree. If a parent device is in an Error state then you will not receive notifications about any "child" devices.

As we previously mentioned, each device can be a member of any number of groups. This allows you to collect like devices into a single group and monitor their status on the Dashboards collectively. As an example, you might have a group for Network Switches and put all your switches in that group. This way on your dashboard you can have a single monitored item called "Network Switches" and know at a glance that all your switches are good.

Finally, each device allows you to override the SNMP Settings and NRPE Settings inherited from the Profile. If all your PFSense Routers use the same SNMP settings except one, you don't need to create a whole new profile just to specify those settings.


Schedules are fairly straight forward, though it may take you a few minutes to think through how these advanced schedules are constructed. These advanced schedules are simply a collection of the schedules you are already familiar with. To simplify configuration, a single schedule component cannot go past midnight.

Edit Schedule

In our example, we have a single component that is Daily at 12:00 AM and runs for 24 hours. This gives us a standard 24/7 type schedule. A more advanced setup might be for a "checkin schedule" that covers all the various times check-in devices are active and setup to be monitored. Such a schedule might contain the following component schedules:

  • Saturday at 4:30 PM and runs for 4 hours.
  • Sunday at 7:00 AM and runs for 5.5 hours.
  • Wednesday at 6:00 PM and runs for 2 hours.

Notification Groups

The notification group lets you create a very customized way to send notifications.

Edit Notification Group

The Schedule specifies what schedule that notifications for this group will be sent. This schedule includes both the immediate notifications as well as the hourly notifications. Immediate notifications are those that happen and tell you about a single service that has changed state. The hourly notifications are aggregate and tell you all the service checks that are currently in a non-OK state.

You can dictate which states you want to be notified about by selecting them in the State checkboxes. For immediate notifications, these are the states that a service check must change to in order for the notification to be sent. For the hourly aggregate notifications, these are the states that the service check must be in for the notification to be sent. The exception to this, as we just mentioned, is that the OK state is always ignored for aggregate notifications. You don't really want an e-mail every hour telling you how many services are OK do you?

Each notification group can be tied to either Device Groups or individual Devices. A device can be referenced multiple times via individual reference and group reference. It will not cause any issues and you will not receive multiple notifications for it.

Notification Group Members

So we have configured what devices we are going to send notifications about, but we need to also specify who will receive those notifications. You can add people to the Members list and specify whether they receive Email notifications or SMS notifications (or both).

One final thing to note, is that when notifications are sent, the system builds an aggregate list of all notification groups and the people in them. Meaning, if a device is matched in three different notification groups and you are listed in two of the groups as Email notification and one group as both Email and SMS, then you will only receive one notification of each type. Meaning, one Email and one SMS.

Data Collectors

Out of the box, your Rock server is configured to be a data collector. There is a Job that is configured to run every minute and process any service checks that need to be run - and are configured to use the Rock server as the data collector.

However, this may not work in your environment as your Rock server may not have access to your internal network. In this case, you can go to the Power Tools > External Applications and download the stand-alone installer. This will install a Windows Service that can run and perform service checks on that Windows server. It is worth noting that you can do this on as many Windows servers as you want. So if you want to put a data collector at each site, feel free. Another thing to keep in mind is security. Because you are sending potentially sensitive data (such as SNMP authentication settings) over the network, it is important to use an SSL connection from the collector to your server.

Once you have installed the remote Data Collector you need to configure it to talk to your server. Currently, this is done by going to your Defined Types page and look for the Watchdog Monitor Collectors type. Open that up and add a new Value. The Value is just a user-friendly name that you will see when selecting Collectors. The Authentication Key can be anything, but we recommend a long sequence of random characters. This is used to identify your remote collectors and is also used as the password for the collector. As such, each collector must have a unique Authentication Key.

On the Windows server, you should see a new application in your Start Menu called Watchdog Monitor Collector Service. Run this and configure the URL used to communicate with your server (for example, and the Authentication Key you created for the collector. You will also need to enter the Shared Secret that is defined on the Settings page (described below).

While the collector will allow you to use a non-SSL port, please do not do this in production use. The Rock server will need to send sensitive information (such as NRPE and SNMP credentials) to the data collectors in order for them to do the work they need. Therefore, you should only use an SSL connection in production.


System-wide settings can be changed on this page, such as the notification templates. Out of the box, most of these values will probably be fine for you - other than the SMS Number.


If you want to customize the e-mail templates used, then you can create your own and then update the active e-mail templates here. The Service State Change Email specifies the system e-mail that is sent when a single service check has changed state (e.g. from OK to Warning). The Service State Email specifies the e-mail to be sent on a periodic basis telling you what all services and devices have issues.

The same logic applies to the Service State Change SMS and Service State SMS message values. Additionally, if you plan to use SMS notifications then will also need to specify which SMS number to use when sending those notifications.

The Collector Shared Secret is used as an additional layer of security when authenticating remote collectors - after all, you will be sending sensitive information to the remote collectors so that they can contact the devices they are monitoring. A unique value is generated for you on install, but if you ever need to change the value you can do so here. Just remember you will need to also update any remote collectors to use this new shared secret value.

Finally, the History Connection String can be used to override where the plugin stores historical data. For a smaller install, this would not be needed. Larger installs may put so much data into your database that you would rather keep outside the Rock database. This value should be either blank to indicate the internal Rock database, or a standard SQL Server connection string to connect to a remote server.


You can create as many dashboards as you want, each showing the same or different devices. The dashboards are designed with Lava so you can style them anyway you want. Below we show two of the default dashboard styles you can use: A list, and buttons.

Dashboard Sample

The top section shows individual devices as a table. The Rock Server has a yellow background for it's Name and Service Checks because one of the service checks is in a warning state. The Device State column is still green because the device itself is still OK, that is, it's responding to Pings correctly.

The second section also shows individual devices, but uses a layout similar to the internal Rock "page-menus" that show up as blocks. Again, the Rock Server shows a yellow background but the icon still shows up as green indicating that the overall state is warning, but the device state itself is still OK.

The final section also shows as a block, but these are showing the overall states of the two device groups we have defined. So we can see at a glance that something in our All Devices group is in a warning state. However, we can also see at a glance that everything at Home is working correctly.

Service Check Components

Collector Status

Collector Status

This is a fairly simple component to setup. When you have one or more remote collector services running, you you have created a single point of failure in that if the collector service stops running for some reason then you don't know that a device might have entered an error state. Thus, you need a way to monitor the collector services as well. This service check does just that. When setup to run on your Rock Server collector, it will check any configured remote collectors to see when they were last active.

If any collector has not contacted the Rock server within the specified threshold limits, then this check will enter the warning or error state and alert you.

Note: This component must be run on the Rock server itself, which means you may need to specify the Rock server in the Collector Override when you attach it to a device profile.

DNS Blacklist Lookup

DNS Blacklist Lookup

This component allows you to check if your mail server is present on one or more DNS Blacklists. These are free lists that many mail servers use to determine whether or not an incoming message is spam or not. If the sending server is listed in one of the lists then many mail servers will reject the message.

Whether or not you run your own on-premise mail server to sending e-mail or use a mail provider like Mailgun or SendGrid, you can configure a service check with the IP address (or DNS name) of your mail server and monitor if it is on any spam blacklists. If so you can then follow up and determine why and work to get it removed. This allows you to stay ahead of the game and not find out you have been blacklisted after people start complaining they are not getting your e-mails - which usually happens some time after you got blacklisted.

You can choose from the existing list of possible DNS blacklists to query (these are the most common) or if you want to query against one or more lists that are not currently options, you can enter them in the Custom Lists field. These would be entered as one or more comma separated DNS list names.

DNS Lookup

DNS Lookup

This component allows you to configure a generic DNS lookup test. If you want to ensure that a specific hostname always resolves to a particular IP address you can set that up here. You can also configure it to just ensure that the name resolves to something rather than erroring out. This type of configuration helps you ensure that your DNS server is working at all so that you can investigate why it stopped responding to DNS queries.

The Hostname is the DNS name you want to resolve back to an IP address. Query Type allows you to select between an IPv4 A and an IPv6 AAAA lookup. If you want to verify the result against a specific, expected, value, then you can enter it in the Expected field. By default the component will use the default DNS server, but you can override that by specifying the name or IP address of the DNS server you wish to query in the Server field.

Currently only A and AAAA records are supported. We may include support for other query types in the future such as PTR and TXT.

HTTP Certificate

HTTP Certificate

Hopefully we all have our sites secured with an SSL certificate. Hopefully we also have some sort of automated renewal process in place, like using the Acme Certificate plugin. But sometimes we can't use automated renewal. Or maybe you want to monitor the SSL certificates of a non-Rock server or device. This component allows you to check if the SSL Certificate for the given web address is valid and not expiring too soon.

The URL checked is specified by the Address setting and must include the https:// prefix to work correctly. If the time until expiration is less than the Warning Threshold or Error Threshold values, as specified in days, then the check will enter the Warning and Error states respectively. The Timeout value allows you to specify how long to wait for the server to respond and is specified in milliseconds. This helps prevent the check from taking a really long time to report a failure if the server is offline or otherwise not responding in a timely manner.

The component currently checks both the expiration date as well as the validity of the certificate. Meaning if you try to check a self-signed certificate it will report an error because it will be treated as not valid. This also means if the certificate is for but you put in the Address field it will also report an error because the names do not match (normally you would have both names listed in your certificate though). In the future we may add an option to only check the expiration date.

HTTP Response

HTTP Response

This component will test to make sure the given URL is responding in a timely manner. It does not care what the actual content returned is, as long as it is indicated by a 2xx success code from the server. You specify the Warning Threshold and Error Threshold values in milliseconds, and if the server takes longer than those values to respond then the check enters a Warning or Error state respectively. The Timeout specifies how long to wait for a response before giving up and recording it as a timeout.

The URL queried is specified by the Addresss field. It can be either an http:// or https:// address. Additionally, you don't need to limit it to just the root page of the site. If you have a decent amount of logic on a specific page of your site that takes a bit of time to process, you can setup a check to target that one page and make sure the time to process hasn't crept up to an unnacceptible level.



This is the most basic component we have. It simply tests if a device is "alive" by sending what is called a PING packet to the device. Normally a device will respond and you use the time difference to determine if there is a network problem between the two devices. A device is not required to respond to a Ping, and many firewalls (for example Azure's firewall) actually block them. But if it is a device on your own network, most likely it will respond to a Ping.

So with this you can monitor devices to see if they are online and plugged into the network. This is often helpful with devices that are expected to be on and plugged in 24 hours a day, such as servers, printers, switches, etc.

The Address contains the hostname or IP address of the device to be pinged. If the response time is greater than the Warning Threshold or Error Threshold, specified in milliseconds, then the check will return a Warning or Error state respectively. The Number of Packets allows you to specify how many packets to send and receive. The average response time of all packets will be used in calculating the round trip time.

NRPE Value

NRPE Value

Many organizations that already do some sort of monitoring already have pre-existing Nagios-style checks on their servers. Many of these operate over the NRPE protocol. If you have these checks installed, or plan to install them for better monitoring of your servers, you can use this component to check the state of those checks.

The Address field contains the hostname or IP address to be contacted to perform the check. After connecting it will send the Command field as the check to be performed and wait for the results. You can specify how long to wait by the Timeout field which is the number of milliseconds before it gives up.

Nagios checks are capable of returning multiple performance index values. For example, their version of an ICMP Ping check returns two performance values: The round trip time, and the packet loss percentage. We only support accessing one performance index so you specify which one to retrieve with the Performance Index field. Normally these are orderd by importance so the first index (zero) is usually the one you want.

Another thing that Nagios checks do is return their own state of Ok, Warning and Error. This is based on the internal configuration. If you want to trust these results as truth then you can set the Trust Result to Yes. Doing so will ignore any comparison values you may enter.

Assuming you don't trust the result, you can specify a warning comparison type and value as well as an error comparison type and value. If the returned value matches the Warning Comparison Type and the Warning Comparison Value then the check is put into the Warning state. If the returned value matches the Error Comparison Type and the Error Comparison Value then the check is put into the Error state.

Finally, since we don't know what kind of value is being returned (temperature, disk space, etc.) we don't know what type of label to use when identifying the value. You will need to enter a Value Label to identify those values. For example, if you are running a check on how many days the the device has been running, you would enter day in that field. It will automatically be pluralized as needed and will result in a final text string that looks something like 23 days.

Plugin Updates

This is a fun component. It allows you check if any of the plugins you have installed have an update available. This only checks plugins that are actually installed on the server and are capable of being upgraded. This means if a plugin has an update but it requires a newer version of Rock than you have installed, it is not counted.

Currently there are no configurable options for this component. It will automatically enter a warning state if there are plugin updates available. There is also a three day delay before an update is considered available. This allows time for the developer to do a final test install from the rock shop and have time to pull it if problems were discovered with the packaging.

Note: This component must be run on the Rock server itself, which means you may need to specify the Rock server in the Collector Override when you attach it to a device profile.

Printer State

Printer State

Depending on how your printers operate, this could either be a very useful component or it could also be useless to you. Basically, every printer reports a very generic state on itself via SNMP. The normal states are: Warming Up, Idle, and Printing. Usually if the printer is in the Unknown or Other state it means an error has occurred. But since there is no specification for when the different states are returned it really comes down to you testing with your printers and seeing if this component will give the results you need.

Configuration is straighforward. Simply specify the address of the printer and which states you want to receive a warning an error alert on.

Printer Status

Printer Status

Most printers provide somewhat useful status flags via SNMP. This component will allow you to check those status flags and initiate alerts under certain conditions. For example, you may want to setup an alert for your helpdesk person when a printer's paper runs low so that they can go fill it up before it stops printing.

A printer can return no status values (indicating everything is probably OK), or it can return one or more status values. So it is perfectly valid for the status value returned by the printer to contain both "Low Paper" and "Low Toner" at the same time.

Bear in mind that not all printers provide the indicated status values. These are only the possible values that are defined in the specification. For example, a small home printer will probably never return the "Low Paper" status as that requires additional sensors. Many will just go from "normal" to "No Paper". Larger office printers will likely have the "Low Paper" sensors.

Another thing to consider, especially with larger units like copiers and such, is that they will return a tray-related condition if _any_ tray meets that condition. So if you have 4 trays with Plain Letter and the unit is configured to use all 4 trays in sequence and the first tray runs out, it will trigger a "No Paper" condition even though it will still print just fine (because the other 3 trays have paper). Other conditions are more likely to cause a full stop in printing, such as "Jammed", "Toner Missing", etc.

SNMP Uptime

SNMP Uptime

If you are monitoring a network device such as a printer, network switch, UPS, etc. then it probably supports being monitored by a protocol called SNMP. Working with SNMP can be tricky and while there is a component for checking any arbitrary value we gave you the most common one you will be using as a self-contained component. This comoonent will query the device via SNMP and check how long it has been up and online. If it is below a certain threshold (indicating a recent reboot) then it will enter either a Warning or Error state.

You specify the hostname or IP address to connect to by the Address field. If the returned system uptime is less than the minutes specified inWarning Threshold then it will enter a Warning state. If the system uptime is less than the Error Threshold specified minutes then it will enter an Error state. The Timeout indicates the number of milliseconds to wait for a response from the device.

It should be noted that just because a device supports SNMP does not mean it will automatically respond to SNMP queries. You will need to configure the SNMP Settings to match the device's own configuration otherwise you will probably get timeout errors.

SNMP Value

SNMP Value

So we just talked about the SNMP Uptime component. That is great if all you want to know is how long the device has been running. But SNMP actually exposes a lot of data for you to monitor. For example, most printers will report how much toner they have left, or how full the paper trays are. A network switch will often report the internal temperature. Most devices also report an "overall status" that wouldn't tell you specifically what is wrong, but would basically let you see remotely that pesky warning indicator on the switch stuffed in the closet on the other side of the building.

To achieve that, you have this component. This is probably one of the most difficult components to set up, purely becuase there is no standard to which OID number a device will use to transmit it's data. You have to find these OID numbers in technical manuals or by trial and error. However, once you know it, you can re-use that same OID number to check other devices of the same make and model.

The Address, like most other checks, specifies the hostname or IP address to connect to. The OID is where you specify which value you are interested in, and is expressed as a long integer string separated by periods, such as To ensure that the check does not sit waiting forever for a response, you can specify a timeout in milliseconds in the Timeout field.

Next you can specify a warning comparison type and value as well as an error comparison type and value. If the returned value matches the Warning Comparison Type and the Warning Comparison Value then the check is put into the Warning state. If the returned value matches the Error Comparison Type and the Error Comparison Value then the check is put into the Error state. Since SNMP values can be numerical or string values, the comparison types include two string comparisons. So if you are querying a string value that might contain the word "fail" if a problem exists, you can specify Contains fail to detect that condition.

Finally, since we don't know what kind of value is being returned (temperature, disk space, etc.) we don't know what type of label to use when identifying the value. You will need to enter a Value Label to identify those values. For example, if you are running a check on the temperature of the device, you would enter degree in that field. It will automatically be pluralized as needed and will result in a final text string that looks something like 96 degrees.

SQL Query

SQL Query

This is another fun component that you can use to do lots of things with. Since a SQL query has access to everything in your database, you can also query on everything. Here are a few ideas:

  • Number of pending Email messages to be sent
  • Number of pending SMS messages to be sent
  • Number of "Web Prospects" in the database that need to be dealt with
  • How many active workflows that are more than 90 days old
  • How many connections are more than 60 days old

Configuration is fairly straight forward. You simply enter a SQL query that returns a single row of data. One column must be named Value and will be used as the value for comparison and for historical charting. You may also specify a column of Summary which will be used as the summary text if the check returns an OK status. You should design your queries to be fast, but just in case you have on that may take a long time to run you can specify a timeout in seconds in the Query Timeout field.

Next you can specify a warning comparison type and value as well as an error comparison type and value. If the returned value matches the Warning Comparison Type and the Warning Comparison Value then the check is put into the Warning state. If the returned value matches the Error Comparison Type and the Error Comparison Value then the check is put into the Error state. Since SNMP values can be numerical or string values, the comparison types include two string comparisons. So if you are querying a string value that might contain the word "fail" if a problem exists, you can specify Contains fail to detect that condition.

Finally, since we don't know what kind of value is being returned (temperature, disk space, etc.) we don't know what type of label to use when identifying the value. You will need to enter a Value Label to identify those values. For example, if you are running a check on how many people are in the database, you would enter person in that field. It will automatically be pluralized as needed and will result in a final text string that looks something like 8,419 people.

To give you an idea of the kinds of things you can do, this is the query we use to monitor the CPU usage on our Azure SQL instance (note: this only works on Azure and not on-premise SQL).

    CAST(AVG(avg_cpu_percent) AS decimal(18, 2)) AS [Value],
    'Currently using ' + CAST(CAST(AVG(avg_cpu_percent) AS decimal(18, 2)) AS varchar(10)) + '% of ' + CAST(MAX(dtu_limit) AS VARCHAR(10)) + ' DTUs.' AS [Summary]
FROM sys.dm_db_resource_stats
WHERE [end_time] >= DATEADD(MINUTE, -5, GETDATE())

A bit of information on what the above is doing. An Azure SQL database stores statistical data in the sys.dm_db_resource_stats table. These are 30 second averages. Because we have the check configured to run every five minutes, we are taking all rows from the past five minutes and averaging them all together. This gives us a five minute average value. Next we want a pretty summary string so we take that same five minute average and also pull the DTU size the database is currently configured for. The final result is a summary string like Currently using 4.28% of 20 DTUs.

Note: This component must be run on the Rock server itself, which means you may need to specify the Rock server in the Collector Override when you attach it to a device profile.

Warning: Usually these SQL queries you will be running are not things you need to update every five minutes. Update your Check Interval with an appropriate value. For example if you are monitoring the number of people in the Web Prospects role, you probably don't need to update that value every five minutes. Configure it to run hourly, or maybe even daily.

TCP Port Open

TCP Port Open

Wouldn't it be nice if you could monitor your Exchange server to ensure it hadn't crashed? Or your Linux hosts to make sure they are still responding to SSH connections correctly? That is exactly what the TCP Port Open component is for. At it's most basic level, it ensures that it can successfully connect to the host on the given port number. These are specified by the Address and Port fields. You can also specify the time in milliseconds to wait for a connection with the Timeout field.

But just connecting to the port doesn't necessarily tell you things are working correctly. Most services send some form of "hello" string when you first open a connection to them. If the port you are connecting to is one of these, then you can enter a value in the Signature field to match against that text data it sends. This is a regular expression field which means you can do some pretty advanced matching. To see a few examples of how this works, take a look at the SSH Service, SMTP Service and IMAP4 Service checks.