Watchdog Monitor

Version 1.0.0

Last Updated Oct 13, 2018

Table Of Contents


Overview

In the simplest terms, this plugin provides a way for you to monitor devices on your network. Furthermore, you are likely interested in more than just "is the device up" monitoring. With that in mind, each monitored device can have multiple services monitored. But there is more to monitoring than just a screen showing the current state of everything. The various things you can do with this plugin are below:

  • Check the current state of a service.
  • See historical charts of a service.
  • See past events when a device or service was in a warning or error state.
  • Receive notifications when a service is in a warning or error state.
  • Schedule downtime windows where notifications are not sent (for example, maintenance windows).
  • Build custom dashboards to quickly see the state of your devices.

The above list is not a fully inclusive list, but it does cover the larger pieces.

Definitions

Below we will quickly cover the definitions used throughout this manual so that you have a better understanding of what we are talking about in the later configuration sections.

Devices

A device is a physical thing to be monitored. By itself, having a device exist in the system does not automatically collect data about it - that is where service checks come in. However, a device contains information about the device, such as IP address, so that the various service checks know how to gather the data they need. While we refer to a device as a physical thing, it does not need to be actual physical hardware. A device could also be a virtualized server for example.

Service Check Components

A component is basically the logic that drives a service check. For example, running an ICMP Ping against an IP address is a component. It contains the logic required to perform that check.

Service Check Types

A service check type defines how that component performs it's check and what options are set. For example, an ICMP Ping is useless unless we know what IP address to ping, or what return time is considered an error. The configured Types define this parameters. Out of the box you will find three different ICMP Ping service check types: Ping LAN, Ping VPN, and Ping WAN. Each one has different options configured because, for example, a device on the LAN is expected to respond faster than one over a VPN.

Service Checks

A service check is a single instance in the database that ties a service check type to a device. So the Ping LAN attached to the Rock Server device would be the instance of the service check. It records the results from performing ICMP Pings against the Rock Server.

Device Profiles

Trying to configure every single device individually would be a pain. In addition it would be prone to errors. If you intended to configure eight devices the same way you are more likely to accidentally forget something if you have to configure each one by itself. Profiles allow you to configure a "template" that all assigned devices will follow. As an example, you might create a HP Network Switch profile that is setup to monitor all these service check types:

  • Ping LAN
  • CPU Usage
  • CPU Temperature
  • Chasis Temperature
  • Power Supply Status

You would then assign all of your five HP switches to this profile. If you update the profile, all the devices pick up the new (or removed) service checks as well.

Device Groups

While on the surface these may sound like the same thing as the Device Profiles they serve very different purposes. A Device Group is simply what its name implies: A collection of devices. A group can contain devices of different profiles and a device can be a part of multiple groups. For example, you may have device Check-in Printer 01 in the following device groups: All Devices, Check-in Devices, and Printers.

Events

An event is when a specific service check (e.g. "Ping LAN on Rock Server") encounters a warning or error condition. An event stores the starting date time of the event and the ending date time of the event. This allows you to go back and review historical events and see how long they lasted.

Schedules

In regards to the Watchdog Monitor plugin, schedules are slighly different than Rock schedules. They use much of the same back-end data, but provide some additional funtionality required for this plugin to work. Think of them as "advanced schedules". You might setup a schedule that covers the following three weekly times:

  • Saturday 4pm - 7pm
  • Sunday 7am - 1pm
  • Wednesday 5pm - 8pm

This schedule might reflect your checkin schedule which covers multiple days of the week and different times on each day.

Notification Groups

Notification Groups allow you to define what you get notified about, who gets those notifications, when they will be notified, and how they get notified. Let's break that down a bit more. In the following paragraphs, let's assume we are talking about a notification group called Checkin Notifications for things related to weekend checkin.

You specify what you get notified about by adding individual devices or device groups to a notification group. For example, you might add the device groups Checkin iPads and Checkin Printers in addition to the single device Rock Server as they are all related to checkin working properly.

Each notification group specifies who gets notified by adding People to the group. So you might have your IT person in charge of configuring checkin, your weekend IT person and maybe the Children's Pastor all in the group. They all would like to know when checkin had issues.

Next we need to specify when they get notified. Each notification group is assigned an (advanced) schedule to associate it with. If an event on a device happens and there is no notification group it is a member of with an active schedule, then no notification is sent.

Finally, we need to know how to notify those individuals. Each person in a notification group has a Notification Method setting that specifies if they want an E-mail and/or an SMS sent to them. In this example, the on-duty IT person might want an SMS so they can go check it right away. The Children's Pastor and IT person in charge of configuration might just want an e-mail so they know about it.

Downtimes

Another piece of the puzzle are downtimes. These let you schedule times when notifications should not be sent - overriding the notification group schedules. For example, you might have two iPads sent out for repair and they won't be back for three weeks. You don't want to remove them from the notification groups as that would mean you need to remember to put them back in. Instead, you can create a downtime for those two iPads with the appropriate date range so that they will be considered "known offline" for that date period and no notifications will be sent.

Data Collectors

A collector is a process that runs the service checks. Out of the box your Rock server will act as a data collector. But that may not be good enough. Your Rock server may be behind a strict firewall that prevents it from talking to the internal LAN. Or you may be cloud hosted in which case your Rock server is for sure not able to talk to your LAN. You can install one or more remote data collectors. These are Windows Services that run in the background and talk to your Rock server to determine what service checks need to be run, running them and then uploading the results to your Rock server.

Configuration

Service Check Types

Out of the box, we provide a number of standard service check types already configured. This should get you started pretty quickly, but you will probably want to customize these and/or create your own service check types to handle specific scenarios.

Edit Service Check Type

The first item to be configured with a service check type is the Provider (Component). This specifies the fundamental type of check to be performed. In this case, we are going to be configuring an ICMP Ping.

The timing of the checks is defined by the three settings: Check Interval, Recheck Interval and Recheck Count. When a service check is in a given state, the Check Interval is used to determine the number of minutes between checks. In this case, every five minutes the device will be pinged.

Before we talk about Recheck Count and Recheck Interval we need to discuss the concept of a "soft state". When a device is functioning normally, it's in the OK state. This is considered a "hard state", it's known to be true. If a single ping check comes back at 400ms that would be considered an Error state. However, this will not trigger an immediate notification. The reason is that this is considered a "soft state". We think the state might now be Error, but it could also just be a transient result. A blip in the network if you will. So a state change will go through a period of "soft state" before becoming a "hard state".

This is where the Recheck Count comes in. When a state changes (whether from OK to Error, vice-versa, or any other combination of state changes), it goes through this "soft state" for the number of checks specified in the Recheck Count. In this case, we run three additional checks before making the new state a "hard state". But, you three extra checks at five minutes each means you have to wait fifteen minutes before getting a notification. When performing these "rechecks", the Recheck Interval is used. In the above example, with a recheck of one minute, that means it will only wait an additional three minutes total (three checks at one minute intervals) before you get notified.

The final section is all the configuration options of the specific Provider. Each provider has it's own options and you will need to see the in-line help for specific information. One thing to note, is that nearly every text-type field supports Lava as you can see in the example above for the Address field.

Device Profiles

Device profiles, as mentioned previously, allow you to setup a configuration that multiple devices use. Below we will show a sample of how you might configure monitoring your PFSense firewall devices.

Edit Device Profile

So the name is rather self explanetory. This is what the device profile is called and what shows up when you need to select a profile. The Icon CSS Class is used to provide the icon for any device that uses this profile.

We're going to make another segue and talk about the difference between a "device state" and the "device overall state". The former is tied to the Host Service Check in the image above. The Host Service Check is used to determine if the device itself is up or down. Or, said another way, it's used to determine the "device state". When the "device state" reports that the device is in the Error state, none of the other service checks for that device are run. In this example, if we cannot successfully ping the device, then we almost certainly won't be able to run the three web related checks. One primary reason this is done is that in the case of a ping failure you would (hopefully) only receive a single error notification rather than foure total.

So back to the sample screenshot, the Host Service Check is again what is used to determine if the device itself is up or down. The Check Schedule will be used to determine in what time period the service checks for the device will be executed. In this case, we are going to run checks seven days a week, twenty-four hours a day.

One device profile can inherit specific settings from another profile. These are which Service Checks are performed on the device as well as the SNMP settings, which specify how to authenticate for SNMP related checks. The bottom of the screenshot shows that any service checks that are inherited from the parent profile(s). This inheritence is also why the Enabled check box is there. If a parent profile includes a service check that you don't want these devices to use, you can add the service check again and turn off the Enabled checkbox.

In this example, we are inheriting from SNMP Device which gives us the SNMP Uptime check. On top of that, we are going to add service checks to make sure that HTTP and HTTPS are working properly, as well as a check to make sure the SSL certificate is valid and not expiring in the near future. One final thing to note is the Collector Override setting. In a moment when we look at creating a Device, we will talk about the Collector the device uses. The Host Service Check will always use the configured Device Collector. But the other individual checks can override that and use a specific collector.

Edit Device Profile SNMP

As you can see, the SNMP settings are very flexible and allow for just about any combination of settings that your devices may require. Some devices let you choose authentication and encryption options, others just mandate what you must support. So we decided to give you the whole kitchen sink. One thing to note, If you choose SNMP v1 it is actually running v2, but in our testing there has been little difference except that some devices claim to use v1 when they are actually using v2 - which causes some data to not return correctly unless we are also using v2.

Edit Device Profile NRPE

We'll talk more about NRPE checks later, but these are Nagios Passive checks. Meaning the data collector (Rock or other collector you install) will query the device via NRPE for its state. If you already have some servers using Nagios-style checks then you can configure some custom service check types to take advantage of those. One thing to note is that our NRPE checks do not support so-called "insecure encryption" that older Nagios systems use. This is a limitation of the libraries available to us and requires you to use either no encryption or full SSL certificate based encryption.

Devices

Okay, now we start getting into the fun stuff. Actually adding a device so we can run checks against it.

Edit Device

Most of these fields should be self explanetory. We'll only cover them briefly. The Name of the device is a user friendly name so you don't need to enter any DNS names here or anything like that. If you turn the Active checkbox off then all service checks will be disabled. Normally you will want to use a Downtime instead, but this could be useful if something is acting up at the end of the day as you are about to leave and you just want to turn off monitoring for the night before you go home.

The Address of the device is optional. If provided it can be either a DNS name or IP address. The Profile and Collector specify the obvious. The Parent is a way to build in automatic silencing of devices and service checks. For example, if the network switch for the Children's Building goes down, you don't want to get notifications about all the devices in the building. You know they are down because the switch is down. So you can build a virtual device tree. If a parent device is in an Error state then you will not receive notifications about any "child" devices.

As we previously mentioned, each device can be a member of any number of groups. This allows you to collect like devices into a single group and monitor their status on the Dashboards collectively. As an example, you might have a group for Network Switches and put all your switches in that group. This way on your dashboard you can have a single monitored item called "Network Switches" and know at a glance that all your switches are good.

Finally, each device allows you to override the SNMP Settings and NRPE Settings inherited from the Profile. If all your PFSense Routers use the same SNMP settings except one, you don't need to create a whole new profile just to specify those settings.

Schedules

Schedules are fairly straight forward, though it may take you a few minutes to think through how these advanced schedules are constructed. These advanced schedules are simply a collection of the schedules you are already familiar with. To simplify configuration, a single schedule component cannot go past midnight.

Edit Schedule

In our example, we have a single component that is Daily at 12:00 AM and runs for 24 hours. This gives us a standard 24/7 type schedule. A more advanced setup might be for a "checkin schedule" that covers all the various times check-in devices are active and setup to be monitored. Such a schedule might contain the following component schedules:

  • Saturday at 4:30 PM and runs for 4 hours.
  • Sunday at 7:00 AM and runs for 5.5 hours.
  • Wednesday at 6:00 PM and runs for 2 hours.

Notification Groups

The notification group lets you create a very customized way to send notifications.

Edit Notification Group

The Schedule specifies what schedule that notifications for this group will be sent. This schedule includes both the immediate notifications as well as the hourly notifications. Immediate notifications are those that happen and tell you about a single service that has changed state. The hourly notifications are aggregate and tell you all the service checks that are currently in a non-OK state.

You can dictate which states you want to be notified about by selecting them in the State checkboxes. For immediate notifications, these are the states that a service check must change to in order for the notification to be sent. For the hourly aggregate notifications, these are the states that the service check must be in for the notification to be sent. The exception to this, as we just mentioned, is that the OK state is always ignored for aggregate notifications. You don't really want an e-mail every hour telling you how many services are OK do you?

Each notification group can be tied to either Device Groups or individual Devices. A device can be referenced multiple times via individual reference and group reference. It will not cause any issues and you will not receive multiple notifications for it.

Notification Group Members

So we have configured what devices we are going to send notifications about, but we need to also specify who will receive those notifications. You can add people to the Members list and specify whether they receive Email notifications or SMS notifications (or both).

One final thing to note, is that when notifications are sent, the system builds an aggregate list of all notification groups and the people in them. Meaning, if a device is matched in three different notification groups and you are listed in two of the groups as Email notification and one group as both Email and SMS, then you will only receive one notification of each type. Meaning, one Email and one SMS.

Data Collectors

Out of the box, your Rock server is configured to be a data collector. There is a Job that is configured to run every minute and process any service checks that need to be run - and are configured to use the Rock server as the data collector.

However, this may not work in your environment as your Rock server may not have access to your internal network. In this case, you can go to the Power Tools > External Applications and download the stand-alone installer. This will install a Windows Service that can run and perform service checks on that Windows server. It is worth noting that you can do this on as many Windows servers as you want. So if you want to put a data collector at each site, feel free. Another thing to keep in mind is security. Because you are sending potentially sensitive data (such as SNMP authentication settings) over the network, it is important to use an SSL connection from the collector to your server.

Once you have installed the remote Data Collector you need to configure it to talk to your server. Currently, this is done by going to your Defined Types page and look for the Watchdog Monitor Collectors type. Open that up and add a new Value. The Value is just a user-friendly name that you will see when selecting Collectors. The Authentication Key can be anything, but we recommend a long sequence of random characters. This is used to identify your remote collectors and is also used as the password for the collector. As such, each collector must have a unique Authentication Key.

On the Windows server, you should see a new application in your Start Menu called Watchdog Monitor Collector Service. Run this and configure the URL used to communicate with your server (for example, https://rock.rocksolidchurchdemo.com/) and the Authentication Key you created for the collector.

Dashboards

You can create as many dashboards as you want, each showing the same or different devices. The dashboards are designed with Lava so you can style them anyway you want. Below we show two of the default dashboard styles you can use: A list, and buttons.

Dashboard Sample

The top section shows individual devices as a table. The Rock Server has a yellow background for it's Name and Service Checks because one of the service checks is in a warning state. The Device State column is still green because the device itself is still OK, that is, it's responding to Pings correctly.

The second section also shows individual devices, but uses a layout similar to the internal Rock "page-menus" that show up as blocks. Again, the Rock Server shows a yellow background but the icon still shows up as green indicating that the overall state is warning, but the device state itself is still OK.

The final section also shows as a block, but these are showing the overall states of the two device groups we have defined. So we can see at a glance that something in our All Devices group is in a warning state. However, we can also see at a glance that everything at Home is working correctly.

Service Check Components

DNS Blacklist Lookup

DNS Blacklist Lookup

This component allows you to check if your mail server is present on one or more DNS Blacklists. These are free lists that many mail servers use to determine whether or not an incoming message is spam or not. If the sending server is listed in one of the lists then many mail servers will reject the message.

Whether or not you run your own on-premise mail server to sending e-mail or use a mail provider like Mailgun or SendGrid, you can configure a service check with the IP address (or DNS name) of your mail server and monitor if it is on any spam blacklists. If so you can then follow up and determine why and work to get it removed. This allows you to stay ahead of the game and not find out you have been blacklisted after people start complaining they are not getting your e-mails - which usually happens some time after you got blacklisted.

You can choose from the existing list of possible DNS blacklists to query (these are the most common) or if you want to query against one or more lists that are not currently options, you can enter them in the Custom Lists field. These would be entered as one or more comma separated DNS list names.

DNS Lookup

DNS Lookup

This component allows you to configure a generic DNS lookup test. If you want to ensure that a specific hostname always resolves to a particular IP address you can set that up here. You can also configure it to just ensure that the name resolves to something rather than erroring out. This type of configuration helps you ensure that your DNS server is working at all so that you can investigate why it stopped responding to DNS queries.

The Hostname is the DNS name you want to resolve back to an IP address. Query Type allows you to select between an IPv4 A and an IPv6 AAAA lookup. If you want to verify the result against a specific, expected, value, then you can enter it in the Expected field. By default the component will use the default DNS server, but you can override that by specifying the name or IP address of the DNS server you wish to query in the Server field.

Currently only A and AAAA records are supported. We may include support for other query types in the future such as PTR and TXT.

HTTP Certificate

HTTP Certificate

Hopefully we all have our sites secured with an SSL certificate. Hopefully we also have some sort of automated renewal process in place, like using the Acme Certificate plugin. But sometimes we can't use automated renewal. Or maybe you want to monitor the SSL certificates of a non-Rock server or device. This component allows you to check if the SSL Certificate for the given web address is valid and not expiring too soon.

The URL checked is specified by the Address setting and must include the https:// prefix to work correctly. If the time until expiration is less than the Warning Threshold or Error Threshold values, as specified in days, then the check will enter the Warning and Error states respectively. The Timeout value allows you to specify how long to wait for the server to respond and is specified in milliseconds. This helps prevent the check from taking a really long time to report a failure if the server is offline or otherwise not responding in a timely manner.

The component currently checks both the expiration date as well as the validity of the certificate. Meaning if you try to check a self-signed certificate it will report an error because it will be treated as not valid. This also means if the certificate is for www.rocksolidchurchdemo.com but you put https://rock.rocksolidchurchdemo.com in the Address field it will also report an error because the names do not match (normally you would have both names listed in your certificate though). In the future we may add an option to only check the expiration date.

HTTP Response

HTTP Response

This component will test to make sure the given URL is responding in a timely manner. It does not care what the actual content returned is, as long as it is indicated by a 2xx success code from the server. You specify the Warning Threshold and Error Threshold values in milliseconds, and if the server takes longer than those values to respond then the check enters a Warning or Error state respectively. The Timeout specifies how long to wait for a response before giving up and recording it as a timeout.

The URL queried is specified by the Addresss field. It can be either an http:// or https:// address. Additionally, you don't need to limit it to just the root page of the site. If you have a decent amount of logic on a specific page of your site that takes a bit of time to process, you can setup a check to target that one page and make sure the time to process hasn't crept up to an unnacceptible level.

ICMP Ping

ICMP Ping

This is the most basic component we have. It simply tests if a device is "alive" by sending what is called a PING packet to the device. Normally a device will respond and you use the time difference to determine if there is a network problem between the two devices. A device is not required to respond to a Ping, and many firewalls (for example Azure's firewall) actually block them. But if it is a device on your own network, most likely it will respond to a Ping.

So with this you can monitor devices to see if they are online and plugged into the network. This is often helpful with devices that are expected to be on and plugged in 24 hours a day, such as servers, printers, switches, etc.

The Address contains the hostname or IP address of the device to be pinged. If the response time is greater than the Warning Threshold or Error Threshold, specified in milliseconds, then the check will return a Warning or Error state respectively. The Number of Packets allows you to specify how many packets to send and receive. The average response time of all packets will be used in calculating the round trip time.

NRPE Value

NRPE Value

Many organizations that already do some sort of monitoring already have pre-existing Nagios-style checks on their servers. Many of these operate over the NRPE protocol. If you have these checks installed, or plan to install them for better monitoring of your servers, you can use this component to check the state of those checks.

The Address field contains the hostname or IP address to be contacted to perform the check. After connecting it will send the Command field as the check to be performed and wait for the results. You can specify how long to wait by the Timeout field which is the number of milliseconds before it gives up.

Nagios checks are capable of returning multiple performance index values. For example, their version of an ICMP Ping check returns two performance values: The round trip time, and the packet loss percentage. We only support accessing one performance index so you specify which one to retrieve with the Performance Index field. Normally these are orderd by importance so the first index (zero) is usually the one you want.

Another thing that Nagios checks do is return their own state of Ok, Warning and Error. This is based on the internal configuration. If you want to trust these results as truth then you can set the Trust Result to Yes. Doing so will ignore any comparison values you may enter.

Assuming you don't trust the result, you can specify a warning comparison type and value as well as an error comparison type and value. If the returned value matches the Warning Comparison Type and the Warning Comparison Value then the check is put into the Warning state. If the returned value matches the Error Comparison Type and the Error Comparison Value then the check is put into the Error state.

Finally, since we don't know what kind of value is being returned (temperature, disk space, etc.) we don't know what type of label to use when identifying the value. You will need to enter a Value Label to identify those values. For example, if you are running a check on how many days the the device has been running, you would enter day in that field. It will automatically be pluralized as needed and will result in a final text string that looks something like 23 days.

Plugin Updates

This is a fun component. It allows you check if any of the plugins you have installed have an update available. This only checks plugins that are actually installed on the server and are capable of being upgraded. This means if a plugin has an update but it requires a newer version of Rock than you have installed it is not counted.

Currently there are no configurable options for this component. It will automatically enter a warning state if there are plugin updates available. There is also a three day delay before an update is considered available. This allows time for the developer to do a final test install from the rock shop and have time to pull it if problems were discovered with the packaging.

Note: This component must be run on the Rock server itself, which means you may need to specify the Rock server in the Collector Override when you attach it to a device profile.

SNMP Uptime

SNMP Uptime

If you are monitoring a network device such as a printer, network switch, UPS, etc. then it probably supports being monitored by a protocol called SNMP. Working with SNMP can be tricky and while there is a component for checking any arbitrary value we gave you the most common one you will be using as a self-contained component. This comoonent will query the device via SNMP and check how long it has been up and online. If it is below a certain threshold (indicating a recent reboot) then it will enter either a Warning or Error state.

You specify the hostname or IP address to connect to by the Address field. If the returned system uptime is less than the minutes specified inWarning Threshold then it will enter a Warning state. If the system uptime is less than the Error Threshold specified minutes then it will enter an Error state. The Timeout indicates the number of milliseconds to wait for a response from the device.

It should be noted that just because a device supports SNMP does not mean it will automatically respond to SNMP queries. You will need to configure the SNMP Settings to match the device's own configuration otherwise you will probably get timeout errors.

SNMP Value

SNMP Value

So we just talked about the SNMP Uptime component. That is great if all you want to know is how long the device has been running. But SNMP actually exposes a lot of data for you to monitor. For example, most printers will report how much toner they have left, or how full the paper trays are. A network switch will often report the internal temperature. Most devices also report an "overall status" that wouldn't tell you specifically what is wrong, but would basically let you see remotely that pesky warning indicator on the switch stuffed in the closet on the other side of the building.

To achieve that, you have this component. This is probably one of the most difficult components to set up, purely becuase there is no standard to which OID number a device will use to transmit it's data. You have to find these OID numbers in technical manuals or by trial and error. However, once you know it, you can re-use that same OID number to check other devices of the same make and model.

The Address, like most other checks, specifies the hostname or IP address to connect to. The OID is where you specify which value you are interested in, and is expressed as a long integer string separated by periods, such as 1.3.6.1.2.1.1.1.0. To ensure that the check does not sit waiting forever for a response, you can specify a timeout in milliseconds in the Timeout field.

Next you can specify a warning comparison type and value as well as an error comparison type and value. If the returned value matches the Warning Comparison Type and the Warning Comparison Value then the check is put into the Warning state. If the returned value matches the Error Comparison Type and the Error Comparison Value then the check is put into the Error state. Since SNMP values can be numerical or string values, the comparison types include two string comparisons. So if you are querying a string value that might contain the word "fail" if a problem exists, you can specify Contains fail to detect that condition.

Finally, since we don't know what kind of value is being returned (temperature, disk space, etc.) we don't know what type of label to use when identifying the value. You will need to enter a Value Label to identify those values. For example, if you are running a check on the temperature of the device, you would enter degree in that field. It will automatically be pluralized as needed and will result in a final text string that looks something like 96 degrees.

SQL Query

SQL Query

This is another fun component that you can use to do lots of things with. Since a SQL query has access to everything in your database, you can also query on everything. Here are a few ideas:

  • Number of pending Email messages to be sent
  • Number of pending SMS messages to be sent
  • Number of "Web Prospects" in the database that need to be dealt with
  • How many active workflows that are more than 90 days old
  • How many connections are more than 60 days old

Configuration is fairly straight forward. You simply enter a SQL query that returns a single row of data. One column must be named Value and will be used as the value for comparison and for historical charting. You may also specify a column of Summary which will be used as the summary text if the check returns an OK status. You should design your queries to be fast, but just in case you have on that may take a long time to run you can specify a timeout in seconds in the Query Timeout field.

Next you can specify a warning comparison type and value as well as an error comparison type and value. If the returned value matches the Warning Comparison Type and the Warning Comparison Value then the check is put into the Warning state. If the returned value matches the Error Comparison Type and the Error Comparison Value then the check is put into the Error state. Since SNMP values can be numerical or string values, the comparison types include two string comparisons. So if you are querying a string value that might contain the word "fail" if a problem exists, you can specify Contains fail to detect that condition.

Finally, since we don't know what kind of value is being returned (temperature, disk space, etc.) we don't know what type of label to use when identifying the value. You will need to enter a Value Label to identify those values. For example, if you are running a check on how many people are in the database, you would enter person in that field. It will automatically be pluralized as needed and will result in a final text string that looks something like 8,419 people.

To give you an idea of the kinds of things you can do, this is the query we use to monitor the CPU usage on our Azure SQL instance (note: this only works on Azure and not on-premise SQL).

SELECT
    CAST(AVG(avg_cpu_percent) AS decimal(18, 2)) AS [Value],
    'Currently using ' + CAST(CAST(AVG(avg_cpu_percent) AS decimal(18, 2)) AS varchar(10)) + '% of ' + CAST(MAX(dtu_limit) AS VARCHAR(10)) + ' DTUs.' AS [Summary]
FROM sys.dm_db_resource_stats
WHERE [end_time] >= DATEADD(MINUTE, -5, GETDATE())

A bit of information on what the above is doing. An Azure SQL database stores statistical data in the sys.dm_db_resource_stats table. These are 30 second averages. Because we have the check configured to run every five minutes, we are taking all rows from the past five minutes and averaging them all together. This gives us a five minute average value. Next we want a pretty summary string so we take that same five minute average and also pull the DTU size the database is currently configured for. The final result is a summary string like Currently using 4.28% of 20 DTUs.

Note: This component must be run on the Rock server itself, which means you may need to specify the Rock server in the Collector Override when you attach it to a device profile.

Warning: Usually these SQL queries you will be running are not things you need to update every five minutes. Update your Check Interval with an appropriate value. For example if you are monitoring the number of people in the Web Prospects role, you probably don't need to update that value every five minutes. Configure it to run hourly, or maybe even daily.

TCP Port Open

TCP Port Open

Wouldn't it be nice if you could monitor your Exchange server to ensure it hadn't crashed? Or your Linux hosts to make sure they are still responding to SSH connections correctly? That is exactly what the TCP Port Open component is for. At it's most basic level, it ensures that it can successfully connect to the host on the given port number. These are specified by the Address and Port fields. You can also specify the time in milliseconds to wait for a connection with the Timeout field.

But just connecting to the port doesn't necessarily tell you things are working correctly. Most services send some form of "hello" string when you first open a connection to them. If the port you are connecting to is one of these, then you can enter a value in the Signature field to match against that text data it sends. This is a regular expression field which means you can do some pretty advanced matching. To see a few examples of how this works, take a look at the SSH Service, SMTP Service and IMAP4 Service checks.