Service and Host Result Freshness Checks

Introduction

Nagios supports a feature that does "freshness" checking on the results of host and service checks. This feature is useful when you want to ensure that passive checks are being received as frequently as you want. Although freshness checking can be used in a number of situations, it is primarily useful when attempting to configure a distributed monitoring environment.

The purpose of "freshness" checking is to ensure that host and service checks are being provided passively by external applications on a regular basis. If the results of a particular host or service check (for which freshness checking has been enabled) is determined to be "stale", Nagios will force an active check of that host or service.

Host vs. Service Freshness Checking

The documentation below discusses service freshness checking. Host freshness checking (which is not documented seperately) works in a similiar way to service freshness checking - except, of course, that its for hosts instead of services. If you need to configure host freshness checking, adjust the directions given below appropriately.

Configuring Service Freshness Checking

Before you configure per-service freshness threshold, you must enable freshness checking using the check_service_freshness and service_freshness_check_interval directives in the main config file. If you were configuring host freshness checking, you would use the check_host_freshness and host_freshness_check_interval directives.

So how do you go about enabling freshness checking for a particular service? You need to configure service definitions as follows.

The check_freshness option in the service definition should be set to 1. This enables "freshness" checking for the service.
The freshness_threshold option in the service definition should be set to a value (in seconds) which reflects how "fresh" the results for the service should be.
The check_command option in the service definition should reflect valid command that should be used to actively check the service when it is detected as being "stale".

How The Freshness Threshold Works

Nagios periodically checks the "freshness" of the results for all services that have freshness checking enabled. The freshness_threshold option in each service definition is used to determine how "fresh" the results for each service should be. For example, if you set the freshness_threshold option to 60 for one of your services, Nagios will consider that service to be "stale" if its results are older than 60 seconds (1 minute). If you do not specify a value for the freshness_threshold option (or you set it to zero), Nagios will automatically calculate a "freshness" threshold to use by looking at either the normal_check_interval or retry_check_interval options (depending on what type of state the service is currently in).

What Happens When A Service Check Result Becomes "Stale"

If the check results of a service are found to be "stale" (as described above), Nagios will force an active check of the service by executing the command specified by the check_command option in the service definition. It is important to note that an active service check which is being forced because the service was detected as being "stale" gets executed even if active service checks are disabled on a program-wide or service-specific basis.

Working With Passive-Only Checks

As I mentioned earlier, freshness checking is of most use when you are dealing with services that get their results from passive checks. More often than not (as in the case with distributed monitoring setups), these services may not be getting all of their results from passive checks - no results are obtained from active checks.

An example of a passive-only service might be one that reports the status of your nightly backup jobs. Perhaps you have a external script that submit the results of the backup job to Nagios once the backup is completed. In this case, all of the checks/results for the service are provided by an external application using passive checks. In order to ensure that the status of the backup job gets reported every day, you may want to enable freshness checking for the service. If the external script doesn't submit the results of the backup job, you can have Nagios fake a critical result by doing something like this...

Here's what the definition for the service might look like (some required options are omitted)...

define service{
	host_name		backup-server
	service_description	ArcServe Backup Job
	active_checks_enabled	0			; active checks are NOT enabled
	passive_checks_enabled	1			; passive checks are enabled (this is how results are reported)
	check_freshness		1
	freshness_threshold	93600			; 26 hour threshold, since backups may not always finish at the same time
	check_command		no-backup-report	; this command is run only if the service results are "stale"
	...other options...
	}

Notice that active checks are disabled for the service. This is because the results for the service are only made by an external application using passive checks. Freshness checking is enabled and the freshness threshold has been set to 26 hours. This is a bit longer than 24 hours because backup jobs sometimes run late from day to day (depending on how much data there is to backup, how much network traffic is present, etc.). The no-backup-report command is executed only if the results of the service are determined to be "stale". The definition of the no-backup-report command might look like this...

define command{
	command_name	no-backup-report
	command_line	/usr/local/nagios/libexec/nobackupreport.sh
	}

The nobackupreport.sh script in your /usr/local/nagios/libexec directory might look something like this:

#!/bin/sh

/bin/echo "CRITICAL: Results of backup job were not reported!"

exit 2

If Nagios detects that the service results are stale, it will run the no-backup-report command as an active service check (even though active checks are disabled for this specific service - remember that this is a special case). This causes the /usr/local/nagios/libexec/nobackupreport.sh script to be executed, which returns a critical state. The service go into to a critical state (if it isn't already there) and someone will probably get notified of the problem.