I used to use custom scripts to monitor services and get alerts of failures till Bas Moussa introduced me to Nagios. Monitoring and self-healing has been flowers and rainbows since.

Nagios event handlers allow you to put in place measures to self-heal your services. The Nagios documentation on setting them up is superb.
How do you do this for a service on a remote host that you monitor using NRPE? There’s this approach

Using it though, you’d have to send arguments to NRPE which, Nagios points out in a very loud way, is a security risk. Nagios warns you in the config file with a subtle:

*** ENABLING THIS OPTION IS A SECURITY RISK! ***

and then they ask you to change dont_blame_nrpe=0 to dont_blame_nrpe=1. Pretty subtle stuff. It suggests, rather mildly, that it is something you want to find every way to avoid.

Here’s what I did – on my main host, I defined a custom event handler for the service monitored by NRPE.

define service{
        use                             remote-service
        hostgroup_name                  my-prod-servers
        service_description             How hungry is the server?
        check_command                   check_nrpe!check_hunger_level
        event_handler                   my_event_handler!my_custom_hunger_command
        }

Then I defined the event handler command, my_event_handler, in commands.cfg

define command{
        command_name    my_event_handler
        command_line    $USER1$/eventhandlers/my_event_handler  -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -a $SERVICEATTEMPT$ -H $HOSTADDRESS$ -c $ARG1$
        }

Notice that I pass the service state(-s), state type,(-t) service attempts(-a), hostname(-H) and command ($ARG1$, -c) as named arguments. Most importantly, I pass the hostname and the command. These, you’ll see, are used by the event handler to decide what to do. The script below, very largely the same as the one in Nagios’s documentation, I switched to bash (out of preference). It shows why we need the other arguments.

So, the script:

#!/bin/bash

#Uncomment the next two lines for debugging. Check logs in /tmp to see how execution's being done
#exec 2> /tmp/nagioslog."$$"
#set -x

# Event handler script for My Apps
# To run, use my_event_handler -s $SERVICESTATE$ -t $STATETYPE$ -a $SERVICEATTEMPT$ -H $HOSTADDRESS$ -c command

#

# Note: This script will only kick in if the service is

#       retried 3 times (in a "soft" state) or if the service somehow

#       manages to fall into a "hard" error state.

#

#function to display correct usage help
function usage(){
cat <<EOF
This script kicks in to perform remedial action on various services monitored by Nagios when their
state changes to near critical. It attempts to correct the issue so we don't go into a CRITICAL state
usage: my_event_handler -s <ServiceState> -t <StateType> -a <ServiceAttempts> -H <HostAddress> -c <Command>

OPTIONS:
  -s The service state (WARNING,UNKNOWN,CRITICAL)
  -t The state type (SOFT,HARD)
  -a The service attempts ( 1,2,3,4)
  -H The host address
  -c The command

EOF
}
serviceState=
stateType=
serviceAttempts=
hostname=
runCommand=

# What state is the service in?
while getopts "s:t:a:H:c:" OPTION
do
  case $OPTION in
        s)
          serviceState=$OPTARG
          ;;
        t)
          stateType=$OPTARG
          ;;
        a)
          serviceAttempts=$OPTARG
          ;;
        H)
          hostname=$OPTARG
          ;;
        c)
          runCommand=$OPTARG
          ;;
        ?)
          usage
          exit
          ;;
  esac
done

#Check that all the arguments have been provided
if [[ -z $serviceState ]] || [[ -z $stateType ]] || [[ -z $serviceAttempts ]] || [[ -z $hostname ]] || [[ -z $runCommand ]]
then
    usage
    exit 1
fi

case "$serviceState" in

        OK)

        # The service just came back up, so don't do anything...

        ;;

        WARNING)

        # Usually, we don't really care about warning states, since the service is probably still running...
        ##IF you have services for which you act even on warnings...
        if [[ $runCommand == "my_custom_hunger_command" ]] && [[ $serviceAttempts -gt 2 ]]
        then
                /usr/local/nagios/libexec/check_nrpe -H $hostname -c $runCommand
        fi
        ;;

        UNKNOWN)

        # We don't know what might be causing an unknown error, so don't do anything...

        ;;

        CRITICAL)

        # Aha!  The service appears to have a problem - perhaps we should restart the service...

        # Is this a "soft" or a "hard" state?

        case "$stateType" in

        # We're in a "soft" state, meaning that Nagios is in the middle of retrying the

        # check before it turns into a "hard" state and contacts get notified...

        SOFT)

                # What check attempt are we on?  We don't want to restart the service on the first

                # check, because it may just be a fluke!

                case "$serviceAttempts" in

                # Wait until the check has been tried 3 times before restarting the service.

                # If the check fails on the 4th time (after we restart the service), the state

                # type will turn to "hard" and contacts will be notified of the problem.

                # Hopefully this will restart the service successfully, so the 4th check will

                # result in a "soft" recovery.  If that happens no one gets notified because we

                # fixed the problem!

                3)

                        echo -n "Performing remedial action..."

                        # Call NRPE on the remote host

                        /usr/local/nagios/libexec/check_nrpe -H $hostname -c $runCommand

                        ;;

                        esac

                ;;

        # The  service somehow managed to turn into a hard error without getting fixed.

        # It should have been restarted by the code above, but for some reason it didn't.

        # Let's give it one last try, shall we?

        # Note: Contacts have already been notified of a problem with the service at this

        # point (unless you disabled notifications for this service)

        HARD)

                echo -n "Performing remedial action..."

                /usr/local/nagios/libexec/check_nrpe -H $hostname -c $runCommand

                ;;

        esac

        ;;

esac
exit 0

Then, in /usr/local/nagios/etc/nrpe.cfg on your remote host, add:

command[my_custom_hunger_command]=/usr/local/nagios/libexec/eventhandlers/my_custom_hunger_command

Lastly, restart NRPE on your remote host & Nagios on your main host.

Using this approach, you won’t have to send arguments to your remote hosts. The remedial command is called only when it needs to be since pre-checks are done.
Also, notice that you can use this event handler to monitor multiple services

define service{
        use                             remote-service
        hostgroup_name                  my-prod-servers
        service_description             Are the servers exercising?
        check_command                   check_nrpe!check_exercise_levels
        event_handler                   my_event_handler!my_custom_exercise_command
        }

In the event handler, $runCommand will be my_custom_exercise_command (and subsequently, the remote command called in NRPE is my_custom_hunger_command)

Note:
To troubleshoot, uncomment the lines shown in the comments in the script
If the remedial script you run on the remote host uses sudo, you’ll need to comment out #Defaults requiretty in /etc/sudoers on that host

Leave a Reply

Your email address will not be published. Required fields are marked *