Clustering and High-Availability

check_pbsnodes

Description:

A nagios script for calling the ‘showq’ command to test for the presence of crashed nodes in a high performance computing cluster that uses Moab/Maui & Torque for job scheduling and queuing.

Example:
./check_pbsnodes -w 1 -c 2

This would warn Nagios if one node was unresponsive. If two nodes were down,
would send Nagios a critical message. In addition, the plugin reports the names of the crashed nodes, along with the job id’s and users who own them.

Current Version

0.2

Last Release Date

2011-05-26

Compatible With

  • Nagios 2.x
  • Nagios 3.x

License

GPL


Project Files
Project Notes
FULL DESCRIPTION: This plugin is for testing the presence of crashed nodes in a high performance computing cluster. In such clusters, it is not uncommon for load to reach very very high levels on compute nodes. Under such load, many parts of the system may bog down and become unresponsive. For example, SSH logins may no longer work. Polling via Gangila or Cacti may cease. And yet, this does not mean that the compute node has crashed or isn't still doing the work assigned to it by the cluster scheduler. Under such circumstances, the only way to know if a node is really down is if a job goes negative. Torque has a higher nice level than the jobs it runs, so it is always guranteed a processor time slice. If walltime is exceeded and Torque is able to get a slice it will kill the job. If it can't, then it's because the node has crashed and we'll see showq show negative time in the (time) REMAINING column. Therefore, this plugin is designed to be run on the Cluster Service Node, calling the showq command, parsing the output, and searching for values in the REMAINING column that are negative numbers. When it finds them, it should report the problem using correct Nagios syntax, and provide the crashed node names to the output string. It needs to be called from a remote plugin executor such as NRPE, or MRPE if using Matthias Kettner's Check_MK.
Reviews (0) Add a Review
Add a Review

You must be logged in to submit a review.

Thank you for your review!

Your review has been submitted and is pending approval.

Recommend

To:


From:


Thank you for your recommendation!

Your recommendation has been sent.

Project Stats
Rating
0 (0)
Favorites
0
Views
95,029