Environmental

check_cdu – Monitor Server Technology Cabinet Distribution (CDU) Products

Description:

A Nagios plugin which has extensive and dynamic monitoring capabilities to monitor Server Technology (ServerTech) Cabinet Distribution Units (CDUs). Monitor nearly any metric that is provided from a CDU. Chain as many or as few options together to create custom checks that meet a specific need. Version 2.0 now supports Sentry4 products, or “PRO2” (Firmware version 8+)

Current Version

2.3

Last Release Date

2018-03-12

Compatible With

  • Nagios 3.x
  • Nagios 4.x

License

GPL


Project Files
Project Notes
(This is a partial dump of 'perldoc check_cdu.pl') NAME check_cdu - Check various metrics from a Server Technology Cabinet Distribution Unit (CDU) VERSION This documentation refers to check_cdu version 2.1 APPLICATION REQUIREMENTS Several standard Perl libraries are required for this program to function. Namely, Net::SNMP, Getopt::Std, Getopt::Long, Nagios::Plugin::Threshold GENERAL USAGE check_cdu.pl -H -C [-t SNMP timeout] [-p SNMP port] REQUIRED ARGUMENTS Only the hostname and community are required. Timeout will default to 2 seconds, port 161. THRESHOLDS I opted to use the Nagios::Plugin::Threshold class to handle thresholds. In general I do not prefer Nagios::Plugins objects, but I just simply could not avoid using the Threshold class. I apologize for the added dependency, I just could not afford re-inventing the wheel. The benefit is that the threshold logic used in this plugin follows the standard used in many other plugins. For reference, here are the general threshold guidelines: Range definition Generate an alert if x... 10 < 0 or > 10, (outside the range of {0 .. 10}) 10: < 10, (outside {10 .. ?}) ~:10 > 10, (outside the range of {-? .. 10}) 10:20 < 10 or > 20, (outside the range of {10 .. 20}) @10:20? 10 and ? 20, (inside the range of {10 .. 20}) 10 < 0 or > 10, (outside the range of {0 .. 10}) Read: http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT For the full, official, documentation FULL DOCUMENTATION check_cdu is intended to provide extremely flexible and extensive monitoring support for Server Technology Cabinet Distribution Units (CDU). In general the workflow for this application follows this procedure: 1. Pull in an entire SNMP table using a Net::SNMP session and get_table(). 2. Renumerate these "flat" values into a structured hash 3. Evaluate any options or thresholds passed on the command line by the user. 4. Process the command line options against the data collected from the CDU 5. Exit appropriately given the status results This workflow is generally followed in four slightly different ways depending on the desired options. These four procedures are: 1. General System 2. Environment 3. Towers (Sentry3 Products) 4. Infeeds (Sentry3 Products) 5. Cords (Sentry4 Products) 6. Lines (Sentry4 Products) 7. Phases (Sentry4 Products) 8. Branches (Sentry4 Products) Environment An optional feature of a CDU are temperature and humidity probes. On most units, only two T/H ports exist. Some "Link" or "Expansion" CDUs also have T/H ports. When using an EMCU 1-1B even more T/H probes are available. This application is designed to support any number of T/H probes available to the system. The way these are identified vary between Sentry3 and Sentry4 products. Prior to running any checks for temperature or humidity, this plugin will check the T/H probe status. The following states will result in an UNKNOWN return: notFound readError lost noComm This applies to both temperature and humidity. The LowThresh and HighThresh states are ignored. Any of these states will issue an UNKNOWN return. Since there is no data available, it's not logical to initiate a WARNING or CRITICAL and roll someone out of bed. This behavior can easily be changed in the code, if desired. In its simplest form, the environment checks will query all available T/H probes connected to the system. Unfortunately if any ports have no sensor connected the plugin will return an UNKNOWN state indicating that some sensors are notFound. In this case you will need to explicity indicate which sensors to query (I have no way of knowing if a sensor isn't really there, or if it failed) The CDU has internal High/Low thresholds configured for both Temperature and Humidity, and this is done on a per sensor basis. Without any arguments, this plugin will honor those values. Considering that there is only one high:low range, I opted to designate this as a WARNING threshold. This behavior can easily be changed in the code to the CRITICAL state, if desired, but it is NOT modifiable from the command line. A basic invocation would resemble: $ check_cdu.pl -H 192.168.0.1 -C public --temp --humid OK -BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 18C, Bottom-Rack-Inlet_F31(A1): 43%, Bottom-Rack-Exhaust_F32(A2): 33C, Bottom-Rack-Exhaust_F32(A2): 16%, Top-Rack-Inlet_F31(B1): 24C, Top-Rack-Inlet_F31(B1): 28%, Top-Rack-Exhaust_F32(B2): 36.5C, Top-Rack-Exhaust_F32(B2): 12% The plugin output always includes the systemLocation defined on the CDU first. The various objects queried are then returned in a comma separated list. For temperature and humidity probes, the sensor Name is returned along with the ID in parantheses. If names haven't been set, the defaults will still be displayed. Finally the value is listed for each sensor. The temperature scale is automatically determined from the TempScale object provided via SNMP. For instances where the CDU is configured for one scale, but the user desires the plugin to report in another scale, the --fahrenheit and --celsius options are quite handy: $ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Bottom-Rack-Inlet_F31(A1): 45%, Bottom-Rack-Exhaust_F32(A2): 91.4F --celsius works in a similar fashion. If a scale is passed to the plugin and the T/H probe is already configured for that scale, no error will occur. The values will be reported in the native scale for that sensor. Expanding on this basic functionality is the --ths option. --ths allows the user to select which sensors to query, based on the sensor ID (not the name!). --ths will automatically determine if the sensors exist, and exit UNKNOWN if they were not found. All of the regular sensor status checks are still performed. $ check_cdu.pl -H 192.168.0.1 -C public --temp --ths A1,B2 --fahrenheit OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Exhaust_F32(B2): 97.7F Note I also left out the --humid option. Either option can be specified alone, or both together, providing maximum flexibility for designing purpose-built nagios service checks. User supplied WARNING and CRITICAL thresholds can be applied to the temperature and humidity sensors using the --warning and --critical directives. This overrides the automatic threshold logic that relies upon the internal CDU configuration. Either --warning or --critical can be used, or both can be used together. When querying multiple temperature sensors, a single threshold is applied across all sensors. The same is true for querying multiple humidity sensors. Both temperature and humidity can be queried together in the same command, by "chaining" the thresholds together. Here are a couple examples: $ check_cdu.pl -H 192.168.0.1 -C public --temp --fahrenheit --ths A1,B1 --warning 60:80 OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Inlet_F31(B1): 77F (Query just the temperature from T/H probes A1 and B1 and apply a warning threshold to alarm if either sensor falls below 60F or above 80F) $ check_cdu.pl -H 192.168.0.1 -C public --humid --ths A2,B2 --warning 10:70 OK - BLDG_ROOM_RACK, Bottom-Rack-Exhaust_F32(A2): 18%, Top-Rack-Exhaust_F32(B2): 13% (Query just the humidity from T/H probes A2 and B2 and apply a warning threshold to alarm if either sensor falls below 10% or above 70% relative humidity) $ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit --ths A1 --warning 80,20: --critical 95,10: OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 64.4F, Bottom-Rack-Inlet_F31(A1): 48% (Check just sensor A1, but query both temperature and humidity from this sensor. If the temperature rises above 80F or the humidity falls below 20% generate a WARNING. If the temperature rises above 95 or the humidity falls below 10% generate a CRITICAL.) IMPORTANT NOTE: When specifying both --temp and --humid the thresholds are chained together as temperature_threshold,humidity_threshold regardless of which order --temp and --humid are passed!! aka the following are equivalent: '--temp --humid --warning 45,60' , '--humid --temp --warning 45,60' The following are NOT equivalent: '--temp --humid --warning 45,60', '--humid --temp --warning 60,45' Starting in version 1.3 monitoring dewpoint temperature and dewpoint delta is supported. The CDU does not natively support dewpoint, but it can be calculated given temperature and humidity. Dewpoint is calculated using constants from J Applied Meteorology and Climatology and the dewpoint calculations provided at: http://en.wikipedia.org/wiki/Dew_point#Calculating_the_dew_point There are two ways to monitor dewpoint. First is with the "--dewtemp" option. This simply calculates the air temperature dewpoint of any given sensor and applies the user supplied thresholds to the value. Using the "--dewdelta" directive calculates the differential temperature between the air temperature and calculated air temperature dewpoint values. This is especially useful for determining how close a sensor is to reaching the dewpoint temperature, and hence when condesnsation might start forming within a data center. An example invocation would look like: $ check_cdu.pl -H 192.168.0.1 -C public --dewdelta --fahrenheit --ths A1,C1 --warning 10: --critical 5: OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet(A1) Delta: 39.00F, Top-Rack-Inlet(C1) Delta: 40.37F This check would initiate a WARNING if the dewpoint is 10F or less from the air temperature and a CRITICAL if the dewpoint is 5F or less from the air temperature. I believe this would be a typical use for this function. The dewpoint temperature can never be greater than air temperature, only less than, or equal to. Since the CDU does not have built-in thresholds for dewpoint, it is required to use either --warning or --critical in conjunction with either --dewtemp or --dewdelta. Like --temp and --humid options chaining is supported with the dewpoint options. The order of the chained thresholds is always temp,humidity,dewpoint. You cannot specify --dewdelta and --dewtemp in the same invocation. An complex example invocation would be: $ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --dewdelta --fahrenheit --ths A1,C1 --warning 80,50,10: --critical 90,80,5: OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet(A1): 67.1F, Bottom-Rack-Inlet(A1): 24%, Bottom-Rack-Inlet(A1) Delta: 37.96F, Top-Rack-Inlet(C1): 68F, Top-Rack-Inlet(C1): 22%, Top-Rack-Inlet(C1) Delta: 40.22F This command checks two sensors for temperature, humidity and dewpoint delta. Temperature WARNING above 80, CRITICAL above 90. Humidity WARNING above 50, CRITICAL above 80. Dewpoint Delta WARNING if less than 10 and CRITICAL if less than 5. Towers (Sentry3 Products) Tower state and statistics are checked using the --tower directive. If specified with no arguments only the overall state of the tower(s) are checked. The ability to query a specific tower does not exist at this time. If the 'noComm' state is encountered for a tower a WARNING state is generated. This is likely only possible on a slave tower. If the master tower is in state 'noComm', I doubt you'd get this far with it ;) If 'fanFail', 'overTemp' or 'nvmFail' states are encountered, the state is returned as CRITICAL. The 'outOfBalance' state returns WARNING. Various metrics from the tower can also be queried by passing them to the --tower directive as a comma separated list. At the time of development, these metrics are only supported on PIPS units. A regular SMART or SWITCHED CDU will likely not benefit from any of these enhancements. The plugin will correctly identify the absence of these metrics if you attempt to query them. The metrics are: VACapacity ApparentPower VACapacityUsed ActivePower Energy LineFrequency It is very important to note that the 'Status' checks are largely skipped when querying any of these metrics. The 'fanFail' and 'overTemp' states are completely ignored. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned. Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using just '--tower'. It was not logical to exit on WARNING/CRITICAL for a 'noComm' state multiple times (say, for instance if there are separate service checks defined for every metric listed above). The towers are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all configurable on the CDU itself. Typically, a circuit name would be used for a Tower name. Thresholds are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in which the metrics are listed is the order in which the thresholds should be "chained". The same logic applies to these thresholds, see the THRESHOLDS section for specifics. Here are some examples: $ check_cdu.pl -H 192.168.0.1 -C public --tower OK - BLDG_ROOM_RACK, TowerA(A) Status: normal(0), TowerB(B) Status: normal(0)a $ check_cdu.pl -H 192.168.0.1 -C public --tower ApparentPower,ActivePower,VACapacityUsed --warning 1200,1000,30 OK - BLDG_ROOM_RACK, TowerA(A) ApparentPower: 993VA, TowerA(A) ActivePower: 939W, TowerA(A) VACapacityUsed: 9.1%, TowerB(B) ApparentPower: 927VA, TowerB(B) ActivePower: 870W, TowerB(B) VACapacityUsed: 8.5% (Check that ApparentPower does not exceed 1200VA, ActivePower does not exceed 1000W and the Capacity used does not exceed 30%. If any of these scenarios occur, generate a WARNING) $ check_cdu.pl -H 192.168.0.1 -C public --tower Energy --warning 10000 --critical 15000 OK - BLDG_ROOM_RACK, TowerA(A) Energy: 6654kWh, TowerB(B) Energy: 7658kWh (If the kWh consumption of either tower exceeds 10,000 generate a WARNING. If it exceeds 15,000 generate a CRITICAL. Say you're in a co-lo paying for power utilization and your piggy bank will run dry if you use too much power ...) $ check_cdu.pl -H 192.168.0.1 -C public --tower VACapacity --warning 10800 WARNING - BLDG_ROOM_RACK, TowerA(A) VACapacity: -1VA (This is a very bizarre but interesting scenario. I included VACapacity because it was there, but who would logically check a static value such as the capacity of a tower? Well, it turns out that this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas on why it may be useful to monitor things that otherwise wouldn't make sense) Infeeds (Sentry3 Products) Infeed state and statistics are checked using the --infeed directive. It is very similar to the --tower check. If specified with no agruments, the infeed 'Status' and 'LoadStatus' objects are checked. The ability to query a specific infeed does not exist at this time (and likely never will). The following infeed Statuses will generate a WARNING: noComm offWait onWait off reading A CRITICAL will be generated if the Infeed has the following Status: offError onError offFuse onFuse Likewise the LoadStatus object is checked for each infeed as well. A WARNING is generated for the following LoadStatus conditions: noComm reading loadLow I wasn't sure what the 'reading' state was, this state is also present across many other CDU objects. There is a good chance this state simply infers that the state is currently being "read" or updated, and it's likely that this state will be ignored in future versions of the plugin if that is the case. The loadLow must be determined by an internal CDU threshold, however this threshold isn't available via SNMP - so I left it alone. A CRITICAL is generated for the other LoadStatus states: notOn loadHigh overLoad readError Simple modifications to the code can be done to move these various Statuses between the CRITICAL and WARNING states if desired, but it is not possible from the command line. Similar to the --tower directive, many of these Status checks are skipped when querying specifc metrics from the infeed. If any metrics are provided to --infeed, the infeed Status is checked for the 'noComm' status. If this is true, the plugin will append this to the UNKNOWN 'bucket' and skip checking the metric. The following infeed metrics are currently supported: PhaseVoltage * Voltage CapacityUsed * Power ApparentPower * Energy * LoadValue PhaseCurrent * CrestFactor * PowerFactor * * These metrics are only available on PIPS units. The infeeds are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all configurable on the CDU itself. Typically, a circuit name would be used for an infeed name. Thresholds are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in which the metrics are listed is the order in which the thresholds should be "chained". The same logic applies to these thresholds, see the THRESHOLDS section for specifics. A special note on PowerFactor: An unloaded infeed will typically report -0.01 for the Power Factor. It does not seem logical to apply the provided threshold to this value. So if the Power Factor is less than 0 the threshold is not used and the state is simply assumed to be 'OK'. Some examples: $ check_cdu.pl -H 192.168.0.1 -C public --infeed OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) Status: on(1), TowerA_InfeedA(AA) LoadStatus: normal(0), TowerA_InfeedB(AB) Status: on(1), TowerA_InfeedB(AB) LoadStatus: normal(0), TowerA_InfeedC(AC) Status: on(1), TowerA_InfeedC(AC) LoadStatus: normal(0), TowerB_InfeedA(BA) Status: on(1), TowerB_InfeedA(BA) LoadStatus: normal(0), TowerB_InfeedB(BB) Status: on(1), TowerB_InfeedB(BB) LoadStatus: normal(0), TowerB_InfeedC(BC) Status: on(1), TowerB_InfeedC(BC) LoadStatus: normal(0) (This is a basic tower check for a master/slave 3 phase CDU. There are 6 infeeds total across both towers, and two separate checks are performed (Status,LoadStatus) for each infeed. This is a lot of data) $ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadValue --warning 12 --critical 24 OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadValue: 4.07A, TowerA_InfeedB(AB) LoadValue: 3.21A, TowerA_InfeedC(AC) LoadValue: 1.62A, TowerB_InfeedA(BA) LoadValue: 3.61A, TowerB_InfeedB(BB) LoadValue: 2.76A, TowerB_InfeedC(BC) LoadValue: 1.73A (This is a simple load/current check which applies a warning and critical threshold to the load of all 6 infeeds on a dual tower 3 phase CDU.) $ check_cdu.pl -H 192.168.0.1 -C public --infeed ApparentPower,CapacityUsed --warning 1000,20 OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) ApparentPower: 673VA, TowerA_InfeedA(AA) CapacityUsed: 12.6%, TowerA_InfeedB(AB) ApparentPower: 0VA, TowerA_InfeedB(AB) CapacityUsed: 10.5%, TowerA_InfeedC(AC) ApparentPower: 317VA, TowerA_InfeedC(AC) CapacityUsed: 5.3%, TowerB_InfeedA(BA) ApparentPower: 575VA, TowerB_InfeedA(BA) CapacityUsed: 12%, TowerB_InfeedB(BB) ApparentPower: 0VA, TowerB_InfeedB(BB) CapacityUsed: 8.9%, TowerB_InfeedC(BC) ApparentPower: 348VA, TowerB_InfeedC(BC) CapacityUsed: 5.7% (Generate a warning if the ApparentPower of any infeed exceeds 1000VA, and generate a warning if the Capacity Used exceeds 20% on any infeed) PhaseVoltage and PhaseCurrent use the PhaseID instead of infeedID in the plugin output. Throughout our testing, it has been difficult to ascertain a difference between PhaseVoltage and Voltage. There is generally a considerable difference between PhaseCurrent and LoadValue, however it most likely makes sense to only check one of these. Enhanced Infeed checks (Sentry3 Products) There are two additional metrics that can be checked with the '--infeed' directive. They are: LoadImbalance VoltageImbalance These metrics are not provided directly by the CDU, rather they are computed internally by the plugin. Please note, these special metrics are ONLY available on 3 phase units. Some versions of the CDU firmware provide a '3-Phase Load Out-of-Balance Threshold' setting and the results are displayed on the 'istat' menu. None of this information is provided via SNMP. Thresholds are required for either of these computed metrics. Unlike the display in 'istat' only the out-of-balance infeed(s) will be displayed, not infeeds across the entire tower. I used a basic 3 phase motor load phase imbalance equation to generate the imbalance percentages for both Current and Voltage: Percent imbalance = maximum deviation from average / average of three phases * 100 When an infeed is queried for either voltage or current imbalance, the plugin determines which tower the infeed is a part of. All infeed values (voltage or current) for that tower are then averaged together. The deviation from the average is then determined for this particular infeed, accomodating either a negative or positive delta from the average. This is then divided by the average and multiplied by 100 to determine the percent imbalance. This equation was pulled from the following document: http://support.fluke.com/educators/download/asset/2161031_b_w.pdf An example invocation of this check would look like: $ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadImbalance --warning 20 --critical 30 CRITICAL - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadImbalance: 39.07%, TowerA_InfeedC(AC) LoadImbalance: 50.46%, TowerB_InfeedA(BA) LoadImbalance: 33.54%, TowerB_InfeedC(BC) LoadImbalance: 34.54% (Generate a WARNING if the load imbalance of any infeed exceeds 20%, and a CRITICAL if the imbalance exceeds 30%. Clearly, this is not a well balanced rack! Hence the need for such a check) The same can be done for voltage, however the margins should be much, much smaller than load. This can be useful to detect bad incoming power conditions. Unfortunately this only evaluates an imbalance across the phases of a single tower. A more useful approach would be to judge imbalance between two separate towers, and hence two separate feeds/circuits which could be coming from two separate sources (ie. UPS/utility). Currently that functionality does not exist. Here is an example: $ check_cdu.pl -H 192.168.0.1 -C public --infeed VoltageImbalance --warning .5 --critical 2 WARNING - BLDG_ROOM_RACK, TowerB_InfeedB(BB) VoltageImbalance: 0.65% (Generate a WARNING if the imbalance between voltages per infeed is greater than .5% and a CRITICAL if the imbalance is greater than 2%) Cords (Sentry4 Products) Cord state and statisitcs are checked using the --cord directive. If specificed with no arguments only the Status and State metrics of the cord(s) are checked. The ability to query a specific cord does not exist at this time. If any other state is encountered for either object a WARNING is generated. There are other status objects that can be queried in addtion to Status and State. Check any number of these objects by passing a comma separated list to the --cord directive. They do not accept thresholds. The "normal" state of each metric is hard-coded (usually either "normal" or "on"). Here is the full list of available "State" metrics: State Status ActivePowerStatus ApparentPowerStatus PowerFactorStatus OutOfBalanceStatus Other non-state metrics can be queried in the same way, but require a threshold. At this time these "metered" metrics do not honor any of the built-in thresholds available on the CDU. If you look in the code, I am collecting any available Warning/Alarm metrics, but I have not coded in the ability to use them. This is planned in a future version, I hope. If a metric is not available for some reason, the plugin will identify this. Here are the cord metrics: PowerCapacity ActivePower ApparentPower PowerUtilized PowerFactor Energy Frequency OutOfBalance It is very important to note that the 'Status' checks are largely skipped when querying any of these metrics. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned. Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using just '--cord', or by specifying the "Status" checks explicity. The naming convention of the cords is very similar (identical) to how all the other resources are identified in the system. Thresholds are applied the same way as is in other checks. ORDER DOES MATTER. The order in which the metrics are listed is the order in which the thresholds should be "chained". See the THRESHOLDS section for specifics. Here are some examples: $ check_cdu.pl -H 192.168.0.1 -C public --cord OK - BLDG_ROOM_RACK, Master_Cord_A(AA) Status: normal(0) State: on(1), Link1_Cord_A(BA) Status: normal(0) State: on(1) $ check_cdu.pl -H 192.168.0.1 -C public --cord ActivePowerStatus,OutOfBalanceStatus OK - BLDG_ROOM_RACK, Master_Cord_A(AA) ActivePowerStatus: normal(0), Master_Cord_A(AA) OutOfBalanceStatus: normal(0), Link1_Cord_A(BA) ActivePowerStatus: normal(0), Link1_Cord_A(BA) OutOfBalanceStatus: normal(0) $ check_cdu.pl -H 192.168.0.1 -C public --cord ActivePower,PowerUtilized --warning 2500,20 --critical 4000,50 OK - BLDG_ROOM_RACK, Master_Cord_A(AA) ActivePower: 1442W, Master_Cord_A(AA) PowerUtilized: 8.1%, Link1_Cord_A(BA) ActivePower: 1511W, Link1_Cord_A(BA) PowerUtilized: 8.2% $ check_cdu.pl -H 192.168.0.1 -C public --cord PowerCapacity WARNING - BLDG_ROOM_RACK, Link1_Cord_A(BA) PowerCapacity: -1VA (This is a very bizarre but interesting scenario. I included PowerCapacity because it was there, but who would logically check a static value such as the capacity of a cord? Well, it turns out that this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas on why it may be useful to monitor things that otherwise wouldn't make sense) Lines (Sentry4 Products) Line state and statisitcs are checked using the -line directive. If specificed with no arguments only the Status and State metrics of the cord(s) are checked. The ability to query a specific cord does not exist at this time. If any other state is encountered for either object a WARNING is generated. There are other status objects that can be queried in addtion to Status and State. Check any number of these objects by passing a comma separated list to the --line directive. They do not accept thresholds. The "normal" state of each metric is hard-coded (usually either "normal" or "on"). Here is the full list of available "State" metrics: State Status CurrentStatus Other non-state metrics can be queried in the same way, but require a threshold. At this time these "metered" metrics do not honor any of the built-in thresholds available on the CDU. If you look in the code, I am collecting any available Warning/Alarm metrics, but I have not coded in the ability to use them. This is planned in a future version, I hope. If a metric is not available for some reason, the plugin will identify this. Here are the line metrics: CurrentCapacity Current CurrentUtilized It is very important to note that the 'Status' checks are largely skipped when querying any of these metrics. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned. Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using just '--line', or by specifying the "Status" checks explicity. The naming convention of the cords is very similar (identical) to how all the other resources are identified in the system. Thresholds are applied the same way as is in other checks. ORDER DOES MATTER. The order in which the metrics are listed is the order in which the thresholds should be "chained". See the THRESHOLDS section for specifics. Here are some examples: $ check_cdu.pl -H 192.168.0.1 -C public --line OK - BLDG_ROOM_RACK, AA:L1(AA1) Status: normal(0) State: on(1), AA:L2(AA2) Status: normal(0) State: on(1), AA:L3(AA3) Status: normal(0) State: on(1), AA:N(AA4) Status: normal(0) State: on(1), BA:L1(BA1) Status: normal(0) State: on(1), BA:L2(BA2) Status: normal(0) State: on(1), BA:L3(BA3) Status: normal(0) State: on(1), BA:N(BA4) Status: normal(0) State: on(1) $ check_cdu.pl -H 192.168.0.1 -C public --line CurrenStatus OK - BLDG_ROOM_RACK, AA:L1(AA1) CurrentStatus: normal(0), AA:L2(AA2) CurrentStatus: normal(0), AA:L3(AA3) CurrentStatus: normal(0), AA:N(AA4) CurrentStatus: normal(0), BA:L1(BA1) CurrentStatus: normal(0), BA:L2(BA2) CurrentStatus: normal(0), BA:L3(BA3) CurrentStatus: normal(0), BA:N(BA4) CurrentStatus: normal(0) $ check_cdu.pl -H 192.168.0.1 -C public --line Current,CurrentUtilized --warning 5,40 --critical 10,95 OK - BLDG_ROOM_RACK, AA:L1(AA1) Current: 3.06A, AA:L1(AA1) CurrentUtilized: 9.5%, AA:L2(AA2) Current: 2.23A, AA:L2(AA2) CurrentUtilized: 6.9%, AA:L3(AA3) Current: 2.1A, AA:L3(AA3) CurrentUtilized: 6.5%, AA:N(AA4) Current: 1.05A, AA:N(AA4) CurrentUtilized: 3.2%, BA:L1(BA1) Current: 3.18A, BA:L1(BA1) CurrentUtilized: 9.9%, BA:L2(BA2) Current: 2.36A, BA:L2(BA2) CurrentUtilized: 7.3%, BA:L3(BA3) Current: 2.07A, BA:L3(BA3) CurrentUtilized: 6.4%, BA:N(BA4) Current: 1.1A, BA:N(BA4) CurrentUtilized: 3.4% Phases (Sentry4 Products) Read the documentation for Cords and Lines. Phases are handled the same way. Available "State" metrics: State Status VoltageStatus PowerFactorStatus Reactance Metered Metrics: Voltage VoltageDeviation Current CrestFactor ActivePower ApparentPower PowerFactor Energy NOTE: Reactance is evaluated in terms of the following states: unknown capacitive inductive resistive I opted to choose "capacitive" as the "OK" state. This could really not work well. YMMV Branches (Sentry4 Products) Read the documentation for Cords and Lines. Branches are handled the same way. Available "State" metrics: State Status CurrentStatus Metered Metrics: CurrentCapacity Current CurrentUtilized Contact Sensors Contact Closure sensors (Dry Contacts) are available when the EMCU-1-1B unit is used. Each firmware version and even each CDU type can enumerate the sensors differently, so the IDs have been "simplified" for use in this plugin. Do not use E1, C1, etc as the ID. Just use 1-4. The plugin figures the rest out automagically. A state/status of "normal(0)" returns an OK. Anything else returns a WARNING. I didn't bother to make this configurable, but you can hack the code yourself to change this if you want If you don't explicity specify which IDs to query, the script looks at all four of them. $ check_cdu.pl -H 192.168.0.1 -C public --contact 1,2 OK - BLDG_ROOM_RACK, FRONT_DOOR(B1): normal(0), REAR_DOOR(B2): normal(0) Plugin Termination Numerous scenarios exist where the plugin will exit abnormally. This could be due to user input error, or failure to retrieve required SNMP data, etc. In all identifiable cases, the plugin will exit with a UNKNOWN state and a descriptive message indicating the failure. Users should be aware that if all SNMP calls fail, monitoring of the CDU may be effectively rendered useless if UNKNOWN states are not report (this is common). This is dissimilar to plugins like check_nrpe that exit CRITICAL if an SSL negotiati erorr occurs! Throughout the workflow of the plugin metrics are evaluated against thresholds and the results are pla into various 'buckets' reflecting OK,WARNING,CRITICAL and UNKNOWN states. At the end of the workflow, reporting is done based upon the presence or absence of these buckets. If both CRITICAL and WARNING conditions exist, they are BOTH reported in the plugin_output text, however the state is reported as CRITICAL. An example of this can be seen in the following output: $ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --ths A1 --warning 16,30 --critical 20,40 CRITICAL - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 43%, WARNING - Bottom-Rack-Inlet_F31(A1): 17C Some options end up producing a large amount of output, and this could easily exceed what Nagios can accept, or also exceed character limits on various notification devices (maybe you're tweeting your CDU status for instance ;P) The '--oksummary' option exists to summarize the output for any type of check being done. If all metrics being checked are in state 'OK' the output supresses the specifics of these metrics and simply reports 'N metrics are OK' The version and location are also displayed in the plugin_output. INCOMPATIBILITIES None. See Bugs. BUGS AND LIMITATIONS None. If you experience any problems please contact me. (eric.schoeller coloradoDOTedu) AUTHOR Eric Schoeller (eric.schoeller coloradoDOTedu) LICENCE AND COPYRIGHT Copyright (c) 2013 Eric Schoeller (eric.schoeller coloradoDOTedu). All rights reserved. This module is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License. See L. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Reviews (0) Add a Review
Add a Review

You must be logged in to submit a review.

Thank you for your review!

Your review has been submitted and is pending approval.

Recommend

To:


From:


Thank you for your recommendation!

Your recommendation has been sent.

Project Stats
Rating
0 (0)
Favorites
1
Views
35,763