Monitoring
[ Back to Top ]
Walrus is designed to support a robust monitoring system to easily check in on anything, really. With the existing plugin support in the framework, and also support for external monitoring scripts, it is easy to do anything you can think of.
Some requirements for the monitoring system:
- Nagios compatibility
- There are a wealth of Nagios plugins out there, we need to support them and all of the plugins people have written for Nagios
- Reliability
- A monitoring system that isn't running or doesn't run often enough is no good to anybody
- Introspection
- A web interface that people can use to do most of the maintenance
- This is expected to be built on the existing Walrus Framework
States
We define several states. For making it easy for administrators to transition to a Walrus setup, we are going to use the same terms that are used in other well known monitoring environments.
- Critical
- Applied when intervention is demanded
- I.e., a system is unreachable, service is down, etc
- Warning
- Applied when something is outside of the expected range
- I.e., load average is high, disk is nearly full, etc
- Nominal
- Nothing out of the ordinary, everything is as expected
- Unknown
- Ideally rare, used when something is just starting
- Can be used when the "last known state" is over a threshold time?
There's some thought to be done: is there ever a reason to have Unknown? Yes, there is, when a service is defined but monitoring hasn't run yet. Or perhaps when it's been more than a few minutes since the item was checked?
Basic Definition
For the purpose of Walrus, a monitoring task is designed as something that gets run against something else and returns a value. All monitoring jobs are done against a system (see Inventory Management for more information on that), and the results are stored in the monitoring system.
Monitoring tasks can be dynamic or static. A dynamic task is one that is being applied to a system because it meets some condition. Static tasks are manually assigned to a system.
Ideally, all tasks will be dynamic, but static support is required in some situations where you want a machine to do something that only it does, and you don't really care to manage that process in Walrus.
The UI should allow people to determine what tasks are running against what systems and then see why.
Requirements
Systems will be automatically added to monitoring as soon as they are inserted into the system. Dynamic tasks are created for it within a few minutes and will begin running as appropriate.
Notes
Brief notes, some use cases:
- Oncall (24x7, all state changes)
- Someone who wants to subscribe to some notices
- Various types of notices: email, IM, IRC, SMS, etc
- Groupings and subsets
- I.e., Bob wants to subscribe to red on tagA, yellow+ on tagB, green+ on tagC
- Notice levels
- Escalations for unanswered alert conditions
