Thinking About Monitoring

[Login] Change #12 by OpenID IdentityMark Smith at 2011-09-08 08:37:54.

Problems

Things that must be solved if a system like this is going to really work.

  • There are two types of issues -- immediate and long-term. Immediate issuses are "the site is down!" type issues that have to get acted on quickly. Long-term issues are "the disk on foo is now at 80%". Traditional systems open tickets for the latter, or just re-alert every so often so people remember it's a problem.

Design Notes

  • Simple easy to use interface. Actually get a user interface person to spend some time making the interface easy to use and flexible. Design it!

  • Chained events. When, say, "zombie procs on foo" fires, then it should go up the chain to the parent event ("host down on foo") and check that event. This way, the broadest alert fires and this should reduce alert storms and bad directions in case someone starts investigating before the other alert fires.

  • Lots of API support. Writing an IRC bot should be easy.

Event Handling

  • I'm thinking about having the system support events. Not only does it monitor, but you can script certain events. I'm thinking about a way to say, disable a data center? Put things into maintenance mode from the control panel and have it take actions on the machine?

  • This could be useful so you have a one-stop place to interact with your system and kick off events. If you have a system that lets you script power cycling a box, you can set up something to handle that:

    1. Click "Power Cycle" on the machine status page
    2. An event is fired PreOffline
    3. An event listener handles shutting down services on the machine (and importantly, it can abort the offlining process if something goes wrong)
    4. The machine is removed from monitoring (silenced)
    5. The reboot is issued
    6. The Offline event is fired
    7. When the machine comes online it goes through the normal boot-up sequence which fires a PreOnline event
    8. An event listener takes the PreOnline event and tries to start up services, returning success or fail
    9. The machine either fires an Online event or a Failed event
    10. Monitoring is re-enabled for the machine and services

    That sounds complicated but I think in practice it can be made straightforward.

Dashboard Requirements

  • Showing recent alert changes and other system events so people can easily see what is going on.

  • Allow easy access to system-wide commands such as "STFU EVERYTHING" and maintenance buttons. These should probably be scriptable with events.

Architecture

MUST tolerate failure of any single component/server.

  1. The component that sits on a machine.
    • Executes event listener scripts.
    • Executes local monitoring scripts.
    • Provides an API to interface.
  2. The master scheduler.
    • Kicks off jobs to go monitor things -- manages the queues.
    • Provides an API to interface.
  3. The UI server.
    • HTTP interface for everything, doesn't actually do much work itself.
  4. Workers.
    • The monitoring components -- just use a simple Redis system?
    • Have monitoring be opt-in participation by each machine.

Event Queues

There are three queues of events:

  • Stuff that has to be processed by the master scheduler.
  • Stuff that has to be run on a machine (local environment).
  • Stuff that can be run anywhere.

The idea is to have as much as possible be in the third queue, then the second queue, and finally the first one there. We don't want many checks to happen from the master scheduler. Those are mostly going to be things like -- fire this escalation. Stuff that we want to guarantee happens once.

The master scheduler is something that runs on a few machines you dictate and then one of them sets itself to be master. If more than one is running, the others will subscribe and process state updates but won't step in to do anything unless the active scheduler dies. Note: maybe this is too complicated? Maybe we should just let the sysadmin use heartbeat or similar to ensure only one is running at a time? That is easier for us but would probably hinder adoption.

Another idea is to have the master scheduler just run on the cluster... it should really take very little processing time in an ideal world.

Configuration

YAML is easy to read/write by humans and similarly easily manipulated by machines. That is probably the easiest way to maintain a configuration file. We will probably suggest that people just use the Web UI to manage the configuration, but they're welcome to do the config files. (The Web UI won't exist at first, of course...)

An API should exist for managing the config. That way people can wire the system in to whatever their other systems are.

Where To Start

The very basic "get up and running" system has to support:

  • Define things to monitor (hosts) and allow tagging/classifying them somehow.
  • Define the checks we have available.
  • Specify which checks apply to which things -- by host, by tag, by group, etc.
  • When a check fires, allow the results to go to email.

At first we don't need to support Redis, queues, locking, or doing this in multiple locations. We can actually run a lot of checks from a central location for quite a while before we get down to the nitty gritty of a distributed system.

[Login] Change #12 by OpenID IdentityMark Smith at 2011-09-08 08:37:54.