Mongoose

[Login] Change #4 by OpenID IdentityMark Smith at 2011-06-14 01:10:39.

This is the monitoring/alert subsystem known as Mongoose. It's a project designed to replace Nagios with a very simple to understand, easy to use monitoring and alerting system.

It's also designed to tie into the OpenTSDB framework as a main source of getting data as well as to have the ability to store data into OpenTSDB for alerts and other information that is gathered.

Notes

  • When a monitor fails, it needs to kick off a re-monitor request up the chain of dependencies. I.e., if a MySQL Slave Lag monitor fails, that should depend on the MySQL Up monitor, which depends on the Host Up monitor. We don't want to alert on MySQL Slave Lag if the problem is really that the host has gone offline.

  • The scheduler/master workers need to download a local copy of the configuration from the database so that in case the db becomes unavailable they can continue to monitor things and send out alerts.

  • The system should automatically monitor the Mongoose level things. I.e., that the database is available, that a scheduler is active, etc.

  • Schedulers should probably talk to each other and make sure that someone is active and happy.

Overview

Mongoose is built on top of Gearman using a MySQL database backend for storing state and configuration information. (Although configuration information is likely to be found in a config file.)

A Mongoose installation is setup by having N machines and configuring the following on each of the machines:

  • (1) Gearman Daemon
  • (1) Mongoose Scheduler (gearman worker)
  • (10+) Mongoose Checker (gearman worker)

Mongoose is configured in /etc/mongoose. The configuration is watched by each of the workers who commit suicide if the configuration changes. Same with the scheduler.

We ensure that only one scheduler is active at a time through the simple expedient of having the scheduler start-up process look like this:

// called when we get going, before we do anything
sub start {
    while ( 1 ) {
        // try to insert a mongoose-schedule job into gearman
        // ignore the results.
        insert_job_into_gearman( 'mongoose-schedule', { unique => 1 } );

        // now try to work
        do_gearman_work( 'mongoose-schedule', { timeout => 5 } );
    }
}

// the gearman worker code, called when we get a job
sub work {
    // if we get here, we know we're the only person to have this
    // job.  guaranteed by gearman, as long as we're using the same
    // gearman servers.  (can we guarantee that?)
    //
    // if the above is true, we're now the master scheduler.
}
[Login] Change #4 by OpenID IdentityMark Smith at 2011-06-14 01:10:39.