1. the design and operation of mon

what is mon?
    -used by ISPs and NOCs for fault detection and
     alert generation
    -written in perl
    -GPLv2

features
    -portable (thanks to perl)
    	linux, solaris, bsd, cygwin
    -simple yet very adaptable design
    -can monitor anything, no clients required
    -configurable, extensible
    	-monitors, alerts, clients
    -good, supportive community

background
    -lots of scripts which test things + alert, run via cron
    -change of test or alert was a lot of work
     (had to change lots of scripts)
    -tracking the different monitors became cumbersome

design goals
    -simple to add alerts and monitors
    -no ties to a single reporting tool
    -simple way of cross-connecting tests and alerts
    -general enough to monitor anything

components
    -server
    -clients
    -monitors
    -alerts
    -traps

server
    -schedules tests
    -accepts remote traps
    -handles clients
    -alert management

configuration
    -config file ties together monitors and alerts
    -defines what is to be monitored
    -defines when alerts happen

monitors
    -test a condition
    -report a summary and detail status
    -exit reporting success/failure

monitors (cont'd)
    -can be written in any language
    -are called as separate processes
    -are usually short-lived (not much fear of mem leaks)
    -simple to write

examples of available monitors:
    fping, http, lpd, msql/mysql/oracle/postgres/informix/sybase, smtp, lpd,
    ldap, imap, disk quotas via snmp, reboot, telnet, pop3, processes, rpc,
    brocade fcal switches, generic tcp, traceroutes, router interfaces, ipsec
    tunnels, compaq chassis, foundry router chassis, dns, novell netware, nt
    services, samba, printers, ntp, bgp, radius, ups, more

alerts
    -accept input from the mon server
    -report the failure status detected by a monitor
    -exit

alerts (cont'd)
    -can be written in any language
    -are called as separate processes
    -are usually short-lived
    -simple to write

examples of available alerts:
    mail, snpp, trap, qpage, aim, bugzilla, gnats, hpov, sms, winpopup, netapp
    snapshot delete, log to file

alert management
    -alert decision logic in the server
    -intended to squelch repetitive alerts
    -dependencies

alert management
    -time period
    -alertafter
    -alertevery
    -numalerts

dependencies
    -perl expressions
    -

clients
    -"mon" protocol, port 2583
    -easy perl interface Mon::Client
    -get operational status of things monitored
    -disable/enable monitoring and alerting
    -acknowledge alerts sent
    -allows for alternate reports

clients (cont'd)
    -multiple web interfaces
    -command-line
    -WAP
    -2-way pager

simple example

    watch webserver.corp.com
	service fping
	    monitor fping.monitor
	    interval 1m
	    period wd {Sun-Sat}
	    	alert mail.alert trockij
		alertevery 24h
		upalert mail.alert trockij

more complex example

    watch webserver.corp.com
	service fping
	    monitor fping.monitor
	    interval 1m
	    period P1: wd {Sun-Sat}
	    	alert mail.alert trockij
		alertevery 12h
		upalert mail.alert trockij
	    period P2: wd {Sun-Sat}
	    	alert mail.alert trockij-pager
		alertevery 24h
		alertafter 3 10m
	    period P3: wd {Mon-Fri} hr {7am-10pm}
	    	alert mail.alert daytime-staff
		alertevery 4h
    	service http
	    monitor http.monitor
	    interval 2m
	    depend webserver.corp.com::fping
	    period wd {Sun-Sat}

making monitors
    monitors are simple
    expect a list of items to poll on @ARGV
    some standard env variables are set MON_LOGDIR, etc.
    perform checks on items
    first line of output is the summary line
    remaining lines are the detail (not interpreted)
    exit status of zero / nonzero

example monitor
    #!/usr/bin/perl

    my @failed;
    my $detail;

    foreach my $item (@ARGV) {
	my $output = `showmount -e $item 2>&1`;
	if ($?) {
	    push @failed, $item;
	    $detail .= "$item failed:\n$output\n";
	}

	else {
	    $detail .= "$item ok:\n$output\n";
	}
    }

    print join (" ", @failed), "\n";
    print $detail;

    @failed == 0 ? exit 0 : exit 1;


making alerts
    @ARGV has some options supplied by server
    rest of @ARGV is from the config file
    first line of stdin is summary
    rest is detail
    perform whatever action desired


example alert
    #!/usr/bin/perl

    chomp (my $summary = <STDIN>);

    my $to = join (",", @ARGV);

    open (MAIL, "| /usr/lib/sendmail -oi -t") || die;

    print MAIL <<EOF;
    From: mon server
    To: $to
    Subject: ALERT $summary

    Something wicked this way comes.
    EOF

    close (MAIL);


making clients
    #!/usr/bin/perl

    use Mon::Client;

    my $cl = new Mon::Client ("host" => "mon-bd2");

    $cl->connect;

    my %s = $cl->list_opstatus;

    $cl->disconnect;

    foreach my $var (keys %{$s{"server"}->{"service"}})
    {
	print "$var=$s{server}->{service}->{$var}\n";
    }


interesting applications
    home-brew failover
    	failover.alert
	several web servers
	each with eth0 admin and eth0:0 virtual addr
	mon server polls http servers
	on failure, failover.alert sshs to a
	2ndary server and ifup's the dead virtual ip on eth0:1

interesting applications
    on-call schedules with Schedule::Oncall
    simple alert which loads the oncall schedule
    sends mail or whatever to the person on call

interesting applications
    debugging wan
    traceroute monitor
    show when path changes
    record history of traces
    call isp with evidence rather than speculation

interesting applications
    print queues jamming
    clumsy unreliable printers, need to tune lprng
    catch them when they jam so can collect data

experience
    useful as a debugging tool
    if it failed twice before, write a monitor
    helps keep in tune with systems problems
    admin team knows problems before users report them


scalability
    parallelization of services
    one monitor per service, per process
    each monitor handles its own parallelization
    e.g. fping.monitor, phttp.monitor

closing

