Overview and structure of HAD monitoring system ----------------------------------------------- HAD monitoring system is a Perl script that is intended to be scheduled as a periodic OS job that collects logs of monitored daemons and analyzes whether the pool state from the daemons' viewpoint at each moment is valid. The script reports about invalid states by sending an e-mail to the pool administrator and sends a periodic (normally, daily) report about all the events that happened in the pool between two last launches of the script. The script is called 'HadMonitoringSystem.pl'. There are several directories and auxiliary files in the monitoring system package: * InitializeState.pl - to bring the system to its initial state, i.e. to make system forget about the last log timestamp that it passed through, make it think of the initial pool state as of pool, where all the machines are down, start over the countdown for daily reports etc. * ClearLogs.pl - to clear the excessive logs in all the logs' directories * ClearArchives.pl - to clear all the archives in the archives' directory * Common.pm - includes global functions and definitions, which can be used by any Perl script in the system * Configuration - the configuration file of the monitoring system (see the "Configuration parameters" chapter below) * State- - files, containing daemons information persisting from one run of the monitoring system to another, like 'State-Had' or 'State-Replication' * ArchiveLogs - to store old zipped logs of previous runs * DaemonLogs - to store temporary copies of daemons' logs * WarningLogs - to store warning messages, generated by the system * ErrorLogs - to store error messages, generated by the system * EventFiles - to store temporary copies of event files; each event file represent the history of most significant events (from the specific daemons' viewpoint) that happened to that specific daemon at the moment of running the script * OutputLogs - to store temporary copies of all the files that are considered as the output of the system, to store the daily report files, the system activity log and the overall monitoring system history files * Checkers - contains scripts for each one of the monitored daemons, which help the main program to properly extract the necessary daemons' events, status, to determine, whether some status is valid, whether some warning should be sent etc. How it works ------------ The script starts by fetching all the monitored daemons' logs (including ".old" logs) from the pool machines' collectors. When this is accomplished, the script extracts the most interesting rows from all these logs, which stand for some interesting events that happened in the pool. Out of these plain logs the event files for each pool machine are created. These files are significantly lighter, than the daemons' logs themselves. By passing over all the event files simultaneously we merge all the event files of different pool machines to one output file, containing chronologically ordered history of all the events, correct for the moment of running the script. This output is still too excessive for the monitoring system adminstrator to read, because it contains successive entries, between which the pool state did not change. The next level of extraction is an epoche log, which contains only periods, in which the pool state did not change. The errors and warnings are generated while analyzing the daemons' logs. Each daemon's specific functions for extracting the information from the logs are located in 'Checkers' directory. These functions determine what is considered an error and what's considered a warning for the specific daemon. So, for instance, for HAD the presence of two and more or no HADs in the pool is considered an error, while a warning is being reported upon cases of huge gap between log lines or upon cases of unexpected shutdown of one of the backup HADs. RD's warnings and errors are similar to HAD ones plus it reports about a warning, when the timestamp difference between state files replicas differ more than some predefined time. If the system was unable to fetch current logs of some daemons, it issues a warning message too The epoche, warning and error logs are all bundled to consolidated epoche, warning and error logs (which we also refer to as daily logs) in order to be sent at the end of the day to the administrator. Besides, all these logs are bundled to epoche, warning and error history log files in order for the administrator to trace all the history of events from the very beginning till the end. In case of any error/warning the corresponding error/warning log is sent to the administrator immediately after the error/warning was discovered. At the end of the day (no matter, were there errors/warnings or were not) the consolidated (daily) epoche log is sent to the administrator. At the end of scanning the pool latest state information is stored in the state file, so that at the beginning of the next run the system could know, where it stopped scanning the daemon logs. All the logs that were produced during this run along with daily reports, history logs, configuration and state files are zipped and moved to archives directory. Archives that were created some (configured) time ago are deleted automatically. After all these operations are performed, the logs of this run are deleted. Consolidated logs are deleted at the end of the day, after the daily report has been sent to the administrator. History logs are not deleted in automated way. Checkers -------- In order to add another daemon that can be monitored by the system, one needs to add respective checker in 'Checkers' directory. Suppose, we want to add a support for 'startd', then we need to create a file, named 'Startd.pl' in 'Checkers' directory and to supply the following functions in it: * StartdValidate - given the epoche start/end timestamps and the status vector, the function should return "" if the status is valid or an error message, if it is invalid * StartdDiscoverEvent - given log line, the function must return the significant event, that it contains from the daemon's viewpoint * StartdApplyStatus - given the event that happened and the previous status, the function must return the new status of daemon after the event has happened * StartdGap - the function must return, what is considered a log gap for this daemon * StartdConfigurationInformation - the function must return the daemon-related configuration information Currently the system supports HAD and RD monitoring. Prerequisites ------------- * $CONDOR_CONFIG configuration parameter must point to the right Condor configuration file * condor_fetchlog and ssh utilities must be in your path as you run the system * If you plan to use the offset calculation option, make sure, you can navigate to the given user on all the hosts without password, using 'ssh' * Set MONITORING_HOME environment variable to the name of directory, where the monitoring system is stored, or alternatively just run the main script from within that directory Configuration parameters ------------------------ * IS_REPORT_SENT - defines, whether any report (be it an error report or a regular daily report) is sent to the administrator. If set to 'no', no report is sent. Any other value tells the system to send the reports. Default: true * IS_NO_EVENT_REPORT_SENT - defines, whether the report, containing no errors, is sent to the administrator. If set to 'no', such report is not sent. Any other value tells the system to send such kind of reports. Default: true * SMTP_SERVER - the full hostname (host and domain) of the SMTP server * ERROR_REPORT_RECIPIENTS - comma-separated list of the error report recipients' addresses * CONSOLIDATED_REPORT_RECIPIENTS - comma-separated list of the daily report recipients' addresses * CONSOLIDATED_REPORT_FREQUENCY - defines the frequency of sending the daily report. If set to 'X', then once in 'X' times the report is sent * STORE_OLD_ARCHIVES_DAYS - defines for how many days the system should store old logs archives. Default: 7 * MONITORED_DAEMONS - list of daemon to monitor; the names appearing in it must be like in DAEMON_LIST * IS_OFFSET_CALCULATION_NEEDED - determines, whether offset calculation is needed for the daemons; useful for WAN pools only. Default: true * OFFSET_CALCULATION_SSH_USER - user, to which we navigate using 'ssh' in order to calculate the date on the remote machines. Make sure, the passwordless 'ssh' navigation to such user is possible on all the hosts * FICTIVE_SENDER_ADDRESS - address, from which the reports will be sent to administrators * DEBUGGING_LEVEL - determines the level of debugging information that is being written into the system activity log. The possible values are (in ascending order of verbosity): INFO, DEBUG. Default: INFO