-
Notifications
You must be signed in to change notification settings - Fork 58
Description
Describe the bug
I have a cacti system with 4000 hosts. This system was isolated by a network outage, and all hosts were reported DOWN by thold. When the network recovered, thold started to report the hosts as up, but because they were so many (and also because I run a local script with Status Change Command for each notification which increases processing time), it took more than 5 minutes to go through the down hosts and generate the emails. The new poller kicked in and the thold process never got to finish thold_debug('Down device checks finished.');. This caused the cleanup to be missed (e.g. thold_debug('Thold Log Cleanup finished.');). On the next polling cycle thold restarts sending the recovered from DOWN emails all over again, and can't recover by itself, because it never has time to finish.
To break the loop I had to truncate the table plugin_thold_host_failed.
To Reproduce
Steps to reproduce the behavior:
- Configure a bunch of hosts (e.g. 10) for monitoring
- Enable
Status Change Commandand set it to a script which (for instance), doessleep 60. This is to artificially slow down the execution of the plugin, to force it to be longer than the polling cycle - Enable email notifications for down/up hosts
- Isolate all 10 hosts from the network, to force them down.
- Wait for the hosts to be down, then recover the network to force them up.
- Observe that thold doesn't finish during a polling cycle and doesn't clear
plugin_thold_host_failed, causing an alarm loop every polling cycle.
Expected behavior
Hosts for which a down/up notification has been sent should be marked in the respective tables and skipped on the next polling cycle, to prevent such a loop. Also, in case thold doesn't finish and is killed off, it should report in the logs (similarly to how the main poller does it).
Plugin (please complete the following information):
- Version: 1.8.2
- Source: github
- Identifier: official release