Skip to content

thold does not clear plugin_thold_host_failed when it doesn't have time to finish, causing a loop #737

@mad-ady

Description

@mad-ady

Describe the bug
I have a cacti system with 4000 hosts. This system was isolated by a network outage, and all hosts were reported DOWN by thold. When the network recovered, thold started to report the hosts as up, but because they were so many (and also because I run a local script with Status Change Command for each notification which increases processing time), it took more than 5 minutes to go through the down hosts and generate the emails. The new poller kicked in and the thold process never got to finish thold_debug('Down device checks finished.');. This caused the cleanup to be missed (e.g. thold_debug('Thold Log Cleanup finished.');). On the next polling cycle thold restarts sending the recovered from DOWN emails all over again, and can't recover by itself, because it never has time to finish.

To break the loop I had to truncate the table plugin_thold_host_failed.

To Reproduce
Steps to reproduce the behavior:

  1. Configure a bunch of hosts (e.g. 10) for monitoring
  2. Enable Status Change Command and set it to a script which (for instance), does sleep 60. This is to artificially slow down the execution of the plugin, to force it to be longer than the polling cycle
  3. Enable email notifications for down/up hosts
  4. Isolate all 10 hosts from the network, to force them down.
  5. Wait for the hosts to be down, then recover the network to force them up.
  6. Observe that thold doesn't finish during a polling cycle and doesn't clear plugin_thold_host_failed, causing an alarm loop every polling cycle.

Expected behavior
Hosts for which a down/up notification has been sent should be marked in the respective tables and skipped on the next polling cycle, to prevent such a loop. Also, in case thold doesn't finish and is killed off, it should report in the logs (similarly to how the main poller does it).

Plugin (please complete the following information):

  • Version: 1.8.2
  • Source: github
  • Identifier: official release

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions