Skip to content

Consider max error when determining if time is synced#10318

Draft
bnaecker wants to merge 5 commits intomainfrom
ben/include-max-error-in-time-sync
Draft

Consider max error when determining if time is synced#10318
bnaecker wants to merge 5 commits intomainfrom
ben/include-max-error-in-time-sync

Conversation

@bnaecker
Copy link
Copy Markdown
Collaborator

  • Parse more data from chronyc tracking, including the max offset, root delay, and root dispersion. Compute maximum error from the last two, and use that when determining if time is synced.
  • Add new NTP admin server version with extra data
  • Log more details in reconciler so we can see why time isn't synced.

- Parse more data from `chronyc tracking`, including the max offset,
  root delay, and root dispersion. Compute maximum error from the last
  two, and use that when determining if time is synced.
- Add new NTP admin server version with extra data
- Log more details in reconciler so we can see why time isn't synced.
@bnaecker bnaecker marked this pull request as draft April 23, 2026 23:10
@bnaecker
Copy link
Copy Markdown
Collaborator Author

Intended to address #7668

Comment thread ntp-admin/src/http_entrypoints.rs Outdated
Comment thread ntp-admin/src/http_entrypoints.rs Outdated
});
};

let max_error = root_delay / 2.0 + root_dispersion;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference for this?

Copy link
Copy Markdown
Collaborator Author

@bnaecker bnaecker Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was my guess until I found a better estimate. There does seem to be a reference in the chronyc man pages, under the description of "Root dispersion":

An absolute bound on the computer’s clock accuracy (assuming the stratum-1 computer is correct) is given by:

clock_error <= |system_time_offset| + root_dispersion + (0.5 * root_delay)

I'll update this to use that and include a link.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the definition from RFC 5905 and the full algorithm in Section 11.2.3.

My original guess came from the fact that the dispersion is defined in the RFC as the maximum error accumulated along the path from us to the root server; since the RTT is assumed to be symmetric, we can divide by 2 to get the time to get just from us to the root, and then add the maximum error.

I think chrony's value is better than just delay / 2 + dispersion. That is the maximum error on chrony's internal estimate of the true time. But it is constantly updating the local tick frequency on the system to slew the clock so that it matches the upstream time. That means that if you just ask the OS "What time is it?", you see both the errors due to the fact that the clock is currently being slewed plus whatever this error is. I think we need both, but I could use some validation on that point.

Comment thread ntp-admin/api/src/lib.rs
// | example for the next person.
// v
// (next_int, IDENT),
(2, ADD_MAX_ERROR_AND_OFFSET),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is (or is supposed to be) client-side-versioned: #8769

I see it doesn't have a comment to that effect here. Can we do this without a version bump? If it's just extra data for debugging, maybe we can add the data in this release, use a client for the older version in this release, and then bump the client back to "latest" in the next release?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I forgot to check that. I can add a comment here for the future too.

We can do this without a version bump in a few ways. The easiest would be to not change the type at all, only the definition of sync. I added the data so that clients could log why it wasn't synced, but we can also log it locally to help understand.

I'm not sure I understand the multi-release process. Do you mean:

  • Add this new version, which is backwards compatible
  • In the clients (sled-agent only?), pin to the previous version so we ignore the new fields
  • Release that in rN
  • Update the clients to the new version
  • Release that in rN+1

Is that right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants