In a recent upgrade of our monitoring infrastructure, I moved network monitoring off of physical hardware and onto virtual machines running on our VMware infrastructure. The migration was completely successful except for one small issue: clock drift.
One of the many data points we monitor on servers and network gear is whether their configured time is in sync with the rest of the infrastructure. This is done by querying their current time (usually via NTP), and comparing it to the local monitoring server’s clock (also synced via NTP). If the offset is larger than a threshold, an alert is raised. The status of the NTP servers themselves, how many peers, what stratum, etc. is monitored separately.
The problem was that we would intermittently – usually during the middle of the night – receive a flood of alerts for every device that particular monitoring server was monitoring.
Subject: PROBLEM: NTP time is CRITICAL on host lab-lb-1 Date: Mon, 21 Feb 2022 03:19:20 -0500 PROBLEM: NTP time is CRITICAL on host lab-lb-1 Service: NTP time Host: lab-lb-1 Alias: Address: lab-lb-1 Host Group Hierarchy: Opsview > Networking > Lab State: CRITICAL Date & Time: Mon Feb 21 03:19:19 UTC 2022 Additional Information: NTP CRITICAL: Offset -1.490670264 secs
And, soon after, receive notice that the alarm was cleared.
Subject: RECOVERY: NTP time is OK on host lab-lb-1 Date: Mon, 21 Feb 2022 03:39:20 -0500 RECOVERY: NTP time is OK on host lab-lb-1 Service: NTP time Host: lab-lb-1 Alias: Address: lab-lb-1 Host Group Hierarchy: Opsview > Networking > Lab State: OK Date & Time: Mon Feb 21 03:39:19 UTC 2022 Additional Information: NTP OK: Offset -0.0007096529007 secs
Investigation and a solution
This was quite annoying. I worked with our VMware administrators to help identify the source of the problem, and to ensure it was not configured to modify the VM’s clock. I confirmed that during an event, the local clock of the affected monitoring server was indeed off by over a second, and that eventually NTP would correct it. We found that disabling vMotion for the monitoring servers helped with the daytime issues, but were still seeing alert floods in the early morning hours.
I finally got annoyed enough to dig a bit deeper and came up with a solution. VMware published a great whitepaper on timekeeping on VMs, which was well worth my time to read. That said, the real key was this knowledge base entry, which explained there are two types of time corrections (their naming):
- Periodic time sync
- One-off time sync
The first, off by default, runs every minute. The second, on my default, runs “once” during certain events: vMotion, take or restore snapshot, disk size adjustment, and restarting VMware Tools on the VM. With vSphere 7.0U1 or above, both of these options are under the “VMware Tools” settings. We are running an older version, so I had to manually change these options in the “Advanced” configuration settings.
Here are the steps to change these settings using the vSphere client.
Shut down the VM
Either locally halt the machine or choose “Actions”, “Power”, “Power Off” in the vSphere client.
Once the VM is halted, select “Actions”, “Settings…”, and click on the “VM Options” tab. Under “Advanced”, click “Edit Configuration…”.
Per the KB article, we want to change the following seven settings:
time.synchronize.continue = FALSE time.synchronize.restore = FALSE time.synchronize.resume.disk = FALSE time.synchronize.shrink = FALSE time.synchronize.tools.startup = FALSE time.synchronize.tools.enable = FALSE time.synchronize.resume.host = FALSE
In the “Configuration Parameters” dialog, click the “Add Configuration Params” button seven times to give you enough blank fields.
Fill in the setting name and value for each entry, and click “Ok”.
Boot the VM
With the new settings, power on the VM.
Since I’ve made these changes, it has been almost a week, and we haven’t had any false NTP alerts from these monitoring servers. It is possible that this change may be masking an underlying problem. I have some time scheduled with a VMware administrator to do some additional investigation and testing. For now, I feel better if VMware isn’t mucking with the local clock and just relying on NTP to keep the servers’ time synchronized.