Ops Notes

NVIDIA Driver/Library Mismatch: Don't Reboot, Try These 5 Fixes First

· InfraOps Router · AI & ML Infrastructure
AI & ML Infrastructure Visualization

The Symptom: nvidia-smi Just Dies

3 AM. PagerDuty screaming. Every GPU node on our training cluster throwing the same error:

Failed to initialize NVML: Driver/library version mismatch

nvidia-smi is dead in the water. But lsmod | grep nvidia shows the kernel module is still loaded. Classic.

This exact thing has bitten me at least 5 times in the last year. The culprit is almost always the same: unattended-upgrades or a manual apt upgrade silently bumped nvidia-kernel-dkms or libnvidia-compute-*, but the kernel module never got rebuilt. So now you’ve got a kernel module (driver) and user-space libraries that don’t match versions.

Root Cause: Who Broke It?

Two scenarios, and they’re both stupid:

  1. Auto-upgrades: Ubuntu/Debian’s unattended-upgrades quietly updates NVIDIA packages but doesn’t trigger a DKMS rebuild. This is the #1 cause, especially if you have security updates set to auto-install.
  2. Install method conflict: You installed via .run file and then used apt. Or vice versa. The two methods leave artifacts that step on each other.

Worst case I saw: someone ran apt install nvidia-driver-550 while a 545 .run install was still partially alive. nvidia-smi called the new library, but the kernel module was still the old one.

The Fix: No Reboot Required

Important caveat: if you can reboot, just do it. sudo reboot is the simplest fix. But for production, remote machines, or when you’re feeling stubborn, try this.

Step 1: Check the Version Gap

First, figure out exactly how far apart things are:

# See the loaded kernel module version
cat /proc/driver/nvidia/version

# See user-space library versions
dpkg -l | grep nvidia

# Or this
nvidia-smi --version

If kernel module says 550.90.07 and libraries say 550.120, you’ve got a mismatch.

Step 2: Kill Stale Processes (This Is Key)

Processes holding old library handles will block the new version. You need to nuke them.

# Find everything using NVIDIA devices
sudo lsof /dev/nvidia*

# Or be more aggressive
sudo fuser -v /dev/nvidia*

You’ll see python, Xorg, gnome-shell, etc. Kill them:

sudo kill -9 <PID>

If you’re running training jobs, stop them first. This step is not optional — skip it and nothing else will work.

Step 3: Unload and Reload the Kernel Module

# Unload in dependency order
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

# Reload (this picks up the new version)
sudo modprobe nvidia

If rmmod complains about the module being in use, you didn’t kill everything. Go back to Step 2.

Step 4: Rebuild initramfs (Future-Proofing)

This prevents the same issue after your next reboot. The old initramfs might cache the wrong module.

sudo update-initramfs -u

Step 5: Verify

nvidia-smi

Still broken? Time for the nuclear option.

The Nuclear Option: Full Reinstall

# 1. Purge everything NVIDIA
sudo apt purge -s "nvidia*" "libnvidia*"

# 2. If you ever used a .run file, uninstall that too
sudo /path/to/NVIDIA-Linux-*.run --uninstall

# 3. Clean up
sudo apt autoremove
sudo apt autoclean

# 4. Fresh install
sudo apt update
sudo apt install nvidia-driver-550

# 5. Reboot
sudo reboot

Never mix .run files with apt. Pick one method and stick with it. I use apt because updates are cleaner.

Fix Comparison

MethodReboot Required?Risk LevelBest ForSuccess Rate
Kill processes + reload moduleNoMedium (task interruption)Production, remote~70%
Full purge + reinstallYesLowDev machines, testing~95%
Version locking (apt-mark hold)NoLowPrevention100% (prevention)
Manual DKMS rebuildNoHighPost-kernel-upgrade~80%

Prevention: Lock It Down

I learned this the hard way. Now every GPU server in our fleet has this:

sudo apt-mark hold nvidia-driver-550 libnvidia-compute-550 nvidia-kernel-dkms

This stops unattended-upgrades from silently bumping your driver. When you want to upgrade, apt-mark unhold first, then update manually.

FAQ

How to fix nvidia graphics driver issue?

First, determine if it’s a driver or hardware issue. Run dmesg | grep nvidia. For mismatch errors, follow the steps above. If you see NVRM: failed to initialize, it’s usually a kernel/driver compatibility problem.

How to fix NVIDIA driver not compatible?

Incompatibility happens after kernel upgrades. For example, kernel 6.8 with NVIDIA 545 driver. Options: downgrade the kernel, upgrade the driver to 550+, or force a DKMS rebuild with sudo dkms install -m nvidia -v <version>.

How do I fix NVIDIA driver installer error?

Common installer errors: “You appear to be running an X server” (kill X or switch to runlevel 3), “Unable to find the kernel source tree” (install linux-headers-$(uname -r)), “CC version check failed” (wrong gcc version). Add --no-opengl-files to the .run installer to avoid conflicts.

Has NVIDIA fixed their driver issues?

No, and they probably never will completely. NVIDIA’s driver is closed-source, so every kernel API change breaks things. The open-source nouveau driver is stable but slow. The 550 series is decent, but 545 and 535 both had nasty bugs. Stick with LTS kernel + LTS driver combos.

Final Thoughts

This mismatch problem is a fundamental issue with closed-source drivers in the Linux ecosystem. NVIDIA’s update strategy is frankly garbage — every new version fixes old bugs and introduces new ones. I’ve seen a single unattended-upgrades default config take down a production cluster for 3 hours.

My advice: lock your driver version, update manually on your schedule. For AI training clusters, stability beats freshness every time.

And if you’re stuck on a remote machine without reboot access? The kill-processes + reload-module combo in Steps 2-3 works about 70% of the time. The other 30%? You’re calling your ops team to hit the power button.