The Symptom: nvidia-smi Just Dies
3 AM. PagerDuty screaming. Every GPU node on our training cluster throwing the same error:
Failed to initialize NVML: Driver/library version mismatch
nvidia-smi is dead in the water. But lsmod | grep nvidia shows the kernel module is still loaded. Classic.
This exact thing has bitten me at least 5 times in the last year. The culprit is almost always the same: unattended-upgrades or a manual apt upgrade silently bumped nvidia-kernel-dkms or libnvidia-compute-*, but the kernel module never got rebuilt. So now you’ve got a kernel module (driver) and user-space libraries that don’t match versions.
Root Cause: Who Broke It?
Two scenarios, and they’re both stupid:
- Auto-upgrades: Ubuntu/Debian’s
unattended-upgradesquietly updates NVIDIA packages but doesn’t trigger a DKMS rebuild. This is the #1 cause, especially if you have security updates set to auto-install. - Install method conflict: You installed via
.runfile and then usedapt. Or vice versa. The two methods leave artifacts that step on each other.
Worst case I saw: someone ran apt install nvidia-driver-550 while a 545 .run install was still partially alive. nvidia-smi called the new library, but the kernel module was still the old one.
The Fix: No Reboot Required
Important caveat: if you can reboot, just do it. sudo reboot is the simplest fix. But for production, remote machines, or when you’re feeling stubborn, try this.
Step 1: Check the Version Gap
First, figure out exactly how far apart things are:
# See the loaded kernel module version
cat /proc/driver/nvidia/version
# See user-space library versions
dpkg -l | grep nvidia
# Or this
nvidia-smi --version
If kernel module says 550.90.07 and libraries say 550.120, you’ve got a mismatch.
Step 2: Kill Stale Processes (This Is Key)
Processes holding old library handles will block the new version. You need to nuke them.
# Find everything using NVIDIA devices
sudo lsof /dev/nvidia*
# Or be more aggressive
sudo fuser -v /dev/nvidia*
You’ll see python, Xorg, gnome-shell, etc. Kill them:
sudo kill -9 <PID>
If you’re running training jobs, stop them first. This step is not optional — skip it and nothing else will work.
Step 3: Unload and Reload the Kernel Module
# Unload in dependency order
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
# Reload (this picks up the new version)
sudo modprobe nvidia
If rmmod complains about the module being in use, you didn’t kill everything. Go back to Step 2.
Step 4: Rebuild initramfs (Future-Proofing)
This prevents the same issue after your next reboot. The old initramfs might cache the wrong module.
sudo update-initramfs -u
Step 5: Verify
nvidia-smi
Still broken? Time for the nuclear option.
The Nuclear Option: Full Reinstall
# 1. Purge everything NVIDIA
sudo apt purge -s "nvidia*" "libnvidia*"
# 2. If you ever used a .run file, uninstall that too
sudo /path/to/NVIDIA-Linux-*.run --uninstall
# 3. Clean up
sudo apt autoremove
sudo apt autoclean
# 4. Fresh install
sudo apt update
sudo apt install nvidia-driver-550
# 5. Reboot
sudo reboot
Never mix .run files with apt. Pick one method and stick with it. I use apt because updates are cleaner.
Fix Comparison
| Method | Reboot Required? | Risk Level | Best For | Success Rate |
|---|---|---|---|---|
| Kill processes + reload module | No | Medium (task interruption) | Production, remote | ~70% |
| Full purge + reinstall | Yes | Low | Dev machines, testing | ~95% |
| Version locking (apt-mark hold) | No | Low | Prevention | 100% (prevention) |
| Manual DKMS rebuild | No | High | Post-kernel-upgrade | ~80% |
Prevention: Lock It Down
I learned this the hard way. Now every GPU server in our fleet has this:
sudo apt-mark hold nvidia-driver-550 libnvidia-compute-550 nvidia-kernel-dkms
This stops unattended-upgrades from silently bumping your driver. When you want to upgrade, apt-mark unhold first, then update manually.
FAQ
How to fix nvidia graphics driver issue?
First, determine if it’s a driver or hardware issue. Run dmesg | grep nvidia. For mismatch errors, follow the steps above. If you see NVRM: failed to initialize, it’s usually a kernel/driver compatibility problem.
How to fix NVIDIA driver not compatible?
Incompatibility happens after kernel upgrades. For example, kernel 6.8 with NVIDIA 545 driver. Options: downgrade the kernel, upgrade the driver to 550+, or force a DKMS rebuild with sudo dkms install -m nvidia -v <version>.
How do I fix NVIDIA driver installer error?
Common installer errors: “You appear to be running an X server” (kill X or switch to runlevel 3), “Unable to find the kernel source tree” (install linux-headers-$(uname -r)), “CC version check failed” (wrong gcc version). Add --no-opengl-files to the .run installer to avoid conflicts.
Has NVIDIA fixed their driver issues?
No, and they probably never will completely. NVIDIA’s driver is closed-source, so every kernel API change breaks things. The open-source nouveau driver is stable but slow. The 550 series is decent, but 545 and 535 both had nasty bugs. Stick with LTS kernel + LTS driver combos.
Final Thoughts
This mismatch problem is a fundamental issue with closed-source drivers in the Linux ecosystem. NVIDIA’s update strategy is frankly garbage — every new version fixes old bugs and introduces new ones. I’ve seen a single unattended-upgrades default config take down a production cluster for 3 hours.
My advice: lock your driver version, update manually on your schedule. For AI training clusters, stability beats freshness every time.
And if you’re stuck on a remote machine without reboot access? The kill-processes + reload-module combo in Steps 2-3 works about 70% of the time. The other 30%? You’re calling your ops team to hit the power button.