Doctrine: Fixing NVIDIA 580 NVML Mismatch and Blacklist Trap on Ubuntu 24.04
GPU Ubuntu 24.04
This doctrine captures a real-world failure pattern on Ubuntu 24.04 with the NVIDIA 580 driver: a version mismatch between NVML and the kernel module, followed by a hidden blacklist that prevents the driver from loading at all. The end state is a clean, aligned stack: 580.126.09 kernel module + 580.126.09 NVML, with the GPU fully online.
-
nvidia-smi initially reports: Failed to initialize NVML: Driver/library version mismatch- Later, after partial fixes:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver-
/proc/driver/nvidia/version either shows an older version (580.95.05) or does not exist at all.1. Initial symptom: NVML mismatch
1.1 The error
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 580.126
Check the kernel module version:cat /proc/driver/nvidia/version
Example problematic output:NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.95.05
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
1.2 Diagnosis
- NVML (user-space) version: 580.126
- Kernel module version: 580.95.05
NVML and the kernel module must match exactly. Any patch-level difference (580.95 vs 580.126) will cause NVML to refuse initialization and produce the mismatch error.
2. Confirming the installed driver and DKMS state
2.1 Check DKMS
dkms status
Example:nvidia/580.126.09, 6.14.0-37-generic, x86_64: installed
This means DKMS successfully built the 580.126.09 module for the current kernel (6.14.0-37-generic), but it does not guarantee the kernel is actually loading it.
2.2 Check the module on disk
modinfo nvidia | egrep 'filename|version'
Example:filename: /lib/modules/6.14.0-37-generic/updates/dkms/nvidia.ko.zst
version: 580.126.09
2.3 Check the driver package
dpkg -l | grep -E 'nvidia-driver|linux-modules-nvidia'
Example:ii nvidia-driver-580 580.126.09-0ubuntu0.24.04.1 amd64 NVIDIA driver metapackage
At this point, the correct driver is installed and the correct module exists on disk, but the kernel either loads the wrong one or none at all.
3. Transition: from mismatch to “driver not running”
After purging/reinstalling or other partial fixes, the system may move from a mismatch to a “driver not running” state:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
cat /proc/driver/nvidia/version
cat: /proc/driver/nvidia/version: No such file or directory
This means the NVIDIA kernel module is not loaded at all.
Check:lsmod | grep -i nvidia
If this returns nothing, the module is not loaded.4. The real culprit: modprobe alias to “off” (blacklist)
4.1 Attempt to load the module
sudo modprobe nvidia
Example error:modprobe: ERROR: ../libkmod/libkmod-module.c:968 kmod_module_insert_module() could not find module by name='off'
modprobe: ERROR: could not insert 'off': Unknown symbol in module, or unknown parameter (see dmesg)
This is the key: something told modprobe to load a fake module called off instead of the real nvidia module.
4.2 Locate the blacklist/alias rules
grep -R "alias nvidia" /etc/modprobe.d /usr/lib/modprobe.d
grep -R "install nvidia" /etc/modprobe.d /usr/lib/modprobe.d
Example output:/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia nvidia
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia_drm nvidia_drm
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia_modeset nvidia_modeset
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia_uvm nvidia_uvm
/usr/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia off
/usr/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia-drm off
/usr/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia-modeset off
The distro-provided file /usr/lib/modprobe.d/blacklist-nvidia.conf is explicitly blocking NVIDIA:
alias nvidia offalias nvidia-drm offalias nvidia-modeset off
This forces any attempt to load NVIDIA modules to instead try to load a non-existent module named off, which fails and leaves the GPU driver unloaded.
Ubuntu 24.04 ships both open kernel modules and proprietary DKMS modules for NVIDIA. This blacklist is used in some configurations to disable the proprietary modules. When you explicitly install
nvidia-driver-580, this blacklist becomes incorrect and must be neutralized.5. The fix: un-blacklist NVIDIA and rebuild initramfs
5.1 Add an explicit un-blacklist override (optional but clean)
If not already present, create an override in /etc/modprobe.d:
echo "alias nvidia nvidia" | sudo tee /etc/modprobe.d/zz-nvidia-unblacklist.conf
echo "alias nvidia_drm nvidia_drm" | sudo tee -a /etc/modprobe.d/zz-nvidia-unblacklist.conf
echo "alias nvidia_modeset nvidia_modeset" | sudo tee -a /etc/modprobe.d/zz-nvidia-unblacklist.conf
echo "alias nvidia_uvm nvidia_uvm" | sudo tee -a /etc/modprobe.d/zz-nvidia-unblacklist.conf
5.2 Disable the distro blacklist file
This is the decisive step: neutralize the file that aliases NVIDIA to off.
sudo mv /usr/lib/modprobe.d/blacklist-nvidia.conf \
/usr/lib/modprobe.d/blacklist-nvidia.conf.disabled
5.3 Rebuild initramfs
sudo update-initramfs -u
5.4 Reboot
sudo reboot
6. Post-fix verification
6.1 Check loaded modules
lsmod | grep -i nvidia
Expected: entries for nvidia, nvidia_drm, nvidia_modeset, nvidia_uvm.6.2 Check nvidia-smi
nvidia-smi
Example healthy output:| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
6.3 Check kernel module version
cat /proc/driver/nvidia/version
Example:NVRM version: NVIDIA UNIX x86_64 Kernel Module 580.126.09 Wed Jan 7 22:59:56 UTC 2026
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
-
nvidia-smi works and shows your GPU(s).-
/proc/driver/nvidia/version matches the NVML version (580.126.09).- No more NVML mismatch, no more “driver not running”.
7. Doctrine summary
- Detect mismatch: NVML 580.126 vs kernel module 580.95.05 → NVML refuses to initialize.
- Confirm DKMS:
dkms statusshows 580.126.09 built for the current kernel. - Confirm module on disk:
modinfo nvidiapoints toupdates/dkmswith version 580.126.09. - Notice driver not loading:
lsmodempty,/proc/driver/nvidiamissing. - Attempt manual load:
sudo modprobe nvidia→ error about moduleoff. - Locate blacklist:
/usr/lib/modprobe.d/blacklist-nvidia.confcontainsalias nvidia offetc. - Neutralize blacklist: rename that file and optionally add an explicit un-blacklist in
/etc/modprobe.d. - Rebuild initramfs and reboot.
- Verify: modules loaded,
nvidia-smiOK, versions aligned at 580.126.09.
This pattern is a classic Ubuntu 24.04 NVIDIA trap: a combination of version drift and a hidden blacklist. Once you know to look for alias nvidia off, the path from “dead GPU” to “fully online” becomes deterministic.
কোন মন্তব্য পাওয়া যায়নি