Doctrine: Fixing NVIDIA 580 NVML Mismatch and Blacklist Trap on Ubuntu 24.04

Googled777 avatar   
Googled777
This doctrine captures a real-world failure pattern on Ubuntu 24.04 with the NVIDIA 580 driver: a version mismatch between NVML and the kernel module, followed by a hidden blacklist that prevents the ..


Doctrine: Fixing NVIDIA 580 NVML Mismatch and Blacklist Trap on Ubuntu 24.04

Doctrine: Fixing NVIDIA 580 NVML Mismatch and Blacklist Trap on Ubuntu 24.04

GPU Ubuntu 24.04

This doctrine captures a real-world failure pattern on Ubuntu 24.04 with the NVIDIA 580 driver: a version mismatch between NVML and the kernel module, followed by a hidden blacklist that prevents the driver from loading at all. The end state is a clean, aligned stack: 580.126.09 kernel module + 580.126.09 NVML, with the GPU fully online.

Symptom summary
- nvidia-smi initially reports: Failed to initialize NVML: Driver/library version mismatch
- Later, after partial fixes: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver
- /proc/driver/nvidia/version either shows an older version (580.95.05) or does not exist at all.

1. Initial symptom: NVML mismatch

1.1 The error

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 580.126
Check the kernel module version:
cat /proc/driver/nvidia/version
Example problematic output:
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  580.95.05
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

1.2 Diagnosis

  • NVML (user-space) version: 580.126
  • Kernel module version: 580.95.05

NVML and the kernel module must match exactly. Any patch-level difference (580.95 vs 580.126) will cause NVML to refuse initialization and produce the mismatch error.


2. Confirming the installed driver and DKMS state

2.1 Check DKMS

dkms status
Example:
nvidia/580.126.09, 6.14.0-37-generic, x86_64: installed

This means DKMS successfully built the 580.126.09 module for the current kernel (6.14.0-37-generic), but it does not guarantee the kernel is actually loading it.

2.2 Check the module on disk

modinfo nvidia | egrep 'filename|version'
Example:
filename:       /lib/modules/6.14.0-37-generic/updates/dkms/nvidia.ko.zst
version:        580.126.09

2.3 Check the driver package

dpkg -l | grep -E 'nvidia-driver|linux-modules-nvidia'
Example:
ii  nvidia-driver-580  580.126.09-0ubuntu0.24.04.1  amd64  NVIDIA driver metapackage

At this point, the correct driver is installed and the correct module exists on disk, but the kernel either loads the wrong one or none at all.


3. Transition: from mismatch to “driver not running”

After purging/reinstalling or other partial fixes, the system may move from a mismatch to a “driver not running” state:

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.

cat /proc/driver/nvidia/version
cat: /proc/driver/nvidia/version: No such file or directory

This means the NVIDIA kernel module is not loaded at all.

Check:
lsmod | grep -i nvidia
If this returns nothing, the module is not loaded.

4. The real culprit: modprobe alias to “off” (blacklist)

4.1 Attempt to load the module

sudo modprobe nvidia
Example error:
modprobe: ERROR: ../libkmod/libkmod-module.c:968 kmod_module_insert_module() could not find module by name='off'
modprobe: ERROR: could not insert 'off': Unknown symbol in module, or unknown parameter (see dmesg)

This is the key: something told modprobe to load a fake module called off instead of the real nvidia module.

4.2 Locate the blacklist/alias rules

grep -R "alias nvidia" /etc/modprobe.d /usr/lib/modprobe.d
grep -R "install nvidia" /etc/modprobe.d /usr/lib/modprobe.d
Example output:
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia nvidia
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia_drm nvidia_drm
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia_modeset nvidia_modeset
/etc/modprobe.d/zz-nvidia-unblacklist.conf:alias nvidia_uvm nvidia_uvm
/usr/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia off
/usr/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia-drm off
/usr/lib/modprobe.d/blacklist-nvidia.conf:alias nvidia-modeset off

The distro-provided file /usr/lib/modprobe.d/blacklist-nvidia.conf is explicitly blocking NVIDIA:

  • alias nvidia off
  • alias nvidia-drm off
  • alias nvidia-modeset off

This forces any attempt to load NVIDIA modules to instead try to load a non-existent module named off, which fails and leaves the GPU driver unloaded.

Why this file exists
Ubuntu 24.04 ships both open kernel modules and proprietary DKMS modules for NVIDIA. This blacklist is used in some configurations to disable the proprietary modules. When you explicitly install nvidia-driver-580, this blacklist becomes incorrect and must be neutralized.

5. The fix: un-blacklist NVIDIA and rebuild initramfs

5.1 Add an explicit un-blacklist override (optional but clean)

If not already present, create an override in /etc/modprobe.d:

echo "alias nvidia nvidia" | sudo tee /etc/modprobe.d/zz-nvidia-unblacklist.conf
echo "alias nvidia_drm nvidia_drm" | sudo tee -a /etc/modprobe.d/zz-nvidia-unblacklist.conf
echo "alias nvidia_modeset nvidia_modeset" | sudo tee -a /etc/modprobe.d/zz-nvidia-unblacklist.conf
echo "alias nvidia_uvm nvidia_uvm" | sudo tee -a /etc/modprobe.d/zz-nvidia-unblacklist.conf

5.2 Disable the distro blacklist file

This is the decisive step: neutralize the file that aliases NVIDIA to off.

sudo mv /usr/lib/modprobe.d/blacklist-nvidia.conf \
         /usr/lib/modprobe.d/blacklist-nvidia.conf.disabled

5.3 Rebuild initramfs

sudo update-initramfs -u

5.4 Reboot

sudo reboot

6. Post-fix verification

6.1 Check loaded modules

lsmod | grep -i nvidia
Expected: entries for nvidia, nvidia_drm, nvidia_modeset, nvidia_uvm.

6.2 Check nvidia-smi

nvidia-smi
Example healthy output:
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |

6.3 Check kernel module version

cat /proc/driver/nvidia/version
Example:
NVRM version: NVIDIA UNIX x86_64 Kernel Module  580.126.09  Wed Jan  7 22:59:56 UTC 2026
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
Success condition
- nvidia-smi works and shows your GPU(s).
- /proc/driver/nvidia/version matches the NVML version (580.126.09).
- No more NVML mismatch, no more “driver not running”.

7. Doctrine summary

  1. Detect mismatch: NVML 580.126 vs kernel module 580.95.05 → NVML refuses to initialize.
  2. Confirm DKMS: dkms status shows 580.126.09 built for the current kernel.
  3. Confirm module on disk: modinfo nvidia points to updates/dkms with version 580.126.09.
  4. Notice driver not loading: lsmod empty, /proc/driver/nvidia missing.
  5. Attempt manual load: sudo modprobe nvidia → error about module off.
  6. Locate blacklist: /usr/lib/modprobe.d/blacklist-nvidia.conf contains alias nvidia off etc.
  7. Neutralize blacklist: rename that file and optionally add an explicit un-blacklist in /etc/modprobe.d.
  8. Rebuild initramfs and reboot.
  9. Verify: modules loaded, nvidia-smi OK, versions aligned at 580.126.09.

This pattern is a classic Ubuntu 24.04 NVIDIA trap: a combination of version drift and a hidden blacklist. Once you know to look for alias nvidia off, the path from “dead GPU” to “fully online” becomes deterministic.

0 মন্তব্য

কোন মন্তব্য পাওয়া যায়নি