Immich, CUDA, and Unkillable Containers: When GPU Memory Won’t Let Go

Googled777

01/11/26 • 317 بازدیدها • 0 اشتراک گذاری • منتشر شده در نحوه و سبک

Why Immich jobs on NVIDIA GPUs can leave Docker and ML workers in an unkillable state until reboot.

Immich, CUDA, and Unkillable Containers: When GPU Memory Won’t Let Go

Doctrine block Why Immich jobs on NVIDIA GPUs can leave Docker and ML workers in an unkillable state until reboot.

Short version: Under repeated heavy GPU workloads, Immich’s ML stack (ONNX Runtime + TensorRT + CUDA) can push the NVIDIA driver into a state where GPU memory and contexts are no longer cleanly releasable. Processes enter an uninterruptible sleep state, Docker cannot kill containers, and only a full machine reboot resets the GPU. This is a GPU lifecycle failure mode, not a misconfigured container.

1. The scenario observed on Signal Raider

During deep Immich stress testing on Signal Raider, the following pattern emerged:

Immich ML jobs run with GPU acceleration enabled (CLIP, faces, OCR, video analysis).
After several heavy jobs, GPU starts throwing CUDA errors and OOMs.
Attempting to stop Immich and Docker fails — containers refuse to die.
Processes remain “alive” but stuck, holding onto GPU allocations that cannot be freed.
Only a full system reboot restores normal operation and frees the GPU.

The key red flag: the system tried to kill Immich and Docker, but couldn’t, because the GPU memory was tied up in a state the driver could no longer unwind.

2. What’s actually happening under the hood

2.1 Processes enter D‑state (uninterruptible sleep)

When a process is blocked inside the kernel (for example, waiting on I/O or a GPU driver operation), it can enter a state known as D‑state (uninterruptible sleep). In this state:

Signals are ignored – even SIGKILL cannot terminate it.
Docker cannot stop the container because the kernel won’t reap the process.
Resources are held (including GPU memory and file descriptors) until the blocking operation completes, which it never does in this failure mode.

2.2 GPU memory in a “wrong” address space

CUDA uses several types of allocations:

Device memory on the GPU.
Pinned host memory in system RAM.
Unified / managed memory shared between CPU and GPU.
Driver‑managed internal pools used by TensorRT and ONNX Runtime.

In this failure mode, some allocations end up tied to a GPU context that is half‑broken. From the OS perspective, memory is “allocated at the wrong address” in the sense that it’s tied to a context the driver can’t cleanly release anymore.

2.3 ONNX Runtime + TensorRT lifecycle

Immich’s ML worker uses ONNX Runtime with:

CUDAExecutionProvider for GPU inference.
TensorRT for engine optimizations.

Over multiple jobs, the following can happen:

Large TensorRT workspaces are allocated and partially freed.
Graph capture buffers are created during CUDA graph optimization.
Model weights and engine caches remain resident between jobs.

If a job dies during one of these phases, it can leave behind stale allocations and broken state inside the driver.

2.4 Docker is not the boss of the GPU

Docker can only kill containers if:

The kernel can deliver signals and terminate the processes.
The processes can exit their kernel wait and release resources.
The GPU driver is willing to tear down the CUDA context.

In this failure mode, the process is stuck in a GPU driver call that never completes, so neither Docker nor kill -9 can break it free.

3. Why Immich triggers this under repeated heavy workloads

Immich is designed as a CPU‑first app with optional GPU acceleration. The GPU path is fast, but its lifecycle is not deeply engineered for repeated, heavy ML jobs. Under stress testing, the following pattern emerges:

First large job – GPU performs well, memory is clean, everything succeeds.
Second/third job – ONNX Runtime and TensorRT reuse existing allocations and add more.
Fourth/fifth heavy job – memory becomes fragmented, stale graph capture state exists, TensorRT workspaces linger.
Eventually – the CUDA allocator hits a state where new allocations fail and old ones cannot be fully freed.

At that point:

Immich’s ML worker may crash or hang.
Processes become stuck in D‑state inside GPU driver calls.
Docker cannot stop or kill the containers.
The GPU context is effectively poisoned until reboot.

4. Recognizing the failure mode in practice

Signs that you’ve hit this GPU lifecycle failure mode:

Immich logs show repeated CUDA errors and OOMs even with apparent free VRAM.
Docker stop / kill on the Immich ML container hangs or fails.
kill -9 on the ML process has no effect.
nvidia-smi shows processes that won’t go away.
System load may show tasks in D‑state (uninterruptible sleep).

Attempts to fix it with:

Restarting the container → fails or leaves stuck processes.
Restarting Docker → may help partially, but often leaves GPU processes behind.
Resetting the GPU via nvidia-smi --gpu-reset → often fails if processes are stuck.

The only reliable fix: reboot the system, which fully resets the GPU driver and context.

5. Practical conclusions for Immich on NVIDIA GPUs

Core insight: Immich’s GPU path is optimized for speed on small, continuous workloads, not for industrial‑scale, repeated heavy ingestion. Under extreme stress, the CUDA + TensorRT + ONNX stack can enter a state where processes become unkillable and only a reboot restores stability.

In practice, this leads to a few operational rules:

Use GPU Immich for private/family archives with incremental uploads and small batches.
Avoid repeated giant imports back‑to‑back on a single GPU node.
Expect to reboot occasionally if you do serious torture‑testing or bulk migration.
Don’t blame Docker when containers become unkillable — the real issue is inside the GPU driver and CUDA allocator.

Architecturally, this is not a reason to abandon Immich. It’s a reason to understand its limits: a powerful, GPU‑accelerated, private photo system that behaves beautifully under normal use, but shows deep ML stack cracks when pushed into repeated, datacenter‑style workloads.

0 0

• 0 نظرات

0 نظرات

نظری یافت نشد

Immich, CUDA, and Unkillable Containers: When GPU Memory Won’t Let Go

Googled777

Immich, CUDA, and Unkillable Containers: When GPU Memory Won’t Let Go

1. The scenario observed on Signal Raider

2. What’s actually happening under the hood

2.1 Processes enter D‑state (uninterruptible sleep)

2.2 GPU memory in a “wrong” address space

2.3 ONNX Runtime + TensorRT lifecycle

2.4 Docker is not the boss of the GPU

3. Why Immich triggers this under repeated heavy workloads

4. Recognizing the failure mode in practice

5. Practical conclusions for Immich on NVIDIA GPUs

مقالات مرتبط

دسته بندی ها

ویدیو های مرتبط

دسته بندی ها

Immich, CUDA, and Unkillable Containers: When GPU Memory Won’t Let Go

Immich, CUDA, and Unkillable Containers: When GPU Memory Won’t Let Go

1. The scenario observed on Signal Raider

2. What’s actually happening under the hood

2.1 Processes enter D‑state (uninterruptible sleep)

2.2 GPU memory in a “wrong” address space

2.3 ONNX Runtime + TensorRT lifecycle

2.4 Docker is not the boss of the GPU

3. Why Immich triggers this under repeated heavy workloads

4. Recognizing the failure mode in practice

5. Practical conclusions for Immich on NVIDIA GPUs

مقالات مرتبط

دسته بندی ها

ویدیو های مرتبط

دسته بندی ها

یک روش پرداخت را انتخاب کنید