Skip to content

llama-server

The Docker image optionally includes a compiled llama.cpp server binary for GPU-accelerated LLM inference. The binary is installed at /opt/llama-server and managed at runtime by ComfyUI Studio's LLM module.

Overview

llama-server is a standalone HTTP server from the llama.cpp project. It loads GGUF-format language models and serves an OpenAI-compatible chat/completion API. In the Docker image, it is compiled from source with CUDA support so that inference runs on the GPU.

The binary is used by ComfyUI Studio's LLM page (/admin/llm) to:

  • Start/stop an LLM server with a selected GGUF model
  • Run chat conversations directly on the pod GPU
  • Provide AI-assisted prompt generation for image/video workflows

Compilation

The llama-server build step in the Dockerfile:

RUN if [ "${ENABLE_LLM}" = "true" ]; then \
      CUDA_ARCHS="75;80;86;89;90;100" && \
      if [ "$LLAMA_CPP_VERSION" = "latest" ]; then \
        git clone --depth 1 https://github.com/ggerganov/llama.cpp.git /tmp/llama.cpp; \
      else \
        git clone --depth 1 --branch "$LLAMA_CPP_VERSION" \
          https://github.com/ggerganov/llama.cpp.git /tmp/llama.cpp; \
      fi && \
      cd /tmp/llama.cpp && \
      cmake -B build \
          -DGGML_CUDA=ON \
          -DGGML_NATIVE=OFF \
          -DCMAKE_BUILD_TYPE=Release \
          -DCMAKE_CUDA_ARCHITECTURES="$CUDA_ARCHS" \
          -DLLAMA_BUILD_TESTS=OFF \
          -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined && \
      cmake --build build --config Release -j$(nproc) --target llama-server && \
      cp build/bin/llama-server /opt/llama-server && \
      rm -rf /tmp/llama.cpp; \
    fi

Build Flags

The CMake flags are based on the llama.cpp official CUDA Dockerfile:

Flag Value Purpose
GGML_CUDA ON Enable CUDA backend for GPU-accelerated inference
GGML_NATIVE OFF Disable native CPU detection -- the build machine's CPU is not the runtime CPU
CMAKE_BUILD_TYPE Release Optimized release build
CMAKE_CUDA_ARCHITECTURES 75;80;86;89;90;100 Compile CUDA kernels for all supported GPU architectures
LLAMA_BUILD_TESTS OFF Skip test compilation to reduce build time
CMAKE_EXE_LINKER_FLAGS -Wl,--allow-shlib-undefined Allow unresolved shared library symbols at link time

CUDA Architecture Targets

The CMAKE_CUDA_ARCHITECTURES value covers every GPU generation from Turing through Blackwell:

SM GPU Generation Example GPUs
75 Turing RTX 2080 Ti, RTX 2070, T4
80 Ampere A100
86 Ampere A6000, A5000, RTX 3090, RTX 3060
89 Ada Lovelace L40S, L4, RTX 4090, RTX 4060
90 Hopper H100, H200
100 Blackwell B100, B200

Each target architecture adds compilation time because nvcc must generate PTX and SASS code for that specific SM version. This is why building for all 6 architectures takes significantly longer than building for a single one.

The --allow-shlib-undefined Linker Flag

This flag deserves special explanation. During the Docker build, the CUDA toolkit provides stub libraries (e.g., libcuda.so) for linking. However, these stubs do not include the versioned soname (libcuda.so.1) that the linker expects when resolving dependencies.

At runtime, the NVIDIA Container Toolkit mounts the real GPU driver libraries from the host into the container. These real libraries provide the correct soname. The --allow-shlib-undefined flag tells the linker to accept unresolved symbols at link time, deferring resolution to runtime when the real driver is available.

Without this flag, the build would fail with:

/usr/bin/ld: cannot find -lcuda

or produce a binary that fails symbol resolution checks.

Version Pinning

The LLAMA_CPP_VERSION build argument controls which llama.cpp release is built:

Value Behavior
b8505 (default) Clones the b8505 git tag -- a known-good release pinned on 2026-03-24
latest Clones HEAD without a tag (--depth 1, no --branch)
Custom tag Clones the specified tag (e.g., b8400)

The default is pinned to ensure reproducible builds. The llama.cpp project moves fast -- using latest provides the newest features but risks build failures from upstream changes.

To update the pinned version, change the ARG LLAMA_CPP_VERSION=b8505 default in the Dockerfile.

Build Time

Compilation time depends on the build environment:

Environment Time Reason
Machine with target GPU ~10 min Compiler can auto-detect GPU, fewer kernel variants needed
CI runner without GPU (e.g., GitHub Actions) ~60 min Must compile CUDA kernels for all 6 architectures

The build uses all available cores (-j$(nproc)) and only compiles the llama-server target (not the full llama.cpp suite).

Skipping the Build

Set ENABLE_LLM=false to skip llama-server compilation entirely:

docker build -f docker/production/Dockerfile -t comfyui-studio \
  --build-arg ENABLE_LLM=false \
  .

This saves ~60 minutes on CI builds. The LLM page in ComfyUI Studio will show an error if llama-server is not found at the configured path.

Layer Caching

The llama-server build step is intentionally placed before the custom node installation in the Dockerfile:

# llama-server compilation (expensive, ~60 min)
RUN if [ "${ENABLE_LLM}" = "true" ]; then ...

# Custom nodes (cheaper, ~15 min)
COPY docker/production/nodes.txt /tmp/nodes.txt
COPY docker/production/install_nodes.sh /tmp/install_nodes.sh
RUN /tmp/install_nodes.sh ...

This means changing nodes.txt (adding or removing a custom node) does not invalidate the llama-server layer. The expensive 60-minute compilation is cached and reused. Only changes to the Dockerfile lines above the node installation -- such as changing LLAMA_CPP_VERSION, ENABLE_LLM, or the CUDA/PyTorch versions -- trigger a recompilation.

Runtime

At runtime, the llama-server binary is managed by backend/llm_server.py:

  • Start: Launches the binary as a subprocess with the selected GGUF model, binding to 127.0.0.1 on the configured port (default 8080)
  • Stop: Sends SIGTERM, waits 10 seconds, then SIGKILL if needed
  • Status: Checks if the subprocess is alive via poll()

Configuration is stored at STUDIO_DIR/llm/config.json and includes:

Parameter Default Description
n_gpu_layers 99 Number of model layers to offload to GPU (99 = all)
ctx_size 8192 Context window size in tokens
threads 4 Number of CPU threads for non-GPU operations

GGUF model files are stored in STUDIO_DIR/llm/models/ and downloaded via the LLM page from HuggingFace.

Environment Variables

Variable Default Purpose
LLAMA_SERVER_PATH /opt/llama-server Path to the compiled binary
LLAMA_SERVER_PORT 8080 Port the server listens on