Skip to content

LLM API

Endpoints for managing LLM (Large Language Model) GGUF models and the llama.cpp server. The LLM server binary is baked into the Docker image at /opt/llama-server and compiled with CUDA support.


GET /api/admin/llm/models

List LLM models from the catalog with download/presence status.

Auth: Required

Response: 200 OK

{
  "version": 3,
  "date": "2026-03-21 08:00",
  "categories": [
    {
      "id": "general",
      "name": "General",
      "models": [
        {
          "name": "Qwen2.5 7B Instruct Q4_K_M",
          "file": "qwen2.5-7b-instruct-q4_k_m.gguf",
          "dest": ".",
          "size_gb": 4.4,
          "status": "present",
          "on_disk_bytes": 4724464640,
          "source": "huggingface",
          "hf_repo": "Qwen/Qwen2.5-7B-Instruct-GGUF"
        }
      ]
    }
  ],
  "summary": {
    "total": 5,
    "present": 2
  }
}
curl https://your-pod.runpod.io/api/admin/llm/models \
  -H "X-API-Key: your-api-key"

POST /api/admin/llm/models/download/{filename}

Queue an LLM model for download.

Auth: Required

Path parameters:

Parameter Description
filename The GGUF model filename from the catalog

Response: 200 OK

{
  "status": "queued",
  "file": "qwen2.5-7b-instruct-q4_k_m.gguf"
}

If already downloading or queued:

{
  "status": "downloading",
  "file": "qwen2.5-7b-instruct-q4_k_m.gguf"
}

Error response: 404 Not Found

{
  "detail": "LLM model 'nonexistent.gguf' not found in catalog"
}
curl -X POST https://your-pod.runpod.io/api/admin/llm/models/download/qwen2.5-7b-instruct-q4_k_m.gguf \
  -H "X-API-Key: your-api-key"

DELETE /api/admin/llm/models/{filename}

Delete an LLM model file from disk.

Auth: Required

Path parameters:

Parameter Description
filename The GGUF model filename

Response: 200 OK

{
  "status": "deleted",
  "file": "qwen2.5-7b-instruct-q4_k_m.gguf"
}

Error response: 404 Not Found

{
  "detail": "LLM model 'nonexistent.gguf' not found in catalog"
}
curl -X DELETE https://your-pod.runpod.io/api/admin/llm/models/qwen2.5-7b-instruct-q4_k_m.gguf \
  -H "X-API-Key: your-api-key"

GET /api/admin/llm/status

Get the LLM server status, active model, configuration, and health.

Auth: Required

Response: 200 OK

{
  "running": true,
  "active_model": "qwen2.5-7b-instruct-q4_k_m.gguf",
  "config": {
    "n_gpu_layers": 99,
    "ctx_size": 8192,
    "threads": 4,
    "temp": 0.7,
    "top_p": 0.9,
    "top_k": 40,
    "repeat_penalty": 1.1
  },
  "health": {
    "status": "ok"
  },
  "port": 8080,
  "binary_exists": true
}
Field Description
running Whether the llama-server process is alive
active_model Currently loaded model filename, or null
config Server configuration (internal keys starting with _ are excluded)
health Health check response from llama-server /health, or null if not running
port The port llama-server is listening on
binary_exists Whether /opt/llama-server binary exists
curl https://your-pod.runpod.io/api/admin/llm/status \
  -H "X-API-Key: your-api-key"

POST /api/admin/llm/start

Start the llama-server with a specific model and optional config overrides.

Auth: Required

Request body:

{
  "model": "qwen2.5-7b-instruct-q4_k_m.gguf",
  "n_gpu_layers": 99,
  "ctx_size": 8192,
  "threads": 4,
  "temp": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "repeat_penalty": 1.1
}
Field Type Required Description
model string Yes GGUF model filename to load
n_gpu_layers integer No Number of layers to offload to GPU
ctx_size integer No Context window size in tokens
threads integer No Number of CPU threads
temp float No Sampling temperature
top_p float No Top-p sampling
top_k integer No Top-k sampling
repeat_penalty float No Repetition penalty

Response: 200 OK

{
  "status": "started",
  "model": "qwen2.5-7b-instruct-q4_k_m.gguf",
  "port": 8080
}

Error responses:

  • 400 Bad Request -- Missing model field or invalid JSON
  • 404 Not Found -- Model file not found on disk
  • 500 Internal Server Error -- Failed to start llama-server
curl -X POST https://your-pod.runpod.io/api/admin/llm/start \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5-7b-instruct-q4_k_m.gguf", "ctx_size": 8192}'

POST /api/admin/llm/stop

Stop the running llama-server process.

Auth: Required

Request body: None

Response: 200 OK

{
  "status": "stopped"
}

If the server is already stopped:

{
  "status": "already_stopped"
}
curl -X POST https://your-pod.runpod.io/api/admin/llm/stop \
  -H "X-API-Key: your-api-key"

POST /api/admin/llm/config

Update the LLM server configuration without restarting. The config is persisted to STUDIO_DIR/llm/config.json.

Auth: Required

Request body:

{
  "n_gpu_layers": 99,
  "ctx_size": 16384,
  "temp": 0.8
}

Accepts the same fields as POST /api/admin/llm/start (except model).

Response: 200 OK

{
  "status": "saved",
  "config": {
    "n_gpu_layers": 99,
    "ctx_size": 16384,
    "threads": 4,
    "temp": 0.8,
    "top_p": 0.9,
    "top_k": 40,
    "repeat_penalty": 1.1
  }
}
curl -X POST https://your-pod.runpod.io/api/admin/llm/config \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"ctx_size": 16384, "temp": 0.8}'

POST /api/admin/llm/chat

Chat completion via the llama-server. Proxies to the llama.cpp /v1/chat/completions endpoint. Supports both streaming and non-streaming modes.

Auth: Required

Request body:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": true,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "repeat_penalty": 1.1,
  "max_tokens": 2048
}
Field Type Required Description
messages array Yes Array of {role, content} objects
stream boolean No Enable streaming (default: false)
temperature float No Overrides server config
top_p float No Overrides server config
top_k integer No Overrides server config
repeat_penalty float No Overrides server config
max_tokens integer No Maximum tokens to generate

Response (non-streaming): 200 OK

Returns the standard OpenAI-compatible chat completion response from llama.cpp.

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "Hello! How can I help you?"},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Response (streaming): 200 OK with text/event-stream

Returns a Server-Sent Events stream in OpenAI-compatible format:

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0}]}

data: [DONE]

Error responses:

  • 400 Bad Request -- Invalid JSON body
  • 503 Service Unavailable -- LLM server is not running or unreachable
# Non-streaming
curl -X POST https://your-pod.runpod.io/api/admin/llm/chat \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

# Streaming
curl -N -X POST https://your-pod.runpod.io/api/admin/llm/chat \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "stream": true}'