LLM Assistant¶

The LLM management page at /admin/llm. Download GGUF language models, control the llama-server process, and chat with a locally running LLM.

Layout¶

The page has three main sections:

Server control panel -- start/stop the LLM server, select model and configuration
Model catalog -- list of available GGUF models with download/delete
Chat area -- interactive chat interface (visible when the server is running)

Server Control Panel¶

A bordered panel at the top with:

Status Indicator¶

A colored dot with text showing the server state:

Green dot + "Running" + model name -- server is active
Grey dot + "Stopped" -- server is not running

Model Selector¶

A dropdown populated with only downloaded models (those with status: present). Users select which model to load before starting the server.

Configuration Fields¶

GPU Layers (n_gpu_layers) -- number of layers offloaded to GPU (default: 99, meaning all layers)
Context Size (ctx_size) -- context window in tokens (default: 8192)
Threads -- CPU threads for inference (default: 4)

Configuration is persisted to STUDIO_DIR/llm/config.json on start.

Server Actions¶

Start button -- calls POST /api/admin/llm/start with the selected model and config. The backend spawns a llama-server subprocess with the specified parameters.
Stop button -- calls POST /api/admin/llm/stop to terminate the subprocess.

On page load, the frontend checks server status via GET /api/admin/llm/status and updates the indicator accordingly. If the server is already running, the panel shows which model is loaded and the server port.

Model Catalog¶

Below the server panel, a list of LLM model cards loaded from:

GET /api/admin/llm/models

Each card shows:

Model name -- display name from the catalog
File size -- in GB
Quantization type -- e.g., Q8_0, Q4_K_M (shown as a tag)
Status badge -- present (green), missing (grey), downloading (yellow with animation)
Actions:
- Download -- POST /api/admin/llm/models/download/{filename} to queue the download
- Delete -- DELETE /api/admin/llm/models/{filename} with confirm dialog

Download progress is shown with a progress bar on the card. Downloads use the same shared download queue as model/LoRA downloads.

Models are stored in STUDIO_DIR/llm/models/.

Chat Area¶

Visible when the server is running. Provides a simple chat interface:

Message history -- scrollable area showing user and assistant messages in alternating styles (user messages have a yellow-tinted background, assistant messages have a subtle border)
Input field -- a textarea for typing messages
Send button -- submits the message to the LLM

Chat calls POST /api/admin/llm/chat with the conversation history. The response streams back from the llama-server running on 127.0.0.1:LLAMA_SERVER_PORT. Messages are rendered with monospace font and preserve whitespace.

Error responses (server not running, model not loaded) are displayed in red-tinted message bubbles.

Architecture Notes¶

The llama-server binary is baked into the Docker image at /opt/llama-server, compiled with CUDA support for GPU inference. It runs as a subprocess managed by llm_server.py, listening on 127.0.0.1:8080 by default (not exposed externally).

Only one model can be loaded at a time. Starting the server with a new model automatically stops the previous instance.

Endpoint	Method	Purpose
`/api/admin/llm/models`	GET	List LLM model catalog with status
`/api/admin/llm/models/download/{filename}`	POST	Queue model download
`/api/admin/llm/models/{filename}`	DELETE	Delete model from disk
`/api/admin/llm/status`	GET	Server running status
`/api/admin/llm/start`	POST	Start llama-server with model
`/api/admin/llm/stop`	POST	Stop llama-server
`/api/admin/llm/chat`	POST	Send chat message