REST API Reference

CSGLite provides REST API endpoints compatible with Ollama and OpenAI's API specifications, making it easy for third-party clients, frontends, or automation scripts to connect. The API service runs via csghub-lite serve and binds to port 11435 by default.

Basic Information

Default Base URL: http://localhost:11435
Default Request Format: Content-Type: application/json
Streaming Response: Responds as Server-Sent Events (SSE) using the text/event-stream mime-type.

Endpoints Overview

Category	Method	Path	Description
Inference API	`POST`	`/api/chat`	Chat completions (supports multi-turn and streaming).
	`POST`	`/api/generate`	Text generation (supports prompt input and streaming).
Model Management	`GET`	`/api/tags`	Lists all downloaded local models.
	`POST`	`/api/show`	Shows detailed metadata for a specific model.
	`POST`	`/api/pull`	Downloads a model (streams file download progress).
	`DELETE`	`/api/delete`	Deletes a model from the local filesystem.
Service Management	`GET`	`/api/health`	Checks the API service health status.
	`GET`	`/api/ps`	Lists models currently loaded and active in memory/VRAM.
	`POST`	`/api/stop`	Unloads a running model, freeing up memory/VRAM.
OpenAI Compatible	`GET`	`/v1/models`	Lists models in OpenAI format.
	`POST`	`/v1/chat/completions`	Chat completions in OpenAI format.

Option Parameters

For all inference endpoints (/api/chat, /api/generate, /v1/chat/completions), you can customize the model execution parameters by passing an "options" object in the request body:

Parameter	Type	Default Value	Description
`temperature`	float	0.7	Randomness of the output. Higher is more creative.
`top_p`	float	0.9	Nucleus sampling probability.
`top_k`	int	40	Top-K sampling cap.
`max_tokens`	int	2048	Max tokens to generate.
`seed`	int	-1	Random seed (-1 for random).
`num_ctx`	int	4096	Model context window size.
`num_parallel`	int	4	Parallel slots to allocate in llama-server.
`n_gpu_layers`	int	-	Layers to offload to GPU (corresponds to `llama-server --n-gpu-layers`).
`cache_type_k`	string	`f16`	Data type for KV cache key, e.g. `q8_0` (limits VRAM usage).
`cache_type_v`	string	`f16`	Data type for KV cache value, e.g. `q8_0` (limits VRAM usage).
`dtype`	string	`f16`	SafeTensors to GGUF auto-converter output precision, e.g. `f32`, `f16`, `bf16`, `q8_0`.

Inference Endpoints

1. Chat Completions: POST /api/chat

Request Example

{
  "model": "Qwen/Qwen3-0.6B-GGUF",
  "messages": [
    {"role": "system", "content": "You are a helpful programming assistant."},
    {"role": "user", "content": "Write a Python quicksort."}
  ],
  "stream": true,
  "options": {
    "temperature": 0.5,
    "max_tokens": 1024
  }
}

messages: Array of message objects containing "role" (optional: system, user, assistant) and "content" (text body).
stream: Defaults to true. Set to false to receive a single completed JSON response when generation ends.

Streaming Response Example (SSE)

data: {"model":"Qwen/Qwen3-0.6B-GGUF","message":{"role":"assistant","content":"Sure"},"done":false,"created_at":"2026-03-11T00:43:14.832Z"}

data: {"model":"Qwen/Qwen3-0.6B-GGUF","message":{"role":"assistant","content":", here is the code"},"done":false,"created_at":"2026-03-11T00:43:14.839Z"}

data: {"model":"Qwen/Qwen3-0.6B-GGUF","done":true,"created_at":"2026-03-11T00:43:14.930Z"}

Non-Streaming Response Example

{
  "model": "Qwen/Qwen3-0.6B-GGUF",
  "message": {
    "role": "assistant",
    "content": "Sure, here is the code for Python quicksort...\n"
  },
  "done": true,
  "created_at": "2026-03-11T00:43:14.930Z"
}

2. Text Generation: POST /api/generate

Request Example

{
  "model": "Qwen/Qwen3-0.6B-GGUF",
  "prompt": "Write a haiku about code",
  "stream": false
}

Non-Streaming Response Example

{
  "model": "Qwen/Qwen3-0.6B-GGUF",
  "response": "Lines of logic flow,\nErrors fade, solutions grow,\nQuicksort runs below.",
  "done": true,
  "created_at": "2026-03-11T00:43:32.343Z"
}

Model & Service Management

3. Health Check: GET /api/health

Response

{"status": "ok"}

4. List Downloaded Models: GET /api/tags

Response

{
  "models": [
    {
      "name": "Qwen/Qwen3-0.6B-GGUF",
      "model": "Qwen/Qwen3-0.6B-GGUF",
      "size": 639466546,
      "format": "gguf",
      "modified_at": "2026-03-11T00:42:14.856Z"
    }
  ]
}

5. Model Information: POST /api/show

Request

{"model": "Qwen/Qwen3-0.6B-GGUF"}

Response

{
  "modelfile": "",
  "details": {
    "name": "Qwen/Qwen3-0.6B-GGUF",
    "model": "Qwen/Qwen3-0.6B-GGUF",
    "size": 639466546,
    "format": "gguf",
    "modified_at": "2026-03-11T00:42:14.856Z"
  }
}

6. Pull Model (Progress Streaming): POST /api/pull

Request

{"model": "Qwen/Qwen3-0.6B-GGUF"}

Response (SSE)

data: {"status":"pulling Qwen/Qwen3-0.6B-GGUF"}

data: {"status":"downloading Qwen3-0.6B-Q8_0.gguf","digest":"Qwen3-0.6B-Q8_0.gguf","total":639446688,"completed":1048576}

data: {"status":"success"}

7. Delete Model: DELETE /api/delete

Request

{"model": "Qwen/Qwen3-0.6B-GGUF"}

Response

{"status": "deleted"}

8. List Active/Loaded Models: GET /api/ps

Response

{
  "models": [
    {
      "name": "Qwen/Qwen3-0.6B-GGUF",
      "model": "Qwen/Qwen3-0.6B-GGUF",
      "size": 639466546,
      "format": "gguf",
      "expires_at": "2026-03-11T00:48:14.930Z"
    }
  ]
}

Note: expires_at shows the scheduled timestamp for unloading the model from VRAM/RAM (unloads after 5 minutes of inactivity by default).

9. Stop/Unload Model: POST /api/stop

Request

{"model": "Qwen/Qwen3-0.6B-GGUF"}

Response

{"status": "stopped"}

OpenAI Compatibility Integration Examples

1. Python Integration

from openai import OpenAI

# Specify base URL, api_key is unused but must be a non-empty string
client = OpenAI(base_url="http://localhost:11435/v1", api_key="unused")

response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B-GGUF",
    messages=[
        {"role": "system", "content": "You are a translator."},
        {"role": "user", "content": "Translate 'Hello, world!' to French."}
    ],
    stream=False
)

print(response.choices[0].message.content)

2. cURL Example

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B-GGUF",
    "messages": [{"role": "user", "content": "Hi!"}],
    "stream": false
  }'

Basic Information​

Endpoints Overview​

Option Parameters​

Inference Endpoints​

1. Chat Completions: POST /api/chat​

Request Example​

Streaming Response Example (SSE)​

Non-Streaming Response Example​

2. Text Generation: POST /api/generate​

Request Example​

Non-Streaming Response Example​

Model & Service Management​

3. Health Check: GET /api/health​

Response​

4. List Downloaded Models: GET /api/tags​

Response​

5. Model Information: POST /api/show​

Request​

Response​

6. Pull Model (Progress Streaming): POST /api/pull​

Request​

Response (SSE)​

7. Delete Model: DELETE /api/delete​

Request​

Response​

8. List Active/Loaded Models: GET /api/ps​

Response​

9. Stop/Unload Model: POST /api/stop​

Request​

Response​

OpenAI Compatibility Integration Examples​

1. Python Integration​

2. cURL Example​

Basic Information

Endpoints Overview

Option Parameters

Inference Endpoints

1. Chat Completions: POST /api/chat

Request Example

Streaming Response Example (SSE)

Non-Streaming Response Example

2. Text Generation: POST /api/generate

Request Example

Non-Streaming Response Example

Model & Service Management

3. Health Check: GET /api/health

Response

4. List Downloaded Models: GET /api/tags

Response

5. Model Information: POST /api/show

Request

Response

6. Pull Model (Progress Streaming): POST /api/pull

Request

Response (SSE)

7. Delete Model: DELETE /api/delete

Request

Response

8. List Active/Loaded Models: GET /api/ps

Response

9. Stop/Unload Model: POST /api/stop

Request

Response

OpenAI Compatibility Integration Examples

1. Python Integration

2. cURL Example