Runner Deployment Guide

📘 Overview

CSGHUB Runner is the core component of the CSGHub platform responsible for executing model training, inference, and job scheduling workloads.

Through Runner, the system communicates with the CSGHUB Server, dynamically creating and destroying user workloads within the Kubernetes cluster.

This Helm Chart provides a standardized way to deploy the Runner, supporting flexible configuration, external resource integration, and automated resource management.

⚙️ System Requirements

Item	Description
Kubernetes Version	v1.28+
Helm Version	v3.12+
Network Requirement	Nodes must access the CSGHub Server and external image registries (if required)
Permissions	cluster-admin or equivalent privileges to create namespaces and RBAC resources

📦 Installation Steps

1️⃣ Add the Helm Repository

helm repo add csghub https://charts.opencsg.com/csghub
helm repo update

2️⃣ Create a Namespace (Optional)

kubectl create namespace csghub

3️⃣ Deploy the Runner

You’ll need to obtain the following information from the CSGHUB main service:

domain

Provide a subdomain for exposing the Runner service.

If your domain is example.com, the Runner will be exposed at runner.example.com default.
externalUrl
```
helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'
```
Use this command to get the CSGHub external access URL.

hubAPIToken

kubectl get cm csghub-core -o yaml -n csghub | grep 'API_TOKEN' | awk '{print $NF}'

region

A custom label to identify the cluster region (e.g., cn-north).
registry
```
helm get notes csghub -n csghub | grep -A 8 'Minio Console'
```
Get the registry domain, username, password, and insecure flag depends on whether HTTPS is enabled.
objectStore
```
helm get notes csghub -n csghub | grep -A 8 'Distribution Registry'
```
This provides the endpoint, accessKey, and secretKey.
- bucket, region, and pathStyle are fixed values.
- secure depends on whether HTTPS is enabled.

Deploy

💡 Tip:

The object store and image registry can be replaced with external infrastructure.

Domestic deployment:

--set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"

--set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"

💡 Tip: For long-term management, it’s recommended to save custom configurations in a custom-values.yaml file.

helm install runner csghub/runner \
  --namespace csghub \
  --create-namespace \
  --set global.gateway.external.domain="example.com" \
  --set externalUrl="<csghub external_url>" \
  --set hubAPIToken="<csghub hub_api_token>" \
  --set region="<region name>" \
  --set registry.registry="<csghub registry>" \
  --set registry.repository="csghub" \
  --set registry.username="<csghub registry username>" \
  --set registry.password="<csghub registry password>" \
  --set registry.insecure="<if registry is insecure>" \
  --set objectStore.endpoint="<csghub minio>" \
  --set objectStore.accessKey="<csghub minio username>" \
  --set objectStore.secretKey="<csghub minio password>" \
  --set objectStore.bucket="csghub-registry" \
  --set objectStore.region="cn-north-1" \
  --set objectStore.secure="false" \
  --set objectStore.pathStyle="true"

🧾 Configuration Reference (values.yaml)

Global Configuration

Parameter	Default	Description
global.gateway.external.domain	example.com	Base domain for the platform
global.gateway.tls.enabled	false	Enable TLS or not
global.image.tag	-	Image version tag

Runner Configuration

Parameter	Default	Description
name	runner	Resource name prefix (used for domain exposure)
region	region-0	Runner region identifier
interval	60	Communication interval with the Server (in seconds)
namespace	spaces	Default namespace for user workloads
autoConfigure	true	Auto-install knative, argo, and lws components
kymlMode	create	Cluster resource management mode (create/update/replace)
mergingNamespace	disable	Namespace merging mode (multi/single/disable)
usePublicDomain	true	Use public domain for access (false may restrict functionality)

Package & Image Management

Parameter	Default	Description
pipIndexUrl	https://pypi.tuna.tsinghua.edu.cn/simple/	Custom PyPI mirror
extraBuildArgs	[]	Additional Kaniko args
modelRegistry	OpenCSG ACR	Model image registry URL

GPU Configuration

Parameter	Default	Description
gpuModelLabel.typeLabel	nvidia.com/gpu.product	GPU model label key
gpuModelLabel.capacityLabel	nvidia.com/gpu	GPU capacity label

Knative Serving Configuration

💡 Note: These parameters are deprecated since v1.12.0 and retained for backward compatibility.

Parameter	Default	Description
knative.serving.domain	“example.com”	Knative service domain suffix
knative.serving.services	[]	Legacy configuration

RBAC Configuration

Parameter	Default	Description
rbac.create	true	Whether to create ServiceAccount & Roles
rbac.serviceAccountName	runner-admin	ServiceAccount name (currently fixed)

Logging & Monitoring

Parameter	Default	Description
logging.level	info	Log level (info/debug/error)
logcollector.enabled	false	Enable log collector
logcollector.loki.address	“”	Loki service address
tempo.address	“”	Tempo tracing endpoint

Loki service is not exposed by default.

Enable it in the main CSGHub chart with loki.gateway.enabled=true to use.
Tempo tracing is currently internal only; external exposure is planned.

External Resources

🔹 Image Registry

registry:
  registry: "registry.example.com"
  repository: "csghub"
  username: "user"
  password: "pass"
  insecure: false

🔹 Object Storage

objectStore:
  endpoint: "https://minio.example.com"
  accessKey: "admin"
  secretKey: "password"
  bucket: "csghub-registry"
  region: "us-east-1"
  secure: true
  pathStyle: true

Resource & Scheduling

Parameter	Default	Description
resources		Pod resource requests/limits config
nodeSelector		Node selector
tolerations	[]	Tolerations
affinity		Affinity rules

🔍 Verify Deployment

Check the status:

kubectl get pods -n csghub
kubectl get svc -n csghub

View Runner logs:

kubectl logs -f deploy/runner-runner -n csghub

🔄 Upgrade & Uninstall

Upgrade Chart

helm upgrade runner csghub/runner -n csghub -f custom-values.yaml

Uninstall Chart

helm uninstall runner -n csghub

🧠 Troubleshooting

Issue	Solution
Runner cannot reach Server	Verify `externalUrl` and `hubAPIToken` are configured correctly
Knative not auto-installed	Ensure `autoConfigure: true` and proper cluster permissions
GPU job not scheduled	Check node GPU labels and drivers
Image pull failed	Verify registry credentials and `image.pullSecrets` settings

📘 Overview​

⚙️ System Requirements​

📦 Installation Steps​

1️⃣ Add the Helm Repository​

2️⃣ Create a Namespace (Optional)​

3️⃣ Deploy the Runner​

🧾 Configuration Reference (values.yaml)​

Global Configuration​

Runner Configuration​

Package & Image Management​

GPU Configuration​

Knative Serving Configuration​

RBAC Configuration​

Logging & Monitoring​

External Resources​

🔹 Image Registry​

🔹 Object Storage​

Resource & Scheduling​

🔍 Verify Deployment​

🔄 Upgrade & Uninstall​

Upgrade Chart​

Uninstall Chart​

🧠 Troubleshooting​