Skip to main content

Runner Deployment Guide

📘 Overview

CSGHUB Runner is the core component of the CSGHub platform responsible for executing model training, inference, and job scheduling workloads.

Through Runner, the system communicates with the CSGHUB Server, dynamically creating and destroying user workloads within the Kubernetes cluster.

This Helm Chart provides a standardized way to deploy the Runner, supporting flexible configuration, external resource integration, and automated resource management.


⚙️ System Requirements

ItemDescription
Kubernetes Versionv1.28+
Helm Versionv3.12+
Network RequirementNodes must access the CSGHub Server and external image registries (if required)
Permissionscluster-admin or equivalent privileges to create namespaces and RBAC resources

📦 Installation Steps

1️⃣ Add the Helm Repository

helm repo add csghub https://charts.opencsg.com/repository/csghub
helm repo update

2️⃣ Create a Namespace (Optional)

kubectl create namespace csghub

3️⃣ Deploy the Runner

You’ll need to obtain the following information from the CSGHUB main service:

  • domain

    Provide a subdomain for exposing the Runner service.

    If your domain is example.com, the Runner will be exposed at runner.example.com default.

  • externalUrl

    helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'

    Use this command to get the CSGHub external access URL.

  • hubAPIToken

    kubectl get cm csghub-core -o yaml -n csghub | grep 'API_TOKEN' | awk '{print $NF}'
  • region

    A custom label to identify the cluster region (e.g., cn-north).

  • registry

    helm get notes csghub -n csghub | grep -A 8 'Minio Console'

    Get the registry domain, username, password, and insecure flag depends on whether HTTPS is enabled.

  • objectStore

    helm get notes csghub -n csghub | grep -A 8 'Distribution Registry'

    This provides the endpoint, accessKey, and secretKey.

    • bucket, region, and pathStyle are fixed values.
    • secure depends on whether HTTPS is enabled.
  • Deploy

    💡 Tip: The object store and image registry can be replaced with external infrastructure.

    helm install runner csghub/runner \
    --namespace csghub \
    --create-namespace \
    --set global.ingress.domain="example.com" \
    --set externalUrl="<csghub external_url>" \
    --set hubAPIToken="<csghub hub_api_token>" \
    --set region="<region name>" \
    --set registry.registry="<csghub registry>" \
    --set registry.repository="csghub" \
    --set registry.username="<csghub registry username>" \
    --set registry.password="<csghub registry password>" \
    --set registry.insecure="<if registry is insecure>" \
    --set objectStore.endpoint="<csghub minio>" \
    --set objectStore.accessKey="<csghub minio username>" \
    --set objectStore.secretKey="<csghub minio password>" \
    --set objectStore.bucket="csghub-registry" \
    --set objectStore.region="cn-north-1" \
    --set objectStore.secure="false" \
    --set objectStore.pathStyle="true"

💡 Tip: For long-term management, it’s recommended to save custom configurations in a custom-values.yaml file.


🧾 Configuration Reference (values.yaml)

Global Configuration

ParameterDefaultDescription
global.ingress.domainexample.comBase domain for the platform
global.ingress.tls.enabledfalseEnable TLS or not
global.image.tag-Image version tag

Runner Configuration

ParameterDefaultDescription
namerunnerResource name prefix (used for domain exposure)
regionregion-0Runner region identifier
interval60Communication interval with the Server (in seconds)
namespacespacesDefault namespace for user workloads
autoConfiguretrueAuto-install knative, argo, and lws components
kymlModeupdateCluster resource management mode (create/update/replace)
mergingNamespacedisableNamespace merging mode (multi/single/disable)
usePublicDomaintrueUse public domain for access (false may restrict functionality)

Package & Image Management

ParameterDefaultDescription
pipIndexUrlhttps://pypi.tuna.tsinghua.edu.cn/simple/Custom PyPI mirror
extraBuildArgs[]Additional Kaniko args
modelRegistryOpenCSG ACRModel image registry URL

GPU Configuration

ParameterDefaultDescription
gpuModelLabel.typeLabelnvidia.com/gpu.productGPU model label key
gpuModelLabel.capacityLabelnvidia.com/gpuGPU capacity label

Knative Serving Configuration

💡 Note: These parameters are deprecated since v1.12.0 and retained for backward compatibility.

ParameterDefaultDescription
knative.serving.domain“example.com”Knative service domain suffix
knative.serving.services[]Legacy configuration

RBAC Configuration

ParameterDefaultDescription
rbac.createtrueWhether to create ServiceAccount & Roles
rbac.serviceAccountNamerunner-adminServiceAccount name (currently fixed)

Logging & Monitoring

ParameterDefaultDescription
logging.levelinfoLog level (info/debug/error)
logcollector.enabledfalseEnable log collector
logcollector.loki.address“”Loki service address
tempo.address“”Tempo tracing endpoint
  • Loki service is not exposed by default.

    Enable it in the main CSGHub chart with loki.ingress.enabled=true to use.

  • Tempo tracing is currently internal only; external exposure is planned.


External Resources

🔹 Image Registry

registry:
registry: "registry.example.com"
repository: "csghub"
username: "user"
password: "pass"
insecure: false

🔹 Object Storage

objectStore:
endpoint: "https://minio.example.com"
accessKey: "admin"
secretKey: "password"
bucket: "csghub-registry"
region: "us-east-1"
secure: true
pathStyle: true

Resource & Scheduling

ParameterDefaultDescription
resourcesPod resource requests/limits config
nodeSelectorNode selector
tolerations[]Tolerations
affinityAffinity rules

🔍 Verify Deployment

Check the status:

kubectl get pods -n csghub
kubectl get svc -n csghub

View Runner logs:

kubectl logs -f deploy/runner-runner -n csghub

🔄 Upgrade & Uninstall

Upgrade Chart

helm upgrade runner csghub/runner -n csghub -f custom-values.yaml

Uninstall Chart

helm uninstall runner -n csghub

🧠 Troubleshooting

IssueSolution
Runner cannot reach ServerVerify externalUrl and hubAPIToken are configured correctly
Knative not auto-installedEnsure autoConfigure: true and proper cluster permissions
GPU job not scheduledCheck node GPU labels and drivers
Image pull failedVerify registry credentials and image.pullSecrets settings