Runner
1. Overview
The Runner is the core component of the CSGHub platform, responsible for executing computational tasks such as model training, inference, and task scheduling.
Through the Runner, communication with the main control end (CSGHub Server) is established, and user workloads are dynamically created and destroyed within the Kubernetes cluster.
This Chart provides a standardized deployment method via Helm, supporting flexible configuration, integration of external dependencies, and automated resource management.
2. Environment Requirements
| Project | Requirement |
|---|---|
| Kubernetes Version | v1.33+ |
| Helm Version | v3.12+ |
| Network Requirements | Cluster nodes must be able to access the CSGHub Server and external image registries (if internal registries are disabled). |
| Permissions | Requires cluster-admin or the ability to create namespaces and RBAC resources (automatically created during deployment). |
3. Deployment Steps
Add Helm Repository
helm repo add csghub https://charts.opencsg.com/csghub
helm repo update
Create Namespace (Optional)
kubectl create namespace csghub
Deploy Runner
Obtain the following information from the CSGHub main service:
-
domain: Provide a second-level domain used to expose the runner service. If the domain provided is
example.com, the service will be exposed underrunner.example.com. -
externalUrl:
helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'Get the corresponding URL for CSGHub.
-
hubAPIToken:
kubectl get cm csghub-core -o yaml -n csghub | grep 'API_TOKEN' | awk '{print $NF}' -
region: A custom parameter used to identify the cluster region (e.g.,
cn-north). -
registry.*:
helm get notes csghub -n csghub | grep -A 8 'Minio Console'Obtain the
<domain>,username, andpassword. Setinsecurebased on whether theexternalUrluses HTTPS. -
objectStore:
helm get notes csghub -n csghub | grep -A 8 'Distribution Registry'The above command provides
endpoint,accessKey, andsecretKey.bucket,region, andpathStyleare fixed values. Setsecurebased on whether theexternalUrluses HTTPS.
Execute Deployment
💡 Tip: > Object storage and container registries can be directly integrated with external infrastructure.
For deployment in China, add: >
--set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"--set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"Note: It is recommended to write custom configurations into a
custom-values.yamlfile for easier upgrades and version management.
helm install runner csghub/runner \
--namespace csghub \
--create-namespace \
--set global.gateway.external.domain="example.com" \
--set externalUrl="<csghub external_url>" \
--set hubAPIToken="<csghub hub_api_token>" \
--set region="<region name>" \
--set registry.registry="<csghub registry>" \
--set registry.repository="csghub" \
--set registry.username="<csghub registry username>" \
--set registry.password="<csghub registry password>" \
--set-string registry.insecure="<if csghub registry secure>" \
--set objectStore.endpoint="<csghub minio>" \
--set objectStore.accessKey="<csghub minio username>" \
--set objectStore.secretKey="<csghub minio password>" \
--set objectStore.bucket="csghub-registry" \
--set objectStore.region="cn-north-1" \
--set-string objectStore.secure="false" \
--set-string objectStore.pathStyle="true"
4. Configuration Details
4.1 Global Configuration
| Parameter | Default Value | Description |
|---|---|---|
global.gateway.external.domain | example.com | Base access domain for the platform |
global.gateway.tls.enabled | false | Whether to enable TLS |
global.image.tag | - | Image version tag |
4.2 Service Configuration
| Parameter | Default Value | Description |
|---|---|---|
name | runner | Name used to identify runner resources (includes exposed domain) |
region | region-0 | Regional identifier for the Runner |
interval | 60 | Communication interval between Runner and Server (seconds) |
namespace | spaces | Default namespace for user workloads |
autoConfigure (Deprecated) | true | Whether to automatically install dependencies like Knative, Argo, LWS |
kymlMode (Deprecated) | create | Initialization mode for cluster resources (create/update/replace) |
mergingNamespace | disable | Namespace merging mode (multi/single/disable) |
usePublicDomain | true | Whether to use public domains for app access (false may limit features) |
4.3 Package and Image Management
| Parameter | Default Value | Description |
|---|---|---|
pipIndexUrl | https://pypi.tuna.tsinghua.edu.cn/simple/ | Custom pip source |
extraBuildArgs | [] | Additional Kaniko build arguments |
model.registry | opencsg-registry.cn-beijing.cr.aliyuncs.com | Model image registry address |
4.4 GPU Configuration
| Parameter | Default Value | Description |
|---|---|---|
gpuModelLabel.typeLabel | nvidia.com/gpu.product | GPU model label |
gpuModelLabel.capacityLabel | nvidia.com/gpu | GPU capacity label |
4.5 Knative Serving Configuration
💡 Tip: Since v1.12.0, the following parameters are deprecated and used only for backward compatibility.
| Parameter | Default Value | Description |
|---|---|---|
knative.serving.domain | example.com | Knative service domain suffix |
knative.serving.services | [] | Legacy service configuration (deprecated) |
knative.serving.autoscaler.enableScaleToZero | true | Enable KSVC instance auto-shutdown |
knative.serving.autoscaler.scaleToZeroGracePeriod | 60m | Tolerance time for KSVC instance shutdown |
4.6 RBAC Configuration
| Parameter | Default Value | Description |
|---|---|---|
rbac.create | true | Automatically create SA, Role, and RoleBinding |
rbac.serviceAccountName | csghub-runner | SA name [Currently not modifiable] |
4.7 Logging and Monitoring
| Parameter | Default Value | Description |
|---|---|---|
logging.level | info | Log level (info/debug/error) |
logcollector.enabled | false | Enable log collector |
logcollector.loki.address | "" | Loki service address |
tempo.address | "" | Tempo tracing address |
Note: Loki is not exposed by default on the CSGHub side. You must set loki.gateway.enabled=true in CSGHub. Tempo support for external exposure is coming soon.
4.8 External Resource Configuration
Registry:
registry:
registry: "registry.example.com"
repository: "csghub"
username: "user"
password: "pass"
insecure: false
Object Store:
objectStore:
endpoint: "https://minio.example.com"
accessKey: "admin"
secretKey: "password"
bucket: "csghub-registry"
region: "us-east-1"
secure: true
pathStyle: true
4.9 Resource and Scheduling Configuration
| Parameter | Default Value | Description |
|---|---|---|
resources | {} | Pod requests/limits configuration |
nodeSelector | {} | Node selector |
tolerations | [] | Tolerations configuration |
affinity | {} | Affinity configuration |
5. Verify Deployment
After deployment, verify the status using:
kubectl get pods -n csghub
kubectl get svc -n csghub
To view Runner logs:
kubectl logs -f deploy/runner-runner -n csghub
6. Upgrade and Uninstallation
6.1 Upgrade Chart
helm upgrade runner csghub/runner -n csghub -f custom-values.yaml
6.2 Uninstall Chart
helm uninstall runner -n csghub
7. FAQ
| Issue | Solution |
|---|---|
| Runner cannot connect to Server | Verify if externalUrl and hubAPIToken are correct. |
| Knative service not installed | Ensure autoConfigure: true and the cluster has necessary permissions. |
| GPU tasks failed to schedule | Check if node labels and GPU drivers are correctly installed. |
| Image pull failure | Verify registry access permissions and image.pullSecrets configuration. |