Skip to main content

Runner

1. Overview

The Runner is the core component of the CSGHub platform, responsible for executing computational tasks such as model training, inference, and task scheduling.

Through the Runner, communication with the main control end (CSGHub Server) is established, and user workloads are dynamically created and destroyed within the Kubernetes cluster.

This Chart provides a standardized deployment method via Helm, supporting flexible configuration, integration of external dependencies, and automated resource management.

2. Environment Requirements

ProjectRequirement
Kubernetes Versionv1.33+
Helm Versionv3.12+
Network RequirementsCluster nodes must be able to access the CSGHub Server and external image registries (if internal registries are disabled).
PermissionsRequires cluster-admin or the ability to create namespaces and RBAC resources (automatically created during deployment).

3. Deployment Steps

Add Helm Repository

helm repo add csghub https://charts.opencsg.com/csghub
helm repo update

Create Namespace (Optional)

kubectl create namespace csghub

Deploy Runner

Obtain the following information from the CSGHub main service:

  • domain: Provide a second-level domain used to expose the runner service. If the domain provided is example.com, the service will be exposed under runner.example.com.

  • externalUrl:

    helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'

    Get the corresponding URL for CSGHub.

  • hubAPIToken:

    kubectl get cm csghub-core -o yaml -n csghub | grep 'API_TOKEN' | awk '{print $NF}'

  • region: A custom parameter used to identify the cluster region (e.g., cn-north).

  • registry.*:

    helm get notes csghub -n csghub | grep -A 8 'Minio Console'

    Obtain the <domain>, username, and password. Set insecure based on whether the externalUrl uses HTTPS.

  • objectStore:

    helm get notes csghub -n csghub | grep -A 8 'Distribution Registry'

    The above command provides endpoint, accessKey, and secretKey.

    bucket, region, and pathStyle are fixed values. Set secure based on whether the externalUrl uses HTTPS.

Execute Deployment

💡 Tip: > Object storage and container registries can be directly integrated with external infrastructure.

For deployment in China, add: > --set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com"

--set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"

Note: It is recommended to write custom configurations into a custom-values.yaml file for easier upgrades and version management.

helm install runner csghub/runner \
--namespace csghub \
--create-namespace \
--set global.gateway.external.domain="example.com" \
--set externalUrl="<csghub external_url>" \
--set hubAPIToken="<csghub hub_api_token>" \
--set region="<region name>" \
--set registry.registry="<csghub registry>" \
--set registry.repository="csghub" \
--set registry.username="<csghub registry username>" \
--set registry.password="<csghub registry password>" \
--set-string registry.insecure="<if csghub registry secure>" \
--set objectStore.endpoint="<csghub minio>" \
--set objectStore.accessKey="<csghub minio username>" \
--set objectStore.secretKey="<csghub minio password>" \
--set objectStore.bucket="csghub-registry" \
--set objectStore.region="cn-north-1" \
--set-string objectStore.secure="false" \
--set-string objectStore.pathStyle="true"

4. Configuration Details

4.1 Global Configuration

ParameterDefault ValueDescription
global.gateway.external.domainexample.comBase access domain for the platform
global.gateway.tls.enabledfalseWhether to enable TLS
global.image.tag-Image version tag

4.2 Service Configuration

ParameterDefault ValueDescription
namerunnerName used to identify runner resources (includes exposed domain)
regionregion-0Regional identifier for the Runner
interval60Communication interval between Runner and Server (seconds)
namespacespacesDefault namespace for user workloads
autoConfigure (Deprecated)trueWhether to automatically install dependencies like Knative, Argo, LWS
kymlMode (Deprecated)createInitialization mode for cluster resources (create/update/replace)
mergingNamespacedisableNamespace merging mode (multi/single/disable)
usePublicDomaintrueWhether to use public domains for app access (false may limit features)

4.3 Package and Image Management

ParameterDefault ValueDescription
pipIndexUrlhttps://pypi.tuna.tsinghua.edu.cn/simple/Custom pip source
extraBuildArgs[]Additional Kaniko build arguments
model.registryopencsg-registry.cn-beijing.cr.aliyuncs.comModel image registry address

4.4 GPU Configuration

ParameterDefault ValueDescription
gpuModelLabel.typeLabelnvidia.com/gpu.productGPU model label
gpuModelLabel.capacityLabelnvidia.com/gpuGPU capacity label

4.5 Knative Serving Configuration

💡 Tip: Since v1.12.0, the following parameters are deprecated and used only for backward compatibility.

ParameterDefault ValueDescription
knative.serving.domainexample.comKnative service domain suffix
knative.serving.services[]Legacy service configuration (deprecated)
knative.serving.autoscaler.enableScaleToZerotrueEnable KSVC instance auto-shutdown
knative.serving.autoscaler.scaleToZeroGracePeriod60mTolerance time for KSVC instance shutdown

4.6 RBAC Configuration

ParameterDefault ValueDescription
rbac.createtrueAutomatically create SA, Role, and RoleBinding
rbac.serviceAccountNamecsghub-runnerSA name [Currently not modifiable]

4.7 Logging and Monitoring

ParameterDefault ValueDescription
logging.levelinfoLog level (info/debug/error)
logcollector.enabledfalseEnable log collector
logcollector.loki.address""Loki service address
tempo.address""Tempo tracing address

Note: Loki is not exposed by default on the CSGHub side. You must set loki.gateway.enabled=true in CSGHub. Tempo support for external exposure is coming soon.

4.8 External Resource Configuration

Registry:

registry:
registry: "registry.example.com"
repository: "csghub"
username: "user"
password: "pass"
insecure: false

Object Store:

objectStore:
endpoint: "https://minio.example.com"
accessKey: "admin"
secretKey: "password"
bucket: "csghub-registry"
region: "us-east-1"
secure: true
pathStyle: true

4.9 Resource and Scheduling Configuration

ParameterDefault ValueDescription
resources{}Pod requests/limits configuration
nodeSelector{}Node selector
tolerations[]Tolerations configuration
affinity{}Affinity configuration

5. Verify Deployment

After deployment, verify the status using:

kubectl get pods -n csghub
kubectl get svc -n csghub

To view Runner logs:

kubectl logs -f deploy/runner-runner -n csghub

6. Upgrade and Uninstallation

6.1 Upgrade Chart

helm upgrade runner csghub/runner -n csghub -f custom-values.yaml

6.2 Uninstall Chart

helm uninstall runner -n csghub

7. FAQ

IssueSolution
Runner cannot connect to ServerVerify if externalUrl and hubAPIToken are correct.
Knative service not installedEnsure autoConfigure: true and the cluster has necessary permissions.
GPU tasks failed to scheduleCheck if node labels and GPU drivers are correctly installed.
Image pull failureVerify registry access permissions and image.pullSecrets configuration.