Skip to main content

Dataflow Deployment Guide

📘 Overview

CSGHUB Dataflow is the data flow management and labeling subsystem of the CSGHub platform.

It handles data processing, annotation, preprocessing, and distribution for model training workflows.

This Helm Chart enables fast deployment of Dataflow and its dependencies (Label Studio, Redis, PostgreSQL, MongoDB, etc.) in a Kubernetes environment.

The chart supports both built-in mode (all dependencies deployed automatically) and external resource mode (connect to managed databases and caches).


⚙️ System Requirements

ItemDescription
Kubernetes Versionv1.28+
Helm Versionv3.12+
NetworkCluster nodes must access the CSGHub main service (externalUrl)
PermissionsRequires permissions to create Namespace, Service, PVC, Ingress, etc.
Cluster StorageMust support ReadWriteMany persistent volumes

🧩 1. Preparation

Add the CSGHub Helm Repository

helm repo add csghub https://charts.opencsg.com/repository/csghub
helm repo update

Create Namespace (Optional)

kubectl create namespace csghub

🏗️ 2. Deploy Dataflow

Basic Installation (with built-in dependencies)

For testing or development, you can use the default configuration:

  • Get the externalUrl of CSGHub:

    helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub'
  • Install Dataflow:

    helm install dataflow csghub/dataflow \
    --namespace csghub \
    --create-namespace \
    --set global.ingress.domain="example.com" \
    --set externalUrl="<csghub externalUrl>" \
    --set dataflow.postgresql.database="csghub_dataflow" \
    --set labelStudio.postgresql.database="csghub_label_studio"

This installation automatically deploys:

  • Dataflow main service
  • Label Studio annotation service
  • Built-in PostgreSQL, Redis, and MongoDB
  • Built-in NGINX Ingress Controller

Using External Resources

For production environments, it is recommended to use managed external services:

helm install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio" \
-f custom-values.yaml

Example custom-values.yaml:

global:
ingress:
domain: "csghub.company.com"
tls:
enabled: true
secretName: "csghub-tls"

postgresql:
enabled: false
external:
host: "pg.company.com"
port: 5432
user: "csghub"
password: "******"
sslmode: "require"

redis:
enabled: false
external:
host: "redis.company.com"
port: 6379
password: "******"

mongo:
enabled: false
external:
host: "mongo.company.com"
port: 27017
user: "admin"
password: "******"

externalUrl: "https://csghub.company.com"

⚙️ 3. Key Configuration Parameters

Global Configuration (global)

ParameterDefaultDescription
global.editioneeEdition: Community (ce) / Enterprise (ee)
global.ingress.domainexample.comBase domain for ingress access
global.image.tagv1.12.0Default image version tag
global.persistence.size10GiDefault persistent volume size
global.postgresql.enabledtrueEnable built-in PostgreSQL
global.redis.enabledtrueEnable built-in Redis
global.mongo.enabledtrueEnable built-in MongoDB

Dataflow Service Configuration

ParameterDefaultDescription
externalUrlhttps://csghub.example.comCSGHub main system URL
dataflow.image.repositoryopencsghq/dataflowDataflow image repository
dataflow.image.tagv1.12.0Dataflow image tag
dataflow.persistence.size100GiPersistent volume size
dataflow.postgresqlOverride PostgreSQL settings
dataflow.redisOverride Redis settings
dataflow.mongoOverride MongoDB settings

Worker Configuration

ParameterDefaultDescription
worker.logging.levelinfoLogging level for Celery Worker

Label Studio Configuration

ParameterDefaultDescription
labelStudio.image.repositoryopencsghq/label-studioLabel Studio image
labelStudio.image.tagv1.12.0Label Studio image tag
labelStudio.persistence.size100GiPersistent volume size
labelStudio.securityContext.runAsUser0Container user UID
labelStudio.postgresql.databasecsghub_label_studioLabel Studio DB name

Built-in Dependencies

ComponentParameterDefaultDescription
PostgreSQLpostgresql.image.repositoryopencsghq/postgresBuilt-in database image
postgresql.databases[csghub_dataflow, csghub_label_studio]Pre-created databases
postgresql.persistence.size50GiPersistent volume size
Redisredis.image.repositoryredisRedis image
redis.persistence.size10GiPersistent volume size
MongoDBmongo.image.repositoryopencsghq/mongoMongoDB image
mongo.persistence.size10GiPersistent volume size

🔍 4. Verify Deployment

Check running Pods:

kubectl get pods -n csghub

Check services:

kubectl get svc -n csghub

Functional testing requires connection to the main CSGHub service.


🔄 5. Upgrade and Uninstall

Upgrade Chart

helm upgrade dataflow csghub/dataflow -n csghub -f custom-values.yaml

Uninstall Chart

helm uninstall dataflow -n csghub-dataflow

🧠 FAQ

IssueCauseSolution
Dataflow cannot reach main CSGHubexternalUrl misconfiguredVerify URL and TLS settings
Label Studio failed to startDatabase or PVC misconfiguredCheck PostgreSQL/Mongo mount paths
Image pull failureMissing registry credentialsAdd image.pullSecrets configuration
Redis/Mongo not startingConflict with external configDisable built-in components and redeploy