Dataflow
1. Overview
Dataflow is the data stream management and annotation subsystem within the CSGHub platform. It is designed to handle model training data, annotation tasks, data preprocessing, and distribution workflows.
By deploying via Helm Chart, you can quickly run Dataflow and its dependencies—including Label Studio, Redis, PostgreSQL, and MongoDB—in a Kubernetes environment. This Chart supports both an All-in-one installation (Built-in mode) and connection to External managed resources.
2. Environment Requirements
| Item | Description |
|---|---|
| Kubernetes Version | v1.33+ |
| Helm Version | v3.12+ |
| Network | Cluster nodes must be able to access the CSGHub main service (externalUrl) |
| Permissions | Authority to create Namespaces, Services, PVCs, Gateways, etc. |
| Storage | Requires storage volumes that support ReadWriteMany (RWX) |
3. Deployment
3.1 Add Helm Repository
helm repo add csghub https://charts.opencsg.com/csghub
helm repo update
3.2 Create Namespace (Optional)
kubectl create namespace csghub
3.3 Deploy Dataflow
-
Obtain externalUrl:
Get the CSGHub access address using the following command:
helm get notes csghub -n csghub | grep -A 6 'Access your CSGHub' -
Execute Installation:
💡 Tip for China-based deployments:
Add these flags to use local mirrors:
-
--set global.image.registry="opencsg-registry.cn-beijing.cr.aliyuncs.com" -
--set global.imageRegistry="opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq"helm install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set global.gateway.external.domain="example.com" \
--set externalUrl="<csghub externalUrl>" \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio"This will automatically start:
- Dataflow Main Service
- Label Studio Annotation Service
- Built-in PostgreSQL, Redis, and MongoDB
- Built-in Gateway API Controller
-
4. Using External Resources
For production environments, it is recommended to use external managed databases and caches:
helm upgrade --install dataflow csghub/dataflow \
--namespace csghub \
--create-namespace \
--set dataflow.postgresql.database="csghub_dataflow" \
--set labelStudio.postgresql.database="csghub_label_studio" \
-f custom-values.yaml
Example custom-values.yaml:
global:
gateway:
external:
domain: "company.com"
tls:
enabled: true
secretName: "csghub-tls"
postgresql:
enabled: false
external:
host: "pg.company.com"
port: 5432
user: "csghub"
password: "******"
sslmode: "require"
redis:
enabled: false
external:
host: "redis.company.com"
port: 6379
password: "******"
mongo:
enabled: false
external:
host: "mongo.company.com"
port: 27017
user: "admin"
password: "******"
externalUrl: "https://csghub.company.com"
5. Configuration Parameters
5.1 Global Configuration
| Parameter | Default | Description |
|---|---|---|
global.edition | ee | Edition: ce (Community) / ee (Enterprise) |
global.gateway.external.domain | example.com | Access domain |
global.image.tag | v1.16.0 | Image version tag |
global.persistence.size | 10Gi | Default PV size |
global.postgresql.enabled | true | Enable built-in PostgreSQL |
global.redis.enabled | true | Enable built-in Redis |
global.mongo.enabled | true | Enable built-in MongoDB |
5.2 Service Configuration
| Parameter | Default | Description |
|---|---|---|
externalUrl | https://csghub.example.com | CSGHub main system URL |
dataflow.image.repository | opencsghq/dataflow | Dataflow image repository |
dataflow.persistence.size | 100Gi | Dataflow PV size |
5.3 Label Studio Configuration
| Parameter | Default | Description |
|---|---|---|
labelStudio.image.repository | opencsghq/label-studio | Label Studio image repository |
labelStudio.persistence.size | 100Gi | Annotation data PV size |
labelStudio.securityContext.runAsUser | 0 | Container User UID |
labelStudio.postgresql.database | csghub_label_studio | Database name for Label Studio |
5.4 Built-in Third-party Components
| Components | Parameter | Default | Description |
|---|---|---|---|
| PostgreSQL | postgresql.image.repository | opencsghq/postgres | Built-in database image |
| postgresql.databases | [csghub_dataflow, csghub_label_studio] | Databases created automatically at startup | |
| postgresql.persistence.size | 50Gi | Persistent volume storage size | |
| Redis | redis.image.repository | redis | |
| redis.persistence.size | 10Gi | ||
| MongoDB | mongo.image.repository | opencsghq/mongo | |
| mongo.persistence.size | 10Gi |
6. Verification
# Check Pod status
kubectl get pods -n csghub
# Verify services
kubectl get svc -n csghub
Note: Full functional verification requires successful integration with the CSGHub main system.
7. Upgrade & Uninstallation
7.1 Upgrade Chart
helm upgrade dataflow csghub/dataflow -n csghub -f custom-values.yaml
7.2 Uninstall Chart
helm uninstall dataflow -n csghub
8. FAQ
- Dataflow cannot access main system: Ensure
externalUrland TLS settings are correctly configured. - Label Studio startup failure: Check for Database connection issues or PVC mounting path errors.
- Image pull failure: Ensure
image.pullSecretsare added if using a private registry. - Redis/Mongo failed to start: Check for configuration conflicts between built-in and external resource settings.