Skip to main content

Hardware Requirements

1. Description

CSGHub is a cloud-native AI hosting platform that includes the following core load types:

  • Control Plane Services (API / Web / Scheduling)
  • Data Plane Services (Model, Dataset, Artifact storage)
  • Computing Tasks (Dataflow / Runner / Inference tasks)
  • Optional AI Components (GPU / Knative / Argo)

Therefore, hardware requirements are highly dependent on the deployment scale and usage scenarios.

2. Deployment Mode Classification

Deployment ModeTarget ScenariosCharacteristics
Docker (Single Machine)Development / DemoSimple, low resource consumption
Single-node K8sTesting / POCClose to production architecture
Standard K8s ClusterProductionScalable and high-availability cluster
Large-scale ProductionMulti-node redundancyLarge-scale production environment

3. Testing/Development Environment (Minimum Configuration)

Applicable to:

  • Functional verification
  • Local development
  • Individual use
ResourceConfiguration
CPU4 Cores
Memory8 GB
Storage≥ 200 GB (SSD)
Network≥ 1 Gbps

3.2 Notes

  • Can use Docker.
  • Not recommended to enable: Dataflow, large-scale Runner, or GPU inference.
  • Local disk (hostPath) can be used for storage.

Applicable to:

  • Team use (10–100 people)
  • Model / Dataset management
  • Medium-scale task scheduling

4.1 Cluster Scale

  • 3 to 5-node Kubernetes cluster.

4.2 Per-node Configuration

ResourceRecommended
CPU8 ~ 16 Cores
Memory16 ~ 32 GB
Storage≥ 1 TB SSD
Network≥ 1 ~ 10 Gbps

4.3 Total Resources (Example)

  • Total CPU: ≥ 32 Cores
  • Total Memory: ≥ 64 GB
  • Storage: ≥ 3 TB

5. Large-scale Production (High Load)

Applicable to:

  • Multi-team / Multi-tenant
  • High-frequency task scheduling
  • AI Inference / Training
  • Large-scale datasets

5.1 Cluster Scale

  • 5 to 20+ nodes.

5.2 Per-node Configuration

ResourceRecommended
CPU16 ~ 64 Cores
Memory64 ~ 256 GB
Storage≥ 2 TB NVMe SSD
Network≥ 10 Gbps

6. GPU Resources (Optional)

Applicable to:

  • Model inference
  • AI training
  • Model evaluation
ScenarioGPU
Lightweight Inference1 × T4 / L4
Medium Load1~4 × A10 / A100
Large-scale TrainingMulti-node GPU

6.2 Requirements

  • Must deploy: NVIDIA Driver, NVIDIA Device Plugin.

7. Storage Requirements (Critical)

7.1 Mandatory Capabilities

  • ✅ CSI support
  • ✅ Dynamic Provisioning support
  • ✅ At least one StorageClass

7.2 Storage Type Recommendations

TypeUsageRecommendation
Local SSDTesting✅ Recommended
NAS / NFSRWX scenarios⚠️ (Average performance)
Distributed Storage (Ceph / Longhorn)Production✅ Recommended
Object Storage (S3)Dataset / Model✅ Recommended

7.3 RWX (ReadWriteMany) Requirements

The following components must support RWX:

  • Dataflow
  • CSGShip
  • Parts of task scheduling

👉 Failure to meet this will result in: Task failures and non-shareable data.

7.4 Storage Capacity Estimation

Total Storage = (Model Size × Quantity) + (Dataset Size × Quantity) + Build Cache (~20%) + Logs (~10%). Example: 500 GB Models + 2 TB Datasets + 500 GB Cache/Logs ≈ 3 TB Total.

8. Component Resource Consumption

ComponentCPUMemoryStorageCharacteristics
API / WebLowLowLowControl Plane
DataflowMedMedHighHeavy I/O dependency
RunnerHighMedMedElastic scaling
KnativeMedMedLowAuto-scaling
ArgoMedMedMedWorkflow scheduling

9. Deployment Method vs. Hardware Advice

MethodConfiguration Details
Docker Single MachineCPU ≥ 4 Core, RAM ≥ 8 GB (for Demos)
K8s Single-nodeCPU ≥ 8 Core, RAM ≥ 16 GB
Standard K8s≥ 3 nodes, 8C / 16GB per node
High AvailabilityK8s: 3 Master (4C/8GB), ≥ 3 Worker (8C/16GB)
PostgreSQL: 3 nodes (4C/8GB)
Object Storage: 4 nodes (4C/8GB)
Gitaly: 3 nodes (8C/16GB)

10. Common Problems & Risks

  • Resource Depletion: Leads to Pod OOMKilled, scheduling failures, and stuck tasks. I/O bottlenecks are the most common issue.
  • Storage Issues: Lack of RWX causes Dataflow startup failure. Slow I/O degrades training/inference performance.
  • Network Issues: Insufficient bandwidth slows model pulling. High latency causes service instability.

11. Final Summary Recommendations

  • Testing Environment: 4C / 8GB / 200GB.
  • Production Environment: 8C+ / 16GB+ / 1TB+.
  • Preferred Deployment: Kubernetes cluster.
  • Storage Priority: RWX + High I/O.
  • AI Scenario: Dedicated GPU nodes are highly suggested.