Skip to main content

Software Requirements

1. Description

CSGHub is built on a cloud-native technology stack. Its core runtime environment includes:

  • Operating System (Linux)
  • Container Runtime (Docker / containerd)
  • Kubernetes Cluster (Recommended)
  • Storage System (CSI / Object Storage)
  • Optional AI / Scheduling Components

👉 Software version compatibility directly affects:

  • Deployment success rate
  • System stability
  • Future upgrade capabilities

2. Operating System

2.1 Supported Operating Systems

DistributionVersion Requirement
Ubuntu≥ 20.04 (Recommended: 22.04 LTS for better stability)
CentOS≥ 7.9 (Recommendation: Migrate to Rocky / AlmaLinux 8+ as CentOS 7 is EOL)
Debian≥ 11 (Recommended: 12, compatible with latest cloud-native components)
openSUSE / SLESRecent versions (≥ 15 SP4, ensuring compatibility with K8s/Runtime)

2.2 System Configuration

ItemRequirement
Architecturex86_64 / ARM64 (x86_64 has better compatibility)
Kernel≥ 4.18 (Recommended ≥ 5.x for better CSI and GPU support)
File Systemext4 / xfs (Recommended: xfs for large-scale storage and higher IO)
Time SyncMandatory (NTP / chrony), offset ≤ 1s (to avoid certificate/scheduling errors)
  • Stripped-down OS: Missing basic tools (curl, wget, vim) will cause script failures.
  • Disabled cgroup/namespace: Core dependencies for K8s; disabling them prevents services from starting.
  • Non-standard Linux: Component compatibility cannot be guaranteed on obscure or highly customized forks.
  • Mixed Architecture: Not recommended to mix x86_64 and ARM64 nodes in a single cluster.

3. Container Runtime

3.1 Docker (For Single-node / Simple Environments)

  • Docker: ≥ 20.10 (Recommended 24.0+)
  • Docker Compose: ≥ 2.x (Recommended 2.20+)
  • containerd: ≥ 1.6 (Recommended 1.7+ for K8s 1.30+ compatibility)
  • runc: Recent stable version (≥ 1.1.7)

3.3 Notes

  • Kubernetes uses containerd by default; it is preferred for production due to superior performance and stability.
  • Docker should only be used for local dev/debug or small-scale single-node trials.
  • Enable mirror acceleration (e.g., Aliyun/Huawei Cloud) to prevent image pull timeouts.

4. Kubernetes Requirements (Core)

4.1 Version Requirements

  • Kubernetes: ≥ 1.30 (Recommended 1.30~1.32 to support the latest CSI/GPU plugins)
  • Helm: ≥ 3.12 (Recommended 3.14+)

4.2 Mandatory Capabilities

  • CNI Plugin: Required (Calico, Flannel, or Cilium; Cilium is recommended for AI due to eBPF acceleration).
  • CSI Driver: Required for dynamic volume management and data persistence.
  • LoadBalancer: Recommended for production (Cloud LB or MetalLB) to ensure high availability.

4.3 Additional Requirements

  • Node Count: Test/Dev ≥ 1 node; Production ≥ 3 nodes.
  • API Server Accessibility: Unrestricted communication between all nodes and the API Server.
  • RBAC: Enabled for secure permission allocation to platform components.
  • etcd: Production environments must deploy a 3-node etcd cluster to avoid single points of failure.

5. Storage Component Dependencies

5.1 Mandatory Capabilities

  • CSI Plugin: Compatible with K8s version; supports dynamic creation, deletion, and expansion.
  • StorageClass: A default StorageClass must be configured for automatic PV/PVC provisioning.
TypeRecommended Software
Distributed BlockLonghorn (Lightweight/Simple) or Ceph (Large-scale/High load)
File Storage (RWX)NFS (Simple) or CephFS (High performance/Concurrent)
Object StorageMinIO or S3 (Compatible with large model/dataset storage)

5.3 Special Requirements

  • ReadWriteMany (RWX): Mandatory for components like Dataflow and CSGShip to allow multi-Pod data sharing.
  • High IO (SSD/NVMe): Critical for model inference (low latency) and AI training (high throughput).

5.4 NVIDIA Components

ComponentRequirement
NVIDIA Driver≥ 580+ (Must support the required CUDA version)
CUDAMatch driver (Recommended 12.2+)
Container ToolkitMandatory for GPU scheduling within containers
Device PluginMandatory for K8s GPU resource management

6. Optional Component Dependencies

6.1 AI / ML Components

  • Volcano Device Plugin: v1.11.0 (Advanced GPU scheduling like slicing/sharing).
  • TensorFlow / PyTorch: Recent versions (Platform supports one-click deployment of these images).

6.2 Observability & Security

  • Grafana / Prometheus: For monitoring GPU, network, and container metrics.
  • ELK Stack: For log collection and analysis.
  • NetworkPolicy: For Pod-to-Pod communication security via CNI.

7. Verification Checklist

Before deploying CSGHub, run these commands to ensure environment readiness:

# Check K8s nodes and version
kubectl get nodes
kubectl version --short

# Check Helm version
helm version

# Check for default StorageClass
kubectl get sc

# Check GPU environment (For AI scenarios)
nvidia-smi

# Check Runtime status
systemctl status containerd # or docker
containerd --version

# Check Time Sync and Kernel
timedatectl status
uname -r

8. Troubleshooting

  • Version Incompatibility:
    • Symptom: Component startup errors.
    • Fix: Upgrade/downgrade software to meet the requirements; always backup etcd before K8s upgrades.
  • GPU Not Recognized:
    • Symptom: GPU resources show as 0 in K8s.
    • Fix: Verify NVIDIA Driver matches GPU; restart runtime after installing Toolkit.
  • Storage Mounting Failure:
    • Symptom: PVC stuck in Pending.
    • Fix: Check CSI driver status and ensure the StorageClass name matches.

9. Final Advice

  1. Environment: Kubernetes + containerd is the standard for production.
  2. Version Control: Strictly follow recommended versions; avoid "beta" or "preview" releases.
  3. Storage: Prioritize High IO SSDs and ensure RWX support is active.
  4. Verification: Always perform the environment check before starting the CSGHub installation script.