Software Requirements
1. Description
CSGHub is built on a cloud-native technology stack. Its core runtime environment includes:
- Operating System (Linux)
- Container Runtime (Docker / containerd)
- Kubernetes Cluster (Recommended)
- Storage System (CSI / Object Storage)
- Optional AI / Scheduling Components
👉 Software version compatibility directly affects:
- Deployment success rate
- System stability
- Future upgrade capabilities
2. Operating System
2.1 Supported Operating Systems
| Distribution | Version Requirement |
|---|---|
| Ubuntu | ≥ 20.04 (Recommended: 22.04 LTS for better stability) |
| CentOS | ≥ 7.9 (Recommendation: Migrate to Rocky / AlmaLinux 8+ as CentOS 7 is EOL) |
| Debian | ≥ 11 (Recommended: 12, compatible with latest cloud-native components) |
| openSUSE / SLES | Recent versions (≥ 15 SP4, ensuring compatibility with K8s/Runtime) |
2.2 System Configuration
| Item | Requirement |
|---|---|
| Architecture | x86_64 / ARM64 (x86_64 has better compatibility) |
| Kernel | ≥ 4.18 (Recommended ≥ 5.x for better CSI and GPU support) |
| File System | ext4 / xfs (Recommended: xfs for large-scale storage and higher IO) |
| Time Sync | Mandatory (NTP / chrony), offset ≤ 1s (to avoid certificate/scheduling errors) |
2.3 Not Recommended
- Stripped-down OS: Missing basic tools (curl, wget, vim) will cause script failures.
- Disabled cgroup/namespace: Core dependencies for K8s; disabling them prevents services from starting.
- Non-standard Linux: Component compatibility cannot be guaranteed on obscure or highly customized forks.
- Mixed Architecture: Not recommended to mix x86_64 and ARM64 nodes in a single cluster.
3. Container Runtime
3.1 Docker (For Single-node / Simple Environments)
- Docker: ≥ 20.10 (Recommended 24.0+)
- Docker Compose: ≥ 2.x (Recommended 2.20+)
3.2 Containerd (Recommended for Production)
- containerd: ≥ 1.6 (Recommended 1.7+ for K8s 1.30+ compatibility)
- runc: Recent stable version (≥ 1.1.7)
3.3 Notes
- Kubernetes uses containerd by default; it is preferred for production due to superior performance and stability.
- Docker should only be used for local dev/debug or small-scale single-node trials.
- Enable mirror acceleration (e.g., Aliyun/Huawei Cloud) to prevent image pull timeouts.
4. Kubernetes Requirements (Core)
4.1 Version Requirements
- Kubernetes: ≥ 1.30 (Recommended 1.30~1.32 to support the latest CSI/GPU plugins)
- Helm: ≥ 3.12 (Recommended 3.14+)
4.2 Mandatory Capabilities
- CNI Plugin: Required (Calico, Flannel, or Cilium; Cilium is recommended for AI due to eBPF acceleration).
- CSI Driver: Required for dynamic volume management and data persistence.
- LoadBalancer: Recommended for production (Cloud LB or MetalLB) to ensure high availability.
4.3 Additional Requirements
- Node Count: Test/Dev ≥ 1 node; Production ≥ 3 nodes.
- API Server Accessibility: Unrestricted communication between all nodes and the API Server.
- RBAC: Enabled for secure permission allocation to platform components.
- etcd: Production environments must deploy a 3-node etcd cluster to avoid single points of failure.
5. Storage Component Dependencies
5.1 Mandatory Capabilities
- CSI Plugin: Compatible with K8s version; supports dynamic creation, deletion, and expansion.
- StorageClass: A default StorageClass must be configured for automatic PV/PVC provisioning.
5.2 Recommended Storage Solutions
| Type | Recommended Software |
|---|---|
| Distributed Block | Longhorn (Lightweight/Simple) or Ceph (Large-scale/High load) |
| File Storage (RWX) | NFS (Simple) or CephFS (High performance/Concurrent) |
| Object Storage | MinIO or S3 (Compatible with large model/dataset storage) |
5.3 Special Requirements
- ReadWriteMany (RWX): Mandatory for components like Dataflow and CSGShip to allow multi-Pod data sharing.
- High IO (SSD/NVMe): Critical for model inference (low latency) and AI training (high throughput).
5.4 NVIDIA Components
| Component | Requirement |
|---|---|
| NVIDIA Driver | ≥ 580+ (Must support the required CUDA version) |
| CUDA | Match driver (Recommended 12.2+) |
| Container Toolkit | Mandatory for GPU scheduling within containers |
| Device Plugin | Mandatory for K8s GPU resource management |
6. Optional Component Dependencies
6.1 AI / ML Components
- Volcano Device Plugin: v1.11.0 (Advanced GPU scheduling like slicing/sharing).
- TensorFlow / PyTorch: Recent versions (Platform supports one-click deployment of these images).
6.2 Observability & Security
- Grafana / Prometheus: For monitoring GPU, network, and container metrics.
- ELK Stack: For log collection and analysis.
- NetworkPolicy: For Pod-to-Pod communication security via CNI.
7. Verification Checklist
Before deploying CSGHub, run these commands to ensure environment readiness:
# Check K8s nodes and version
kubectl get nodes
kubectl version --short
# Check Helm version
helm version
# Check for default StorageClass
kubectl get sc
# Check GPU environment (For AI scenarios)
nvidia-smi
# Check Runtime status
systemctl status containerd # or docker
containerd --version
# Check Time Sync and Kernel
timedatectl status
uname -r
8. Troubleshooting
- Version Incompatibility:
- Symptom: Component startup errors.
- Fix: Upgrade/downgrade software to meet the requirements; always backup etcd before K8s upgrades.
- GPU Not Recognized:
- Symptom: GPU resources show as 0 in K8s.
- Fix: Verify NVIDIA Driver matches GPU; restart runtime after installing Toolkit.
- Storage Mounting Failure:
- Symptom: PVC stuck in
Pending. - Fix: Check CSI driver status and ensure the StorageClass name matches.
- Symptom: PVC stuck in
9. Final Advice
- Environment: Kubernetes + containerd is the standard for production.
- Version Control: Strictly follow recommended versions; avoid "beta" or "preview" releases.
- Storage: Prioritize High IO SSDs and ensure RWX support is active.
- Verification: Always perform the environment check before starting the CSGHub installation script.