Network Requirements
1. Description
As a distributed cloud-native AI hosting platform, the network is the core pillar of CSGHub's stable operation. It primarily handles the following responsibilities:
- Inter-node Communication: Ensures collaboration between Kubernetes nodes for scheduling and management.
- Inter-service Calls: Facilitates efficient interaction between internal microservices.
- Data Transmission: Manages uploads and downloads of large files (models, datasets, container images), directly impacting efficiency.
- External Access: Supports user access via Web UI and APIs, ensuring stability and experience.
- AI Task Communication: Provides support for Dataflow, Runner, and inference tasks.
👉 Network performance directly determines the following key indicators:
- Task Efficiency: Latency and bandwidth affect completion speeds for AI and data processing tasks.
- Model Loading Speed: Loading large models (GB to TB scale) depends entirely on transmission capacity.
- System Stability: Jitter or packet loss leads to service timeouts, Pod restarts, and task failures.
2. Network Architecture Layers
To ensure clarity and high performance, the platform network is divided into four functional layers:
2.1 Inter-node Network
- Purpose: Communication between K8s nodes and interaction with the control plane.
- Requirements:
- Bandwidth: ≥ 1 Gbps (Test/Dev); ≥ 10 Gbps (Production).
- Latency: ≤ 1 ms (Same data center deployment).
- Packet Loss: Near 0%.
2.2 Pod Network (CNI)
- Purpose: Communication between Pods and external access to Pod services.
- Recommendations:
- Calico: High compatibility and stability, suitable for most scenarios.
- Cilium: High performance and low latency via eBPF, ideal for high-concurrency AI scenarios.
- Security: Must support
NetworkPolicyfor secure isolation between Pods.
2.3 Storage Network
- Purpose: Communication for distributed storage (Ceph/Longhorn), NFS/NAS, and Object Storage (S3).
- Requirements:
- Bandwidth: ≥ 10 Gbps (Essential for large file I/O).
- Isolation: Recommended independent VLAN to avoid bandwidth contention.
- Latency: As low as possible to prevent bottlenecks in training/processing tasks.
2.4 External Access Network
- Purpose: Gateway for Web UI access, API calls, and asset uploads/downloads.
- Core Components:
- Ingress/Gateway Controller: Nginx or EnvoyGateway for routing and SSL termination.
- LoadBalancer / NodePort: Use LoadBalancer for high availability in production.
3. Scenario-specific Requirements
| Scenario | Bandwidth Requirement | Storage Network | External Export |
|---|---|---|---|
| Test / Dev | ≥ 100 Mbps | Shared with node | N/A |
| SME Production | ≥ 1 Gbps per node | ≥ 10 Gbps (Rec.) | ≥ 100 Mbps |
| Large-scale AI | ≥ 10 Gbps per node | ≥ 10 ~ 25 Gbps | ≥ 1 Gbps |
3.4 Large File Transfer Optimization
CSGHub involves transferring massive files: GB~TB scale models, large datasets, and container images. These place significant pressure on the network.
3.5 Optimization Suggestions
- Deploy an internal image registry (e.g., Harbor) to save external bandwidth.
- Prioritize Object Storage (S3) for models and datasets to leverage high concurrency.
- Avoid cross-region deployment; prefer same-datacenter setups to minimize latency.
4. Port Requirements
| Component | Port Configuration |
|---|---|
| Web / API / Git SSH | 80 / 443 / 22 (K8s NodePort: 30080 / 30443 / 30022) |
| Casdoor (Auth) | 8000 (Docker deployment only) |
| CSGShip / API | 8001 / 8002 (Docker deployment only) |
| MinIO (S3) | 9000 (Docker deployment only) |
5. High-Performance Network Advice (AI / Training)
For scenarios like distributed training or large-scale inference:
- RDMA: Remote Direct Memory Access for low-latency, high-bandwidth data transfer.
- InfiniBand: High-performance interconnect for large GPU clusters.
- GPU Direct: Direct GPU-to-GPU communication without CPU intervention.
6. Bandwidth Estimation Methods
6.1 Model Download Estimation
Bandwidth ≈ Model Size × Concurrent Downloads / Target Download Time
Example: 10 GB model, 10 concurrent users, 60s target time ≈ 1.6 Gbps (Recommend 2 Gbps).
6.2 Dataset Transfer Estimation
Bandwidth ≈ Dataset Size / Expected Loading Time
Recommend reserving ~30% redundancy.
7. Common Risks and Troubleshooting
- Insufficient Bandwidth: Slow model pulls or failed Pod starts. Suggestion: Upgrade bandwidth or use internal registries.
- High Latency: Service timeouts or UI lag. Suggestion: Use same-datacenter deployment or RDMA.
- Instability: Frequent Pod restarts or storage disconnects. Suggestion: Check switches and VLAN isolation.
- DNS Issues: Service inaccessible or image pull failures. Suggestion: Use stable DNS and check resolution.
8. Recommended Network Topology
- Simple Architecture (SME Production): Cost-effective for 10-100 person teams.
- High-Performance Architecture (Large-scale/AI): Multi-tenant support with high availability and RDMA support.
9. Summary Recommendations
- Baseline: Test/Dev ≥ 100 Mbps; Production ≥ 1 Gbps.
- Production Pick: Node & Storage bandwidth ≥ 10 Gbps.
- Storage First: Ensure high bandwidth and low latency; use independent VLANs.
- AI Optimization: Leverage RDMA/InfiniBand for distributed efficiency.
- Risk Control: Regular network health checks and reserved bandwidth redundancy.