Network Requirements

1. Description

As a distributed cloud-native AI hosting platform, the network is the core pillar of CSGHub's stable operation. It primarily handles the following responsibilities:

Inter-node Communication: Ensures collaboration between Kubernetes nodes for scheduling and management.
Inter-service Calls: Facilitates efficient interaction between internal microservices.
Data Transmission: Manages uploads and downloads of large files (models, datasets, container images), directly impacting efficiency.
External Access: Supports user access via Web UI and APIs, ensuring stability and experience.
AI Task Communication: Provides support for Dataflow, Runner, and inference tasks.

👉 Network performance directly determines the following key indicators:

Task Efficiency: Latency and bandwidth affect completion speeds for AI and data processing tasks.
Model Loading Speed: Loading large models (GB to TB scale) depends entirely on transmission capacity.
System Stability: Jitter or packet loss leads to service timeouts, Pod restarts, and task failures.

2. Network Architecture Layers

To ensure clarity and high performance, the platform network is divided into four functional layers:

2.1 Inter-node Network

Purpose: Communication between K8s nodes and interaction with the control plane.
Requirements:
- Bandwidth: ≥ 1 Gbps (Test/Dev); ≥ 10 Gbps (Production).
- Latency: ≤ 1 ms (Same data center deployment).
- Packet Loss: Near 0%.

2.2 Pod Network (CNI)

Purpose: Communication between Pods and external access to Pod services.
Recommendations:
- Calico: High compatibility and stability, suitable for most scenarios.
- Cilium: High performance and low latency via eBPF, ideal for high-concurrency AI scenarios.
- Security: Must support NetworkPolicy for secure isolation between Pods.

2.3 Storage Network

Purpose: Communication for distributed storage (Ceph/Longhorn), NFS/NAS, and Object Storage (S3).
Requirements:
- Bandwidth: ≥ 10 Gbps (Essential for large file I/O).
- Isolation: Recommended independent VLAN to avoid bandwidth contention.
- Latency: As low as possible to prevent bottlenecks in training/processing tasks.

2.4 External Access Network

Purpose: Gateway for Web UI access, API calls, and asset uploads/downloads.
Core Components:
- Ingress/Gateway Controller: Nginx or EnvoyGateway for routing and SSL termination.
- LoadBalancer / NodePort: Use LoadBalancer for high availability in production.

3. Scenario-specific Requirements

Scenario	Bandwidth Requirement	Storage Network	External Export
Test / Dev	≥ 100 Mbps	Shared with node	N/A
SME Production	≥ 1 Gbps per node	≥ 10 Gbps (Rec.)	≥ 100 Mbps
Large-scale AI	≥ 10 Gbps per node	≥ 10 ~ 25 Gbps	≥ 1 Gbps

3.4 Large File Transfer Optimization

CSGHub involves transferring massive files: GB~TB scale models, large datasets, and container images. These place significant pressure on the network.

3.5 Optimization Suggestions

Deploy an internal image registry (e.g., Harbor) to save external bandwidth.
Prioritize Object Storage (S3) for models and datasets to leverage high concurrency.
Avoid cross-region deployment; prefer same-datacenter setups to minimize latency.

4. Port Requirements

Component	Port Configuration
Web / API / Git SSH	80 / 443 / 22 (K8s NodePort: 30080 / 30443 / 30022)
Casdoor (Auth)	8000 (Docker deployment only)
CSGShip / API	8001 / 8002 (Docker deployment only)
MinIO (S3)	9000 (Docker deployment only)

5. High-Performance Network Advice (AI / Training)

For scenarios like distributed training or large-scale inference:

RDMA: Remote Direct Memory Access for low-latency, high-bandwidth data transfer.
InfiniBand: High-performance interconnect for large GPU clusters.
GPU Direct: Direct GPU-to-GPU communication without CPU intervention.

6. Bandwidth Estimation Methods

6.1 Model Download Estimation

Bandwidth ≈ Model Size × Concurrent Downloads / Target Download Time Example: 10 GB model, 10 concurrent users, 60s target time ≈ 1.6 Gbps (Recommend 2 Gbps).

6.2 Dataset Transfer Estimation

Bandwidth ≈ Dataset Size / Expected Loading Time Recommend reserving ~30% redundancy.

7. Common Risks and Troubleshooting

Insufficient Bandwidth: Slow model pulls or failed Pod starts. Suggestion: Upgrade bandwidth or use internal registries.
High Latency: Service timeouts or UI lag. Suggestion: Use same-datacenter deployment or RDMA.
Instability: Frequent Pod restarts or storage disconnects. Suggestion: Check switches and VLAN isolation.
DNS Issues: Service inaccessible or image pull failures. Suggestion: Use stable DNS and check resolution.

8. Recommended Network Topology

Simple Architecture (SME Production): Cost-effective for 10-100 person teams.
High-Performance Architecture (Large-scale/AI): Multi-tenant support with high availability and RDMA support.

9. Summary Recommendations

Baseline: Test/Dev ≥ 100 Mbps; Production ≥ 1 Gbps.
Production Pick: Node & Storage bandwidth ≥ 10 Gbps.
Storage First: Ensure high bandwidth and low latency; use independent VLANs.
AI Optimization: Leverage RDMA/InfiniBand for distributed efficiency.
Risk Control: Regular network health checks and reserved bandwidth redundancy.

1. Description​

2. Network Architecture Layers​

2.1 Inter-node Network​

2.2 Pod Network (CNI)​

2.3 Storage Network​

2.4 External Access Network​

3. Scenario-specific Requirements​

3.4 Large File Transfer Optimization​

3.5 Optimization Suggestions​

4. Port Requirements​

5. High-Performance Network Advice (AI / Training)​

6. Bandwidth Estimation Methods​

6.1 Model Download Estimation​

6.2 Dataset Transfer Estimation​

7. Common Risks and Troubleshooting​

8. Recommended Network Topology​

9. Summary Recommendations​