More

PC conflicts

None

Submitted

[PDF] Submission 31 Jul 2024 7:28:18pm EDT · 9684b9f1cc7fc12a25feb19f6fb691d8ec87b7ba0ace89dd5185766fa89a83749684b9f1

In our forthcoming presentation, we will explore significant advancements in network communication protocols, focusing on the extension of TCP to support Collective Communication (CC) semantics. Originally introduced as device memory TCP, or Devmem TCP, by Google at the last NetDev conference, this initiative marks a pivotal evolution in the AI network landscape. NCCL currently predominates in managing the transfer of CC semantics to both RDMA and TCP. Unlike traditional point-to-point configurations, CC enables intra-group communication, which is crucial for enhancing the complexity and performance of AI network interactions.

These enhancements simplify the framework for CC semantics, introducing innovations such as direct device access and the potential for random access, moving beyond conventional stream-only access. These developments are essential for a broad range of applications across AI, high-performance computing (HPC), and storage solutions, including NVMe over TCP. The evolution of TCP semantics is anticipated to inspire diverse implementations within the industry, as exemplified by Google's Falcon and AWS’s EFA under RDMA semantics. Our efforts extend these innovations to TCP, significantly enhancing its applicability and potential for widespread adoption.

For practical deployment, we have enabled this enhanced TCP on Intel’s NICs, specifically within the Intel IPU series with IDPF driver, to ensure broader utilization in established environments such as HPC MPI and the AI NCCL framework. During our session, we will discuss our implementation strategies for these NICs and provide updates on our progress. Furthermore, we will present detailed performance data to demonstrate the enhanced TCP’s effectiveness, showcasing its comparability to RDMA and its superiority over standard TCP, particularly in handling larger packet sizes. This talk aims to provide a comprehensive overview of our technological advancements and their potential impact on the future of network communications.

A. Singhai, S. He, S. Samudrala [details]

Anjali Singhai <anjali.singhai@intel.com>

Shaopeng He <shaopeng.he@intel.com>

Sridhar Samudrala <sridhar.samudrala@intel.com>

Submission Type
Talk
Submission Label
Nuts and Bolts
Estimated Length Of Time For Presentation (in minutes)
30
Attendance
Physically

To edit this submission, sign in using your email and password.