Western Digital has confirmed its OpenFlex Data24 4000 Series storage system has achieved high-performance results in the latest MLPerf Storage V2 benchmarks. These tests, recognized as a key standard for measuring Artificial Intelligence hardware performance, show the platform’s ability to supply data to powerful AI systems at very high speeds, which is a critical factor for developing and running complex AI applications. The tests were conducted in collaboration with PEAK:AIO and utilized KIOXIA’s NVMe SSDs.
Key Takeaways:
- Western Digital’s OpenFlex Data24 platform demonstrated high-speed data delivery for AI tasks in the MLPerf Storage V2 tests.
- In one key test, the system reached a sustained read throughput of 106.5 GB/s, feeding 36 simulated NVIDIA H100 GPUs.
- The system uses NVMe-oF technology, which allows for fast, shared access to storage, similar to having the drives directly attached to each computer.
- The validation involved partners PEAK:AIO, providing a specialized AI data server, and KIOXIA, whose CM7-V Series SSDs were used in the storage platform.
The growth of AI in India and globally has created a massive need for powerful computing hardware. At the heart of this are Graphics Processing Units (GPUs), like the NVIDIA H100, which handle the intense calculations for training AI models. However, these GPUs need data to be fed to them at extremely high speeds. If the storage system is slow, the expensive GPUs sit idle, waiting for data. This slowdown increases the time and cost of AI projects. Western Digital’s recent benchmark results address this exact problem.
The company’s OpenFlex Data24 is what is known as an “Ethernet bunch of flash” (EBOF). It uses a technology called NVMe-oF (Non-Volatile Memory Express over Fabrics) to connect multiple servers to a shared pool of fast flash storage over a standard Ethernet network. This setup, known as disaggregated storage, allows companies to scale their computing power and storage capacity independently, offering more flexibility and potentially lower costs. For example, a company can add more GPUs without having to buy new storage servers each time.
The MLPerf benchmarks simulate real-world AI workloads. The tests included:
- 3D U-Net: This workload, common in medical imaging, involves reading large files and places a heavy demand on storage bandwidth. In this test, the OpenFlex Data24 platform on its own was able to saturate 36 simulated H100 GPUs, achieving a read speed of 106.5 GB/s. When configured with a PEAK: AIO AI Data Server, it delivered 64.9 GB/s to 22 simulated GPUs from a single server.
- ResNet50: This is a common image classification workload that involves reading many smaller files. Here, the platform supported 186 simulated H100 GPUs. With the PEAK: AIO server, it saturated 52 simulated GPUs from a single point of connection.
These results are significant for India’s rapidly growing data center market. As more Indian companies adopt AI, the need for efficient infrastructure that maximizes the use of costly GPUs becomes very important. Technologies like the one tested by Western Digital offer a path to build scalable and cost-effective AI systems.
“These results validate Western Digital’s disaggregated architecture as a powerful enabler and cornerstone of next-generation AI infrastructure, maximizing GPU utilization while minimizing footprint, complexity and overall total cost of ownership,” said Kurt Chan, a vice president at Western Digital.
The setup’s ability to connect up to 12 host machines without a switch simplifies the network design for smaller AI clusters, making high-performance AI more accessible.
Related FAQs
Ques 1: What is NVMe-oF and why is it important for AI?
Ans: NVMe-oF stands for Non-Volatile Memory Express over Fabrics. It is a technology that extends the high performance of NVMe flash storage over an Ethernet network. For AI, this allows multiple servers with powerful GPUs to share a central, high-speed storage pool with low latency. This improves efficiency and simplifies data management for scalable, disaggregated AI infrastructure.
Ques 2: What does disaggregated storage mean?
Ans: Disaggregated storage is an IT infrastructure design where the storage is separate from the compute (CPU and GPU) resources. They are connected by a high-speed network, like Ethernet. This allows an organization to scale its storage and compute independently based on demand, which offers more flexibility.
Ques 3: What is the role of the simulated NVIDIA H100 GPU in these tests?
Ans: In the MLPerf tests, simulated H100 GPUs are used to generate I/O load patterns typical of real-world AI servers accessing storage during training. The goal is to evaluate how many of these powerful GPUs the storage system can effectively support without causing a bottleneck. The number of GPUs “saturated” is a key performance metric.
Ques 4: Why was the PEAK:AIO AI Data Server used in some tests?
Ans: The PEAK:AIO AI Data Server is a high-performance software-defined storage (SDS) provider. It was used to test the Western Digital platform in a realistic deployment scenario where a software layer is used to manage and serve large volumes of data at high speeds to GPU infrastructure. The collaboration demonstrates the performance of the hardware both on its own and with an SDS layer.
Ques 5: How do these results benefit a company?
Ans: These results show that the storage architecture can maximize GPU utilization while minimizing complexity and overall total cost of ownership. By getting faster results and reducing infrastructure sprawl, companies can scale their AI workloads confidently without the high upfront costs or power demands of some alternative solutions. The platform also allows up to 12 hosts to be connected without a switch, simplifying deployment.