This is Intel Gaudi 3 the New 128GB HBM2e AI Chip in the Wild

April 19, 2024

Intel Gaudi 3 OAM Working Sample Package 1

At Intel Vision 2024, we saw the Intel Gaudi 3 AI accelerator. This is Intel’s newest dedicated AI accelerator in a very hot market. The new Gaudi 3 is a big leap over the previous generation Gaudi 2, which we have seen before, and is set to go into volume production later in 2024.

As a quick note: I attended the Intel Vision 2024 as an “Influencer.” The press was not invited, but analysts were. The event is being held about 25 min from the STH studio in Scottsdale, so that is how I got a badge. If you want the sub-30 second short on our new shorts channel, here you go:

Intel Gaudi 3 128GB HBM2e AI Chip in the Wild

Of course, most folks like to see specs, and the Gaudi 3 is an interesting part. The new part still uses HBM2e but is using eight stacks to achieve 128GB of capacity. Intel has up to 1.835PFLOPS of FP8 compute. It should also be noted that simply supporting FP8 is non-trivial as some accelerators do not support it at this point. The new part has 64 tensor processor cores and 8 matrix math engines.

We have been tracking Habana’s Gaudi since 2019, and the company later in 2019 was acquired by Intel when it had its Gaudi 1 generation. That was at the point that Nervana was still around and a lot of the action was happening around Facebook’s choice for AI acceleration. Since 2019’s Intel acquisition, there was a Gaudi 2 generation that Intel started putting effort behind in late 2022 as AI took off. The Gaudi 3 is the successor with huge upgrades in compute and bandwidth, and a shrink from 7nm to 5nm.

Just for some sense of scale as to how much bigger Gaudi 3 is, here is Gaudi 3 next to Gaudi 2.

Here is Gaudi 2 and Gaudi 3 in hand. It is plain to see when you are standing next to them that the Gaudi 3 silicon package is much larger.

On that silicon package, there are two dies, each with 48MB of SRAM, 16 tensor processing cores, and a media engine. Something that Habana Gaudi 1 did, and we have seen continue, is that Gaudi 3 uses Ethernet to scale up and out. In this generation, that means the 24 network interfaces are 200GbE up from 100GbE in Gaudi 2 and only 10x 100GbE in Gaudi 1. Intel uses Ethernet as the interconnect between AI accelerators in a chassis, but also to scale out to multiple AI accelerators in a data center. Compare that to NVIDIA with NVLink/ NVSwitch, Infiniband, and Ethernet in a HGX H100 platform.

Here is a closer look at the die. The thermal paste is there because this was a working card that was pulled. If you are wondering what is between the 8x 16GB HBM2e packages, we were told that is filler silicon to make the structure of the package.

Intel Gaudi 3 OAM Working Sample Package Close 1

One can also see the line between the two pieces of silicon that make up the main compute, SRAM, and networking portion of the AI accelerator.

Of course, a single AI accelerator today with LLM’s is less exciting. What is more exciting is a collection of AI accelerators. Here is an OCP UBB with the 8x Intel Gaudi 3 OAM accelerators. All are listed at 900W each, but we heard there may be more room with liquid-cooled variants. TSMC has gotten decent voltage frequency scaling and NVIDIA has been taking advantage of that as well. The system below is around 7.2kW+ of accelerators, not taking into account other components.

Here is the OAM package bottom for the Gaudi 3.

Gaudi 3 using Ethernet to scale was a key concept of Habana. The idea was that network admins fundamentally do not want to manage many types of data fabrics. Ethernet is ubiquitous so the aim was to use that, and this was before the idea of UltraEthernet came about. Of course, in 2024, network switch speeds are so much faster today with a modern 51.2T switch that can handle a ton of 200GbE devices. In 2019, with Gaudi 1, when 32-port 100GbE was still decently high-end, the network bandwidth and topologies were a bit more limited. The OAM package above takes 21 of the 24 lanes and uses 3x 200GbE to connect to each of the other seven OAM packages. The remaining 3x 200GbE links are then fed to OSFP connectors at the rear of the chassis.

We saw a Supermicro X14 Gaudi 3 platform at Intel Vision 2024. That platform has six OSFP connections on the top where the Gaudi 3 OAM UBB resides. That should help map to the diagram above.

Supermicro SYS 822GA NGR3 Intel Gaudi 3 8 Way 2

Those six OSFP ports can then be connected to switches and scale-out topologies built. Intel color-coded the networks in its materials to help better illustrate this. The benefit here is that building big Ethernet networks is fairly well understood.

Intel Gaudi 3 AI OAM Scale Out To Data Center

Intel says that the new Gaudi 3 is more power efficient, and sometimes faster than a NVIDIA H100 in inferencing. Our sense is that Gaudi 3 is going to find its home in the AI inference market.

At the same time, Intel Gaudi 3 can also offer faster training than the NVIDIA H100. While the NVIDIA H200 is announced, we are still a bit from seeing training numbers for that as it starts ramping in volume production, and then later this year the NVIDIA Blackwell products will see another jump. Gaudi 3 is really competing on price here. The power efficiency message will be more powerful later in 2024 when organizations have to look at the power footprints they have and decide when to keep or retire machines to free up power capacity.

To that end, Intel also has the Gaudi PCIe CEM, the HL-338. This is an add-in card with a 600W TDP. It takes a lot from chassis to go from the 300-350W PCIe accelerators of just a year or two ago to 600W TDP today.

All of these accelerators are sampling in the first half of this year. The air-cooled and liquid-cooled variants will go into production in the second half of this year. That ties to the Granite Rapids-AP Xeon 6 Supermicro platform shown above.

The important thing to remember is that these will be contemporaries of the NVIDIA H200 and Blackwell, so Intel will likely offer them at substantially lower cost than NVIDIA. Next up, in 2025 is Falcon Shores, which will be a GPU architecture designed for AI as well. Intel says that it will be making the transition easy for users of Gaudi 3.

Final Words

Overall, the Intel Gaudi 3 is a big step up with more memory, more compute, and faster interconnect.

The market needs more compute, so the Intel Gaudi 3 will sell well. If you saw our Touring the Intel AI Playground Inside the Intel Developer Cloud piece, since 2023 Intel has been supply constrained on Gaudi 2. As a result, they are selling the parts and putting them in the Intel Developer Cloud to both try and run production workloads.

Hopefully, later in 2024, we get to check out systems with the new parts.

4 COMMENTS

M April 20, 2024 At 6:36 am

What are those small Green PCBs/chips on most od the edges when looking at the accelerator from above?
Rob April 20, 2024 At 1:45 pm

@M, the SER-DES for the PCIe x16 and the 200 GbE x24.
Chrishafferty April 22, 2024 At 12:34 pm

Yawn no CUDA equals no developers and no software. Nice brick.
Ziple April 23, 2024 At 5:46 am

The fact that there won’t be a Gaudi 4 but just Falcon Shore means no one will invest in making software for it.
That was honest from Intel to say there won’t be a follow up, but that also ensured no sales.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Intel Gaudi 3 128GB HBM2e AI Chip in the Wild

Final Words

RELATED ARTICLESMORE FROM AUTHOR

ASRock Industrial NUC BOX-155H Mini PC Review

Intel Gaudi 3 24x 200GbE AI Accelerator The Short Video

NVIDIA RTX A400 and RTX A1000 Lower-Cost Single Slot GPUs

4 COMMENTS

LEAVE A REPLY

RELATED ARTICLES MORE FROM AUTHOR