Positron’s Asimov Accelerators Set to Challenge Nvidia’s Rubin GPUs
At first glance, Positron’s next-generation Asimov accelerators, aptly named after the esteemed science fiction writer, may not seem formidable when compared to Nvidia’s Rubin GPUs.
However, the Arm-backed AI firm asserts that its inference chip can generate five times more tokens per dollar while consuming only one-fifth of the power utilized by Nvidia’s latest offerings.
These striking declarations hinge on the premise that the Asimov chip is engineered specifically for extensive inference workloads. The infusion of an additional $230 million in fresh capital likely bolsters this ambitious strategy.
In stark contrast to the GPUs synonymous with Nvidia and Arm, Positron’s Asimov diverges significantly in design and capability.
The Asimov accelerators abandon the high-bandwidth memory (HBM) architecture of their predecessors, the Atlas systems, opting instead for LPDDR5x memory.
This innovative choice allows for expandable memory configurations via Compute Express Link (CXL), ranging from 864GB to a staggering 2.3TB per chip.
Such an increase in memory capacity facilitates greater accommodation for large language model (LLM) parameters and essential key-value caches crucial for model state management.
Nonetheless, while LPDDR5x presents a more economical and capacious alternative to HBM, its performance lags, exhibiting significantly slower speeds.
NVIDIA’s newly unveiled Rubin GPUs, by comparison, feature a robust 288GB of HBM4, which delivers a remarkable 22 TB/s of peak bandwidth. In contrast, Positron’s Asimov accelerators peak at approximately 3 TB/s.
Yet, the company maintains that its chips can effectively harness 90 percent of this bandwidth, a stark contrast to Nvidia’s GPUs, which reportedly achieve only about 30 percent in practical scenarios.
This assertion, however, pertains mainly to the on-package LPDDR5x memory. Any expansion utilizing CXL memory will encounter limitations imposed by the chip’s 32 PCIe 3.0 lanes, which accommodate around 256 GB/s of bandwidth.
Reportedly, Positron intends to leverage this CXL memory for storing key-value caches (KV-Cache), an approach that theoretically should alleviate much of the complexity and overhead associated with KV-Cache offloading.
It is worth noting that even if Positron’s claim regarding HBM-based GPUs attaining approximately 30 percent of their peak bandwidth holds true, Rubin’s memory remains about 2.4 times faster.
Moreover, this analysis does not take into account the computing capabilities, an aspect Positron appears to have overlooked in its promotional narratives.
Positron asserts that its 400-watt chip incorporates a 512×512 systolic array operating at 2 GHz, designed to accommodate a variety of data types, including TF32, FP16/BF16, FP8, NVFP4, and Int4.
This systolic array is supported by a series of Armv9 cores, which can be reconfigured into various configurations, such as 128×512 or 512×128, depending on task optimization needs. However, a specific teraFLOPS figure remains elusive.
That said, raw computational power is but one piece of the intricate puzzle. The efficiency of generative AI models rarely hinges solely on individual chip performance. Historical evidence from technologies like Google’s TPU or Amazon’s Trainium demonstrates that effective scalability often outweighs per-chip metrics.
Each Asimov accelerator boasts an impressive 16 Tbps of chip-to-chip bandwidth, translating to 2 TB/s, positioning the interconnect nearly at par with the memory speed.
Four Asimov chips will compose Positron’s Titan compute platform. However, these setups resemble compute blades in Nvidia’s NVL72 racks rather than standalone machines.
Positron claims the capacity to amalgamate up to 4,096 Titan systems into a unified scale-up domain, collectively encompassing over 32 petabytes of memory.
This capability is realized through a pure chip-to-chip mesh design, as opposed to the switched scale-up fabrics seen in Nvidia and AMD’s rack-scale architectures.
In this respect, Positron’s scale-up fabric aligns more closely with Amazon’s Trainium 2 clusters or Google’s TPUs, which utilize a range of rings and both 2D and 3D torus topologies.
While this mesh approach resolves the necessity for energy-intensive packet switches, it presents challenges in reconfiguration.
Google has addressed this issue through the utilization of optical circuit switches, facilitating a flexible connection regime between chips, akin to a telephone switchboard.

Conversely, Amazon favors switched fabrics in Trainium 3, advocating for improved scalability in inference workloads.
Positron has yet to disclose its strategy for managing cluster provisioning, but it appears a resolution is on the horizon. The Asimov accelerators are projected to commence shipments next year.
Source link: Theregister.com.






