Large Language Models (LLMs) comprising 400 billion parameters necessitate high-performance hardware equipped with significant memory resources.
Even in their optimized forms, these models demand a minimum of 200GB of RAM. Given such formidable requirements, the iPhone 17 Pro might seem an improbable candidate for executing a 400B LLM.
However, recent video evidence has surfaced, showcasing an individual’s remarkable achievement in this realm. This extraordinary feat, while impressive, was made possible through ingenious methods—let us delve into the specifics.
Performance Limits and Required Patience
An enterprising open-source initiative named Flash-MoE has successfully operated on the iPhone 17 Pro, as demonstrated by @anemll.
Nevertheless, while this flagship smartphone can indeed run such a computationally demanding model, it is not without significant drawbacks. The token generation speed is dismally low at 0.6 tokens per second, translating to approximately one word every 1.5 to 2 seconds.
For users gifted with ample patience or those able to engage in other activities during the slow query generation, the experience may still evoke frustration. The very fact that a 400B LLM can be executed on a smartphone underscores the potential for future optimizations that could enhance on-device LLM capabilities.
This remarkable achievement stems from a strategic decision to bypass loading the entire LLM into memory—a near impossibility given that the iPhone 17 Pro is equipped with only 12GB of LPDDR5X RAM.
Instead, Flash-MoE facilitates direct streaming from the device’s SSD to the GPU. The ‘MoE’ in its name stands for Mixture of Experts, which allows it to utilize only a fraction of the 400B parameters for each word generated.
Beyond the technical intricacies, another significant advantage lies in the privacy afforded by localized LLM usage, permitting response generation without relying on an active internet connection.
However, this functionality comes at the cost of substantial battery drain. Developers also often employ compressed or ‘quantized’ versions of LLMs, but even a quantized model with 400 billion parameters would still necessitate at least 200GB of RAM, rendering it unfeasible for the iPhone 17 Pro.

In summary, this recent demonstration illustrates that while one can technically run a 400B LLM on a smartphone—admittedly at the laborious pace of 0.6 tokens per second—there exists a considerable disparity between running such a model and operating it in a practical, user-friendly manner.
Source link: Wccftech.com.





