iPhone 17 Pro Runs 400B Large Language Model with 200GB Memory

iPhone 17 Pro Successfully Showcases Ability to Run a 400B Large Language Model, Demanding at Least 200GB of Memory Even in Compressed Form

Last updated on March 23, 2026March 24, 2026 by Neil Hemmings on Categories Gadget

Large Language Models (LLMs) comprising 400 billion parameters necessitate high-performance hardware equipped with significant memory resources.

Even in their optimized forms, these models demand a minimum of 200GB of RAM. Given such formidable requirements, the iPhone 17 Pro might seem an improbable candidate for executing a 400B LLM.

However, recent video evidence has surfaced, showcasing an individual’s remarkable achievement in this realm. This extraordinary feat, while impressive, was made possible through ingenious methods—let us delve into the specifics.

Performance Limits and Required Patience

An enterprising open-source initiative named Flash-MoE has successfully operated on the iPhone 17 Pro, as demonstrated by @anemll.

Nevertheless, while this flagship smartphone can indeed run such a computationally demanding model, it is not without significant drawbacks. The token generation speed is dismally low at 0.6 tokens per second, translating to approximately one word every 1.5 to 2 seconds.

For users gifted with ample patience or those able to engage in other activities during the slow query generation, the experience may still evoke frustration. The very fact that a 400B LLM can be executed on a smartphone underscores the potential for future optimizations that could enhance on-device LLM capabilities.

This remarkable achievement stems from a strategic decision to bypass loading the entire LLM into memory—a near impossibility given that the iPhone 17 Pro is equipped with only 12GB of LPDDR5X RAM.

Instead, Flash-MoE facilitates direct streaming from the device’s SSD to the GPU. The ‘MoE’ in its name stands for Mixture of Experts, which allows it to utilize only a fraction of the 400B parameters for each word generated.

Beyond the technical intricacies, another significant advantage lies in the privacy afforded by localized LLM usage, permitting response generation without relying on an active internet connection.

However, this functionality comes at the cost of substantial battery drain. Developers also often employ compressed or ‘quantized’ versions of LLMs, but even a quantized model with 400 billion parameters would still necessitate at least 200GB of RAM, rendering it unfeasible for the iPhone 17 Pro.

Black smartphone with three camera lenses on back

In summary, this recent demonstration illustrates that while one can technically run a 400B LLM on a smartphone—admittedly at the laborious pace of 0.6 tokens per second—there exists a considerable disparity between running such a model and operating it in a practical, user-friendly manner.

Source link: Wccftech.com.

Disclosure: This article is for general information only and is based on publicly available sources. We aim for accuracy but can't guarantee it. The views expressed are the author's and may not reflect those of the publication. Some content was created with help from AI and reviewed by a human for clarity and accuracy. We value transparency and encourage readers to verify important details. This article may include affiliate links. If you buy something through them, we may earn a small commission — at no extra cost to you. All information is carefully selected and reviewed to ensure it's helpful and trustworthy.

Reported By

Neil Hemmings

I'm Neil Hemmings from Anaheim, CA, with an Associate of Science in Computer Science from Diablo Valley College. As Senior Tech Associate and Content Manager at RS Web Solutions, I write about AI, gadgets, cybersecurity, and apps – sharing hands-on reviews, tutorials, and practical tech insights.