Nvidia cuts AI inference costs by up to tenfold with Blackwell architecture and open-source models

In a published blog post, Nvidia highlights that leading inference service providers — including Baseten, DeepInfra, Fireworks AI and Together AI — are able to reduce the unit cost of processing a single token by as much as 10× compared with previous hardware generations such as the Hopper platform, by combining Blackwell with optimized software stacks and open-source models.

The Blackwell platform, based on Nvidia’s newly designed microarchitecture, was built specifically to handle AI workloads while increasing both throughput and energy efficiency. As a result, a higher number of tokens can be processed using the same amount of infrastructure. It is this increase in throughput that directly drives down the operational cost per token.

Deployment examples show the broad economic impact of this approach. In healthcare, Sully.ai — using Blackwell together with open-source models — achieved a 90% reduction in inference costs while also shortening response times, improving the viability of automating tasks such as medical coding and clinical documentation workflows.

Other use cases include gaming platforms and customer-support tools, where companies reported token-cost reductions of between 4× and 10× when running Blackwell with low-precision formats (such as NVFP4) and open-source models instead of relying on expensive proprietary API providers.

This shift in the cost model is important not only for cloud providers, but also for enterprises and startups that want to scale AI-based applications without massive financial outlays. A substantial drop in cost per token could make AI less exclusive to the largest players and significantly more accessible to smaller organizations.

Industry analyses indicate that the cost reduction is driven not only by the hardware itself, but by the tight integration of hardware and software — optimized drivers, algorithms and open-source models run more efficiently on the Blackwell platform, maximizing utilization of compute resources.

The new inference cost structure could have a meaningful impact on the pace of commercialization of AI solutions in sectors such as healthcare, services and entertainment — especially in use cases where every processed token translates directly into operating expenses. Lower costs may also reduce barriers to entry for companies building products on top of large language models.

The Era of Gemini 3.5 and a Total Search Revolution: Google I/O 2026 Recap

Jail Time for Hiding Content Origins. South Korea Announces Strict Digital Watermark Law

Our Brain Tricks Us Into Thinking AI Has No Doubts

What Is Cloud Computing in Healthcare and How Is It Used?

The Security Perimeter Is Gone: How Zero Trust Is Changing Corporate Cybersecurity

IT Worker Migration in 2026. Where Tech Talent Is Moving and Why