GPT-4, PaLM, Claude, Bard, LaMDA, Chinchilla, Sparrow – the list of large-language models on the market continues to grow. But behind their remarkable capabilities, users are discovering substantial costs.

While LLMs offer tremendous potential, understanding their economic implications is crucial for businesses and individuals considering their adoption.

First, building and training LLMs is expensive. It requires thousands of Graphics Processing Units, or GPUs, offering the parallel processing power needed to handle the massive datasets these models learn from. The cost of the GPUs, alone, can amount to millions of dollars. According to a technical overview of OpenAI’s GPT-3 language model, training required at least $5 million worth of GPUs.

These models require many, many training runs as they are developed and tuned, so the final cost is far in excess of this figure. The electricity alone used to train GPT-3 is estimated to have cost $100 million. And training is getting more expensive.

Ever larger models require more computational power and longer training times. Model size is also scaling faster than hardware capabilities, requiring a greater number of expensive processors. The size of training datasets is also increasing, meaning longer power-draining training iterations. State-of-the-art models are trained for weeks or months to reach optimal performance, racking up costs.

OpenAI itself says that the amount of compute used in the largest AI training runs has been increasing exponentially, doubling every few months. That means GPT-3’s $100 million electricity bill would be $400 million a year later for training the next iteration of the model. That’s enough electricity to power thousands of homes for a year.

Sam Mugel, Chief Technology Officer of Multiverse, which uses tensor networks and quantum computing to bring down costs, estimates that training the next generation of large language models will pass $1 billion within a few years.

The process also requires skilled AI engineers, who come with hefty salaries, and armies of human reviewers.

The cost doesn’t end there. Running inference on the models, once trained, is also expensive. Estimates suggest that in January 2023, ChatGPT used nearly 30,000 GPUs to handle hundreds of millions of daily user requests. Sajjad Moazeni, a University of Washington assistant professor of electrical and computer engineering, says those queries may consume around 1 GWh each day, the equivalent of the daily energy consumption for about 33,000 U.S. households.

All this computing can only take place in a data center where hundreds of thousands of processing units, memory and storage devices are stored in server racks along with internal infrastructure for cooling down the servers with water and air. Rising demand from large models is pushing up data center prices across the board, according to Raul Martynek, CEO at DataBank, which builds and operates data centers.

Meanwhile, clouds of cash are blown away to access these models through APIs, an alternative for organizations that don’t build their own models. Each token produced by these statistical brains costs money. Worse, LLMs vomit verbose drivel as often as gems, wasting much of their work. Before long, bills can turn into mountains.

“LLMs get bigger, not smarter,” said Olivier Gaudin, CEO & co-founder of Sonar, which has a suite of products to help developers write clean code.

Recent research reveals that a substantial portion of the computational operations in large models is wasted, with over 99% of floating-point operations resulting in zero calculations. Relying solely on massive models that consume vast amounts of resources is unsustainable in the long term.

Running LLMs on powerful machines in the cloud, coupled with the API usage fees and coding costs, can become prohibitively expensive for large-scale applications.

Diffblue, a company founded by Oxford researchers, uses reinforcement learning to automatically generate unit tests for Java code. They replaced the test generator with a call to OpenAI LLMs and analyzed the tests produced by running it against several large open-source projects, comparing the results with that of their own product, Diffblue Cover.

The experiments showed that the OpenAI LLM model was effective at understanding general instructions, but the model’s output lacked certain elements required for the code to compile, such as package declarations, imports, correct type declarations, and method calls. These deficiencies need to be fixed through post-processing. After fixing package declarations, only 14% of the produced tests compiled and passed, and after fixing imports, only 32% compiled and passed.

Moreover, the generated tests did not always meet the desired criteria, necessitating manual review. In the end, Diffblue Cover’s reinforcement learning approach proved much cheaper and more efficient than an LLM.

These cost considerations are crucial as AI continues to revolutionize industries. When deciding whether to build or fine-tune an LLM, companies need to consider factors like resources, data quality and size, technical expertise, and business strategy alignment.

Companies should make sure their teams understand new scaling laws that dictate optimal model size based on dataset size, rigorous data curation, pre-training in incremental steps, and techniques to mitigate instability like regularization and learning rate decay.

In the meantime, the hefty computing costs of large language models pose a major challenge for developing profitable real-world generative AI apps. Companies eager to harness the new technology need to make sure they aren’t buying a sledgehammer to crack a nut.