Small Language Models

Jun 19, 2024
Filed under: #tech #ai

A lot of the talk in the news is about the big guys in the LLM space. Of course OpenAI with ChatGPT, Microsoft Copilot which uses OpenAI, and Google’s Gemini which powers their big AI features. At this point, those models have gotten so large in the AI race that the companies no longer publish the actual size. So we don’t actually know how large these models have gotten. This matters because it gives an indication of the power consumption and cost to run these models at scale. It’s estimated that generating 1 image uses about as much power as fully charging a cell phone once. That may not sound like much at first, but consider how many images one may try to generate before finding one you like. And then multiply that by who-knows how many users.

Whatever the size may be, these powerful LLMs cannot run on normal hardware. First there’s the amount of memory required to load the model into GPU memory. Next is the memory required to load and compute on the text being sent in, and that memory requirement increases significantly with larger prompts and outputs. And then of course, raw compute power to crunch every number in that big model. At this moment, the fastest and largest consumer GPU is the RTX 4090 with 24GB of memory (and it retails somewhere between $1800 and $2200 at the time of writing, just for the GPU). There are ways to have multiple GPUs in one computer and link them together (with some performance penalty) but that gets into bitcoin mining enthusiast territory for sure. The ‘average’ consumer GPU card has 8GB of memory these days.

However, in the year and a half since ChatGPT was revealed, research hasn’t stood still. And even though GPT has arguably improved over the original ChatGPT (3.5) version, more accessible models have become publicly available that can realistically run on consumer hardware with similar or better capabilities than the original ChatGPT version. And although these are not exactly small, they are called “Small Language Models” (or, SLMs).

Small language models vs large language models as David and Goliath, according to Bing Copilot

These small(er) models also come in different sizes with different capabilities. The larger well-known ones are Meta’s Llama, Mistrall, Google Gemma, etc. Some much smaller models like Microsoft’s Phi3-mini and Google Nano are so small they can run on phones. Besides this being obviously much better for the environment, having small AI models that can run locally is also good news for low connectivity scenarios, privacy, etc.

Now clearly, it will be hard to get all the exact same capabilities out of the small models. Many of them are designed and trained to do well with specific tasks, like a specialized version that’s a subset of ChatGPT. For any techies out there though, these are exciting developments as it means one can experiment with building AI applications without racking up a cloud bill or taking a dependency. Many of these models are freely available, although some licenses do not allow commercial use. And multi-modal models are now also becoming available, allowing you to also interpret or generate audio and pictures locally.

Beyond the excitement in DIY and developer communities, these smaller models are also resulting in more practical applications where devices can run the easier tasks on local models and only use expensive and power-hungry LLMs for the more complicated tasks. Both Android and Apple phones will see more and more features that can just run locally (which is good news for your privacy as well as your mobile data plan), and Windows is getting more built-in features to host and run models, and even deliver standard models that Windows App developers will be able to use in their apps.

The popular place to go for models is Hugging Face. Think of it as a GitHub for AI models. And just like GitHub, one shouldn’t just grab any model and run it locally - for security reasons. But, all the big players that have open source models release them there (so search for the official accounts of Microsoft, Meta, Google, etc.). Just double check the license if you want to use these models for anything commercial.

And finally, if you’re just looking to play with these models without needing to write any code or do any crazy setup… Check out LM Studio. The app is available for free for Windows, Mac and Linux and allows you to download and load models from Hugging Face directly, and open a chat window. And the nice thing is, it can use GPU or just CPU (obviously performance will vary). Just two limitations to keep in mind: commercial use of LM Studio is not immediately permitted and you have to contact the company about those use cases. Second, LM Studio is based on Llama.cpp which means it only supports a specific subset of model types - something we will discuss in a following article.

If you want to try it out and you’re not sure about your hardware, try the Phi3-Mini model from Microsoft. It’s completely free (MIT license), and it’s so small it will run on most recent computers even without a GPU.

There is no comment section here, but I would love to hear your thoughts! Get in touch!

Code Crib

Small Language Models

Blog Links

Blog Post Collections

Recent Posts