65B LLaMA on CPU
16 years ago dog ate my AI book. At the time (and way before that) common argument on «Why we still don't have AI working and it is always 10 years away» was that we can't make AI work even at 1% or 0.1% human speed, even on supercomputers of the time – therefore it's not about GFLOPS.
This weekend I ran gpt4-alpaca-lora_mlp-65B language model in home lab on CPU (using llama.cpp, due to model size – there is 0 chance to run it on a consumer GPU). This model is arguably the best open LLM (this week), and 65 billion parameters is no joke: with single precision math it won't fit in 256Gb of RAM. If you let it spill into swap, even on NVMe drive – it will run at ~1 token per minute (limited by swap speed), which is about 0.5% of human speed. Even at this snail pace it can still show superhuman performance in memorization-related tasks. It is clear that it was not possible to get there 20 years ago – training time would have been prohibitive even with unlimited government funding.
And this is where unexpected open approach of Meta proven to be superior to closed, dystopian megacorp approach of OpenAI: In 10 weeks since LLaMA was released into the wild, not only derivative models were trained but 2-4-5 bit quantization enabled larger models on consumer hardware. In my case with 5bit quantization - model fits into 64Gb of RAM and runs at ~2 tokens per second (on 64-cores), which is probably 70-90% of my human speed in best shape.
For comparison, I tried Replit-code-v1.3b 2.7B model optimized for coding. After 65B monster – Replit feels like a breeze and shows very good performance despite its size. This is a good reminder that field-specific, smaller models should always be used where possible.
It feels like "1 Trillion parameters will be enough for everybody", but such models would not be practical probably for another 2 years. Meanwhile key enablers of AI proliferation could be increase of RAM in consumer GPUs beyond 24Gb (which is sadly unlikely to happen due to commercial interests) and smaller field-specific models where I would be looking into with much more interest.