run inference on llama models on macbook m1


by msx47

[edited ]

5 minutes read
Paz Arando on UnsplashPaz Arando on Unsplash

Getting Llama

Llama is being distributed on torrents for some time now. The legality is grey so please excersise judgement. The model can be obtained via this GitHub repository. Follow the instructions and download the base models. The download size is around 220GB.


From their GitHub

Inference of Facebook's LLaMA model in pure C/C++

This project is a C++ implemetation and is a Apple silicon first-class citizen. It uses 4-bit quantization to run the model on a macbook.

This is a new project and is being developed rapidly. I'll update this post if any breaking changes happen.

Setting up

Clone the llama.cpp GitHub repository. Now open this directory in a terminal and run make (must have gcc installed). This will compile the source files and result in 2 executables, main and quantize. We'll leave them for now.

Next up download the llama models and place them inside the models directory in llama.cpp. The llama.cpp/models directory should look like this now.

Now we need to transform the .pth models from the downlaod to valid ggml files. For this we can use the file. Make sure you have python3 installed using

python3 --version

Now install a few dependencies with pip or conda

For pip use:

pip3 install numpy torch sentencepiece 

For conda use:

conda install numpy torch sentencepiece

The llama models has 4 variants of different parameter size. You can choose which one to run based on your RAM size.

| RAM | Model         |
| 16G | 7B and 13B    |
| 32G | 30B and lower |
| 64G | 65B and lower |

Based on the above table run the following to convert the models to ggml:

python3 ./models/13B 1

This will take a while (~10-15 mins) depending on the model you selected.

Make the executable with

chmod +x

After the ggml files are generated, run the following command to quantize the models. The argument to quantize script would be the model you transformed to ggml in the previous step.

./ 13B

This also might take around 10-15 minutes to complete.

When the quantization is complete you can run the inference by providing it a prompt. To do so enter the following command.

./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512

This command will load the model and genrate a response based on your prompt. If you get a segmentation fault or a very slow response here then that is probably due to low RAM.

The project is being worked on currently and will have faster load times for models soon. I'll update this post in case of a major change.