llama.cpp – Efficient LLM Inference in C/C++

llama-cpp-python: A Python Binding for llama.cpp

llama-cpp-python is a Python binding for the llama.cpp library, which allows you to run various LLM models. The models can be found on Hugging Face.

Key Notes:

New versions of llama-cpp-python use GGUF model files (as opposed to GGML).
Converting GGML models to GGUF involves a command like this:

				
					python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/openorca-platypus2-13b.ggmlv3.q4_0.bin --output models/openorca-platypus2-13b.gguf.q4_0.bin

Installation Instructions

There are several installation options based on your system and preferences:

1. CPU Only Installation:

Install the package using:

				
					%pip install --upgrade --quiet llama-cpp-python

2. Installation with OpenBLAS/cuBLAS/CLBlast:

llama.cpp supports multiple BLAS backends for faster processing. You can install with cuBLAS backend using:

				
					!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install llama-cpp-python

If you have previously installed the CPU-only version, reinstall it using:

				
					!CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

3. Installation with Metal (MacOS):

To install with Metal support on MacOS, use:

				
					!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

4. Installation for Windows:

For Windows, it is recommended to compile from the source. The prerequisites include:

Git
Python
CMake
Visual Studio Community

Clone the repository and set up necessary environment variables before compiling:

				
					set FORCE_CMAKE=1
set CMAKE_ARGS=-DGGML_CUDA=OFF

Then Install:

				
					python -m pip install -e .

Usage

Once the installation is complete, use LlamaCpp in LangChain to interact with your models. Here’s an example:

				
					from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

# Define the prompt template
template = """Question: {question}

Answer: Let's work this out step by step to make sure we have the correct answer."""

prompt = PromptTemplate.from_template(template)

# Set up callback manager for token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Example using a model
llm = LlamaCpp(
    model_path="/path/to/your/model.bin",
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,
)

question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm.invoke(question)

Using the GPU

If you have a GPU, set the number of layers to offload to it and the batch size for parallel token processing:

				
					n_gpu_layers = -1  # Number of layers to put on the GPU
n_batch = 512  # Number of tokens processed in parallel

Example:

				
					llm = LlamaCpp(
    model_path="/path/to/your/model.bin",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

This simplified version preserves the core concepts while making the steps easier to follow. if you have further inquires please contact to our support team.

FAQ's(Frequently Asked Question)

1. What is llama-cpp?

llama-cpp is a lightweight, C++ implementation of Meta’s LLaMA (Large Language Model Meta AI). It is optimized for running LLaMA models efficiently on consumer hardware without requiring a GPU.

2. Why use llama-cpp instead of other AI frameworks?

llama-cpp is designed for efficiency and portability. It allows users to run large language models on CPUs with minimal dependencies, making it ideal for edge computing, offline AI, and lightweight AI applications.

3. Does llama.cpp require a GPU to run?

No, llama-cpp is optimized for running entirely on a CPU. However, it does support GPU acceleration through Metal (macOS), CUDA (NVIDIA GPUs), and OpenCL (AMD GPUs) for better performance.

4. What models are supported by llama.cpp?

llama-cpp primarily supports Meta’s LLaMA models (LLaMA, LLaMA 2, and LLaMA 3). It also works with models in GGUF format, including Mistral, Falcon, and OpenAssistant models.

5. What is GGUF, and why is it used in llama.cpp?

GGUF (Ggerganov’s Unified Format) is a model file format optimized for llama-cpp. It reduces file size, improves loading times, and ensures compatibility with various model architectures.

6. Can I fine-tune models using llama.cpp?

Yes, llama-cpp supports LoRA (Low-Rank Adaptation) fine-tuning, allowing users to adapt LLaMA models to specific tasks while using less computational power.

7. How much RAM do I need to run llama.cpp?

The required RAM depends on the model size:

7B model → ~4GB RAM
13B model → ~10GB RAM
30B model → ~20GB+ RAM
65B model → ~40GB+ RAM
For larger models, using a swap file or running on a GPU with enough VRAM is recommended.

8. Is llama.cpp faster than running models on a GPU?

It depends. On low-end GPUs, llama-cpp on a CPU can be comparable or even faster due to its optimizations. However, high-end GPUs (e.g., RTX 4090) will significantly outperform CPU-only execution.

9. Can llama.cpp run on mobile devices or Raspberry Pi?

Yes! llama-cpp is lightweight and optimized for ARM-based devices, including Android phones and Raspberry Pi. However, performance will be limited based on hardware capabilities.

10. Does llama.cpp support GPU acceleration?

Yes! llama-cpp supports GPU acceleration via CUDA, Metal, OpenCL, and Vulkan to improve performance on supported hardware. However, its efficiency depends on your GPU model and available VRAM.