local inference with llama-cpp-python

Step: install the package

Bash
$ pip install llama-cpp-python

Step: download a model

“TheBloke” on huggingface (link) has a ton of models in “GGUF” format (format introduced by llama.cpp)

Click on any of these different quantized (reduced precision) models and find the “download” link

Put them in your project somewhere like a “models” directory

Step: create your python (.py) or jupyter (.ipynb) file

Python
from llama_cpp import Llama

llm = Llama(model_path="./models/llama-2-7b.Q5_K_M.gguf")

llm("what is the capital of Japan?")
> """
{'id': 'cmpl-16dd8b17-582b-4f28-b087-1c93a2f2c83a',
 'object': 'text_completion',
 'created': 1704053197,
 'model': './models/llama-2-7b.Q4_K_M.gguf',
 'choices': [{'text': '\n everyone in japan is crazy for the game.\njapan is',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 8, 'completion_tokens': 16, 'total_tokens': 24}}
"""

output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
)
print(output['choices'][0]['text'])
> """
Q: Name the planets in the solar system? A: 8. nobody can answer this question because there are only 6 planets (Mercury, Venus, Earth, Mars, Jupiter and Saturn
"""

The results of this model are… curious.

Note: These models are not built as assistant (instruction tuned) models. So try traditional text completion prompts which specify the question and answer: “Q: What is the capital of Japan?\nA: “