Local LlaMa models

Background and resources

This is built using llama.cpp and it’s python bindings from llama-cpp-python.

Documentation is the llama header.

Acquiring models

You need a quantized model. For raw pytorch models use the huggingface ALM (not finished).

Where to look

Good address is e.g. TheBloke.

Quantizing a model

Look in C library. Quantization is resource hungry. Can be used to make any Llama based model usable and generally at quite the significant speed increase.

Usage info

Basic

from pyalm import LLaMa
llm = LLaMa(PATH_TO_QUANTIZED_MODEL_FILE)

Everything else is mostly model dependent. You can find that out via a model card. Alternatively you can load the model for a single time. The library will obtain everything there is to find out from the file

CPU only

CPU support is automatic. Perfomance can be controlled via n_threads. If not set the library will take whatever it can get. Lower quantizations of the same model are faster but quality can suffer immensely.

GPU only or mixed

n_gpu_layers is what controls how much of the model is offloaded to a GPU. It has no effect on versions that are not compiled with CUBLAS. The required VRAM per layer is model dependent and can be found out via a first load with a low-ish value like e.g. 10-20 layers.

The final layer may produce a much larger overhead than all previous ones and is not accounted for in the total VRAM usage estimate.

70b

from pyalm import LLaMa
llm = LLaMa(PATH_TO_MODEL, is_70b=True)

Will lead to errors for non 70B models. Without proper GPU this is a futile endeavor.

Documentation

class pyalm.models.llama.LLaMa(model_path, n_ctx=2048, verbose=0, n_threads=-1, n_gpu_layers=-1, quantize_format='auto', is_70b=False, disable_log_hook=False, disable_resource_check=False, use_gguf_chat_template=False, **kwargs)
build_prompt(conv_history=None, system_msg=None, preserve_flow=False)

Build prompt in format native to library

Parameters:

preserve_flow – Block suffix for purely text based models

Returns:

prompt obj

create_native_completion(text, max_tokens=256, stop=None, token_prob_delta=None, token_prob_abs=None, log_probs=None, endless=False, **kwargs)

Library native completion retriever. Different for each library. No processing of output is done

Parameters:
  • text – Prompt or prompt obj

  • max_tokens – maximum tokens generated in completion

  • stop – Additional stop sequences

  • keep_dict – If library or API returns something else than raw tokens, whether to return native format

  • token_prob_delta – dict, relative added number for token logits

  • token_prob_abs – dict, Absolute logits for tokens

  • log_probs – int, when not None return the top X log probs and their tokens

  • kwargs – kwargs

Returns:

completion

create_native_generator(text, max_tokens=512, stream=True, endless=False, token_prob_delta=None, token_prob_abs=None, stop=None, **kwargs)

Library native generator for tokens. Different for each library. No processing of output is done

Parameters:
  • text – Prompt or prompt obj

  • keep_dict – If library or API returns something else than raw tokens, whether to return native format

  • token_prob_delta – dict, Absolute logits for tokens

  • token_prob_abs – dict, relative added number for token logits

  • kwargs – kwargs

Returns:

generator

detokenize(toks)
get_n_tokens(text)

How many tokens are in a string

Parameters:

text – tokenizable text

Returns:

amount

load_state_from_disk(filename)
restore_ctx_from_disk(path)
save_ctx_to_disk(prompt, path)
save_state_to_disk(filename)
setup_backend()
tokenize(text)

Text to token as vector representation

Parameters:

text

Returns:

List of tokens as ints

tokenize_as_str(text)

Text to token as vector representation but each token is converted to string

Parameters:

text

Returns:

List of tokens as strings

Installing hardware acceleration

CPU always works but is not goal oriented for models > 13B params. There are speed-ups available for cpu only via providing better BLAS libraries. Look at llama-cpp-python for more info.

GPU-Standard

Install Cuda. Download a fitting precompiled wheel from here and install it. When supplying the n_layers parameter your GPU should automatically be utilized

GPU-Advanced

Recommend experience with building

You need CUDA and cpp build tools.

Build original library. It’s not strictly necessary. But gives access to the endless scripts and other stuff. Also the only way to train LoRA from quantized model is from this fork https://github.com/xaedes/llama.cpp/tree/finetune-lora (as of now)

And makes debugging the next step easier should it fail

Follow this

When finished supplying the n_layers parameter should now utilize your GPU.

How to use without GPU

Due to the nature of the task you will come only this far with CPU-only. You can use a backend like exllama that has more aggressive optimizations, use lower bit quantizations and so on.

Be aware though: A lot of the more effective optimizations cause quality degradation in various degrees.

Just inference

If you don’t want to code but just infer you could use third party providers like e.g. Aleph-Alpha. As they usually offer their own playground the usefulness of this framework is quite limited. But I am glad to be of help anyway.

Coding+Inference

  • Google colab is a good start. GPU availability may be limited. Also you can only have one notebook so larger projects are difficult.

  • Kaggle offers free GPU accelerated notebooks

  • There is a lot more

Not-so-secret dev tip

Saturncloud

A lot of this and other RIXA stuff was developed there. Incredibly helpful for background tasks. You get 150 free compute hours/month. There are no problems with GPU availability. But most importantly it allows for full project structures and temporary deployments into the web.

CUDA is preinstalled (11.7) so you can use the preinstalled binaries with an identifier like this cu117-cp39-cp39-linux_x86_64

The free version ‘only’ contains 16 GB VRAM + 16 GB RAM so ~6B quantized 30B models is the absolute maximum you can get out.