Local LlaMa models¶

Background and resources¶

This is built using llama.cpp and it’s python bindings from llama-cpp-python.

Documentation is the llama header.

Acquiring models¶

You need a quantized model. For raw pytorch models use the huggingface ALM (not finished).

Where to look¶

Good address is e.g. TheBloke.

Quantizing a model¶

Look in C library. Quantization is resource hungry. Can be used to make any Llama based model usable and generally at quite the significant speed increase.

Usage info¶

Basic

from pyalm import LLaMa
llm = LLaMa(PATH_TO_QUANTIZED_MODEL_FILE)

Everything else is mostly model dependent. You can find that out via a model card. Alternatively you can load the model for a single time. The library will obtain everything there is to find out from the file

CPU only¶

CPU support is automatic. Perfomance can be controlled via n_threads. If not set the library will take whatever it can get. Lower quantizations of the same model are faster but quality can suffer immensely.

GPU only or mixed¶

n_gpu_layers is what controls how much of the model is offloaded to a GPU. It has no effect on versions that are not compiled with CUBLAS. The required VRAM per layer is model dependent and can be found out via a first load with a low-ish value like e.g. 10-20 layers.

The final layer may produce a much larger overhead than all previous ones and is not accounted for in the total VRAM usage estimate.

70b¶

from pyalm import LLaMa
llm = LLaMa(PATH_TO_MODEL, is_70b=True)

Will lead to errors for non 70B models. Without proper GPU this is a futile endeavor.

Documentation¶

class pyalm.models.llama.LLaMa(model_path, n_ctx=2048, verbose=0, n_threads=-1, n_gpu_layers=-1, quantize_format='auto', is_70b=False, disable_log_hook=False, disable_resource_check=False, use_gguf_chat_template=False, **kwargs)¶

build_prompt(conv_history=None, system_msg=None, preserve_flow=False)¶

Build prompt in format native to library

Parameters:: preserve_flow – Block suffix for purely text based models
Returns:: prompt obj

create_native_completion(text, max_tokens=256, stop=None, token_prob_delta=None, token_prob_abs=None, log_probs=None, endless=False, **kwargs)¶

Library native completion retriever. Different for each library. No processing of output is done

Parameters:

text – Prompt or prompt obj
max_tokens – maximum tokens generated in completion
stop – Additional stop sequences
keep_dict – If library or API returns something else than raw tokens, whether to return native format
token_prob_delta – dict, relative added number for token logits
token_prob_abs – dict, Absolute logits for tokens
log_probs – int, when not None return the top X log probs and their tokens
kwargs – kwargs

Returns:

completion

create_native_generator(text, max_tokens=512, stream=True, endless=False, token_prob_delta=None, token_prob_abs=None, stop=None, **kwargs)¶

Library native generator for tokens. Different for each library. No processing of output is done

Parameters:

text – Prompt or prompt obj
keep_dict – If library or API returns something else than raw tokens, whether to return native format
token_prob_delta – dict, Absolute logits for tokens
token_prob_abs – dict, relative added number for token logits
kwargs – kwargs

Returns:

generator

detokenize(toks)¶

get_n_tokens(text)¶

How many tokens are in a string

Parameters:: text – tokenizable text
Returns:: amount

load_state_from_disk(filename)¶

restore_ctx_from_disk(path)¶

save_ctx_to_disk(prompt, path)¶

save_state_to_disk(filename)¶

setup_backend()¶

tokenize(text)¶

Text to token as vector representation

Parameters:: text –
Returns:: List of tokens as ints

tokenize_as_str(text)¶

Text to token as vector representation but each token is converted to string

Parameters:: text –
Returns:: List of tokens as strings

Installing hardware acceleration¶

CPU always works but is not goal oriented for models > 13B params. There are speed-ups available for cpu only via providing better BLAS libraries. Look at llama-cpp-python for more info.

GPU-Standard¶

Install Cuda. Download a fitting precompiled wheel from here and install it. When supplying the n_layers parameter your GPU should automatically be utilized

GPU-Advanced¶

Recommend experience with building

You need CUDA and cpp build tools.

Build original library. It’s not strictly necessary. But gives access to the endless scripts and other stuff. Also the only way to train LoRA from quantized model is from this fork https://github.com/xaedes/llama.cpp/tree/finetune-lora (as of now)

And makes debugging the next step easier should it fail

Follow this

When finished supplying the n_layers parameter should now utilize your GPU.

How to use without GPU¶

Due to the nature of the task you will come only this far with CPU-only. You can use a backend like exllama that has more aggressive optimizations, use lower bit quantizations and so on.

Be aware though: A lot of the more effective optimizations cause quality degradation in various degrees.

Just inference¶

If you don’t want to code but just infer you could use third party providers like e.g. Aleph-Alpha. As they usually offer their own playground the usefulness of this framework is quite limited. But I am glad to be of help anyway.

Coding+Inference¶

Google colab is a good start. GPU availability may be limited. Also you can only have one notebook so larger projects are difficult.
Kaggle offers free GPU accelerated notebooks
There is a lot more

Not-so-secret dev tip¶

Saturncloud

A lot of this and other RIXA stuff was developed there. Incredibly helpful for background tasks. You get 150 free compute hours/month. There are no problems with GPU availability. But most importantly it allows for full project structures and temporary deployments into the web.

CUDA is preinstalled (11.7) so you can use the preinstalled binaries with an identifier like this cu117-cp39-cp39-linux_x86_64

The free version ‘only’ contains 16 GB VRAM + 16 GB RAM so ~6B quantized 30B models is the absolute maximum you can get out.