Local LlaMa models¶
Background and resources¶
This is built using llama.cpp and it’s python bindings from llama-cpp-python.
Documentation is the llama header.
Acquiring models¶
You need a quantized model. For raw pytorch models use the huggingface ALM (not finished).
Where to look¶
Good address is e.g. TheBloke.
Quantizing a model¶
Look in C library. Quantization is resource hungry. Can be used to make any Llama based model usable and generally at quite the significant speed increase.
Usage info¶
Basic
from pyalm import LLaMa
llm = LLaMa(PATH_TO_QUANTIZED_MODEL_FILE)
Everything else is mostly model dependent. You can find that out via a model card. Alternatively you can load the model for a single time. The library will obtain everything there is to find out from the file
CPU only¶
CPU support is automatic. Perfomance can be controlled via n_threads
. If not set the library will take whatever it can get.
Lower quantizations of the same model are faster but quality can suffer immensely.
GPU only or mixed¶
n_gpu_layers
is what controls how much of the model is offloaded to a GPU.
It has no effect on versions that are not compiled with CUBLAS.
The required VRAM per layer is model dependent and can be found out via a first load with a low-ish
value like e.g. 10-20 layers.
The final layer may produce a much larger overhead than all previous ones and is not accounted for in the total VRAM usage estimate.
70b¶
from pyalm import LLaMa
llm = LLaMa(PATH_TO_MODEL, is_70b=True)
Will lead to errors for non 70B models. Without proper GPU this is a futile endeavor.
Documentation¶
- class pyalm.models.llama.LLaMa(model_path, n_ctx=2048, verbose=0, n_threads=-1, n_gpu_layers=-1, quantize_format='auto', is_70b=False, disable_log_hook=False, disable_resource_check=False, use_gguf_chat_template=False, **kwargs)¶
- build_prompt(conv_history=None, system_msg=None, preserve_flow=False)¶
Build prompt in format native to library
- Parameters:
preserve_flow – Block suffix for purely text based models
- Returns:
prompt obj
- create_native_completion(text, max_tokens=256, stop=None, token_prob_delta=None, token_prob_abs=None, log_probs=None, endless=False, **kwargs)¶
Library native completion retriever. Different for each library. No processing of output is done
- Parameters:
text – Prompt or prompt obj
max_tokens – maximum tokens generated in completion
stop – Additional stop sequences
keep_dict – If library or API returns something else than raw tokens, whether to return native format
token_prob_delta – dict, relative added number for token logits
token_prob_abs – dict, Absolute logits for tokens
log_probs – int, when not None return the top X log probs and their tokens
kwargs – kwargs
- Returns:
completion
- create_native_generator(text, max_tokens=512, stream=True, endless=False, token_prob_delta=None, token_prob_abs=None, stop=None, **kwargs)¶
Library native generator for tokens. Different for each library. No processing of output is done
- Parameters:
text – Prompt or prompt obj
keep_dict – If library or API returns something else than raw tokens, whether to return native format
token_prob_delta – dict, Absolute logits for tokens
token_prob_abs – dict, relative added number for token logits
kwargs – kwargs
- Returns:
generator
- detokenize(toks)¶
- get_n_tokens(text)¶
How many tokens are in a string
- Parameters:
text – tokenizable text
- Returns:
amount
- load_state_from_disk(filename)¶
- restore_ctx_from_disk(path)¶
- save_ctx_to_disk(prompt, path)¶
- save_state_to_disk(filename)¶
- setup_backend()¶
- tokenize(text)¶
Text to token as vector representation
- Parameters:
text –
- Returns:
List of tokens as ints
- tokenize_as_str(text)¶
Text to token as vector representation but each token is converted to string
- Parameters:
text –
- Returns:
List of tokens as strings
Installing hardware acceleration¶
CPU always works but is not goal oriented for models > 13B params. There are speed-ups available for cpu only via providing better BLAS libraries. Look at llama-cpp-python for more info.
GPU-Standard¶
Install Cuda. Download a fitting precompiled wheel from
here and install it.
When supplying the n_layers
parameter your GPU should automatically be utilized
GPU-Advanced¶
Recommend experience with building
You need CUDA and cpp build tools.
Build original library. It’s not strictly necessary. But gives access to the endless scripts and other stuff. Also the only way to train LoRA from quantized model is from this fork https://github.com/xaedes/llama.cpp/tree/finetune-lora (as of now)
And makes debugging the next step easier should it fail
Follow this
When finished supplying the n_layers
parameter should now utilize your GPU.
How to use without GPU¶
Due to the nature of the task you will come only this far with CPU-only. You can use a backend like exllama that has more aggressive optimizations, use lower bit quantizations and so on.
Be aware though: A lot of the more effective optimizations cause quality degradation in various degrees.
Just inference¶
If you don’t want to code but just infer you could use third party providers like e.g. Aleph-Alpha. As they usually offer their own playground the usefulness of this framework is quite limited. But I am glad to be of help anyway.
Coding+Inference¶
Google colab is a good start. GPU availability may be limited. Also you can only have one notebook so larger projects are difficult.
Kaggle offers free GPU accelerated notebooks
There is a lot more
Not-so-secret dev tip¶
A lot of this and other RIXA stuff was developed there. Incredibly helpful for background tasks. You get 150 free compute hours/month. There are no problems with GPU availability. But most importantly it allows for full project structures and temporary deployments into the web.
CUDA is preinstalled (11.7) so you can use the preinstalled binaries with an identifier like this
cu117-cp39-cp39-linux_x86_64
The free version ‘only’ contains 16 GB VRAM + 16 GB RAM so ~6B quantized 30B models is the absolute maximum you can get out.