# Local LlaMa models

## Background and resources
This is built using [llama.cpp](https://github.com/ggerganov/llama.cpp) and it's python bindings
from [llama-cpp-python](https://github.com/abetlen/llama-cpp-python).

Documentation is the llama [header](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).


## Acquiring models
You need a quantized model. For raw pytorch models use the huggingface ALM (not finished).

### Where to look
Good address is e.g. [TheBloke](https://huggingface.co/TheBloke).

### Quantizing a model
Look in [C library](https://github.com/ggerganov/llama.cpp). Quantization is resource hungry. Can be used
to make any Llama based model usable and generally at quite the significant speed increase.

## Usage info
Basic
```python
from pyalm import LLaMa
llm = LLaMa(PATH_TO_QUANTIZED_MODEL_FILE)
```
Everything else is mostly model dependent. You can find that out via a model card. Alternatively
you can load the model for a single time. The library will obtain everything there is to find out from the file


### CPU only
CPU support is automatic. Perfomance can be controlled via `n_threads`. If not set the library will take whatever it can get.
Lower quantizations of the same model are faster but quality can suffer immensely.

### GPU only or mixed
`n_gpu_layers` is what controls how much of the model is offloaded to a GPU.
It has no effect on versions that are not compiled with CUBLAS.
The required VRAM per layer is model dependent and can be found out via a first load with a low-ish
value like e.g. 10-20 layers.

The final layer may produce a much larger overhead than all previous ones and is not accounted for in the
total VRAM usage estimate.

### 70b
```python
from pyalm import LLaMa
llm = LLaMa(PATH_TO_MODEL, is_70b=True)
```
Will lead to errors for non 70B models. Without proper GPU this is a futile endeavor.


## Documentation
```{eval-rst}  
.. automodule:: pyalm.models.llama
   :members:
   :undoc-members:
```

## Installing hardware acceleration
CPU always works but is not _goal oriented_ for models > 13B params. There are speed-ups available for
cpu only via providing better BLAS libraries. Look at [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
for more info.

### GPU-Standard
Install Cuda. Download a fitting precompiled wheel from 
[here](https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/tag/wheels) and install it.
When supplying the `n_layers` parameter your GPU should automatically be utilized

### GPU-Advanced
*Recommend experience with building*

You need CUDA and cpp build tools.

Build original [library](https://github.com/ggerganov/llama.cpp). It's not strictly necessary.
But gives access to the endless scripts and other stuff.
Also the only way to train LoRA from quantized model is from this fork https://github.com/xaedes/llama.cpp/tree/finetune-lora 
(as of now)

And makes debugging the next step easier should it fail

Follow [this](https://github.com/abetlen/llama-cpp-python)

When finished supplying the `n_layers` parameter should now utilize your GPU.


## How to use without GPU
Due to the nature of the task you will come only this far with CPU-only. You can use a backend like exllama
that has more aggressive optimizations, use lower bit quantizations and so on.

Be aware though: A lot of the more effective optimizations cause quality degradation in various degrees.

### Just inference
If you don't want to code but just infer you could use third party providers like
e.g. Aleph-Alpha. As they usually offer their own playground the usefulness of this framework is quite limited.
But I am glad to be of help anyway.

### Coding+Inference
* Google colab is a good start. GPU availability may be limited. Also you can only have one notebook so larger
projects are difficult.
* Kaggle offers free GPU accelerated notebooks
* There is a lot more
### Not-so-secret dev tip
[Saturncloud](https://saturncloud.io/)

A lot of this and other RIXA stuff was developed there. Incredibly helpful for background tasks.
You get 150 free compute hours/month.
There are no problems with GPU availability. But most importantly it allows for full project structures and
temporary deployments into the web.

CUDA is preinstalled (11.7) so you can use the preinstalled binaries with an identifier like this
`cu117-cp39-cp39-linux_x86_64`

The free version 'only' contains 16 GB VRAM + 16 GB RAM so ~6B quantized 30B models is the absolute maximum
you can get out.