 
g4l is a high-level Python library that allows you to run language models using the llama.cpp bindings. It is a sister project to @gpt4free, which also provides AI, but using internet and external providers, aswell as additional feature such as text retrieval from documents.
pull requests are welcome !!
- Gui / playground
- Support function calling & image models
- tts / stt models
- Blog article creator (use of multiple queries to produce a qualitative blog atricle with efficient style prompting and context retrieval)
- Allow for passing of more arguments
- Improve compatibility / Unittests.
-  Native binding implementation / more low level usage of llama-cpp-python
- Ability to finetune models on datasets / dataset generator
- Optimise for devices with low memory and computing (current min ram is 8gb & gpu is preferred)
- Blog articles explaining usage, and how llm's work.
- Better model list / optimised parameters
- Create custom local benchmarking.
To use G4L, you need to have the llama.cpp Python bindings installed. You can install them using pip:
pip3 install -U llama-cpp-python
- Clone the G4L repository:
git clone https://github.com/gpt4free/gpt4local
- Navigate to the cloned directory:
cd gpt4local
- Install the required dependencies:
pip install -r requirements.txt
- Download the desired models in the GGUFformat from HuggingFace. You can find a variety of quantized.ggufmodels on TheBloke's page.
- Place the downloaded models in the ./modelsfolder.
Some popular models include:
The models are available in different quantization levels, such as q2_0, q4_0, q5_0, and q8_0. Higher quantization 'bit counts' (4 bits or more) generally preserve more quality, whereas lower levels compress the model further, which can lead to a significant loss in quality. The standard quantization level is q4_0.
Keep in mind the memory requirements for different model sizes:
- 7b parameters ~ 8gbof RAM
- 13b parameters ~ 16gbof RAM
According to chat.lmsys.org, the best models are:
- Best 7bmodel:Mistral-7B-Instruct-v0.2
- Best opensource model: Qwen1.5-72B-Chat(available here)
from g4l.local import LocalEngine
engine = LocalEngine(
    gpu_layers = -1,  # use all GPU layers
    cores      = 0    # use all CPU cores
)
response = engine.chat.completions.create(
    model    = 'orca-mini-3b-gguf2-g4_0',
    messages = [{"role": "user", "content": "hi"}],
    stream   = True
)
for token in response:
    print(token.choices[0].delta.content)Note: The model parameter must match the file name of the .gguf model you placed in ./models, without the .gguf extension!
from g4l.local import LocalEngine, DocumentRetriever
engine = LocalEngine(
    gpu_layers = -1,  # use all GPU layers
    cores      = 0,   # use all CPU cores
    document_retriever = DocumentRetriever(
        files       = ['einstein-albert.pdf'], 
        embed_model = 'SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard
    )
)
response = engine.chat.completions.create(
    model    = 'mistral-7b-instruct',
    messages = [
        {
            "role": "user", "content": "how was einstein's work in the laboratory"
        }
    ],
    stream   = True
)
for token in response:
    print(token.choices[0].delta.content or "", end="", flush=True)! The embeddings model will be downloaded upon first use, but it is really small and lightweight.
G4L provides a DocumentRetriever class that allows you to retrieve relevant information from documents based on a query. Here's an example of how to use it:
from g4l.local import DocumentRetriever
engine = DocumentRetriever(
    files=['einstein-albert.txt'], 
    embed_model='SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard
    verbose=True,
)
retrieval_data = engine.retrieve('what inventions did he do')
for node_with_score in retrieval_data:
    node = node_with_score.node
    score = node_with_score.score
    text = node.text
    metadata = node.metadata
    page_label = metadata['page_label']
    file_name = metadata['file_name']
    
    print(f"Text: {text}")
    print(f"Score: {score}")
    print(f"Page Label: {page_label}")
    print(f"File Name: {file_name}")
    print("---")You can also get a ready-to-go prompt for the language model using the retrieve_for_llm method:
retrieval_data = engine.retrieve_for_llm('what inventions did he do')
print(retrieval_data)The prompt template used by retrieve_for_llm is as follows:
prompt = (f'Context information is below.\n'
    + '---------------------\n'
    + f'{context_batches}\n'
    + '---------------------\n'
    + 'Given the context information and not prior knowledge, answer the query.\n'
    + f'Query: {query_str}\n'
    + 'Answer: ')G4L provides several configuration options to customize the behavior of the LocalEngine. Here are some of the available options:
- gpu_layers: The number of layers to offload to the GPU. Use- -1to offload all layers.
- cores: The number of CPU cores to use. Use- 0to use all available cores.
- use_mmap: Whether to use memory mapping for faster model loading. Default is- True.
- use_mlock: Whether to lock the model in memory to prevent swapping. Default is- False.
- offload_kqv: Whether to offload key, query, and value tensors to the GPU. Default is- True.
- context_window: The maximum context window size. Default is- 4900.
You can pass these options when creating an instance of LocalEngine:
engine = LocalEngine(
    gpu_layers = -1,
    cores      = 0,
    use_mmap   = True,
    use_mlock  = False,
    offload_kqv= True,
    context_window = 4900
)Benchmark ran on a 2022 MacBook Air M2, 8GB RAM.
PC: Mac Air M2
CPU/GPU: M2 chip
Cores: All (8)
GPU Layers: All
GPU Offload: 100%
No power:
Model: mistral-7b-instruct-v2
Number of iterations: 5
Average loading time: 1.85s
Average total tokens: 48.20
Average total time: 5.34s
Average speed: 9.02 t/s
With power:
Model: mistral-7b-instruct-v2
Number of iterations: 5
Average loading time: 1.88s
Average total tokens: 317
Average total time: 17.7s
Average speed: 17.9 t/s
- I have coded G4L in a way that you can use language models in a very familiar way with quick installation, while preserving maximum performance.
- Using the direct Python bindings, I was able to max out the performance by using 100% GPU, CPU, and RAM.
- I tried different 3rd party packages that wrap llama.cpp, like LmStudio, which still had great performance but in my case a speed of ~7.83tokens/s in contrast to9.02t/s with native llama.cpp Python bindings.