Model compatibility
LocalAI is compatible with the models supported by llama.cpp supports also GPT4ALL-J and cerebras-GPT with ggml.
Note
LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.
Hardware requirements
Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also here for gguf based backends. rwkv is less expensive on resources.
Model compatibility table
Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.
| Backend and Bindings |
Compatible models |
Completion/Chat endpoint |
Capability |
Embeddings support |
Token stream support |
Acceleration |
| llama.cpp |
Vicuna, Alpaca, LLaMa |
yes |
GPT and Functions |
yes** |
yes |
CUDA, openCL, cuBLAS, Metal |
| gpt4all-llama |
Vicuna, Alpaca, LLaMa |
yes |
GPT |
no |
yes |
N/A |
| gpt4all-mpt |
MPT |
yes |
GPT |
no |
yes |
N/A |
| gpt4all-j |
GPT4ALL-J |
yes |
GPT |
no |
yes |
N/A |
| falcon-ggml (binding) |
Falcon (*) |
yes |
GPT |
no |
no |
N/A |
| gpt2 (binding) |
GPT2, Cerebras |
yes |
GPT |
no |
no |
N/A |
| dolly (binding) |
Dolly |
yes |
GPT |
no |
no |
N/A |
| gptj (binding) |
GPTJ |
yes |
GPT |
no |
no |
N/A |
| mpt (binding) |
MPT |
yes |
GPT |
no |
no |
N/A |
| replit (binding) |
Replit |
yes |
GPT |
no |
no |
N/A |
| gptneox (binding) |
GPT NeoX, RedPajama, StableLM |
yes |
GPT |
no |
no |
N/A |
| starcoder (binding) |
Starcoder |
yes |
GPT |
no |
no |
N/A |
| bloomz (binding) |
Bloom |
yes |
GPT |
no |
no |
N/A |
| rwkv (binding) |
rwkv |
yes |
GPT |
no |
yes |
N/A |
| bert (binding) |
bert |
no |
Embeddings only |
yes |
no |
N/A |
| whisper |
whisper |
no |
Audio |
no |
no |
N/A |
| stablediffusion (binding) |
stablediffusion |
no |
Image |
no |
no |
N/A |
| langchain-huggingface |
Any text generators available on HuggingFace through API |
yes |
GPT |
no |
no |
N/A |
| piper (binding) |
Any piper onnx model |
no |
Text to voice |
no |
no |
N/A |
| falcon (binding) |
Falcon *** |
yes |
GPT |
no |
yes |
CUDA |
| sentencetransformers |
BERT |
no |
Embeddings only |
yes |
no |
N/A |
bark |
bark |
no |
Audio generation |
no |
no |
yes |
autogptq |
GPTQ |
yes |
GPT |
yes |
no |
N/A |
exllama |
GPTQ |
yes |
GPT only |
no |
no |
N/A |
diffusers |
SD,… |
no |
Image generation |
no |
no |
N/A |
vall-e-x |
Vall-E |
no |
Audio generation and Voice cloning |
no |
no |
CPU/CUDA |
vllm |
Various GPTs and quantization formats |
yes |
GPT |
no |
no |
CPU/CUDA |
exllama2 |
GPTQ |
yes |
GPT only |
no |
no |
N/A |
transformers-musicgen |
|
no |
Audio generation |
no |
no |
N/A |
| tinydream |
stablediffusion |
no |
Image |
no |
no |
N/A |
coqui |
Coqui |
no |
Audio generation and Voice cloning |
no |
no |
CPU/CUDA |
petals |
Various GPTs and quantization formats |
yes |
GPT |
no |
no |
CPU/CUDA |
Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).
Tested with:
Note: You might need to convert some models from older models to the new format, for indications, see the README in llama.cpp for instance to run gpt4all.
Subsections of Model compatibility
RWKV
A full example on how to run a rwkv model is in the examples.
Note: rwkv models needs to specify the backend rwkv in the YAML config files and have an associated tokenizer along that needs to be provided with it:
36464540 -rw-r--r-- 1 mudler mudler 1.2G May 3 10:51 rwkv_small
36464543 -rw-r--r-- 1 mudler mudler 2.4M May 3 10:51 rwkv_small.tokenizer.json
๐ฆ llama.cpp
llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.
Note
The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use the llama-ggml backend instead. If you are relying in automatic detection of the model, you should be fine. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama. The go backend supports still features not available in the mainline: speculative sampling and embeddings.
Features
The llama.cpp model supports the following features:
Setup
LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.
Manual setup
It is sufficient to copy the ggml or guf model files in the models folder. You can refer to the model in the model parameter in the API calls.
You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.
Prompt templates are useful for models that are fine-tuned towards a specific prompt.
Automatic setup
LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.
For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.1
}'
LocalAI will automatically download and configure the model in the model directory.
Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.
YAML configuration
To use the llama.cpp backend, specify llama as the backend in the YAML file:
name: llama
backend: llama
parameters:
# Relative to the models path
model: file.gguf.bin
In the example above we specify llama as the backend to restrict loading gguf models only.
For instance, to use the llama-ggml backend for ggml models:
name: llama
backend: llama-ggml
parameters:
# Relative to the models path
model: file.ggml.bin
Reference
๐ฆ Exllama
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”
Prerequisites
This is an extra backend - in the container images is already available and there is nothing to do for the setup.
If you are building LocalAI locally, you need to install exllama manually first.
Model setup
Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:
$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/
.keep WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml
name: exllama
parameters:
model: WizardLM-7B-uncensored-GPTQ
backend: exllama
# ...
Test with:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "exllama",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.1
}'
๐ฆ AutoGPTQ
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Prerequisites
This is an extra backend - in the container images is already available and there is nothing to do for the setup.
If you are building LocalAI locally, you need to install AutoGPTQ manually.
Model setup
The models are automatically downloaded from huggingface if not present the first time. It is possible to define models via YAML config file, or just by querying the endpoint with the huggingface repository model name. For example, create a YAML config file in models/:
name: orca
backend: autogptq
model_base_name: "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
parameters:
model: "TheBloke/orca_mini_v2_13b-GPTQ"
# ...
Test with:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "orca",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.1
}'
๐ถ Bark
Bark allows to generate audio from text prompts.
Setup
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
Usage
Use the tts endpoint by specifying the bark backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "bark",
"input":"Hello!"
}' | aplay
To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "bark",
"input":"Hello!",
"model": "v2/en_speaker_4"
}' | aplay
Vall-E-X
VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.
Setup
The backend will automatically download the required files in order to run the model.
This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
Usage
Use the tts endpoint by specifying the vall-e-x backend:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "vall-e-x",
"input":"Hello!"
}' | aplay
Voice cloning
In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:
name: cloned-voice
backend: vall-e-x
parameters:
model: "cloned-voice"
vall-e:
# The path to the audio file to be cloned
# relative to the models directory
audio_path: "path-to-wav-source.wav"
Then you can specify the model name in the requests:
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
"backend": "vall-e-x",
"model": "cloned-voice",
"input":"Hello!"
}' | aplay
vLLM
vLLM is a fast and easy-to-use library for LLM inference.
LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.
Setup
Create a YAML file for the model you want to use with vllm.
To setup a model, you need to just specify the model name in the YAML config file:
name: vllm
backend: vllm
parameters:
model: "facebook/opt-125m"
# Decomment to specify a quantization method (optional)
# quantization: "awq"
The backend will automatically download the required files in order to run the model.
Usage
Use the completions endpoint by specifying the vllm backend:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "vllm",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
๐งจ Diffusers
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.
(Generated with AnimagineXL)
Note: currently only the image generation is supported. It is experimental, so you might encounter some issues on models which weren’t tested yet.
Setup
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
The models will be downloaded the first time you use the backend from huggingface automatically.
Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:
name: animagine-xl
parameters:
model: Linaqruf/animagine-xl
backend: diffusers
cuda: true
f16: true
diffusers:
scheduler_type: euler_a
Local models
You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:
name: stablediffusion
parameters:
model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
cuda: true
diffusers:
pipeline_type: StableDiffusionPipeline
enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
scheduler_type: "k_dpmpp_sde"
cfg_scale: 8
clip_skip: 11
Configuration parameters
The following parameters are available in the configuration file:
| Parameter |
Description |
Default |
f16 |
Force the usage of float16 instead of float32 |
false |
step |
Number of steps to run the model for |
30 |
cuda |
Enable CUDA acceleration |
false |
enable_parameters |
Parameters to enable for the model |
negative_prompt,num_inference_steps,clip_skip |
scheduler_type |
Scheduler type |
k_dpp_sde |
cfg_scale |
Configuration scale |
8 |
clip_skip |
Clip skip |
None |
pipeline_type |
Pipeline type |
AutoPipelineForText2Image |
There are available several types of schedulers:
| Scheduler |
Description |
ddim |
DDIM |
pndm |
PNDM |
heun |
Heun |
unipc |
UniPC |
euler |
Euler |
euler_a |
Euler a |
lms |
LMS |
k_lms |
LMS Karras |
dpm_2 |
DPM2 |
k_dpm_2 |
DPM2 Karras |
dpm_2_a |
DPM2 a |
k_dpm_2_a |
DPM2 a Karras |
dpmpp_2m |
DPM++ 2M |
k_dpmpp_2m |
DPM++ 2M Karras |
dpmpp_sde |
DPM++ SDE |
k_dpmpp_sde |
DPM++ SDE Karras |
dpmpp_2m_sde |
DPM++ 2M SDE |
k_dpmpp_2m_sde |
DPM++ 2M SDE Karras |
Pipelines types available:
| Pipeline type |
Description |
StableDiffusionPipeline |
Stable diffusion pipeline |
StableDiffusionImg2ImgPipeline |
Stable diffusion image to image pipeline |
StableDiffusionDepth2ImgPipeline |
Stable diffusion depth to image pipeline |
DiffusionPipeline |
Diffusion pipeline |
StableDiffusionXLPipeline |
Stable diffusion XL pipeline |
Usage
Text to Image
Use the image generation endpoint with the model name from the configuration file:
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "<positive prompt>|<negative prompt>",
"model": "animagine-xl",
"step": 51,
"size": "1024x1024"
}'
Image to Image
https://huggingface.co/docs/diffusers/using-diffusers/img2img
An example model (GPU):
name: stablediffusion-edit
parameters:
model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25
cuda: true
f16: true
diffusers:
pipeline_type: StableDiffusionImg2ImgPipeline
enable_parameters: "negative_prompt,num_inference_steps,image"
IMAGE_PATH=/path/to/your/image
(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @- http://localhost:8080/v1/images/generations
Depth to Image
https://huggingface.co/docs/diffusers/using-diffusers/depth2img
name: stablediffusion-depth
parameters:
model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
# Force CPU usage
f16: true
cuda: true
diffusers:
pipeline_type: StableDiffusionDepth2ImgPipeline
enable_parameters: "negative_prompt,num_inference_steps,image"
cfg_scale: 6
(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @- http://localhost:8080/v1/images/generations
img2vid
name: img2vid
parameters:
model: stabilityai/stable-video-diffusion-img2vid
backend: diffusers
step: 25
# Force CPU usage
f16: true
cuda: true
diffusers:
pipeline_type: StableVideoDiffusionPipeline
(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
txt2vid
name: txt2vid
parameters:
model: damo-vilab/text-to-video-ms-1.7b
backend: diffusers
step: 25
# Force CPU usage
f16: true
cuda: true
diffusers:
pipeline_type: VideoDiffusionPipeline
cuda: true
(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations