Model compatibility

LocalAI is compatible with the models supported by llama.cpp supports also GPT4ALL-J and cerebras-GPT with ggml.

Note

LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.

Hardware requirements

Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also here for gguf based backends. rwkv is less expensive on resources.

Model compatibility table

Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.

Backend and Bindings	Compatible models	Completion/Chat endpoint	Capability	Embeddings support	Token stream support	Acceleration
llama.cpp	Vicuna, Alpaca, LLaMa	yes	GPT and Functions	yes**	yes	CUDA, openCL, cuBLAS, Metal
gpt4all-llama	Vicuna, Alpaca, LLaMa	yes	GPT	no	yes	N/A
gpt4all-mpt	MPT	yes	GPT	no	yes	N/A
gpt4all-j	GPT4ALL-J	yes	GPT	no	yes	N/A
falcon-ggml (binding)	Falcon (*)	yes	GPT	no	no	N/A
gpt2 (binding)	GPT2, Cerebras	yes	GPT	no	no	N/A
dolly (binding)	Dolly	yes	GPT	no	no	N/A
gptj (binding)	GPTJ	yes	GPT	no	no	N/A
mpt (binding)	MPT	yes	GPT	no	no	N/A
replit (binding)	Replit	yes	GPT	no	no	N/A
gptneox (binding)	GPT NeoX, RedPajama, StableLM	yes	GPT	no	no	N/A
starcoder (binding)	Starcoder	yes	GPT	no	no	N/A
bloomz (binding)	Bloom	yes	GPT	no	no	N/A
rwkv (binding)	rwkv	yes	GPT	no	yes	N/A
bert (binding)	bert	no	Embeddings only	yes	no	N/A
whisper	whisper	no	Audio	no	no	N/A
stablediffusion (binding)	stablediffusion	no	Image	no	no	N/A
langchain-huggingface	Any text generators available on HuggingFace through API	yes	GPT	no	no	N/A
piper (binding)	Any piper onnx model	no	Text to voice	no	no	N/A
falcon (binding)	Falcon ***	yes	GPT	no	yes	CUDA
sentencetransformers	BERT	no	Embeddings only	yes	no	N/A
`bark`	bark	no	Audio generation	no	no	yes
`autogptq`	GPTQ	yes	GPT	yes	no	N/A
`exllama`	GPTQ	yes	GPT only	no	no	N/A
`diffusers`	SD,…	no	Image generation	no	no	N/A
`vall-e-x`	Vall-E	no	Audio generation and Voice cloning	no	no	CPU/CUDA
`vllm`	Various GPTs and quantization formats	yes	GPT	no	no	CPU/CUDA
`exllama2`	GPTQ	yes	GPT only	no	no	N/A
`transformers-musicgen`		no	Audio generation	no	no	N/A
tinydream	stablediffusion	no	Image	no	no	N/A
`coqui`	Coqui	no	Audio generation and Voice cloning	no	no	CPU/CUDA
`petals`	Various GPTs and quantization formats	yes	GPT	no	no	CPU/CUDA

Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).

* 7b ONLY
** doesn’t seem to be accurate
*** 7b and 40b with the ggccv format, for instance: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML

Tested with:

Automatically by CI with OpenLLAMA and GPT4ALL.
LLaMA 🦙
Vicuna
Alpaca
GPT4ALL (see also using GPT4All)
GPT4ALL-J (no changes required)
Koala 🐨
Cerebras-GPT
WizardLM
RWKV models with rwkv.cpp
bloom.cpp
Chinese LLaMA / Alpaca
Vigogne (French)
OpenBuddy 🐶 (Multilingual)
Pygmalion 7B / Metharme 7B
HuggingFace Inference models available through API
Falcon

Note: You might need to convert some models from older models to the new format, for indications, see the README in llama.cpp for instance to run gpt4all.

RWKV

A full example on how to run a rwkv model is in the examples.

Note: rwkv models needs to specify the backend rwkv in the YAML config files and have an associated tokenizer along that needs to be provided with it:

36464540 -rw-r--r--  1 mudler mudler 1.2G May  3 10:51 rwkv_small
36464543 -rw-r--r--  1 mudler mudler 2.4M May  3 10:51 rwkv_small.tokenizer.json

🦙 llama.cpp

llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.

Note

The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use the llama-ggml backend instead. If you are relying in automatic detection of the model, you should be fine. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama. The go backend supports still features not available in the mainline: speculative sampling and embeddings.

Features

The llama.cpp model supports the following features:

Setup

LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.

Manual setup

It is sufficient to copy the ggml or guf model files in the models folder. You can refer to the model in the model parameter in the API calls.

You can optionally create an associated YAML model config file to tune the model’s parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt.

Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.

For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

LocalAI will automatically download and configure the model in the model directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.

YAML configuration

To use the llama.cpp backend, specify llama as the backend in the YAML file:

name: llama
backend: llama
parameters:
  # Relative to the models path
  model: file.gguf.bin

In the example above we specify llama as the backend to restrict loading gguf models only.

For instance, to use the llama-ggml backend for ggml models:

name: llama
backend: llama-ggml
parameters:
  # Relative to the models path
  model: file.ggml.bin

Reference

🦙 Exllama

Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”

Prerequisites

This is an extra backend - in the container images is already available and there is nothing to do for the setup.

If you are building LocalAI locally, you need to install exllama manually first.

Model setup

Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:

$ git lfs install
$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
$ ls models/                                                                 
.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
$ cat models/exllama.yaml                                                     
name: exllama
parameters:
  model: WizardLM-7B-uncensored-GPTQ
backend: exllama
# ...

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "exllama",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

🦙 AutoGPTQ

AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Prerequisites

This is an extra backend - in the container images is already available and there is nothing to do for the setup.

If you are building LocalAI locally, you need to install AutoGPTQ manually.

Model setup

The models are automatically downloaded from huggingface if not present the first time. It is possible to define models via YAML config file, or just by querying the endpoint with the huggingface repository model name. For example, create a YAML config file in models/:

name: orca
backend: autogptq
model_base_name: "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
parameters:
  model: "TheBloke/orca_mini_v2_13b-GPTQ"
# ...

Test with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
   "model": "orca",
   "messages": [{"role": "user", "content": "How are you?"}],
   "temperature": 0.1
 }'

🐶 Bark

Bark allows to generate audio from text prompts.

Setup

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.

Usage

Use the tts endpoint by specifying the bark backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!"
   }' | aplay

To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the model parameter:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!",
     "model": "v2/en_speaker_4"
   }' | aplay

Vall-E-X

VALL-E-X is an open source implementation of Microsoft’s VALL-E X zero-shot TTS model.

Setup

The backend will automatically download the required files in order to run the model.

This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.

Usage

Use the tts endpoint by specifying the vall-e-x backend:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "input":"Hello!"
   }' | aplay

Voice cloning

In order to use voice cloning capabilities you must create a YAML configuration file to setup a model:

name: cloned-voice
backend: vall-e-x
parameters:
  model: "cloned-voice"
vall-e:
  # The path to the audio file to be cloned
  # relative to the models directory 
  audio_path: "path-to-wav-source.wav"

Then you can specify the model name in the requests:

curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "model": "cloned-voice",
     "input":"Hello!"
   }' | aplay

vLLM

vLLM is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.

Setup

Create a YAML file for the model you want to use with vllm.

To setup a model, you need to just specify the model name in the YAML config file:

name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

# Decomment to specify a quantization method (optional)
# quantization: "awq"

The backend will automatically download the required files in order to run the model.

Usage

Use the completions endpoint by specifying the vllm backend:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

🧨 Diffusers

Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.

(Generated with AnimagineXL)

Note: currently only the image generation is supported. It is experimental, so you might encounter some issues on models which weren’t tested yet.

Setup

This is an extra backend - in the container is already available and there is nothing to do for the setup.

Model setup

The models will be downloaded the first time you use the backend from huggingface automatically.

Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:

name: animagine-xl
parameters:
  model: Linaqruf/animagine-xl
backend: diffusers
cuda: true
f16: true
diffusers:
  scheduler_type: euler_a

Local models

You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:

name: stablediffusion
parameters:
  model: toonyou_beta6.safetensors
backend: diffusers
step: 30
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionPipeline
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"
  cfg_scale: 8
  clip_skip: 11

Configuration parameters

The following parameters are available in the configuration file:

Parameter	Description	Default
`f16`	Force the usage of `float16` instead of `float32`	`false`
`step`	Number of steps to run the model for	`30`
`cuda`	Enable CUDA acceleration	`false`
`enable_parameters`	Parameters to enable for the model	`negative_prompt,num_inference_steps,clip_skip`
`scheduler_type`	Scheduler type	`k_dpp_sde`
`cfg_scale`	Configuration scale	`8`
`clip_skip`	Clip skip	None
`pipeline_type`	Pipeline type	`AutoPipelineForText2Image`

There are available several types of schedulers:

Scheduler	Description
`ddim`	DDIM
`pndm`	PNDM
`heun`	Heun
`unipc`	UniPC
`euler`	Euler
`euler_a`	Euler a
`lms`	LMS
`k_lms`	LMS Karras
`dpm_2`	DPM2
`k_dpm_2`	DPM2 Karras
`dpm_2_a`	DPM2 a
`k_dpm_2_a`	DPM2 a Karras
`dpmpp_2m`	DPM++ 2M
`k_dpmpp_2m`	DPM++ 2M Karras
`dpmpp_sde`	DPM++ SDE
`k_dpmpp_sde`	DPM++ SDE Karras
`dpmpp_2m_sde`	DPM++ 2M SDE
`k_dpmpp_2m_sde`	DPM++ 2M SDE Karras

Pipelines types available:

Pipeline type	Description
`StableDiffusionPipeline`	Stable diffusion pipeline
`StableDiffusionImg2ImgPipeline`	Stable diffusion image to image pipeline
`StableDiffusionDepth2ImgPipeline`	Stable diffusion depth to image pipeline
`DiffusionPipeline`	Diffusion pipeline
`StableDiffusionXLPipeline`	Stable diffusion XL pipeline

Usage

Text to Image

Use the image generation endpoint with the model name from the configuration file:

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
      "prompt": "<positive prompt>|<negative prompt>", 
      "model": "animagine-xl", 
      "step": 51,
      "size": "1024x1024" 
    }'

Image to Image

https://huggingface.co/docs/diffusers/using-diffusers/img2img

An example model (GPU):

name: stablediffusion-edit
parameters:
  model: nitrosocke/Ghibli-Diffusion
backend: diffusers
step: 25
cuda: true
f16: true
diffusers:
  pipeline_type: StableDiffusionImg2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"

IMAGE_PATH=/path/to/your/image
(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

Depth to Image

https://huggingface.co/docs/diffusers/using-diffusers/depth2img

name: stablediffusion-depth
parameters:
  model: stabilityai/stable-diffusion-2-depth
backend: diffusers
step: 50
# Force CPU usage
f16: true
cuda: true
diffusers:
  pipeline_type: StableDiffusionDepth2ImgPipeline
  enable_parameters: "negative_prompt,num_inference_steps,image"
  cfg_scale: 6

(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations

img2vid

name: img2vid
parameters:
  model: stabilityai/stable-video-diffusion-img2vid
backend: diffusers
step: 25
# Force CPU usage
f16: true
cuda: true
diffusers:
  pipeline_type: StableVideoDiffusionPipeline

(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

txt2vid

name: txt2vid
parameters:
  model: damo-vilab/text-to-video-ms-1.7b
backend: diffusers
step: 25
# Force CPU usage
f16: true
cuda: true
diffusers:
  pipeline_type: VideoDiffusionPipeline
  cuda: true

(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations

Model compatibility

Hardware requirements

Model compatibility table

Subsections of Model compatibility

RWKV

🦙 llama.cpp

Features

Setup

Manual setup

Automatic setup

YAML configuration

Reference

🦙 Exllama

Prerequisites

Model setup

🦙 AutoGPTQ

Prerequisites

Model setup

🐶 Bark

Setup

Model setup

Usage

Vall-E-X

Setup

Usage

Voice cloning

vLLM

Setup

Usage

🧨 Diffusers

Setup

Model setup

Local models

Configuration parameters

Usage

Text to Image

Image to Image

Depth to Image

img2vid

txt2vid