What Are Local Models?
Local models are optional but powerful AI capabilities built into PiecesOS that allow you to run Large Language Models (LLMs) directly on your device instead of relying on cloud-based AI processing.
Unlike cloud models—which require an internet connection—local models run entirely on your device, providing complete privacy and offline functionality.
Why Use Local Models?
PiecesOS provides local AI processing directly through its built-in infrastructure, making it faster, more stable, and easier to use.
Local models can be downloaded on-demand and ensure compatibility with Pieces supported LLMs, allowing new local models to be integrated soon after they are released.
Additionally, many developers and organizations prefer local LLMs over cloud-hosted models for reasons such as:
Stronger data security, as it keeps proprietary code and sensitive queries 100% local.
Faster response times with no network delays during generative AI processing and local inference.
Offline accessibility for use even when no internet connection is present.
Enterprise compliance so all AI queries remain within company-managed environments.
How It Works
Local models are integrated with PiecesOS to enable local model inference and generative AI capabilities.
Here's how local models work with PiecesOS:
Serve on-device LLMs, reducing cloud dependency and enhancing privacy.
Download on-demand, so you only install the models you need for your workflow.
Support a curated set of models, all optimized for performance with efficient quantization.
Ensure compatibility with PiecesOS through automatic version management.
Using Local vs Cloud Models
PiecesOS supports both cloud-based and local AI models for Conversational Search.
Users who prefer on-device AI for speed, privacy, or offline access can download local models directly through PiecesOS.
Supported cloud providers and example models include:
- OpenAI: GPT-5.2 Pro, GPT-5.2, GPT-5.1, GPT-5 Thinking, GPT-5, GPT-5 Fast, o4 Mini, o3 Pro, o3 Mini, o3, o1, GPT-4.1, GPT-4o, GPT-4o Mini
- Anthropic: Claude 4.5 Opus, Claude 4.5 Sonnet, Claude 4.5 Haiku, Claude 4 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3.5 Haiku
- Google: Gemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Lite, Gemini 2 Flash Lite
See the full Cloud Models list →
| Feature | Cloud AI (Default) | Local AI |
|---|---|---|
| Processing Location | Cloud-based (requires internet) | On-device (runs locally) |
| Performance | Dependent on internet speed | Potentially faster response times (no network latency) |
| Data Privacy | Data only sent to cloud if included as context in a Conversational Search chat—governed by provider privacy policies | 100% local (no data transmission from local device) |
| Model Availability | Uses several cloud-hosted models | Download models on-demand through PiecesOS |
| Storage Requirements | Minimal outside of the PiecesOS installation | Several GBs (model storage) |
| Offline Support | No | Yes |
Required Specifications
Local models use more system resources than cloud-based AI. To run local LLMs smoothly, your device should meet the following minimum specifications. These guidelines are based on Ollama documentation and experience-tested public sources such as LocalLLM.in's VRAM requirements guide.
Minimum System Requirements
| Component | Minimum |
|---|---|
| Operating System | macOS 11.0 (Big Sur) or later, Windows 10 or later, or Ubuntu 18.04 or later |
| RAM | 8GB for 3B models; 16GB for 7B models; 32GB for 13B models |
| CPU | Modern CPU with at least 4 cores (8 cores recommended for 13B models) |
| GPU (optional) | 6GB+ VRAM recommended for faster inference |
| Storage | At least 12GB free for Ollama and base models; more for larger models |
VRAM Guidelines by Model Size
If you use a dedicated GPU, the amount of VRAM you need depends on the model size and quantization. The following are general guidelines for running models at Q4_K_M quantization:
| VRAM | Typical model size |
|---|---|
| 3–4 GB | 3–4B parameter models (e.g., 4k context) |
| 6–8 GB | 7–9B models (e.g., Llama 3.1 8B, Qwen3 8B) |
| 10–12 GB | 12–14B models (e.g., Gemma 3 12B, Qwen3 14B) |
| 16–24 GB | 22–35B models (e.g., Gemma 3 27B, Qwen3 32B) |
| 48 GB+ | 70B+ models (e.g., Llama 3.3 70B, Qwen2.5 72B) |
Total VRAM usage includes model weights, KV cache (which grows with context length), and system overhead. For more precise requirements for a specific model, refer to LocalLLM.in's Ollama VRAM requirements guide.