Bit-TTT-Engine

BitLlama

Pure Rust LLM inference engine with Soul learning and hierarchical memory.

Status: v1.0.0 — Development Complete. This project is fully functional and no longer under active development.

What is BitLlama?

A local LLM inference engine written entirely in Rust. It runs GGUF and safetensors models on your PC, with a unique Soul system that lets the AI learn and remember across conversations.

Key features:

GGUF + safetensors model inference (Llama 2/3, Gemma 2/3, Qwen2.5, Mistral, BitNet)
Soul learning — teach the AI via LoRA fine-tuning from conversations
Memory system — 4-layer hierarchical memory (Episodes/Facts/Concepts/Worldview)
Sleep consolidation — background memory organization (Tidy/Fold/Merge/Elevate/Dream)
Desktop GUI (Tauri 2.0 + Svelte 5) with Japanese/English i18n
CLI with chat, learning, API server, RAG, and MCP support
CUDA acceleration + Q8 KV Cache
1096+ tests, Pure Rust single binary

Quick Start

Install

# Homebrew (macOS / Linux)
brew tap imonoonoko/bitllama && brew install bitllama

# Windows (winget)
winget install imonoonoko.BitLlama

# Or download from GitHub Releases

Run

bitllama pull bartowski/gemma-2-2b-it-GGUF
bitllama run ~/.bitllama/models/gemma-2-2b-it-Q4_K_M.gguf

Teach

bitllama learn "My name is Onoko" --model model.gguf --save onoko.soul
bitllama run model.gguf --soul onoko.soul

API Server

bitllama serve model.gguf --port 8000
# OpenAI-compatible: POST /v1/chat/completions

Desktop GUI

BitLlama Desktop — built with Tauri 2.0 + Svelte 5.

Model download, management, and auto-recommendation
Streaming chat with conversation history
Soul learning (chat, drag & drop, correction)
Memory visualization (Episodes, Facts, Concepts, Sleep status)
Hardware auto-detection
Japanese / English i18n

# Install
winget install imonoonoko.BitLlamaDesktop

# Or build from source
cd bitllama-desktop && npm install && npx tauri build

Supported Models

Model	Format	Chat Template
Llama-2 7B/13B	GGUF	llama2
Llama-3 8B	GGUF	llama3
Gemma-2 2B/9B	GGUF	gemma
Gemma-3	GGUF	gemma
Qwen2.5 0.5B-7B	GGUF	chatml
Mistral 7B	GGUF	mistral
BitNet 2B4T	safetensors	bitnet

GGUF quantizations: Q4_K_M, Q6_K, Q8_0, F16.

Performance

RTX 4060 Ti 8GB, Q4_K_M:

Model	Speed	vs llama.cpp
Llama-2 7B	45.4 tok/s	90%
Mistral 7B	42.1 tok/s	89%
Gemma-2 2B	75.1 tok/s	74%

Architecture

Bit-TTT-Engine/
├── crates/
│   ├── bit_llama/        # CLI application
│   ├── rust_engine/      # Core inference engine (GGUF, CUDA, LoRA, KV Cache)
│   └── bit_converter/    # Model conversion utilities
├── bitllama-desktop/     # Desktop GUI (Tauri 2.0 + Svelte 5)
└── docs/                 # Documentation

Soul & Memory Architecture

Conversations → Episodes (L0) → Sleep → Facts (L1) → Concepts (L2) → Worldview (L3)
                                  ↓
                            Soul Promotion (LoRA fine-tuning from stable patterns)

Build from Source

# CLI
cargo build --release -p bit_llama

# With CUDA
cargo build --release -p bit_llama --features cuda

# Desktop
cd bitllama-desktop && npm install && npx tauri build

# Tests
cargo test --no-default-features --lib

Requirements: Rust 1.75+, CUDA 12.x (optional)

What Was Built

This project was developed over 3 weeks (Jan-Feb 2026) as a solo effort. Final stats:

1096+ tests across 4 crates (125 converter + 383 llama + 500 engine + 88 desktop)
7 model architectures supported
Memory-First Soul: 4-layer hierarchical memory + 7-stage Sleep consolidation + Soul promotion
Full Desktop GUI with i18n, model management, Soul visualization
Quality score 9.0/10 — 9 refactoring stages + 36 code review improvements

Acknowledgments

candle — ML framework for Rust
llama.cpp — GGUF format and quantization reference

License

MIT License — see LICENSE.

Built with Rust by @imonoonoko

This site is open source. Improve this page.