Ifty's BlogThoughts & Ideas
20 Fresh Uncensored LLMs You Can Run Locally — March 2026 Roundup
LLMUncensored AILocal AIOpen SourceGGUFAbliteration

20 Fresh Uncensored LLMs You Can Run Locally — March 2026 Roundup

A complete roundup of 20 new uncensored LLMs released in March 2026 that you can run locally. Includes GLM-4.7 Flash, Dark Champion, Qwen3 MoE, and more — with hardware guides, use cases, and a FAQ.

Ifthe Kharul IslamIfthe Kharul Islam
··14 min read

The world of open, unfiltered AI models just had one of its busiest months ever. Between late February and early March 2026, the community shipped a wave of 20 brand-new uncensored large language models (LLMs) — all designed to run on your own machine, no cloud required.

While the big AI labs keep tightening their safety filters, a growing group of independent builders is doing the opposite: unlocking these models so you can run them privately, locally, and without restrictions. Here's everything you need to know.

Abliteration vs Fine-Tuning — What's the Difference?

Before we jump into the list, let's clear up the two main ways these models are made:

Abliteration (the "surgical" method): Think of this like removing the safety switch from the model's brain without touching anything else. The model keeps all its smarts — it just stops saying "I can't help with that." The Heretic tool is the go-to for this approach.

Uncensored Fine-Tuning (the "retraining" method): This works by feeding the model thousands of examples where it learns to answer freely. It works well, but if the training data isn't top-quality, the model can lose a bit of its edge.

Most of the March 2026 releases lean on abliteration — it's the cleaner, smarter way to get the job done.

The 20 New Releases at a Glance

#01 — GLM-4.7-Flash-Grande-Heretic-UNCENSORED (42B MoE)

The heavyweight of the month. 42 billion total parameters but only 3 billion activate per token — so it runs on mid-range GPUs while delivering top-tier reasoning. 200K context window, fantastic for complex coding and agent-like workflows. Needs about 16GB of VRAM.

  • Base Model: GLM-4.7-Flash (Zhipu AI)
  • Method: Abliteration (Heretic)
  • VRAM: 16GB (Q4) / 28GB (Q8)
  • Format: GGUF (Imatrix)
  • Best for: Agentic Coding, Long Context

#02 — GLM-4.7-Flash-NEO-CODE-Imatrix-MAX (30B MoE)

A coding-focused fork of the GLM-4.7 series. Uses Imatrix MAX compression to keep quality high even at low precision. If you're a developer who wants zero-refusal coding help on a 16GB card, this is your pick.

  • Base Model: GLM-4.7-Flash (Zhipu AI)
  • Method: Abliteration (Heretic NEO)
  • VRAM: 14GB (Q4) / 20GB (Q8)
  • Format: GGUF (Imatrix MAX)
  • Best for: Coding, Dev Work

#03 — Llama-3.2-8X3B-MOE-Dark-Champion-Instruct (18.4B MoE)

Built from eight Llama 3.2 3B experts stitched together. Punches like a 24B model but runs fast — 50+ tokens per second on a 16GB GPU. Perfect for creative writing, roleplay, and long-form fiction without content blocks.

  • Base Model: Meta Llama 3.2 3B (x8 MoE)
  • Method: Abliteration
  • VRAM: 10GB (Q4) / 16GB (Q8)
  • Format: GGUF
  • Best for: Creative Writing, Roleplay, Fast Inference

#04 — GLM-4.7-Flash-Heretic-NEO-CODE (Feb 26 build)

A second community release of the same NEO-CODE config. Different calibration data means slightly different performance — good to try if the first one didn't quite click for you.

  • Base Model: GLM-4.7-Flash
  • Method: Abliteration
  • VRAM: 14GB (Q4) / 20GB (Q8)
  • Format: GGUF (Imatrix MAX)
  • Best for: Coding, Imatrix

#05 — GEITje-7b-uncensored (GGUF)

A Dutch-language model built on Mistral 7B, trained on 10 billion Dutch tokens. If you do research, journalism, or creative work in Dutch, this is a gem — and it runs on just 5GB of VRAM.

  • Base Model: GEITje-7B (Mistral 7B / Dutch)
  • Method: Uncensored Fine-Tune
  • VRAM: 5GB (Q4) / 8GB (Q8)
  • Format: GGUF
  • Best for: Dutch Language, Multilingual, Research

#06 — GEITje-7b-uncensored (Native Weights)

Same model, full precision. Meant for researchers who want to fine-tune further or run benchmarks without any quality loss from compression.

  • Base Model: GEITje-7B (Mistral 7B / Dutch)
  • Method: Uncensored Fine-Tune
  • VRAM: 14GB (Native BF16)
  • Format: Native PyTorch / Safetensors
  • Best for: Research, Fine-Tuning

#07 — Venice Uncensored (24B, Q8)

From Venice.ai — a privacy-first platform. Based on Dolphin Mistral 24B with all system-level safety prompts stripped out. Needs 20GB of VRAM but delivers near-perfect quality. Great for RTX 3090/4090 owners.

  • Base Model: Dolphin Mistral 24B Venice Edition
  • Method: Architectural De-alignment
  • VRAM: 20GB (Q8)
  • Format: GGUF (Q8_0)
  • Best for: Privacy-First, High Fidelity

#08 — Qwen3-30B-A3B-Claude-4.5-Opus-Abliterated-V2 (MLX)

A wild hybrid: Qwen3 MoE architecture mixed with reasoning patterns borrowed from Claude 4.5 Opus, then fully unlocked. Built for Apple Silicon — hits 100+ tokens per second on M4 Max. The go-to for uncensored RAG on Macs.

  • Base Model: Qwen3 30B-A3B + Claude 4.5 Opus distillation
  • Method: Abliteration V2 + Knowledge Distillation
  • VRAM: 24GB+ (MLX on Apple Silicon)
  • Format: MLX (6-bit)
  • Best for: Apple Silicon, RAG, Reasoning

#09 — Darkhn-M3.2-36B-Animus-V12-Heretic (36B)

Twelfth version of the legendary Darkhn series. Near-zero refusal rate, 128K context window, and outstanding coherence for very long creative sessions. Needs 22GB VRAM at Q4.

  • Base Model: Llama 3.2 MoE (Custom Darkhn Architecture)
  • Method: Abliteration (Heretic V12)
  • VRAM: 22GB (Q4) / 36GB (Q8)
  • Format: GGUF / ARM Quants
  • Best for: Creative Writing, Roleplay, Long Context

#10 — GLM-4.7-Flash-NEO-CODE (Feb 23 — Original Build)

The first widely-shared release in this lineage. Most battle-tested, with the largest amount of community feedback behind it. A safe bet if you want tried-and-true performance.

  • Base Model: GLM-4.7-Flash
  • Method: Abliteration
  • VRAM: 14GB (Q4) / 20GB (Q8)
  • Format: GGUF (Imatrix MAX)
  • Best for: Coding, MoE, Established

#11 — GLM5.Uncensored

An early peek at the next-gen GLM-5 architecture with safety filters removed. Parameter count is fuzzy, but early tests show better multilingual skills and deeper reasoning than GLM-4.7. For advanced users and researchers.

  • Base Model: GLM-5 (Zhipu AI)
  • Method: Full Uncensored Fine-Tune
  • VRAM: 20GB+ (Estimated)
  • Format: Native Weights
  • Best for: Next-Gen, Research, Multilingual

#12 — Qwen3-4B-Thinking-2507-SFT-Uncensored (4B)

A tiny powerhouse. Only 3GB VRAM at Q4, but has a "Thinking Mode" that lets it reason through tough problems step-by-step — like a mini DeepSeek R1. The best thinking model you can run on a basic laptop.

  • Base Model: Qwen3 4B (Thinking Mode)
  • Method: Supervised Fine-Tuning (SFT)
  • VRAM: 3GB (Q4) / 5GB (Q8)
  • Format: Native / GGUF
  • Best for: Edge Hardware, Thinking Mode, Small

#13 — Llama-3.2-3B-Instruct-uncensored (GGUF)

Just 2.5GB at Q4. Runs on practically anything — Raspberry Pi clusters, old laptops, you name it. Clean, fast, and refuses nothing.

  • Base Model: Meta Llama 3.2 3B Instruct
  • Method: Uncensored Fine-Tune
  • VRAM: 2.5GB (Q4) / 4GB (Q8)
  • Format: GGUF
  • Best for: Edge Hardware, Portable

#14 — Mistral-Nemo-2407-12B-Uncensored-HERETIC (12B)

The Mistral-NVIDIA collaboration, unlocked. Has a thinking-mode layer for complex multi-step problems. Hits the sweet spot at 8GB VRAM for security research and technical analysis.

  • Base Model: Mistral Nemo 12B (Mistral AI x NVIDIA)
  • Method: Abliteration (Heretic)
  • VRAM: 8GB (Q4) / 14GB (Q8)
  • Format: GGUF
  • Best for: Coding, Security Research, Thinking Mode

#15 — Llama-3.2-3B-Instruct-uncensored (Native Weights)

Full-precision version of #13. For researchers who want to fine-tune, merge, or benchmark without compression getting in the way.

  • Base Model: Meta Llama 3.2 3B Instruct
  • Method: Uncensored Fine-Tune
  • VRAM: 6GB (Native BF16)
  • Format: Native Safetensors
  • Best for: Research, Fine-Tuning

#16 — Llama3.1-70b-Uncensored (70B)

The big one. Needs 40GB+ VRAM (multi-GPU or Mac Studio Ultra territory). Built for enterprise-grade private deployments — law firms, hospitals, research labs running air-gapped AI.

  • Base Model: Meta Llama 3.1 70B
  • Method: Uncensored Fine-Tune
  • VRAM: 40GB (Q4) / 72GB (Q8)
  • Format: Native Safetensors
  • Best for: Enterprise, Multi-GPU

#17 — ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-biased-bt (2B)

One of the most technically interesting releases. This tiny 2GB model reverses Anthropic's SaferLHF safety training — essentially unlearning the alignment. More of a research artifact than a daily driver, but gives unique insight into how alignment works under the hood.

  • Base Model: Google Gemma 2B
  • Method: Reverse-RLHF (RIPO)
  • VRAM: 2GB (Q4)
  • Format: Native
  • Best for: Research, Reverse RLHF

#18 — ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-seed-bt (2B)

The seed variant of the reverse-engineering project above. Uses a different random starting point for the Bradley-Terry preference reversal — letting researchers compare how initial conditions shape the final behavior.

  • Base Model: Google Gemma 2B
  • Method: Reverse-RLHF (Seed Variant)
  • VRAM: 2GB (Q4)
  • Format: Native
  • Best for: Research, Reverse RLHF

#19 — GPT-OSS-20B-Abliterated-NEO-Imatrix (20B)

The middle-ground champion. 12GB VRAM at Q4. Strong across coding, logic, and creative writing. A versatile all-rounder for people who want one model that does it all.

  • Base Model: OpenAI GPT-OSS 20B (Community Open Weight)
  • Method: Abliteration (NEO) + Imatrix
  • VRAM: 12GB (Q4) / 18GB (Q8)
  • Format: GGUF (Imatrix)
  • Best for: Coding, Creative Writing, Mixed Tasks

#20 — Qwen2.5-coder-Uncensored (7B/14B/32B)

A coding-specialized model with safety filters removed. Lets security researchers and developers explore exploit analysis, penetration testing, and software security without hitting "I can't discuss that" walls.

  • Base Model: Qwen 2.5 Coder
  • Method: Uncensored Fine-Tune
  • VRAM: 5GB – 20GB
  • Format: Native
  • Best for: Coding, Security, Coder Model

Quick Comparison Table

#ModelSizeMin VRAMMethodBest For
01GLM-4.7-Flash-Grande Heretic42B MoE16GBAbliterationAgentic Coding
02GLM-4.7-Flash NEO-CODE30B MoE14GBAbliterationDev Work
03Llama-3.2 Dark Champion18.4B MoE10GBAbliterationRoleplay / Fiction
05GEITje-7b-uncensored7B5GBSFTDutch Language
07Venice Uncensored24B20GBDe-alignmentPrivacy-First
08Qwen3-30B-A3B Abliterated V230B MoE24GBAbliteration V2Apple Silicon / RAG
09Darkhn Animus V1236B22GBAbliteration V12Long Creative Writing
12Qwen3-4B-Thinking4B3GBSFTEdge Devices / Thinking
14Mistral Nemo 12B Uncensored12B8GBAbliterationResearch / Analysis
16Llama3.1-70B-Uncensored70B40GBSFTEnterprise
19GPT-OSS-20B NEO Imatrix20B12GBAbliterationMixed Tasks
20Qwen2.5-coder-UncensoredVariable5GB+SFTSecurity / Code

What Can Your Hardware Actually Run?

Your VRAM decides which models from this list you can actually use. Here's the breakdown:

  • 8GB VRAM (RTX 4060 / RX 7600 Tier): Models #12, #13, #15, #17, #18, #20 — plus Mistral Nemo 12B at Q4. Covers the compact thinking models, GEITje 7B, Llama 3.2 3B variants, and the 2B Gemma research models.

  • 16GB VRAM (RTX 4080 / RX 7900 XTX Tier): The sweet spot. Unlocks Dark Champion 18.4B (#03), GLM-4.7 Flash MoE (#02, #10), and GPT-OSS 20B (#19). This is where most people will land in 2026.

  • 24GB VRAM (RTX 4090 / RTX 5080 Tier): Venice Uncensored Q8 (#07), Qwen3-30B-A3B (#08), all GLM-4.7 variants at Q8 precision, and Darkhn Animus V12 (#09) at Q4.

  • 40GB+ (RTX 5090 / Multi-GPU / Apple Ultra Tier): Full access to everything — including the massive Llama 3.1 70B (#16) and GLM-4.7-Flash-Grande 42B (#01). A Mac Studio Ultra with 192GB unified memory can run every model on this list at Q8 precision.

What People Use These For

Creative Writing & Fiction

Models like Darkhn Animus V12 (#09) and Dark Champion MoE (#03) are built for long-form creative content, dark-themed fiction, and mature narratives — without the content filters getting in the way.

Cybersecurity & Red Teaming

GPT-OSS 20B (#19) and Qwen2.5-coder (#20) let security pros analyze malware, explore exploit logic, and understand adversarial techniques freely. These are often deployed in air-gapped forensic labs.

Agentic Coding & Automation

The GLM-4.7 Flash family (#01, #02, #10) leads here — tool use, UI generation, and multi-step code execution at speeds that rival cloud APIs.

Multilingual Research

GEITje (#05, #06) is the definitive uncensored Dutch-language model, critical for researchers, journalists, and policy analysts working in the Dutch linguistic sphere.

Apple-Native RAG Pipelines

Qwen3-30B MLX (#08) on M4 Max hits 100+ tokens per second — the best uncensored option for Mac users who want private RAG on their own documents.

AI Alignment Research

The Gemma 2B reverse-RLHF pair (#17, #18) gives academic researchers a window into how safety training is mechanically applied — and how it can be systematically reversed.

Quantization Breakthroughs: Beyond GGUF

MXFP4 (Microscaling)

MXFP4 uses shared exponents across small groups of weights, cutting the "quantization tax" that usually hurts small models (3B–7B). A 7B model can now perform nearly as well as its 16-bit counterpart while using 70% less VRAM.

EXL2 for High-Speed Inference

For NVIDIA users, EXL2 has become the go-to format for the GLM-4.7 42B Grande series. It keeps the most important layers (like the attention mechanism) at higher precision while compressing less critical MLP layers more aggressively — smart efficiency.

Private Knowledge Bases (RAG)

The real magic happens when you pair an uncensored local LLM with your own documents. By hooking up Retrieval-Augmented Generation (RAG) with a 128K+ context window (standard on GLM-4.7 and Llama 3.2 MoE), you get a private "second brain" — it knows all your data and leaks none of it.


Frequently Asked Questions

Yes, in most places. Running abliterated or uncensored models on your own hardware for research and personal use falls under open research principles. But laws vary by country — always check your local regulations. The important thing is that YOU are responsible for what you generate with these tools.

What's the difference between abliteration and fine-tuning?

Abliteration surgically removes the refusal mechanism from a model's weights without retraining. The model keeps 100% of its original intelligence — it just stops refusing. Fine-tuning retrains the model on examples of compliant answers, which works but can slightly degrade capabilities if the training data isn't top-notch. Most high-quality March 2026 releases use abliteration.

Which model is best for creative writing?

Two stand out: Darkhn Animus V12 (#09) at 36B for the absolute best long-form coherence and character consistency, and Dark Champion 18.4B MoE (#03) if you're on 16GB VRAM and need fast inference with great creative quality.

What's the smallest uncensored model I can run?

Qwen3-4B-Thinking (#12) at just 3GB VRAM in Q4. Despite its tiny size, it has a chain-of-thought "Thinking Mode" that lets it reason through complex problems. Llama 3.2 3B (#13) is even smaller at 2.5GB — great for basic tasks on any device.

Can I run these on a Mac?

Absolutely. Qwen3-30B-A3B-Abliterated-V2 (#08) is built specifically for Apple Silicon in MLX format and hits 100+ tokens/sec on M4 Max. Most GGUF models also run well on Macs via Ollama or LM Studio using Metal acceleration.

Do I need multiple GPUs?

Only for the biggest models. The 70B Llama (#16) needs 40GB+ VRAM. Everything else on the list runs on a single consumer GPU — 16GB is the most common sweet spot for 2026.

What is RAG and why should I care?

RAG (Retrieval-Augmented Generation) lets you connect an LLM to your own private documents. Instead of the model guessing or pulling from internet training data, it reads from YOUR files. Pair it with an uncensored local model, and you get a completely private AI that works on your data without any leaks to the cloud.

Are these models safe to use for work?

Use caution. These models have no safety filters — that's the point. For regulated industries (healthcare, law, finance), always run a secondary aligned model as a review layer before using any output in production. Never feed sensitive customer data into a model unless you fully understand your compliance obligations.

What's the best all-rounder model on this list?

GPT-OSS-20B (#19) at 12GB VRAM is the strongest middle-ground pick — good at coding, logic, and creative writing without demanding high-end hardware. If you have 16GB, GLM-4.7-Flash-Grande (#01) is the premium choice for agentic coding and complex tasks.

Why would anyone want an uncensored model?

Several legit reasons: cybersecurity researchers need to analyze malware without hitting refusal walls. Creative writers don't want their stories interrupted by content filters. Companies running air-gapped private AI want full control over what their models can and can't discuss. Researchers study alignment by comparing censored vs. uncensored behavior. The common thread: control and privacy.


The Bottom Line

March 2026 proved one thing: the community-driven push for open, unfiltered AI isn't slowing down — it's speeding up. The GLM-4.7 Heretic family now owns the mid-range space, while Darkhn and Dark Champion still rule creative work. And for anyone on a tight budget, the Qwen3-4B Thinking model at just 3GB VRAM is nothing short of remarkable.

The age of private, uncensored AI isn't coming — it's already here. 🚀