Building a Production MLIR Compiler: Architecture and Design Decisions
Building a Production MLIR Compiler: Architecture and Design Decisions
After spending months building a production-grade MLIR-based compiler for a domain-specific architecture, I want to share the architectural decisions and patterns that shaped the project. This post focuses on the high-level design — how to structure an MLIR compiler that goes from PyTorch models and C/C++ code all the way down to assembly.
The Three-Layer Architecture
Our compiler is structured as three distinct layers, each with clear responsibilities:
Frontend Layer (Clang + PyTorch)
↓
Midend Layer (MLIR Dialects + Passes)
↓
Backend Layer (LLVM CodeGen)
This separation is more than organizational — it enables independent development velocity. The midend can evolve without rebuilding LLVM. The frontend team can add new op coverage without touching optimization passes.
Why Separate the Midend?
One critical decision was building the midend as a standalone CMake project that links against a pre-built LLVM/MLIR installation. The alternative — embedding everything into the LLVM tree — has a steep cost: every change requires rebuilding a large chunk of LLVM.
Our approach:
# Step 1: Build LLVM + MLIR + Clang (once, rarely)
cmake -G Ninja -S llvm/llvm -B llvm/build \
-DLLVM_ENABLE_PROJECTS="clang;mlir"
ninja -C llvm/build
# Step 2: Build midend standalone (fast iteration)
cmake -G Ninja -S . -B build \
-DMLIR_DIR=$PWD/llvm/build/lib/cmake/mlir
ninja -C build
Midend rebuild takes seconds instead of minutes. This matters enormously during pass development.
Dialect Design Philosophy
We defined two custom MLIR dialects, following fundamentally different design patterns:
Pattern 1: LLVMIR Sub-Dialect (for compute ops)
Our primary compute dialect follows the NVVM/ROCDL model — it lives as a sub-dialect of the LLVM dialect. Each operation maps 1:1 to an LLVM intrinsic:
ftm.vload_w → llvm.ftm.vload.w
ftm.vfmadd → llvm.ftm.vfmadd
ftm.vstore_w → llvm.ftm.vstore.w
This design has a key advantage: lowering to LLVM IR is trivial. Each op translates directly to an intrinsic call. No complex lowering logic needed.
We used TableGen helper classes to keep definitions consistent:
// Pure ops (no side effects)
class FTM_PureUnaryOp<...> // 1 in, 1 out
class FTM_PureBinaryOp<...> // 2 in, 1 out
// Memory ops (with side effects)
class FTM_LoadOp<...> // reads memory
class FTM_StoreOp<...> // writes memory
Pattern 2: Standalone Dialect (for system ops)
System-level operations (DMA, synchronization, clock management) use a different pattern — a standalone dialect where each op lowers to an llvm.call targeting a pre-built library function:
// Before lowering
csl.dma_start(%config, %src, %dst, %size)
// After CSL → LLVM lowering
llvm.call @csl_dma_start(%config, %src, %dst, %size)
This encapsulates hardware details behind a stable function interface. When the hardware library changes, only the library needs updating — the compiler passes remain stable.
The 205-Op Challenge
Our compute dialect defines 205+ operations. Managing this at scale requires discipline:
- Consistent naming convention:
vprefix for vector, operation name, type suffix (dfor f64,s32for f32) - Category-based organization: Memory ops, FP compare, precision conversion, dot products, etc.
- Exhaustive testing: Every op has at least one FileCheck test validating the lowering path
Pass Pipeline Orchestration
The midend chains 18 passes in a specific order. Getting this order right is the hardest part of compiler design. Our pipeline:
Pre-optimization (cleanup)
→ eliminate-memref-copy
→ optimize-padding
→ simplify-linalg-ops
Kernel decomposition
→ convert-conv-to-im2col
→ linalg-fusion
Memory hierarchy tiling (outer)
→ linalg-tiling (AM locality)
→ insert-dma
Vectorization (inner)
→ linalg-tiling (vector width)
Register optimization
→ simplify-vector
→ hoist-vector-transfers
→ unroll-loops
→ promote-accumulators
→ forward-store-to-load
→ merge-vector-loads
Dialect lowering
→ vector → ftm
→ math → ftm
→ arith → ftm
→ csl → llvm
Key insight: The order between memory hierarchy tiling and vectorization tiling is critical. You must tile for memory locality first, insert DMA operations, and then tile for vectorization within each memory-local tile. Reversing this order produces incorrect DMA boundaries.
Build System Integration
We use LLVM as a git submodule with a patch-based integration strategy:
- Patches stored as unified diffs in
/patches/ - Applied automatically via
apply-patches.sh - Can be regenerated from working tree changes
This avoids maintaining a separate LLVM fork while keeping our changes version-controlled. When upstream LLVM updates, we rebase patches — much lighter than a full fork merge.
Testing Strategy
With 18 passes and 205+ ops, testing is non-negotiable:
- Per-pass directories: Each pass has its own test directory with
implemented/andunimplemented/subdirectories - FileCheck + lit: Standard MLIR testing infrastructure
- Kernel integration tests: Full end-to-end compilation of BLAS kernels, FFT, etc.
- 141 tests total: 48 backend + 93 midend
The unimplemented/ directories with XFAIL markers serve as living documentation of known limitations — when you fix something, you move the test and change the expectation.
Takeaways
- Separate your midend build — fast iteration on passes is worth the initial setup cost
- Choose dialect patterns deliberately — LLVMIR sub-dialect for compute, standalone for system ops
- Pass ordering is the hard problem — document why each pass precedes the next
- Patch-based LLVM integration beats fork maintenance for smaller teams
- Test per-pass, not just end-to-end — localized failures are infinitely easier to debug
In the next post, I’ll dive into the cost-model-driven tiling framework — how we automatically decide which dimensions to vectorize based on memory access patterns and hardware instruction costs.