← Home

MLIR Compilation Framework for FT-Matrix

MLIR Compilation Framework for FT-Matrix

A production-grade compiler built on LLVM/MLIR that compiles PyTorch models and C/C++ code to FT-Matrix processors, paired with a systematic benchmark framework for evaluating optimization effectiveness.

Architecture

The compiler follows a three-layer design:

  • Frontend: Clang-based C/C++ compilation + PyTorch TorchDynamo capture with 350+ op lowerings and three compilation modes (MLIR inspection / assembly generation / JIT execution)
  • Midend: 18-pass MLIR optimization pipeline with two custom dialects — FTM dialect (205 operations mapping hardware intrinsics) and CSL dialect (25 system-level operations for DMA, barrier sync, and core management)
  • Backend: LLVM SelectionDAG instruction selection with post-increment addressing optimization and assembly generation

Key Technical Contributions

  • Cost-model-driven tiling: Three-phase framework — access pattern analysis, instruction latency cost evaluation, and optimal vectorization dimension selection
  • Multi-level memory tiling: Two-level tiling for on-chip memory hierarchy with auto-inserted multi-channel DMA (double buffering, scatter-gather support)
  • Register blocking & accumulator promotion: M-dimension blocking for matmul with accumulator reuse; automatic conversion of load/store patterns to register-carried values across reduction loops
  • Store-to-load forwarding: Epilogue fusion that eliminates intermediate memory round-trips between producer-consumer operations
  • Elementwise fusion: Fusing chains of elementwise operations for improved data locality
  • Conv2D im2col decomposition: Transforms convolution into matrix multiplication for efficient vectorization
  • Padding optimization: Border-only padding strategy achieving ~96% reduction compared to full-buffer initialization
  • Pure Python PyTorch frontend: TorchDynamo integration with Linalg lowering, supporting model capture → MLIR → assembly end-to-end (demonstrated with Qwen3-0.6B LLM)

Benchmark Framework

A systematic evaluation platform that measures compiler optimization effectiveness through:

  • Three-way comparison: Scalar baseline vs. compiler-optimized vs. hand-written assembly, reporting speedup ratios and percentage of peak performance
  • Ablation studies: 8 incremental optimization variants per kernel (scalar → pre-optimization → memory tiling → DMA insertion → vectorization → register blocking → loop unrolling → full pipeline), isolating each pass’s contribution
  • End-to-end simulation: Cycle-accurate simulation with automatic test data generation and NumPy reference verification for correctness

Benchmarks include:

  • 5 BLAS kernels (GEMM, GEMV, dot product, axpby, scal) with hand-written reference implementations
  • 5 additional kernels (softmax, layernorm, GELU, elementwise fusion, broadcast add)
  • Model-level benchmarks (LeNet CNN, Transformer encoder)

Performance Results

  • ~57x speedup on matrix kernels vs. scalar baseline
  • ~44x speedup on transformer inference vs. scalar baseline
  • Compiler output achieves ~75% of hand-tuned assembly performance
  • End-to-end LLM compilation demonstrated (Qwen3-0.6B)

Scale

  • 25,000+ lines of code across frontend, midend, and backend
  • 205+ custom MLIR operations + 25 system-level operations
  • 146 compiler tests + comprehensive benchmark suite covering 12 kernels and 2 models