DeepGEMM: Revolutionizing FP8 Matrix Computation for Next-Gen AI Workloads

February 26, 2025 — DeepSeek, the trailblazer in AI infrastructure innovation, has unveiled DeepGEMM, a groundbreaking open-source library designed to accelerate FP8 matrix operations for large language models (LLMs) and Mixture-of-Experts (MoE) architectures. Built on CUDA and optimized for NVIDIA Hopper GPUs, this lightweight yet powerful tool achieves 206 TFLOPS performance while maintaining sub-300-line core code simplicity.

Core Innovations

1. FP8 Precision with Industrial-Grade Reliability

DeepGEMM leverages NVIDIA Hopper Tensor Cores to execute FP8 (8-bit floating point) matrix multiplications, a critical operation for modern LLMs like DeepSeek-V3 and R1. To address FP8’s inherent precision challenges, it implements two-level accumulation at the CUDA core level, ensuring computational accuracy without sacrificing speed.

Key Formats Supported:

fp8_e4m3 (4-bit exponent, 3-bit mantissa)
fp8_e5m2 (5-bit exponent, 2-bit mantissa)

2. MoE-Optimized Grouped GEMM

Tailored for MoE models requiring dynamic expert activation, DeepGEMM introduces:

Contiguous Layout GEMM: Achieves 1.2x speedup for MoE prefill phases with grouped M-axis computations.
Masked Layout GEMM: Optimizes decoding stages using CUDA graphs, reducing latency by 15% in real-world deployments.

3. JIT-Powered Lightweight Design

With a zero-dependency architecture, DeepGEMM compiles kernels dynamically at runtime via a lightweight Just-In-Time (JIT) module. This eliminates precompilation hassles and reduces deployment overhead by 80% compared to traditional libraries like CUTLASS.

Performance Benchmarks

Tested on NVIDIA H800 GPUs (NVCC 12.8), DeepGEMM outperforms expert-tuned implementations:

Workload Type	Matrix Shape (MxNxK)	TFLOPS	Speedup vs CUTLASS 3.6
Dense GEMM	64x2112x7168	206	2.7x
MoE Contiguous (4 groups)	8192x4096x7168	1297	1.2x
Large-Scale Dense	4096x7168x16384	1358	1.2x

Source: DeepSeek internal testing with H800 clusters

Technical Breakthroughs

1. Hopper TMA Acceleration

DeepGEMM harnesses Hopper’s Tensor Memory Accelerator (TMA) for asynchronous data transfers, achieving:

3000 GB/s memory bandwidth utilization
580 TFLOPS sustained compute throughput

2. Warp-Specialized Scheduling

Persistent warp specialization enables overlapping of data movement, Tensor Core MMA instructions, and CUDA core operations—a technique absent in CUTLASS.

3. Non-Aligned Block Optimization

By supporting unconventional block sizes (e.g., 112×112), DeepGEMM boosts SM utilization by 22% compared to power-of-two-aligned alternatives.

Integration & Compatibility

System Requirements:

GPU: NVIDIA Hopper architecture (sm_90a)
Software: CUDA 12.8+, Python 3.8+, PyTorch 2.1+

Quick Start:

bash
git clone --recursive https://github.com/deepseek-ai/DeepGEMM  
python setup.py install

Sample API call for MoE inference:

python
import deep_gemm  
output = deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(  
    lhs, rhs, m_groups=4, precision='e4m3'  
)

Competitive Edge

Metric	DeepGEMM	CUTLASS 3.6
Code Complexity	300 lines (core)	10,000+ lines
FP8 MoE Support	Native	Requires custom hacks
Deployment Footprint	15MB	150MB+
License	MIT	Proprietary

Industry Impact

Validated by TSMC and Foxconn production systems, DeepGEMM has demonstrated:

72% reduction in semiconductor defect analysis time
40% lower cloud inference costs for MoE models

Explore DeepGEMM

Download DeepGEMM from GitHub

​Core Innovations​

1. ​FP8 Precision with Industrial-Grade Reliability​

2. ​MoE-Optimized Grouped GEMM​

3. ​JIT-Powered Lightweight Design​

​Performance Benchmarks​

​Technical Breakthroughs​

1. ​Hopper TMA Acceleration​

2. ​Warp-Specialized Scheduling​

3. ​Non-Aligned Block Optimization​

​Integration & Compatibility​