DeepSeek Open-Sources DeepGEMM: The 300-Line CUDA Library Redefining High-Performance Matrix Multiplication

February 26, 2025​ — DeepSeek, the trailblazer in AI infrastructure innovation, has unveiled ​DeepGEMM, a groundbreaking open-source library designed to accelerate FP8 matrix operations for large language models (LLMs) and Mixture-of-Experts (MoE) architectures. Built on CUDA and optimized for NVIDIA Hopper GPUs, this lightweight yet powerful tool achieves ​206 TFLOPS​ performance while maintaining sub-300-line core code simplicity.


Core Innovations

1. ​FP8 Precision with Industrial-Grade Reliability

DeepGEMM leverages NVIDIA Hopper Tensor Cores to execute FP8 (8-bit floating point) matrix multiplications, a critical operation for modern LLMs like DeepSeek-V3 and R1. To address FP8’s inherent precision challenges, it implements ​two-level accumulation​ at the CUDA core level, ensuring computational accuracy without sacrificing speed.

Key Formats Supported:

  • fp8_e4m3 (4-bit exponent, 3-bit mantissa)
  • fp8_e5m2 (5-bit exponent, 2-bit mantissa)

2. ​MoE-Optimized Grouped GEMM

Tailored for MoE models requiring dynamic expert activation, DeepGEMM introduces:

  • Contiguous Layout GEMM: Achieves ​1.2x speedup​ for MoE prefill phases with grouped M-axis computations.
  • Masked Layout GEMM: Optimizes decoding stages using CUDA graphs, reducing latency by 15% in real-world deployments.

3. ​JIT-Powered Lightweight Design

With a ​zero-dependency architecture, DeepGEMM compiles kernels dynamically at runtime via a lightweight Just-In-Time (JIT) module. This eliminates precompilation hassles and reduces deployment overhead by 80% compared to traditional libraries like CUTLASS.


Performance Benchmarks

Tested on NVIDIA H800 GPUs (NVCC 12.8), DeepGEMM outperforms expert-tuned implementations:

Workload Type​Matrix Shape (MxNxK)TFLOPSSpeedup vs CUTLASS 3.6
Dense GEMM64x2112x71682062.7x
MoE Contiguous (4 groups)8192x4096x716812971.2x
Large-Scale Dense4096x7168x1638413581.2x

Source: DeepSeek internal testing with H800 clusters


Technical Breakthroughs

1. ​Hopper TMA Acceleration

DeepGEMM harnesses Hopper’s ​Tensor Memory Accelerator (TMA)​ for asynchronous data transfers, achieving:

  • 3000 GB/s memory bandwidth utilization
  • 580 TFLOPS sustained compute throughput

2. ​Warp-Specialized Scheduling

Persistent warp specialization enables overlapping of data movement, Tensor Core MMA instructions, and CUDA core operations—a technique absent in CUTLASS.

3. ​Non-Aligned Block Optimization

By supporting unconventional block sizes (e.g., 112×112), DeepGEMM boosts SM utilization by 22% compared to power-of-two-aligned alternatives.


Integration & Compatibility

System Requirements:

  • GPU: NVIDIA Hopper architecture (sm_90a)
  • Software: CUDA 12.8+, Python 3.8+, PyTorch 2.1+

Quick Start:

bash
git clone --recursive https://github.com/deepseek-ai/DeepGEMM  
python setup.py install  

Sample API call for MoE inference:

python
import deep_gemm  
output = deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(  
    lhs, rhs, m_groups=4, precision='e4m3'  
)  

Competitive Edge

MetricDeepGEMMCUTLASS 3.6
Code Complexity300 lines (core)10,000+ lines
FP8 MoE SupportNativeRequires custom hacks
Deployment Footprint15MB150MB+
LicenseMITProprietary

Industry Impact

Validated by TSMC and Foxconn production systems, DeepGEMM has demonstrated:

  • 72% reduction in semiconductor defect analysis time
  • 40% lower cloud inference costs for MoE models

Explore DeepGEMM