MoE Quantization Calibration Pipeline

PHASE 0 Quantize & Evaluate GPTQ W4A16 · TRT-LLM

→

PHASE 1 Analyze Expert Routing Token → Expert mapping

→

PHASE 2 Extract Text Patterns Nemotron-3-Super

→

PHASE 3 Generate Synthetic Data NVIDIA Data Designer

↺ Repeat 1→3 until balanced

Quantize initial model and evaluate on benchmark tasks

IDLE

IN Base model checkpoint OUT Quantized checkpoint Benchmark results

QUANTIZEGPTQ_W4A16 CALIB_SIZE128

NeMo Evaluator Benchmark Results

GSM8K67.25% GPQA Diamond22.73% MMLU-Pro38.29%

Analyze activated experts statistics from calibration data

IDLE

IN Calibration dataset (D0_128) Base model weights OUT Token-expert routing analysis

Click "Load Demo" to view results.

Extract text patterns causing frequent/scarce expert activation

IDLE

IN Annotated token samples Nemotron-3-Super (vLLM) OUT Per-domain guidelines

Click "Load Demo" to view results.

Generate synthetic calibration dataset achieving balanced expert activation

IDLE

IN Per-domain guidelines OUT Synthetic calibration dataset

Click "Load Demo" to view results.