Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling #511

JamesBrianD · 2025-12-02T09:18:38Z

Fix SGL-101

Motivation

Fused MOE Pallas Kernel
Integrate the fused MoE kernel in an attempt to gain performance improvements.

Modifications

Adopt the fused MoE kernel from the TPU inference implementation.
Add the fused MoE layer implementation.
Add fused-MoE support/adaptation for MoE models.
Add benchmarks.

Accuracy Tests

Use fused_moe and epmoe in 1 layers grok2.

 curl -X POST 'http://127.0.0.1:30000/generate' -d '{"sampling_params": {"max_new_tokens": 10}, "text": "the capital of France is"}' -H 'Content-Type: application/json'
{"text":" Steady Borrow Clip McKe Steady Borrow Clip McKe Steady Borrow","output_ids":[118126,87778,63731,116174,118126,87778,63731,116174,118126,87778],"meta_info":{"id":"2401c3804bc74f1b87212916dfde5a81","finish_reason":{"type":"length","length":10},"prompt_tokens":24,"completion_tokens":10,"cached_tokens":0,"cache_miss_count":0,"e2e_latency":0.0681540966033935

 curl -X POST 'http://127.0.0.1:30000/generate' -d '{"sampling_params": {"max_new_tokens": 10}, "text": "the capital of France is"}' -H 'Content-Type: application/json'
{"text":" Steady Borrow Clip McKe Steady Borrow Clip McKe Steady Borrow","output_ids":[118126,87778,63731,116174,118126,87778,63731,116174,118126,87778],"meta_info":{"id":"d7fadfa226a34309bfb038bdd68328c9","finish_reason":{"type":"length","length":10},"prompt_tokens":24,"completion_tokens":10,"cached_tokens":0,"cache_miss_count":0,"e2e_latency":0.06584501266479492}}

JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server --model-path Qwen/Qwen3-30B-A3B --trust-remote-code --dist-init-addr=0.0.0.0:10011 --nnodes=1 --tp-size=4 --ep-size=4 --device=tpu --random-seed=3 --node-rank=0 --mem-fraction-static=0.85 --chunked-prefill-size=2048 --download-dir=/dev/shm --dtype=bfloat16 --max-running-requests 256 --skip-server-warmup --page-size=128

# Grok2
evalscope eval  --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gsm8k --eval-batch-size 256
+------------+-----------+----------+----------+-------+---------+---------+
| Model      | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+============+===========+==========+==========+=======+=========+=========+
| xai-grok-2 | gsm8k     | mean_acc | main     |  1319 |  0.9613 | default |
+------------+-----------+----------+----------+-------+---------+---------+

Benchmarking and Profiling

baseline(pr 349 pure ep)

============ Serving Benchmark Result ============
Backend:                                 sgl-jax
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     128
Benchmark duration (s):                  116.44
Total input tokens:                      131072
Total generated tokens:                  131072
Total generated tokens (retokenized):    131025
Request throughput (req/s):              1.10
Input token throughput (tok/s):          1125.67
Output token throughput (tok/s):         1125.67
Total token throughput (tok/s):          2251.34
Concurrency:                             16.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14550.81
Median E2E Latency (ms):                 14575.58
---------------Time to First Token----------------
Mean TTFT (ms):                          611.44
Median TTFT (ms):                        614.03
P99 TTFT (ms):                           1081.60
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.63
Median ITL (ms):                         13.17
P95 ITL (ms):                            13.70
P99 ITL (ms):                            13.91
Max ITL (ms):                            1009.14
==================================================

fused moe with ep4

============ Serving Benchmark Result ============
Backend:                                 sgl-jax
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     128
Benchmark duration (s):                  101.19
Total input tokens:                      131072
Total generated tokens:                  131072
Total generated tokens (retokenized):    131013
Request throughput (req/s):              1.26
Input token throughput (tok/s):          1295.30
Output token throughput (tok/s):         1295.30
Total token throughput (tok/s):          2590.59
Concurrency:                             16.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12645.18
Median E2E Latency (ms):                 12649.56
---------------Time to First Token----------------
Mean TTFT (ms):                          441.50
Median TTFT (ms):                        442.33
P99 TTFT (ms):                           779.29
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.93
Median ITL (ms):                         11.60
P95 ITL (ms):                            12.17
P99 ITL (ms):                            12.38
Max ITL (ms):                            722.26
==================================================

Checklist

Please use English, otherwise it will be closed.
The purpose of the PR, or link existing issues this PR will resolve.
The test plan, such as providing test command.
(Optional) The necessary documentation update.

gemini-code-assist · 2025-12-02T09:19:01Z

Summary of Changes

Hello @JamesBrianD, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a highly optimized, fused Mixture of Experts (MoE) implementation specifically designed for JAX on TPUs. It provides a new configuration option to select between the existing Expert Parallel MoE and this new fused kernel, which integrates multiple MoE operations into a single, efficient computation. The changes include the core kernel, a wrapper layer, and modifications to the Grok model and weight loading utility to support this new backend, aiming to significantly improve performance for MoE models on TPU hardware.

Highlights

New MoE Backend Configuration: A new MoEBackend enum is introduced in model_config.py to allow selection between 'epmoe' (native Expert Parallel MoE), 'fused' (TPU-optimized fused kernel), and 'auto' (automatic selection based on ep_size).
Fused MoE Kernel Implementation: A new file kernel.py is added, containing a TPU-friendly fused Mixture of Experts (MoE) kernel. This kernel is adapted from vllm-project/tpu-inference and uses JAX Pallas for optimized execution on TPUs, handling Top-K selection, expert computation, and aggregation efficiently.
Fused MoE Layer Integration: A new FusedEPMoE layer is added in fused_moe.py. This layer wraps the new fused kernel, providing an interface for model integration and including logic for auto-selecting tile sizes based on model dimensions for optimal TPU performance.
Grok Model Adaptation: The GrokMoEBlock in grok.py is updated to conditionally use either the existing EPMoE layer or the new FusedEPMoE layer based on the moe_backend configuration. This involves changes to how MoE weights are mapped and processed during loading.
Weight Loading Enhancements: The WeightLoader in weight_utils.py is modified to support the new fused MoE weight format. It now includes logic to fuse gate_proj and up_proj weights into a single w1 parameter for the FusedEPMoE layer.
Unit Tests for Fused MoE: New test files fused_moe_v1_test.py and test_fused_moe.py are added to validate the correctness and functionality of the fused MoE kernel and layer, covering various configurations, activations, and quantization schemes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

JamesBrianD force-pushed the integrate-fused-moe branch from b84734e to 9e9cb20 Compare December 2, 2025 09:28

JamesBrianD changed the title ~~fused moe layer, edit grok~~ Integrate Fused Moe Kernel, Edit Grok Dec 2, 2025

JamesBrianD force-pushed the integrate-fused-moe branch from 9e9cb20 to b8dc559 Compare December 2, 2025 09:41

JamesBrianD linked an issue Dec 2, 2025 that may be closed by this pull request

[Feature] Integrate fused moe kernel #507

Open

2 tasks

JamesBrianD force-pushed the integrate-fused-moe branch 2 times, most recently from 21d6567 to 1f2130c Compare December 2, 2025 10:51

JamesBrianD requested a review from Prayer3th December 2, 2025 11:47

JamesBrianD force-pushed the integrate-fused-moe branch 2 times, most recently from 9c60856 to 9281a57 Compare December 2, 2025 13:31

JamesBrianD changed the title ~~Integrate Fused Moe Kernel, Edit Grok~~ Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling Dec 2, 2025

JamesBrianD force-pushed the integrate-fused-moe branch 10 times, most recently from b07af1c to 43f8198 Compare December 5, 2025 10:58

JamesBrianD added 2 commits December 8, 2025 15:33

fused moe layer, edit grok

0c18b7c

add fused moe benchmark

2cf0141

JamesBrianD force-pushed the integrate-fused-moe branch 7 times, most recently from 57a995b to 03ce0a3 Compare December 8, 2025 08:30

JamesBrianD force-pushed the integrate-fused-moe branch 9 times, most recently from 0cd8d47 to 675be04 Compare December 9, 2025 11:30

change fused moe kernel mesh

4c57054

JamesBrianD force-pushed the integrate-fused-moe branch 18 times, most recently from 788ebc4 to b4a2aa6 Compare December 11, 2025 06:20

fix: load fused moe

a22a08e

JamesBrianD force-pushed the integrate-fused-moe branch from b4a2aa6 to a22a08e Compare December 11, 2025 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling #511

Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling #511

JamesBrianD commented Dec 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling #511

Are you sure you want to change the base?

Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling #511

Conversation

JamesBrianD commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

baseline(pr 349 pure ep)

fused moe with ep4

Checklist

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JamesBrianD commented Dec 2, 2025 •

edited

Loading