Skip to content

Conversation

@JamesBrianD
Copy link
Collaborator

@JamesBrianD JamesBrianD commented Dec 2, 2025

Fix SGL-101

Motivation

Fused MOE Pallas Kernel
Integrate the fused MoE kernel in an attempt to gain performance improvements.

Modifications

  • Adopt the fused MoE kernel from the TPU inference implementation.
  • Add the fused MoE layer implementation.
  • Add fused-MoE support/adaptation for MoE models.
  • Add benchmarks.

Accuracy Tests

Use fused_moe and epmoe in 1 layers grok2.

 curl -X POST 'http://127.0.0.1:30000/generate' -d '{"sampling_params": {"max_new_tokens": 10}, "text": "the capital of France is"}' -H 'Content-Type: application/json'
{"text":" Steady Borrow Clip McKe Steady Borrow Clip McKe Steady Borrow","output_ids":[118126,87778,63731,116174,118126,87778,63731,116174,118126,87778],"meta_info":{"id":"2401c3804bc74f1b87212916dfde5a81","finish_reason":{"type":"length","length":10},"prompt_tokens":24,"completion_tokens":10,"cached_tokens":0,"cache_miss_count":0,"e2e_latency":0.0681540966033935

 curl -X POST 'http://127.0.0.1:30000/generate' -d '{"sampling_params": {"max_new_tokens": 10}, "text": "the capital of France is"}' -H 'Content-Type: application/json'
{"text":" Steady Borrow Clip McKe Steady Borrow Clip McKe Steady Borrow","output_ids":[118126,87778,63731,116174,118126,87778,63731,116174,118126,87778],"meta_info":{"id":"d7fadfa226a34309bfb038bdd68328c9","finish_reason":{"type":"length","length":10},"prompt_tokens":24,"completion_tokens":10,"cached_tokens":0,"cache_miss_count":0,"e2e_latency":0.06584501266479492}}

JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server --model-path Qwen/Qwen3-30B-A3B --trust-remote-code --dist-init-addr=0.0.0.0:10011 --nnodes=1 --tp-size=4 --ep-size=4 --device=tpu --random-seed=3 --node-rank=0 --mem-fraction-static=0.85 --chunked-prefill-size=2048 --download-dir=/dev/shm --dtype=bfloat16 --max-running-requests 256 --skip-server-warmup --page-size=128

image
# Grok2
evalscope eval  --model /models/xai-grok-2 --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type openai_api --datasets gsm8k --eval-batch-size 256
+------------+-----------+----------+----------+-------+---------+---------+
| Model      | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+============+===========+==========+==========+=======+=========+=========+
| xai-grok-2 | gsm8k     | mean_acc | main     |  1319 |  0.9613 | default |
+------------+-----------+----------+----------+-------+---------+---------+

Benchmarking and Profiling

baseline(pr 349 pure ep)

============ Serving Benchmark Result ============
Backend:                                 sgl-jax
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     128
Benchmark duration (s):                  116.44
Total input tokens:                      131072
Total generated tokens:                  131072
Total generated tokens (retokenized):    131025
Request throughput (req/s):              1.10
Input token throughput (tok/s):          1125.67
Output token throughput (tok/s):         1125.67
Total token throughput (tok/s):          2251.34
Concurrency:                             16.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14550.81
Median E2E Latency (ms):                 14575.58
---------------Time to First Token----------------
Mean TTFT (ms):                          611.44
Median TTFT (ms):                        614.03
P99 TTFT (ms):                           1081.60
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.63
Median ITL (ms):                         13.17
P95 ITL (ms):                            13.70
P99 ITL (ms):                            13.91
Max ITL (ms):                            1009.14
==================================================

fused moe with ep4

============ Serving Benchmark Result ============
Backend:                                 sgl-jax
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     128
Benchmark duration (s):                  101.19
Total input tokens:                      131072
Total generated tokens:                  131072
Total generated tokens (retokenized):    131013
Request throughput (req/s):              1.26
Input token throughput (tok/s):          1295.30
Output token throughput (tok/s):         1295.30
Total token throughput (tok/s):          2590.59
Concurrency:                             16.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12645.18
Median E2E Latency (ms):                 12649.56
---------------Time to First Token----------------
Mean TTFT (ms):                          441.50
Median TTFT (ms):                        442.33
P99 TTFT (ms):                           779.29
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11.93
Median ITL (ms):                         11.60
P95 ITL (ms):                            12.17
P99 ITL (ms):                            12.38
Max ITL (ms):                            722.26
==================================================

Checklist

  • Please use English, otherwise it will be closed.
  • The purpose of the PR, or link existing issues this PR will resolve.
  • The test plan, such as providing test command.
  • (Optional) The necessary documentation update.

@gemini-code-assist
Copy link

Summary of Changes

Hello @JamesBrianD, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a highly optimized, fused Mixture of Experts (MoE) implementation specifically designed for JAX on TPUs. It provides a new configuration option to select between the existing Expert Parallel MoE and this new fused kernel, which integrates multiple MoE operations into a single, efficient computation. The changes include the core kernel, a wrapper layer, and modifications to the Grok model and weight loading utility to support this new backend, aiming to significantly improve performance for MoE models on TPU hardware.

Highlights

  • New MoE Backend Configuration: A new MoEBackend enum is introduced in model_config.py to allow selection between 'epmoe' (native Expert Parallel MoE), 'fused' (TPU-optimized fused kernel), and 'auto' (automatic selection based on ep_size).
  • Fused MoE Kernel Implementation: A new file kernel.py is added, containing a TPU-friendly fused Mixture of Experts (MoE) kernel. This kernel is adapted from vllm-project/tpu-inference and uses JAX Pallas for optimized execution on TPUs, handling Top-K selection, expert computation, and aggregation efficiently.
  • Fused MoE Layer Integration: A new FusedEPMoE layer is added in fused_moe.py. This layer wraps the new fused kernel, providing an interface for model integration and including logic for auto-selecting tile sizes based on model dimensions for optimal TPU performance.
  • Grok Model Adaptation: The GrokMoEBlock in grok.py is updated to conditionally use either the existing EPMoE layer or the new FusedEPMoE layer based on the moe_backend configuration. This involves changes to how MoE weights are mapped and processed during loading.
  • Weight Loading Enhancements: The WeightLoader in weight_utils.py is modified to support the new fused MoE weight format. It now includes logic to fuse gate_proj and up_proj weights into a single w1 parameter for the FusedEPMoE layer.
  • Unit Tests for Fused MoE: New test files fused_moe_v1_test.py and test_fused_moe.py are added to validate the correctness and functionality of the fused MoE kernel and layer, covering various configurations, activations, and quantization schemes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@JamesBrianD JamesBrianD changed the title fused moe layer, edit grok Integrate Fused Moe Kernel, Edit Grok Dec 2, 2025
@JamesBrianD JamesBrianD linked an issue Dec 2, 2025 that may be closed by this pull request
2 tasks
@JamesBrianD JamesBrianD force-pushed the integrate-fused-moe branch 2 times, most recently from 21d6567 to 1f2130c Compare December 2, 2025 10:51
@JamesBrianD JamesBrianD requested a review from Prayer3th December 2, 2025 11:47
@JamesBrianD JamesBrianD force-pushed the integrate-fused-moe branch 2 times, most recently from 9c60856 to 9281a57 Compare December 2, 2025 13:31
@JamesBrianD JamesBrianD changed the title Integrate Fused Moe Kernel, Edit Grok Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling Dec 2, 2025
@JamesBrianD JamesBrianD force-pushed the integrate-fused-moe branch 10 times, most recently from b07af1c to 43f8198 Compare December 5, 2025 10:58
@JamesBrianD JamesBrianD force-pushed the integrate-fused-moe branch 7 times, most recently from 57a995b to 03ce0a3 Compare December 8, 2025 08:30
@JamesBrianD JamesBrianD force-pushed the integrate-fused-moe branch 9 times, most recently from 0cd8d47 to 675be04 Compare December 9, 2025 11:30
@JamesBrianD JamesBrianD force-pushed the integrate-fused-moe branch 18 times, most recently from 788ebc4 to b4a2aa6 Compare December 11, 2025 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Integrate fused moe kernel

2 participants