-
Notifications
You must be signed in to change notification settings - Fork 42
Integrate Fused Moe Kernel, Edit Grok, Qwen-MOE and Ling #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @JamesBrianD, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a highly optimized, fused Mixture of Experts (MoE) implementation specifically designed for JAX on TPUs. It provides a new configuration option to select between the existing Expert Parallel MoE and this new fused kernel, which integrates multiple MoE operations into a single, efficient computation. The changes include the core kernel, a wrapper layer, and modifications to the Grok model and weight loading utility to support this new backend, aiming to significantly improve performance for MoE models on TPU hardware. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
b84734e to
9e9cb20
Compare
9e9cb20 to
b8dc559
Compare
21d6567 to
1f2130c
Compare
9c60856 to
9281a57
Compare
b07af1c to
43f8198
Compare
57a995b to
03ce0a3
Compare
0cd8d47 to
675be04
Compare
788ebc4 to
b4a2aa6
Compare
b4a2aa6 to
a22a08e
Compare
Fix SGL-101
Motivation
Fused MOE Pallas Kernel
Integrate the fused MoE kernel in an attempt to gain performance improvements.
Modifications
Accuracy Tests
Use
fused_moeandepmoein 1 layers grok2.JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server --model-path Qwen/Qwen3-30B-A3B --trust-remote-code --dist-init-addr=0.0.0.0:10011 --nnodes=1 --tp-size=4 --ep-size=4 --device=tpu --random-seed=3 --node-rank=0 --mem-fraction-static=0.85 --chunked-prefill-size=2048 --download-dir=/dev/shm --dtype=bfloat16 --max-running-requests 256 --skip-server-warmup --page-size=128
Benchmarking and Profiling
baseline(pr 349 pure ep)
fused moe with ep4
Checklist