Skip to content

Conversation

@denyszhak
Copy link

Change:
Adds SGLang as a new inference backend.

Testing:

  1. Compiles on macOS, unit tests pass, Docker image builds successfully.
  2. I would love someone to help in testing it on Linux with NVIDIA GPU since it goes with both Docker and host Python installation paths

Closes #431

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @denyszhak, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the model-runner by integrating SGLang as a new inference backend. This addition broadens the range of supported model formats, particularly for safetensors, and introduces the necessary infrastructure to manage backends that communicate over TCP, ensuring greater flexibility and compatibility within the system.

Highlights

  • New SGLang Backend: Introduced SGLang as a new inference backend, expanding support for safetensors models.
  • Dockerfile Update: Added a new multi-stage build in the Dockerfile to install SGLang, including Python environment setup.
  • TCP Communication for SGLang: Implemented SGLang to use TCP for communication, diverging from other backends that utilize Unix sockets, and updated the scheduler to handle this.
  • Backend Interface Enhancement: Added a UsesTCP() method to the inference.Backend interface to allow backends to declare their communication protocol.
  • Safetensors Model Prioritization: Modified the scheduler to prefer vLLM for safetensors models, with SGLang as a fallback if vLLM is unavailable.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/promote-to-latest.yml
    • .github/workflows/release.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@denyszhak denyszhak changed the title Add sglang backend Add SGLang backend Dec 4, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)
  • Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code. (link)

General comments:

  • The SGLang Docker integration seems inconsistent: initFromDocker detects installation solely via /opt/sglang-env/sglang, but the Dockerfile only sets up a venv and version file without creating that binary, so either the detection logic should rely on the venv/version file or the image should install a wrapper at /opt/sglang-env/sglang to match the runtime expectations.
  • The TCP configuration for SGLang is currently hard-coded to 127.0.0.1 and a fixed base port in runner.go/sglang_config.go; consider feeding the host/port via configuration or environment so multiple instances or non-local setups can avoid port clashes and have more flexible networking.
  • Right now the SGLang backend is always created and registered in main.go regardless of platform, with Run failing at runtime when SupportsSGLang is false; you may want to gate backend construction/registration on platform.SupportsSGLang() to avoid picking an unusable backend on unsupported OSes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The SGLang Docker integration seems inconsistent: `initFromDocker` detects installation solely via `/opt/sglang-env/sglang`, but the Dockerfile only sets up a venv and version file without creating that binary, so either the detection logic should rely on the venv/version file or the image should install a wrapper at `/opt/sglang-env/sglang` to match the runtime expectations.
- The TCP configuration for SGLang is currently hard-coded to `127.0.0.1` and a fixed base port in `runner.go`/`sglang_config.go`; consider feeding the host/port via configuration or environment so multiple instances or non-local setups can avoid port clashes and have more flexible networking.
- Right now the SGLang backend is always created and registered in `main.go` regardless of platform, with `Run` failing at runtime when `SupportsSGLang` is false; you may want to gate backend construction/registration on `platform.SupportsSGLang()` to avoid picking an unusable backend on unsupported OSes.

## Individual Comments

### Comment 1
<location> `pkg/inference/backends/sglang/sglang.go:101-110` </location>
<code_context>
+func (s *sglang) initFromDocker() error {
</code_context>

<issue_to_address>
**issue (bug_risk):** Docker-based installation path and `binaryPath` do not align, so the "docker" path is effectively never used

In the Dockerfile we only create a venv at `/opt/sglang-env` and install the `sglang` package, but we never create an executable at `/opt/sglang-env/sglang`. `initFromDocker` uses the presence of `binaryPath()` (`/opt/sglang-env/sglang`) to detect a Docker-provided binary, so in the current image this check always fails with `fs.ErrNotExist` and the code always falls back to `initFromHost`, even in the container. That also means the `sandboxPath` branch in `Run` that depends on `binaryPath()` is never exercised.

To fix this, either point `binaryPath()` to the venv Python (e.g. `/opt/sglang-env/bin/python3`) and keep `-m sglang.launch_server`, or create a launcher at `/opt/sglang-env/sglang` in the Dockerfile so the current detection logic works as intended.
</issue_to_address>

### Comment 2
<location> `pkg/inference/scheduling/loader_test.go:50-51` </location>
<code_context>
 	return m.usesExternalModelMgmt
 }

+func (m *mockBackend) UsesTCP() bool {
+	return false
+}
+
</code_context>

<issue_to_address>
**suggestion (testing):** Add tests that cover the new TCP vs Unix-socket behavior in the runner

Right now the only test usage of `Backend.UsesTCP()` is this mock always returning `false`, so the TCP branch in `runner.run` isn’t exercised. Please add tests (e.g. in `runner_test.go`) that:

- use a backend mock with `UsesTCP() == true` and assert the transport dials `tcp` to `127.0.0.1:<base+slot>`;
- use a backend mock with `UsesTCP() == false` and assert it still dials `unix` using `RunnerSocketPath`;
- cover an error from `RunnerSocketPath` when `UsesTCP()` is false.

This will validate the new connection logic for both TCP-based (SGLang) and existing backends.

Suggested implementation:

```golang
func (m *mockBackend) UsesTCP() bool {
	return m.usesTCP
}

```

To fully implement your review comment you will also need to:

1. Update the `mockBackend` struct definition in this file to include a `usesTCP bool` field:
   - This allows tests to construct backends with either `UsesTCP() == true` or `UsesTCP() == false`.
2. In `runner_test.go` (or the appropriate runner test file), add tests that:
   - Create a backend mock with `usesTCP: true` and assert that the runner dials `tcp` with `127.0.0.1:<base+slot>`.
   - Create a backend mock with `usesTCP: false` and assert that the runner dials `unix` using `RunnerSocketPath`.
   - Use a backend mock with `usesTCP: false` and a `RunnerSocketPath` implementation that returns an error, and assert that the runner surfaces that error.
3. Ensure those tests construct `mockBackend` instances with the new `usesTCP` field set appropriately, and that any existing test helpers/constructors for `mockBackend` are updated to set `usesTCP` to `false` by default to preserve current behavior.

These changes, together with the edit above, will let the tests cover both TCP-based and Unix-socket-based behavior in the runner without altering production code paths.
</issue_to_address>

### Comment 3
<location> `pkg/inference/backends/sglang/sglang_config_test.go:48-57` </location>
<code_context>
+		expected    []string
+		expectError bool
+	}{
+		{
+			name: "empty safetensors path should error",
+			bundle: &mockModelBundle{
+				safetensorsPath: "",
+			},
+			mode:        inference.BackendModeCompletion,
+			config:      nil,
+			expected:    nil,
+			expectError: true,
+		},
+		{
+			name: "basic args without context size",
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding a test for invalid socket format to exercise the error path in GetArgs

Right now `TestGetArgs` only covers a valid socket (`"127.0.0.1:30000"`), but `GetArgs` returns an error when `net.SplitHostPort(socket)` fails. Please add a table entry with an invalid socket (e.g. `"127.0.0.1"` or `"30000"`) and wire it so that malformed input is passed into `GetArgs`, asserting `expectError: true` to cover this error path.

Suggested implementation:

```golang
		{
			name: "empty safetensors path should error",
			bundle: &mockModelBundle{
				safetensorsPath: "",
			},
			mode:        inference.BackendModeCompletion,
			config:      nil,
			expected:    nil,
			expectError: true,
		},
		{
			name: "invalid socket format should error",
			bundle: &mockModelBundle{
				safetensorsPath: "/path/to/model/model.safetensors",
			},
			mode: inference.BackendModeCompletion,
			config: &inference.BackendConfiguration{
				// TODO: adjust field name to match the actual socket/host:port field used by GetArgs.
				// This should be set to an invalid value (no port) so that net.SplitHostPort fails.
				Socket: "127.0.0.1",
			},
			expected:    nil,
			expectError: true,
		},
		{
			name: "basic args without context size",

```

1. Adjust the `Socket` field name in the `inference.BackendConfiguration` literal to match your actual configuration type (e.g. it might be `Socket`, `Address`, `ListenAddr`, `GRPCHostPort`, etc.), ensuring this is the field that `GetArgs` passes to `net.SplitHostPort`.
2. Confirm that `TestGetArgs` actually constructs and passes the `config` field from the table entry into `GetArgs`. If not, update the test loop to do so, so that the invalid socket value flows into `GetArgs`.
3. If `GetArgs` requires additional mandatory configuration fields, add them to this test case as needed so that the only failure cause is the malformed socket format.
</issue_to_address>

### Comment 4
<location> `pkg/inference/backends/sglang/sglang_config_test.go:239` </location>
<code_context>
+	}
+}
+
+func TestGetContextLength(t *testing.T) {
+	tests := []struct {
+		name          string
</code_context>

<issue_to_address>
**suggestion (testing):** Consider basic integration-style tests for the sglang backend Install/Run flows

The configuration helpers (`GetArgs`, `GetContextLength`) are well covered, but the higher-level behavior in `sglang.go` (Docker vs host Python detection, `Status()` contents, and how `Run` chooses between the binary vs `python3 -m sglang.launch_server`) isn’t exercised. It would be valuable to add tests that, using fakes for `platform.SupportsSGLang`, filesystem checks, and `exec.Command`, verify that:

- `Install` sets `status` and errors correctly for unsupported platforms, missing `python3`, or missing the `sglang` package.
- `Run` fails clearly if no binary and no `pythonPath` are available, and that it passes `Args` and `socket` correctly to `backends.RunBackend`.

These can be shallow unit tests over the key branches rather than full end-to-end tests.
</issue_to_address>

### Comment 5
<location> `pkg/inference/backends/sglang/sglang.go:138` </location>
<code_context>
	if err := exec.Command(pythonPath, "-c", "import sglang").Run(); err != nil {
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

### Comment 6
<location> `pkg/inference/backends/sglang/sglang.go:144` </location>
<code_context>
	output, err := exec.Command(pythonPath, "-c", "import sglang; print(sglang.__version__)").Output()
</code_context>

<issue_to_address>
**security (go.lang.security.audit.dangerous-exec-command):** Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +101 to +110
func (s *sglang) initFromDocker() error {
sglangBinaryPath := s.binaryPath()

if _, err := os.Stat(sglangBinaryPath); err != nil {
return err
}

versionBytes, err := os.ReadFile(sglangVersionFile)
if err != nil {
s.log.Warnf("could not get sglang version: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Docker-based installation path and binaryPath do not align, so the "docker" path is effectively never used

In the Dockerfile we only create a venv at /opt/sglang-env and install the sglang package, but we never create an executable at /opt/sglang-env/sglang. initFromDocker uses the presence of binaryPath() (/opt/sglang-env/sglang) to detect a Docker-provided binary, so in the current image this check always fails with fs.ErrNotExist and the code always falls back to initFromHost, even in the container. That also means the sandboxPath branch in Run that depends on binaryPath() is never exercised.

To fix this, either point binaryPath() to the venv Python (e.g. /opt/sglang-env/bin/python3) and keep -m sglang.launch_server, or create a launcher at /opt/sglang-env/sglang in the Dockerfile so the current detection logic works as intended.


s.pythonPath = pythonPath

if err := exec.Command(pythonPath, "-c", "import sglang").Run(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (go.lang.security.audit.dangerous-exec-command): Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

Source: opengrep

return ErrSGLangNotFound
}

output, err := exec.Command(pythonPath, "-c", "import sglang; print(sglang.__version__)").Output()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (go.lang.security.audit.dangerous-exec-command): Detected non-static command inside Command. Audit the input to 'exec.Command'. If unverified user data can reach this call site, this is a code injection vulnerability. A malicious actor can inject a malicious script to execute arbitrary code.

Source: opengrep

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces sglang as a new inference backend, which is a great addition. The implementation includes creating a new Docker image variant, adding the necessary backend logic, and integrating it with the scheduler. A new UsesTCP method has been added to the Backend interface to support sglang's communication protocol, and this has been implemented across all existing backends.

My review focuses on a few key areas to improve the robustness and security of the new backend:

  • Ensuring proper GPU support in the Dockerfile.
  • Correctly handling unimplemented memory estimation to prevent scheduler issues.
  • Enabling sandboxing for security.
  • Expanding platform support to include macOS as intended.

Overall, the changes are well-structured. Addressing these points will make the sglang backend more robust and secure.

Comment on lines +227 to +230
return inference.RequiredMemory{
RAM: 1,
VRAM: 1,
}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Returning a sentinel value of 1 for RAM and VRAM is dangerous as it signals to the scheduler that the model requires almost no memory. This can lead to the scheduler overallocating models, causing out-of-memory errors. Since memory estimation is not yet implemented for this backend, it's safer to return an error, similar to how the mlx backend handles it. This will prevent the scheduler from attempting to manage memory for this backend until the feature is implemented.

	return inference.RequiredMemory{}, ErrNotImplemented

# Install uv and SGLang as modelrunner user
RUN curl -LsSf https://astral.sh/uv/install.sh | sh \
&& ~/.local/bin/uv venv --python /usr/bin/python3 /opt/sglang-env \
&& ~/.local/bin/uv pip install --python /opt/sglang-env/bin/python "sglang==${SGLANG_VERSION}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current pip install command for sglang will only install the base package, which may not include dependencies for GPU acceleration (like CUDA). To ensure sglang can leverage NVIDIA GPUs, you should install it with the [all] extra, which includes all necessary dependencies for features like FlashAttention and CUDA.

 && ~/.local/bin/uv pip install --python /opt/sglang-env/bin/python "sglang[all]==${SGLANG_VERSION}"

return fmt.Errorf("sglang: no docker binary at %s and no python runtime configured; did you forget to call Install?", binaryPath)
}
binaryPath = s.pythonPath
sandboxPath = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

By setting sandboxPath = "", you are disabling the sandboxing mechanism for the sglang backend when it runs from a Python installation (which is the case even inside the Docker container). Running model inference without a sandbox can pose a security risk, as it allows the model code to have broader access to the system. To maintain proper isolation, the sandbox should remain enabled. The sglangDir (/opt/sglang-env) can be used as the sandbox root.

Copy link
Contributor

@ericcurtin ericcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Have a look through the bot comments to see if they make sense. One Nvidia/x86_64 test could be helpful if somebody would like to give @denyszhak a hand

@doringeman
Copy link
Contributor

I can test it on Linux with NVIDIA GPU.

@ericcurtin
Copy link
Contributor

@denyszhak try and make sourcery build green if you can, it has a certain tolerance threshold and you have not met it. Consider what it's saying and try and fix a few. If you feel sourcery is ultimately being harsh after considering all it's comments, let us know.

@doringeman
Copy link
Contributor

Hey @denyszhak,

I gave this PR a go and I spotted a few things missing:

With my changes, you can now test it (even the CPU variant only) by using:

  • Terminal 1:
make docker-run-sglang

or

make docker-run-sglang-cuda
  • Terminal 2:
MODEL_RUNNER_HOST=http://localhost:8080 docker model run ai/smollm2-vllm hi

I notice it doesn't work though. I get the same error for both CPU and CUDA variants.

$ MODEL_RUNNER_HOST=http://localhost:8080 docker model run ai/smollm2-vllm hi
Failed to generate a response: error response: status=500 body=unable to load runner: error waiting for runner to be ready: SGLang terminated unexpectedly: SGLang exit status: exit status 1
with output: Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/sglang-env/lib/python3.12/site-packages/sglang/launch_server.py", line 6, in <module>
    from sglang.srt.server import launch_server
  File "/opt/sglang-env/lib/python3.12/site-packages/sglang/srt/server.py", line 35, in <module>
    import aiohttp
ModuleNotFoundError: No module named 'aiohttp'

I hope this helps. Thanks! 🙌

@denyszhak
Copy link
Author

Hey @denyszhak,

I gave this PR a go and I spotted a few things missing:

With my changes, you can now test it (even the CPU variant only) by using:

  • Terminal 1:
make docker-run-sglang

or

make docker-run-sglang-cuda
  • Terminal 2:
MODEL_RUNNER_HOST=http://localhost:8080 docker model run ai/smollm2-vllm hi

I notice it doesn't work though. I get the same error for both CPU and CUDA variants.

$ MODEL_RUNNER_HOST=http://localhost:8080 docker model run ai/smollm2-vllm hi
Failed to generate a response: error response: status=500 body=unable to load runner: error waiting for runner to be ready: SGLang terminated unexpectedly: SGLang exit status: exit status 1
with output: Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/sglang-env/lib/python3.12/site-packages/sglang/launch_server.py", line 6, in <module>
    from sglang.srt.server import launch_server
  File "/opt/sglang-env/lib/python3.12/site-packages/sglang/srt/server.py", line 35, in <module>
    import aiohttp
ModuleNotFoundError: No module named 'aiohttp'

I hope this helps. Thanks! 🙌

Thank you! I will go over this soon

@dougyster
Copy link

@denyszhak I checked out doringeman's branch (feature/add-sglang-backend) and can confirm that the docker image builds successfully, the SGLang backend installs correctly, and that server starts fine; I did this on an H200 cluster where I cloned the @doringeman 's branch and in one terminal ran make docker-run-sglang-cuda.

However, I tried running a model with the following command in a seperate terminal with the same model:

curl -X POST http://localhost:8080/engines/sglang/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "ai/smollm2", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 20}'

but got the following message:

unable to load runner: error waiting for runner to be ready: failed to get SGLang arguments: safetensors path required by SGLang backend

Seems like the SGLang backend only supports safetensor formatted models and the test model is in GGUF format (for llama) - is there a compatible model that I should use for testing?

@ericcurtin
Copy link
Contributor

@dougyster welcome! The DMR way would be to use the models from here with -vllm suffix:

https://hub.docker.com/u/ai?page=1&search=-vllm

they are safetensor ones.

@dougyster
Copy link

@dougyster welcome! The DMR way would be to use the models from here with -vllm suffix:

https://hub.docker.com/u/ai?page=1&search=-vllm

they are safetensor ones.

Awesome! Will try this out and thanks for the warm welcome :)

@ericcurtin
Copy link
Contributor

ericcurtin commented Dec 4, 2025

When this is tested end to end, if you have sglang backend selected, it would be just:

$ docker model run ai/smollm2-vllm
>

and you should have a chatbot that can talk to sglang (in this codebase, the daemon and client are called model-runner and model-cli, so in this codebase we can often run this for testing ./model-cli run ai/smollm2-vllm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: SGLang backend

4 participants