Skip to content

Feature Request: Allow EPP to dynamically discover ports based on matching Pods #1965

@delavet

Description

@delavet

Background

Currently, EPP uses pods with fixed ports as the basic unit for request scheduling. However, this approach does not adapt well to scenarios where multiple model servers may start dynamicly on different ports. It is hoped that the capability of EPP can be generalized so that EPP can dynamically discover the inference server port from the matched pod, instead of discovering it from the definition of InferencePool.

One potential use case for this scenario is as follows:

https://github.com/llm-d-incubation/llm-d-fast-model-actuation) is a project aiming to speed up scale out, at least in simple model server deployment patterns. The key technologies are using vLLM sleep/wake to maintain some low-GPU-resource idle instances, and a launcher process to spring-load startup of child processes. The launcher will have only one awake child at a time, but also some sleeping ones; switching which is awake is fast. Thus, the launcher pod serves different models at different times. There is a controller that manages this, including adjusting the launcher pod's labels to match the right InferencePool. Remember that even a vLLM instance in sleep mode is handling HTTP requests on its inference port. So the children of one launcher have to have different inference ports. That means that the InferencePools have to have different inference ports. I do not like that, it adds constraints between InferencePool objects. So it would be better if the inference port number(s) could come from a label or annotation on the matching pods.

Proposal

Currently, in the InferencePool API, multiple ports of the backend Model Server are identified through a required targetPorts field.

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/api/v1/inferencepool_types.go#L81C2-L81C13

However, with this approach, it is not possible to identify inference servers dynamically started on different ports within a pod.

We suggest making the targetPorts field optional and proposing a new annotation-based method in the EPP Model Server Protocol to help EPP dynamically discover model servers running on multiple dynamic ports, thereby serving the aforementioned use cases. Specifically, an example of the annotations would look like this:

annotations:
 inference.networking.x-k8s.io/port-discovery: [{"inferencePool":"qwen-pool","number":8007}]

This can be parsed into the following data structure:

type InferencePort struct {
   InferencePool string            `json:"inferencePool"`
   Number        int32             `json:"number"`
   Attributes    map[string]string `json:"attributes,omitempty"`
}

type InferencePorts []InferencePort
  • Each InferencePort declares a model server instance running on that port.
  • The inferencePool field specifies which inference pool the instance belongs to (since services running on different ports may belong to different inference pools).
  • For any additional requirements, an optional Attributes field allows users to attach metadata to the model server running on that port. This is particularly useful when the inference servers, launched on different ports, are responsible for distinct roles, such as prefill and decode.

At the implementation level, EPP can listen to pods and, based on predefined annotations, transform a pod into one or more "virtual pods", which are then recorded in the local datastore. This approach ensures full compatibility with other EPP scheduling logic based on pods. This implementation has already been proposed in #1663 to support DP. We only need to modify the creation logic of the "virtual pods" so that EPP can create "virtual pods" according to the annotations.

@kfswain
cc @MikeSpreitzer @osswangxining @shmuelk

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions