-
Notifications
You must be signed in to change notification settings - Fork 207
Description
Background
Currently, EPP uses pods with fixed ports as the basic unit for request scheduling. However, this approach does not adapt well to scenarios where multiple model servers may start dynamicly on different ports. It is hoped that the capability of EPP can be generalized so that EPP can dynamically discover the inference server port from the matched pod, instead of discovering it from the definition of InferencePool.
One potential use case for this scenario is as follows:
https://github.com/llm-d-incubation/llm-d-fast-model-actuation) is a project aiming to speed up scale out, at least in simple model server deployment patterns. The key technologies are using vLLM sleep/wake to maintain some low-GPU-resource idle instances, and a launcher process to spring-load startup of child processes. The launcher will have only one awake child at a time, but also some sleeping ones; switching which is awake is fast. Thus, the launcher pod serves different models at different times. There is a controller that manages this, including adjusting the launcher pod's labels to match the right InferencePool. Remember that even a vLLM instance in sleep mode is handling HTTP requests on its inference port. So the children of one launcher have to have different inference ports. That means that the InferencePools have to have different inference ports. I do not like that, it adds constraints between InferencePool objects. So it would be better if the inference port number(s) could come from a label or annotation on the matching pods.
Proposal
Currently, in the InferencePool API, multiple ports of the backend Model Server are identified through a required targetPorts field.
However, with this approach, it is not possible to identify inference servers dynamically started on different ports within a pod.
We suggest making the targetPorts field optional and proposing a new annotation-based method in the EPP Model Server Protocol to help EPP dynamically discover model servers running on multiple dynamic ports, thereby serving the aforementioned use cases. Specifically, an example of the annotations would look like this:
annotations:
inference.networking.x-k8s.io/port-discovery: [{"inferencePool":"qwen-pool","number":8007}]This can be parsed into the following data structure:
type InferencePort struct {
InferencePool string `json:"inferencePool"`
Number int32 `json:"number"`
Attributes map[string]string `json:"attributes,omitempty"`
}
type InferencePorts []InferencePort- Each
InferencePortdeclares a model server instance running on that port. - The
inferencePoolfield specifies which inference pool the instance belongs to (since services running on different ports may belong to different inference pools). - For any additional requirements, an optional
Attributesfield allows users to attach metadata to the model server running on that port. This is particularly useful when the inference servers, launched on different ports, are responsible for distinct roles, such as prefill and decode.
At the implementation level, EPP can listen to pods and, based on predefined annotations, transform a pod into one or more "virtual pods", which are then recorded in the local datastore. This approach ensures full compatibility with other EPP scheduling logic based on pods. This implementation has already been proposed in #1663 to support DP. We only need to modify the creation logic of the "virtual pods" so that EPP can create "virtual pods" according to the annotations.