Deploy As A Standalone Request Scheduler¶

The endpoint picker (EPP) at its core is a smart request scheduler for LLM requests, it currently implements a number of LLM-specific load balancing optimizations including:

Prefix-cache aware scheduling
Load-aware scheduling

When using EPP with Gateway API, it works as an ext-proc to an envoy-based proxy fronting model servers running in a k8s cluster; examples of such proxies are cloud managed ones like GKE’s L7LB and open source counterparts like Istio and kGateway. EPP as an ext-proc here offers several key advantages:

It utilizes robust, pre-existing L7 proxies, including both managed and open-source options.
Seamless integration with the Kubernetes networking ecosystem, the Gateway API, allows for:Transforming a Kubernetes gateway into an inference scheduler using familiar APIs. Leveraging Gateway API features like traffic splitting for gradual rollouts and HTTP rule matching. Access to provider-specific features.

These benefits are critical for online services, including MaaS (Model-as-a-Service), which require support for multi-tenancy, demand high availability, scalability, and streamlined operations.

However, for some batch inference, a tight integration with the Gateway API and requiring an external proxy to be deployed separately is in practice an operational overhead. Consider an offline RL post-training job, where the sampler, the inference service in the job, is a single tenant/workload with a lifecycle tied with the training job; this inference service is specific to the job, it is continuously updated during post-training, and so it is not one that would be serving any other traffic. A simpler deployment mode would reduce the barrier to adopting the EPP for such single-tenant workloads.

How¶

A proxy is deployed as a sidecar to the EPP. The proxy and EPP continue to communicate via ext-proc protocol over localhost. For the endpoint discovery, you can configure the model server pods as a flag to EPP instead of using InferencePool dependency.

Example¶

Prerequisites¶

A cluster with:

Support for one of the three most recent Kubernetes minor releases.
Support for services of type LoadBalancer. For kind clusters, follow this guide to get services of type LoadBalancer working.
Support for sidecar containers (enabled by default since Kubernetes v1.29) to run the model server deployment.

Tools:

Helm
jq

Steps¶

Deploy Sample Model Server¶

GPU-Based Model ServerCPU-Based Model ServervLLM Simulator Model Server

For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas as needed. Create a Hugging Face secret to download the model meta-llama/Llama-3.1-8B-Instruct. Ensure that the token grants access to this model.

Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.

kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to the set of Llama models
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml

Warning

CPU deployment can be unreliable i.e. the pods may crash/restart because of resource contraints.

This setup is using the formal vllm-cpu image, which according to the documentation can run vLLM on x86 CPU platform. For this setup, we use approximately 9.5GB of memory and 12 CPUs for each replica.

While it is possible to deploy the model server with less resources, this is not recommended. For example, in our tests, loading the model using 8GB of memory and 1 CPU was possible but took almost 3.5 minutes and inference requests took unreasonable time. In general, there is a tradeoff between the memory and CPU we allocate to our pods and the performance. The more memory and CPU we allocate the better performance we can get.

After running multiple configurations of these values we decided in this sample to use 9.5GB of memory and 12 CPUs for each replica, which gives reasonable response times. You can increase those numbers and potentially may even get better response times. For modifying the allocated resources, adjust the numbers in cpu-deployment.yaml as needed.

Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml

This option uses the vLLM simulator to simulate a backend model server. This setup uses the least amount of compute resources, does not require GPU's, and is ideal for test/dev environments.

To deploy the vLLM simulator, run the following command.

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment.yaml

Deploy Endpoint Picker Extension with Envoy sidecar¶

Deploy an Endpoint Picker Extension named vllm-llama3-8b-instruct that selects from endpoints with label app=vllm-llama3-8b-instruct and listening on port 8000. The Helm install command automatically installs the endpoint-picker specific resources.

Set the chart version and then select a tab to follow the provider-specific instructions.

 export EPP_STANDALONE_CHART_VERSION=v0
 export PROVIDER=<YOUR_PROVIDER> #optional, can be gke as gke needed it specific epp monitoring resources.
 helm install vllm-llama3-8b-instruct \
 --dependency-update \
 --set inferenceExtension.endpointsServer.endpointSelector="app=vllm-llama3-8b-instruct" \
 --set provider.name=$PROVIDER \
 --version $EPP_STANDALONE_CHART_VERSION \
  oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/epp-standalone

Try it out¶

Wait until the EPP deployment is ready.

Once you epp-standalone pod is running, Install the curl pod as follows:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: curl
  labels:
    app: curl
spec:
  containers:
  - name: curl
    image: curlimages/curl:7.83.1
    imagePullPolicy: IfNotPresent
    command:
      - tail
      - -f
      - /dev/null
  restartPolicy: Never
EOF

Send an inference request via

kubectl exec curl -- curl -i http://vllm-llama3-8b-instruct-epp:8081/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model": "food-review-1","prompt": "Write as if you were a critic: San Francisco","max_tokens": 100,"temperature": 0}'

Cleanup¶

Run the following commands to remove all resources created by this guide.

The following instructions assume you would like to cleanup ALL resources that were created in this guide. Please be careful not to delete resources you'd like to keep.

Uninstall the EPP, curl pod and model server resources:

helm uninstall vllm-llama3-8b-instruct
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferenceobjective.yaml --ignore-not-found
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml --ignore-not-found
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --ignore-not-found
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment.yaml --ignore-not-found
kubectl delete secret hf-token --ignore-not-found
kubectl delete pod curl --ignore-not-found