The project aims to design and deploy a scalable, efficient cloud-based inference service for large language models (LLMs) using Kubernetes on Google Cloud. Leveraging vLLM, an open-source library for optimizing LLMs, the service addresses challenges in memory consumption and latency.
Before proceeding, ensure you have:
- A Google Cloud Platform (GCP) account
gcloudCLI installed and authenticatedkubectlCLI installed and configured- Docker installed and configured to push images to a container registry
Create a GKE cluster on GCP and authenticate with:
gcloud container clusters get-credentials final-project --region us-central1 --project cml-finalsCheck existing deployments and services:
kubectl get deployments
kubectl get svccd db-service
kubectl apply -f deployment-db.yaml
kubectl apply -f service-db.yamlcd pub-sub-service/deployment
kubectl apply -f deployment-pub-sub.yaml
kubectl apply -f service-pub-sub.yamlEnsure both services are running before proceeding.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.ymlcd llm-service
docker build . -t ad060398/llm-service --no-cache --platform=linux/amd64
docker push ad060398/llm-service
kubectl apply -f deployment/deployment-llm-service.yaml
kubectl apply -f deployment/service-llm-service.yamlcd api-server
docker build . -t ad060398/api-server --no-cache --platform=linux/amd64
docker push ad060398/api-server
kubectl apply -f deployment/deployment-api-server.yaml
kubectl apply -f deployment/service-api-server.yamlkubectl get pods # List running pods
kubectl logs <pod_name> # View logs for a specific pod
kubectl get svc # List servicesRetrieve the external IP of the API server from kubectl get svc and use it to test the service:
curl -X POST http://<external-ip>/chat \
-H "Content-Type: application/json" \
-d '{"text": "Hello, LLM!"}'
curl http://<external-ip>/status/<job_id>locust load_test.pyAccess Locust dashboard via:
http://localhost:8089Configure and start the test from the web interface.
Following these steps will set up and deploy all services required for the project. Ensure each service is running correctly before proceeding to the next step. If you encounter issues, use kubectl logs and kubectl describe to debug any deployment errors.