Vertex AI - Self Deployed Models
Deploy and use your own models on Vertex AI through Model Garden or custom endpoints.
Model Garden​
tip
All OpenAI compatible models from Vertex Model Garden are supported.
Using Model Garden​
Almost all Vertex Model Garden models are OpenAI compatible.
- OpenAI Compatible Models
- Non-OpenAI Compatible Models
| Property | Details |
|---|---|
| Provider Route | vertex_ai/openai/{MODEL_ID} |
| Vertex Documentation | Model Garden LiteLLM Inference, Vertex Model Garden |
| Supported Operations | /chat/completions, /embeddings |
- SDK
- Proxy
from litellm import completion
import os
## set ENV variables
os.environ["VERTEXAI_PROJECT"] = "hardy-device-38811"
os.environ["VERTEXAI_LOCATION"] = "us-central1"
response = completion(
model="vertex_ai/openai/<your-endpoint-id>",
messages=[{ "content": "Hello, how are you?","role": "user"}]
)
1. Add to config
model_list:
- model_name: llama3-1-8b-instruct
litellm_params:
model: vertex_ai/openai/5464397967697903616
vertex_ai_project: "my-test-project"
vertex_ai_location: "us-east-1"
2. Start proxy
litellm --config /path/to/config.yaml
# RUNNING at http://0.0.0.0:4000
3. Test it!
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "llama3-1-8b-instruct", # 👈 the 'model_name' in config
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}'
from litellm import completion
import os
## set ENV variables
os.environ["VERTEXAI_PROJECT"] = "hardy-device-38811"
os.environ["VERTEXAI_LOCATION"] = "us-central1"
response = completion(
model="vertex_ai/<your-endpoint-id>",
messages=[{ "content": "Hello, how are you?","role": "user"}]
)
Gemma Models (Custom Endpoints)​
Deploy Gemma models on custom Vertex AI prediction endpoints with OpenAI-compatible format.
| Property | Details |
|---|---|
| Provider Route | vertex_ai/gemma/{MODEL_NAME} |
| Vertex Documentation | Vertex AI Prediction |
| Required Parameter | api_base - Full prediction endpoint URL |
Proxy Usage:
1. Add to config.yaml
model_list:
- model_name: gemma-model
litellm_params:
model: vertex_ai/gemma/gemma-3-12b-it-1222199011122
api_base: https://ENDPOINT.us-central1-PROJECT.prediction.vertexai.goog/v1/projects/PROJECT_ID/locations/us-central1/endpoints/ENDPOINT_ID:predict
vertex_project: "my-project-id"
vertex_location: "us-central1"
2. Start proxy
litellm --config /path/to/config.yaml
3. Test it
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gemma-model",
"messages": [{"role": "user", "content": "What is machine learning?"}],
"max_tokens": 100
}'
SDK Usage:
from litellm import completion
response = completion(
model="vertex_ai/gemma/gemma-3-12b-it-1222199011122",
messages=[{"role": "user", "content": "What is machine learning?"}],
api_base="https://ENDPOINT.us-central1-PROJECT.prediction.vertexai.goog/v1/projects/PROJECT_ID/locations/us-central1/endpoints/ENDPOINT_ID:predict",
vertex_project="my-project-id",
vertex_location="us-central1",
)
MedGemma Models (Custom Endpoints)​
Deploy MedGemma models on custom Vertex AI prediction endpoints with OpenAI-compatible format. MedGemma models use the same vertex_ai/gemma/ route.
| Property | Details |
|---|---|
| Provider Route | vertex_ai/gemma/{MODEL_NAME} |
| Vertex Documentation | Vertex AI Prediction |
| Required Parameter | api_base - Full prediction endpoint URL |
Proxy Usage:
1. Add to config.yaml
model_list:
- model_name: medgemma-model
litellm_params:
model: vertex_ai/gemma/medgemma-2b-v1
api_base: https://ENDPOINT.us-central1-PROJECT.prediction.vertexai.goog/v1/projects/PROJECT_ID/locations/us-central1/endpoints/ENDPOINT_ID:predict
vertex_project: "my-project-id"
vertex_location: "us-central1"
2. Start proxy
litellm --config /path/to/config.yaml
3. Test it
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "medgemma-model",
"messages": [{"role": "user", "content": "What are the symptoms of hypertension?"}],
"max_tokens": 100
}'
SDK Usage:
from litellm import completion
response = completion(
model="vertex_ai/gemma/medgemma-2b-v1",
messages=[{"role": "user", "content": "What are the symptoms of hypertension?"}],
api_base="https://ENDPOINT.us-central1-PROJECT.prediction.vertexai.goog/v1/projects/PROJECT_ID/locations/us-central1/endpoints/ENDPOINT_ID:predict",
vertex_project="my-project-id",
vertex_location="us-central1",
)