Your first voyage

On this page

This guide walks through deploying Modelplane on a local kind cluster and using it to serve a model on GKE. By the end you’ll have a working OpenAI-compatible endpoint serving Qwen 2.5 0.5B.

The whole process takes about 45 minutes. Most of that time is GKE provisioning the GPU cluster and installing the inference stack. The GKE cluster with an L4 GPU costs roughly $2-3/hr.

Prerequisites

You need the following tools installed:

kind
kubectl
Helm
Docker (or a compatible credential helper) for registry authentication.

You also need:

A GCP project with the GKE API enabled.
A GCP service account key (JSON) with permissions to create GKE clusters, VPCs, and IAM bindings. The Editor role works for trying things out.

Create a kind cluster

The control plane runs in a local kind cluster. It needs no special configuration.

1kind create cluster --name modelplane

Install Crossplane

Modelplane is built on Crossplane v2. Install it with Helm:

1helm repo add crossplane-stable https://charts.crossplane.io/stable
2helm repo update crossplane-stable
3helm install crossplane crossplane-stable/crossplane \
4  --namespace crossplane-system --create-namespace \
5  --wait

Apply prerequisites

Modelplane needs a few Kubernetes resources that Crossplane can’t compose for itself: a shared namespace, RBAC for Gateway API and MetalLB resources, and a runtime config for provider-helm.

 1kubectl apply -f - <<'EOF'
 2# Shared namespace for Modelplane infrastructure.
 3apiVersion: v1
 4kind: Namespace
 5metadata:
 6  name: modelplane-system
 7---
 8# Grant Crossplane permissions to compose Gateway API, MetalLB, and
 9# Service/EndpointSlice routing resources. This ClusterRole is aggregated
10# into Crossplane's role automatically.
11apiVersion: rbac.authorization.k8s.io/v1
12kind: ClusterRole
13metadata:
14  name: crossplane-compose-modelplane
15  labels:
16    rbac.crossplane.io/aggregate-to-crossplane: "true"
17rules:
18  - apiGroups: [""]
19    resources: ["namespaces"]
20    verbs: ["*"]
21  # Selectorless Service plus EndpointSlice composed by ModelEndpoint to route
22  # the control plane gateway to a remote model endpoint.
23  - apiGroups: [""]
24    resources: ["services"]
25    verbs: ["*"]
26  - apiGroups: ["discovery.k8s.io"]
27    resources: ["endpointslices"]
28    verbs: ["*"]
29  - apiGroups: ["gateway.networking.k8s.io"]
30    resources: ["gateways", "gatewayclasses", "httproutes"]
31    verbs: ["*"]
32  - apiGroups: ["gateway.envoyproxy.io"]
33    resources: ["backends"]
34    verbs: ["*"]
35  - apiGroups: ["metallb.io"]
36    resources: ["ipaddresspools", "l2advertisements"]
37    verbs: ["*"]
38  - apiGroups: ["protection.crossplane.io"]
39    resources: ["usages"]
40    verbs: ["*"]
41---
42# Give provider-helm a deterministic ServiceAccount name so we can grant it
43# permissions. Without this, the SA name has a random hash.
44apiVersion: pkg.crossplane.io/v1beta1
45kind: DeploymentRuntimeConfig
46metadata:
47  name: provider-helm-modelplane
48spec:
49  serviceAccountTemplate:
50    metadata:
51      name: provider-helm-modelplane
52---
53# Grant provider-helm cluster-admin. Helm charts install arbitrary Kubernetes
54# resources and need broad permissions.
55apiVersion: rbac.authorization.k8s.io/v1
56kind: ClusterRoleBinding
57metadata:
58  name: provider-helm-modelplane
59roleRef:
60  apiGroup: rbac.authorization.k8s.io
61  kind: ClusterRole
62  name: cluster-admin
63subjects:
64  - kind: ServiceAccount
65    name: provider-helm-modelplane
66    namespace: crossplane-system
67---
68# Apply the runtime config to provider-helm automatically by matching its OCI
69# image prefix.
70apiVersion: pkg.crossplane.io/v1beta1
71kind: ImageConfig
72metadata:
73  name: provider-helm-modelplane
74spec:
75  matchImages:
76    - type: Prefix
77      prefix: xpkg.upbound.io/upbound/provider-helm
78  runtime:
79    configRef:
80      name: provider-helm-modelplane
81---
82# Pull secret for Modelplane packages. The package registry requires
83# authentication. The next step applies the pull secret.
84apiVersion: pkg.crossplane.io/v1beta1
85kind: ImageConfig
86metadata:
87  name: modelplane-pull-secret
88spec:
89  matchImages:
90    - type: Prefix
91      prefix: xpkg.upbound.io/modelplane/
92  registry:
93    authentication:
94      pullSecretRef:
95        name: upbound-pull-secret
96EOF

Install Modelplane

Modelplane is packaged as a Crossplane Configuration. The package registry requires authentication. Create a pull secret, then install the Configuration. This pulls the providers and composition functions it depends on.

1kubectl create secret docker-registry upbound-pull-secret \
2  --docker-server=xpkg.upbound.io \
3  --docker-username='<robot-id>' \
4  --docker-password='<robot-token>' \
5  -n crossplane-system

1kubectl apply -f - <<'EOF'
2apiVersion: pkg.crossplane.io/v1
3kind: Configuration
4metadata:
5  name: modelplane
6spec:
7  package: xpkg.upbound.io/modelplane/modelplane:v0.1.0-dev.125.g0cba874
8EOF

Wait for the Configuration and all its dependencies to become healthy. This pulls several container images and takes a few minutes.

1kubectl get configuration modelplane --watch
2# Wait until HEALTHY shows True, then Ctrl-C.

Configure GCP credentials

Create a Secret with your GCP service account key, then create a ProviderConfig that references it.

1kubectl create secret generic gcp-creds \
2  --from-file=credentials=/path/to/sa-key.json \
3  -n crossplane-system

 1kubectl apply -f - <<'EOF'
 2apiVersion: gcp.m.upbound.io/v1beta1
 3kind: ClusterProviderConfig
 4metadata:
 5  name: default
 6spec:
 7  projectID: my-gcp-project  # Replace with your GCP project ID.
 8  credentials:
 9    source: Secret
10    secretRef:
11      namespace: crossplane-system
12      name: gcp-creds
13      key: credentials
14EOF

Create the InferenceGateway

The InferenceGateway installs Envoy Gateway and MetalLB on the control plane cluster and creates a Gateway that routes traffic to model endpoints.

1kubectl apply -f examples/platform/inference-gateway.yaml

Wait for it to become ready (~3-5 minutes):

1kubectl get ig default --watch

Create an InferenceClass and InferenceCluster

An InferenceClass defines a hardware recipe (GPU type, count, provisioning config). An InferenceCluster references it to provision GPU node pools.

Apply the L4 InferenceClass, then edit the cluster example to set your GCP project ID and apply it:

1kubectl apply -f examples/platform/inference-class-gke-l4.yaml

1# Edit examples/platform/inference-cluster-gke.yaml and set
2# spec.cluster.gke.project to your GCP project ID.
3kubectl apply -f examples/platform/inference-cluster-gke.yaml

This provisions a GKE cluster with an L4 GPU and installs the inference stack. It’s the longest step, taking roughly 20-30 minutes.

1kubectl get ic --watch
2# Wait until READY shows True, then Ctrl-C.

Deploy a model

When a ModelDeployment does not reference a ModelCache, the inference engine fetches model weights directly from the source (e.g. Hugging Face) at pod startup. The deployment must supply any required credentials via the engine container’s env (e.g. HF_TOKEN), and the engine image must support fetching from that source. For large models or frequent restarts, a ModelCache avoids repeated downloads; see examples/cache/ for cached single-pod and multi-node deployments.

Create the ml-team namespace, deploy the model, and create a ModelService to expose it:

1kubectl create namespace ml-team
2kubectl apply -f examples/deployment/model-deployment.yaml
3kubectl apply -f examples/deployment/model-service.yaml

The deployment’s nodeSelector declares the GPU its model needs as a DRA device request (here, a GPU with at least 24Gi of memory). The scheduler matches that request against each cluster’s GPU pools, pins the ModelReplica to a pool that satisfies it, and the same request becomes the DRA ResourceClaim the serving pod binds its GPU through. Wait for the deployment to become ready:

1kubectl get md -n ml-team --watch
2# Wait until REPLICAS shows 1, then Ctrl-C.

Talk to the model

The gateway endpoint is only reachable from inside the kind Docker network. Use a pod to send a request:

 1kubectl run -i --rm curl-test \
 2  --image=curlimages/curl \
 3  --restart=Never \
 4  -- curl -s http://172.18.255.200/ml-team/qwen/v1/chat/completions \
 5  -H "Content-Type: application/json" \
 6  -d '{
 7    "model": "Qwen/Qwen2.5-0.5B-Instruct",
 8    "messages": [{"role": "user", "content": "What is Crossplane in one sentence?"}],
 9    "max_tokens": 100
10  }'

You can also get the endpoint URL from the ModelService status:

1kubectl get ms qwen -n ml-team -o jsonpath='{.status.address}'

Clean up

Delete the ModelDeployment before the InferenceCluster. If you delete the cluster first, the deployment gets stuck reconciling against a cluster Crossplane is tearing down.

Delete the InferenceCluster with foreground cascading deletion. The inference stack runs on the workload cluster and must uninstall while that cluster’s API server and kubeconfig still exist. Foreground deletion holds the cluster until the stack is uninstalled; the default (background) deletion tears everything down at once, which leaves the stack’s Helm releases unable to reach the cluster and can orphan cloud resources - for example a load balancer’s security group, which then blocks the VPC from deleting.

Wait for the cluster to be fully deprovisioned before deleting the kind cluster. If you delete the kind cluster while Crossplane is still cleaning up, Crossplane orphans the cloud resources.

1kubectl delete md --all -n ml-team
2kubectl delete ms --all -n ml-team
3kubectl delete ic --all --cascade=foreground
4
5# Wait for the InferenceCluster to be fully deleted.
6kubectl get ic --watch
7# Wait until no resources remain, then Ctrl-C.
8
9kind delete cluster --name modelplane