Deploying Models

ModelStudio supports three deployment paths:

  • LLM models — deployed as KubeAI Model CRDs, sourced from HuggingFace Hub

  • ML models — deployed as KServe InferenceService CRDs, sourced from MLflow or HuggingFace Hub

  • NVIDIA NIM models — registered from the NVIDIA NGC catalog and deployed in-cluster on GPU nodes via the k8s-nim-operator


LLM Models (KubeAI)

Discovering LLM Models

Use the LLM Catalog page to search HuggingFace. Models are shown across four tabs:

  • Trending — models gaining momentum

  • Most Downloaded — sorted by all-time download count

  • Recent — newest additions

  • Search Results — results for your query or HuggingFace URL/slug

You can filter by:

Filter

Options

Task

Text Generation, Embeddings, Reranking, Text to Speech (Speech Recognition and Image Generation — coming in future KubeAI releases)

Precision

GGUF (quantized), Safetensors, PyTorch

Quantization

Q4_K_M, Q4_0, Q5_K_M, Q6_K, Q8_0, FP16, BF16

Input modality

Text, Image, Audio, Video, File

Output modality

Text, Image, Audio, Embeddings, Video

LLM Catalog filters


The LLM Deploy Form

Clicking Deploy on a model card opens the deploy modal.

Deploy modal form

Fields

Field

Description

Example

Model

Pre-filled from the LLM Catalog card

Qwen/Qwen2.5-0.5B-Instruct

Quantization

Weight precision level; lower = smaller, faster, slightly less accurate

Q4_K_M

Engine

Inference runtime (auto-selected based on model type)

OLlama, vLLM, Infinity, FasterWhisper

Features

Inference capabilities to enable

TextGeneration, TextEmbedding, Reranking, TextToSpeech

Resource Profile

CPU/GPU/memory allocation preset

cpu-4c-8g

Replicas

Number of serving pods

1

Scope

Private (only you) or Shared (all workspace users)

Private

Supported features in the current KubeAI release: TextGeneration, TextEmbedding, Reranking, TextToSpeech. Support for SpeechToText, ImageText (Vision), and ImageGeneration is planned for future KubeAI releases.

Resource Profiles

Resource profiles are cluster-defined presets. The list is fetched dynamically from /api/resource-profiles based on the nodes available in your cluster. Contact your cluster admin to add custom profiles.


LLM Deployment Lifecycle

After you click Deploy, the KubeAI Model CRD is created. KubeAI reconciles the resource:

Pending → Downloading → Starting → Running

Status

Meaning

Pending

CRD created, pods not yet scheduled

Downloading

Model weights being pulled from HuggingFace

Starting

Pods scheduled, inference server initializing

Running

Model ready to serve requests

Failed

Pod could not start — check cluster logs

Monitor progress in the Dashboard model table or the LLM Models page.


Managing Deployed LLM Models

All your LLM deployments are visible on the LLM Models page.

Edit a Model

Click the Edit (pencil) icon to update:

  • Resource profile

  • Replica count

The update is applied via a PATCH /api/models/{id} call and takes effect immediately.

Delete a Model

Click the Delete (trash) icon. The KubeAI CRD is removed and all pods are terminated. This action cannot be undone.

Promote / Demote Scope

Use the scope toggle button to move a model between Private and Shared:

  • Promote to Shared — the model becomes visible and usable by all workspace users

  • Demote to Private — the model is restricted back to your account

Note: promoting renames the CRD from {username}--{model-id} to shared--{model-id}.

Open in Playground

Click the Playground icon on any running model to jump directly to the inference workbench with that model pre-selected.


LLM CRD Naming Convention

Scope

CRD Name Pattern

Example

Private

{username}--{model-id}

alice--qwen2-0-5b

Shared

shared--{model-id}

shared--qwen2-0-5b

Model IDs are sanitized to be Kubernetes-safe (lowercase alphanumeric and -, max 63 characters). Colons, slashes, underscores, and dots in the original HuggingFace model ID are converted to -.


ML Models (KServe)

ML model deployment requires KServe to be enabled in the platform. The ML Registry and ML Models pages are hidden when KServe is not available.

Discovering ML Models

Use the ML Registry page. It has two tabs:

  • MLflow — lists registered models from your connected MLflow model registry. Each model shows its registered versions and lifecycle stages (None, Staging, Production, Archived). Requires MLflow to be configured in the platform. All MLflow model formats (sklearn, XGBoost, LightGBM, TensorFlow, PyTorch, ONNX, MLflow pyfunc, HuggingFace) are deployable.

  • HuggingFace ML — HuggingFace models filtered to ML-oriented task types. Only text-based pipeline tasks are currently supported by kserve-huggingfaceserver. Models with vision, audio, tabular, or multimodal pipeline tags show a Coming Soon badge and cannot be deployed yet.

    Supported HuggingFace ML pipeline tasks: text-generation, text2text-generation, text-classification, token-classification, fill-mask, question-answering, summarization, translation, feature-extraction

    Coming Soon (not yet deployable): image-classification, object-detection, tabular-classification, tabular-regression, image-segmentation, audio-classification, and other vision/audio/multimodal tasks

Click Deploy on any supported model card to open the ML deploy form.


The ML Deploy Form

Fields

Field

Description

Example

Model Name

Auto-filled from MLflow model name or HuggingFace model ID

my-classifier

Model Format

Framework used to train the model

sklearn, xgboost, lightgbm, tensorflow, pytorch, onnx, mlflow, huggingface

Source

Where the model artifact lives

mlflow, huggingface

Min Replicas

Minimum number of serving pods

1

Max Replicas

Maximum pods for autoscaling

3

CPU Request / Limit

Pod CPU allocation

500m / 2

Memory Request / Limit

Pod memory allocation

512Mi / 4Gi

GPU

Optional GPU resource key and count

nvidia.com/gpu: 1

ModelStudio automatically resolves the correct KServe runtime based on the model format:

Format

KServe Runtime

Notes

sklearn

kserve-sklearnserver

xgboost

kserve-xgbserver

lightgbm

kserve-lgbserver

tensorflow

kserve-tensorflow-serving

pytorch

kserve-torchserve

onnx

kserve-tritonserver

mlflow

kserve-mlserver (pyfunc)

Auto-detects framework flavor from MLmodel file

huggingface

kserve-huggingfaceserver

Text-based pipeline tasks only; vision/audio/multimodal coming soon

For MLflow-sourced models, leaving the format as mlflow enables auto-detection: the backend reads the MLmodel artifact file and selects the purpose-built runtime (e.g. kserve-sklearnserver for sklearn flavors) instead of the generic mlserver.

HuggingFace ML models are limited to text-based pipeline tasks by the current kserve-huggingfaceserver runtime. Attempting to deploy a vision, audio, or multimodal HuggingFace model is blocked in the UI with a Coming Soon indicator.


ML Deployment Lifecycle

After clicking Deploy, a KServe InferenceService CRD is created:

Pending → Running
         → Failed

Status

Meaning

Pending

InferenceService created, model not yet loaded

Running

Model loaded and ready to serve requests

Failed

Pod could not start or model failed to load

Monitor progress in the ML Models table.


Managing Deployed ML Models

All your KServe deployments are visible on the ML Models page.

You can filter by format, source, status, and owner.

Delete a Model

Click the Delete icon. The KServe InferenceService CRD is removed. You can only delete models you own.

Shared ML models are visible to all workspace users but can only be deleted by the model owner.


KServe InferenceService Naming

Scope

Name Pattern

Example

Private

ml-p-{model_name}-{user_hash8}

ml-p-myclassifier-a1b2c3d4

Shared

ml-s-{model_name}-{user_hash8}

ml-s-myclassifier-a1b2c3d4

Model names are truncated and sanitized to fit Kubernetes naming constraints. The 8-character user hash ensures uniqueness when multiple users deploy the same model name.


NVIDIA NIM Models

NIM requires the NIM feature to be enabled in the platform. The NIM tab in Catalog and NGC API Keys in Settings are hidden when NIM is not available.

Prerequisites

  • An NGC Personal API Key — add it in Settings → NGC API Keys

  • GPU-enabled nodes and the k8s-nim-operator installed by your cluster admin. Contact your admin to confirm availability.


Discovering NIM Models

Go to the Catalog page and select the NIM tab. The NIM catalog shows models available from NVIDIA NGC, organized by category:

Category

Examples

LLM

LLaMA 3.1, Mistral, Gemma, Phi

Embedding

NV-Embed, E5-Mistral

Reranking

NV-RerankQA

Vision-Language

VILA, PaliGemma

Speech

Parakeet ASR

Use the search bar to find a specific model by name.


Registering a NIM Model

Click Register on any NIM catalog card to open the registration form.

Fields

Field

Description

Default

Display Name

Human-readable name shown in Playground and model lists

model id basename

GPU Count

Number of GPUs per replica

1

Replicas

Number of serving instances

1

Inference Engine (advanced)

auto (vLLM for LLMs, operator picks for non-LLMs), vllm, or tensorrt_llm

auto

NIM inference runs on your cluster’s GPU nodes. The NGC API Key saved in Settings is used for both NGC catalog metadata and the in-cluster image-pull secret. Click Register to create the registration.


NIM Deployment Lifecycle

After registering a NIM model, the deployment goes through several phases:

Pending → Deploying (cache) → Deploying (service) → Running

Status

Meaning

Typical Duration

Pending

Registration created, deployment initializing

Seconds

Deploying (cache)

Model weights being downloaded to cluster storage

10–25 min for LLM models

Deploying (service)

Inference container starting, TRT engines compiling

2–5 min on first start

Running

Ready to serve inference requests

The Playground shows a status banner and disables inference for NIM models that are not yet Running.

First-time deployment: TRT engine compilation happens on the first start and takes 2–5 minutes. Subsequent restarts on the same node are faster because compiled engines are cached.


Managing Registered NIM Models

All registered NIM models are shown in the NIM Models section of the Dashboard and on the LLM Models page.

Available Actions

Action

Description

View status

Live status with ready replica count

Scale replicas

Change replica count up or down (set to 0 to release GPU)

Restart pods

Rolling restart of inference pods

View logs

Tail the last N lines from the container

View events

Recent events from the deployment

Unregister

Removes registration and all cluster resources

Unregister: Unregistering removes the inference deployment and cached model weights from the cluster. This action cannot be undone.


NIM Model Availability

Registered NIM models are visible to all authenticated workspace users, but inference is only available once the deployment reaches Running status.

There is no private/shared scope toggle for NIM models — all registered NIM models are workspace-wide.