# Deploying Models ModelStudio supports three deployment paths: - **LLM models** — deployed as [KubeAI](https://github.com/substratusai/kubeai) `Model` CRDs, sourced from HuggingFace Hub - **ML models** — deployed as [KServe](https://kserve.github.io/website/) `InferenceService` CRDs, sourced from MLflow or HuggingFace Hub - **NVIDIA NIM models** — registered from the NVIDIA NGC catalog and deployed in-cluster on GPU nodes via the k8s-nim-operator --- ## LLM Models (KubeAI) ### Discovering LLM Models Use the **LLM Catalog** page to search HuggingFace. Models are shown across four tabs: - **Trending** — models gaining momentum - **Most Downloaded** — sorted by all-time download count - **Recent** — newest additions - **Search Results** — results for your query or HuggingFace URL/slug You can filter by: | Filter | Options | |--------|---------| | Task | Text Generation, Embeddings, Reranking, Text to Speech *(Speech Recognition and Image Generation — coming in future KubeAI releases)* | | Precision | GGUF (quantized), Safetensors, PyTorch | | Quantization | Q4_K_M, Q4_0, Q5_K_M, Q6_K, Q8_0, FP16, BF16 | | Input modality | Text, Image, Audio, Video, File | | Output modality | Text, Image, Audio, Embeddings, Video | ![LLM Catalog filters](./media/catalog-filters.png) --- ### The LLM Deploy Form Clicking **Deploy** on a model card opens the deploy modal. ![Deploy modal form](./media/deploy-modal.png) #### Fields | Field | Description | Example | |-------|-------------|---------| | **Model** | Pre-filled from the LLM Catalog card | `Qwen/Qwen2.5-0.5B-Instruct` | | **Quantization** | Weight precision level; lower = smaller, faster, slightly less accurate | `Q4_K_M` | | **Engine** | Inference runtime (auto-selected based on model type) | `OLlama`, `vLLM`, `Infinity`, `FasterWhisper` | | **Features** | Inference capabilities to enable | `TextGeneration`, `TextEmbedding`, `Reranking`, `TextToSpeech` | | **Resource Profile** | CPU/GPU/memory allocation preset | `cpu-4c-8g` | | **Replicas** | Number of serving pods | `1` | | **Scope** | `Private` (only you) or `Shared` (all workspace users) | `Private` | > **Supported features in the current KubeAI release:** `TextGeneration`, `TextEmbedding`, `Reranking`, `TextToSpeech`. Support for `SpeechToText`, `ImageText` (Vision), and `ImageGeneration` is planned for future KubeAI releases. #### Resource Profiles Resource profiles are cluster-defined presets. The list is fetched dynamically from `/api/resource-profiles` based on the nodes available in your cluster. Contact your cluster admin to add custom profiles. --- ### LLM Deployment Lifecycle After you click **Deploy**, the KubeAI `Model` CRD is created. KubeAI reconciles the resource: ``` Pending → Downloading → Starting → Running ``` | Status | Meaning | |--------|---------| | `Pending` | CRD created, pods not yet scheduled | | `Downloading` | Model weights being pulled from HuggingFace | | `Starting` | Pods scheduled, inference server initializing | | `Running` | Model ready to serve requests | | `Failed` | Pod could not start — check cluster logs | Monitor progress in the **Dashboard** model table or the **LLM Models** page. --- ### Managing Deployed LLM Models All your LLM deployments are visible on the **LLM Models** page. #### Edit a Model Click the **Edit** (pencil) icon to update: - Resource profile - Replica count The update is applied via a `PATCH /api/models/{id}` call and takes effect immediately. #### Delete a Model Click the **Delete** (trash) icon. The KubeAI CRD is removed and all pods are terminated. This action cannot be undone. #### Promote / Demote Scope Use the scope toggle button to move a model between **Private** and **Shared**: - **Promote to Shared** — the model becomes visible and usable by all workspace users - **Demote to Private** — the model is restricted back to your account > Note: promoting renames the CRD from `{username}--{model-id}` to `shared--{model-id}`. #### Open in Playground Click the **Playground** icon on any running model to jump directly to the inference workbench with that model pre-selected. --- ### LLM CRD Naming Convention | Scope | CRD Name Pattern | Example | |-------|-----------------|---------| | Private | `{username}--{model-id}` | `alice--qwen2-0-5b` | | Shared | `shared--{model-id}` | `shared--qwen2-0-5b` | Model IDs are sanitized to be Kubernetes-safe (lowercase alphanumeric and `-`, max 63 characters). Colons, slashes, underscores, and dots in the original HuggingFace model ID are converted to `-`. --- ## ML Models (KServe) > ML model deployment requires KServe to be enabled in the platform. The **ML Registry** and **ML Models** pages are hidden when KServe is not available. ### Discovering ML Models Use the **ML Registry** page. It has two tabs: - **MLflow** — lists registered models from your connected MLflow model registry. Each model shows its registered versions and lifecycle stages (`None`, `Staging`, `Production`, `Archived`). Requires MLflow to be configured in the platform. All MLflow model formats (sklearn, XGBoost, LightGBM, TensorFlow, PyTorch, ONNX, MLflow pyfunc, HuggingFace) are deployable. - **HuggingFace ML** — HuggingFace models filtered to ML-oriented task types. **Only text-based pipeline tasks are currently supported** by `kserve-huggingfaceserver`. Models with vision, audio, tabular, or multimodal pipeline tags show a **Coming Soon** badge and cannot be deployed yet. **Supported HuggingFace ML pipeline tasks:** `text-generation`, `text2text-generation`, `text-classification`, `token-classification`, `fill-mask`, `question-answering`, `summarization`, `translation`, `feature-extraction` **Coming Soon (not yet deployable):** `image-classification`, `object-detection`, `tabular-classification`, `tabular-regression`, `image-segmentation`, `audio-classification`, and other vision/audio/multimodal tasks Click **Deploy** on any supported model card to open the ML deploy form. --- ### The ML Deploy Form #### Fields | Field | Description | Example | |-------|-------------|---------| | **Model Name** | Auto-filled from MLflow model name or HuggingFace model ID | `my-classifier` | | **Model Format** | Framework used to train the model | `sklearn`, `xgboost`, `lightgbm`, `tensorflow`, `pytorch`, `onnx`, `mlflow`, `huggingface` | | **Source** | Where the model artifact lives | `mlflow`, `huggingface` | | **Min Replicas** | Minimum number of serving pods | `1` | | **Max Replicas** | Maximum pods for autoscaling | `3` | | **CPU Request / Limit** | Pod CPU allocation | `500m` / `2` | | **Memory Request / Limit** | Pod memory allocation | `512Mi` / `4Gi` | | **GPU** | Optional GPU resource key and count | `nvidia.com/gpu: 1` | ModelStudio automatically resolves the correct KServe runtime based on the model format: | Format | KServe Runtime | Notes | |--------|---------------|-------| | `sklearn` | kserve-sklearnserver | | | `xgboost` | kserve-xgbserver | | | `lightgbm` | kserve-lgbserver | | | `tensorflow` | kserve-tensorflow-serving | | | `pytorch` | kserve-torchserve | | | `onnx` | kserve-tritonserver | | | `mlflow` | kserve-mlserver (pyfunc) | Auto-detects framework flavor from MLmodel file | | `huggingface` | kserve-huggingfaceserver | Text-based pipeline tasks only; vision/audio/multimodal coming soon | For MLflow-sourced models, leaving the format as `mlflow` enables auto-detection: the backend reads the `MLmodel` artifact file and selects the purpose-built runtime (e.g. `kserve-sklearnserver` for sklearn flavors) instead of the generic mlserver. > **HuggingFace ML models** are limited to text-based pipeline tasks by the current `kserve-huggingfaceserver` runtime. Attempting to deploy a vision, audio, or multimodal HuggingFace model is blocked in the UI with a **Coming Soon** indicator. --- ### ML Deployment Lifecycle After clicking **Deploy**, a KServe `InferenceService` CRD is created: ``` Pending → Running → Failed ``` | Status | Meaning | |--------|---------| | `Pending` | InferenceService created, model not yet loaded | | `Running` | Model loaded and ready to serve requests | | `Failed` | Pod could not start or model failed to load | Monitor progress in the **ML Models** table. --- ### Managing Deployed ML Models All your KServe deployments are visible on the **ML Models** page. You can filter by format, source, status, and owner. #### Delete a Model Click the **Delete** icon. The KServe `InferenceService` CRD is removed. You can only delete models you own. > Shared ML models are visible to all workspace users but can only be deleted by the model owner. --- ### KServe InferenceService Naming | Scope | Name Pattern | Example | |-------|-------------|---------| | Private | `ml-p-{model_name}-{user_hash8}` | `ml-p-myclassifier-a1b2c3d4` | | Shared | `ml-s-{model_name}-{user_hash8}` | `ml-s-myclassifier-a1b2c3d4` | Model names are truncated and sanitized to fit Kubernetes naming constraints. The 8-character user hash ensures uniqueness when multiple users deploy the same model name. --- ## NVIDIA NIM Models > NIM requires the NIM feature to be enabled in the platform. The **NIM** tab in Catalog and NGC API Keys in Settings are hidden when NIM is not available. ### Prerequisites - An [NGC Personal API Key](https://org.ngc.nvidia.com/setup/personal-keys) — add it in **Settings → NGC API Keys** - GPU-enabled nodes and the k8s-nim-operator installed by your cluster admin. Contact your admin to confirm availability. --- ### Discovering NIM Models Go to the **Catalog** page and select the **NIM** tab. The NIM catalog shows models available from NVIDIA NGC, organized by category: | Category | Examples | |----------|---------| | LLM | LLaMA 3.1, Mistral, Gemma, Phi | | Embedding | NV-Embed, E5-Mistral | | Reranking | NV-RerankQA | | Vision-Language | VILA, PaliGemma | | Speech | Parakeet ASR | Use the search bar to find a specific model by name. --- ### Registering a NIM Model Click **Register** on any NIM catalog card to open the registration form. #### Fields | Field | Description | Default | |-------|-------------|---------| | **Display Name** | Human-readable name shown in Playground and model lists | model id basename | | **GPU Count** | Number of GPUs per replica | `1` | | **Replicas** | Number of serving instances | `1` | | **Inference Engine** *(advanced)* | `auto` (vLLM for LLMs, operator picks for non-LLMs), `vllm`, or `tensorrt_llm` | `auto` | NIM inference runs on your cluster's GPU nodes. The NGC API Key saved in **Settings** is used for both NGC catalog metadata and the in-cluster image-pull secret. Click **Register** to create the registration. --- ### NIM Deployment Lifecycle After registering a NIM model, the deployment goes through several phases: ``` Pending → Deploying (cache) → Deploying (service) → Running ``` | Status | Meaning | Typical Duration | |--------|---------|-----------------| | `Pending` | Registration created, deployment initializing | Seconds | | `Deploying` (cache) | Model weights being downloaded to cluster storage | 10–25 min for LLM models | | `Deploying` (service) | Inference container starting, TRT engines compiling | 2–5 min on first start | | `Running` | Ready to serve inference requests | — | The Playground shows a status banner and disables inference for NIM models that are not yet Running. > **First-time deployment:** TRT engine compilation happens on the first start and takes 2–5 minutes. Subsequent restarts on the same node are faster because compiled engines are cached. --- ### Managing Registered NIM Models All registered NIM models are shown in the **NIM Models** section of the **Dashboard** and on the **LLM Models** page. #### Available Actions | Action | Description | |--------|-------------| | View status | Live status with ready replica count | | Scale replicas | Change replica count up or down (set to 0 to release GPU) | | Restart pods | Rolling restart of inference pods | | View logs | Tail the last N lines from the container | | View events | Recent events from the deployment | | Unregister | Removes registration and all cluster resources | > **Unregister:** Unregistering removes the inference deployment and cached model weights from the cluster. This action cannot be undone. --- ### NIM Model Availability Registered NIM models are visible to all authenticated workspace users, but inference is only available once the deployment reaches Running status. There is no private/shared scope toggle for NIM models — all registered NIM models are workspace-wide.