hugging-face-space-deployer
Create, configure, and deploy Hugging Face Spaces for showcasing ML models. Supports Gradio, Streamlit, and Docker SDKs with templates for common use cases like chat interfaces, image generation, and model comparisons.
What this skill does
# Hugging Face Space Deployer
A skill for AI engineers to create, configure, and deploy interactive ML demos on Hugging Face Spaces.
## CRITICAL: Pre-Deployment Checklist
**Before writing ANY code, gather this information about the model:**
### 1. Check Model Type (LoRA Adapter vs Full Model)
**Use the HF MCP tool to inspect the model files:**
```
hf-skills - Hub Repo Details (repo_ids: ["username/model"], repo_type: "model")
```
**Look for these indicators:**
| Files Present | Model Type | Action Required |
|---------------|------------|-----------------|
| `model.safetensors` or `pytorch_model.bin` | Full model | Load directly with `AutoModelForCausalLM` |
| `adapter_model.safetensors` + `adapter_config.json` | LoRA/PEFT adapter | Must load base model first, then apply adapter with `peft` |
| Only config files, no weights | Broken/incomplete | Ask user to verify |
**If adapter_config.json exists, check for `base_model_name_or_path` to identify the base model.**
### 2. Check Inference API Availability
Visit the model page on HF Hub and look for "Inference Providers" widget on the right side.
**Indicators that model HAS Inference API:**
- Inference widget visible on model page
- Model from known provider: `meta-llama`, `mistralai`, `HuggingFaceH4`, `google`, `stabilityai`, `Qwen`
- High download count (>10,000) with standard architecture
**Indicators that model DOES NOT have Inference API:**
- Personal namespace (e.g., `GhostScientist/my-model`)
- LoRA/PEFT adapter (adapters never have direct Inference API)
- Missing `pipeline_tag` in model metadata
- No inference widget on model page
### 3. Check Model Metadata
- Ensure `pipeline_tag` is set (e.g., `text-generation`)
- Add `conversational` tag for chat models
### 4. Determine Hardware Needs
| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B parameters | ZeroGPU (free) or CPU |
| 3B - 7B parameters | ZeroGPU or T4 |
| > 7B parameters | A10G or A100 |
### 5. Ask User If Unclear
**If you cannot determine the model type, ASK THE USER:**
> "I'm analyzing your model to determine the best deployment strategy. I found:
> - [what you found about files]
> - [what you found about inference API]
>
> Is this model:
> 1. A full model you trained/uploaded?
> 2. A LoRA/PEFT adapter on top of another model?
> 3. Something else?
>
> Also, would you prefer:
> A. Free deployment with ZeroGPU (may have queue times)
> B. Paid GPU for faster response (~$0.60/hr)"
## Hardware Options
| Hardware | Use Case | Cost |
|----------|----------|------|
| `cpu-basic` | Simple demos, Inference API apps | Free |
| `cpu-upgrade` | Faster CPU inference | ~$0.03/hr |
| **`zero-a10g`** | **Models needing GPU on-demand (recommended for most)** | **Free (with quota)** |
| `t4-small` | Small GPU models (<7B) | ~$0.60/hr |
| `t4-medium` | Medium GPU models | ~$0.90/hr |
| `a10g-small` | Large models (7B-13B) | ~$1.50/hr |
| `a10g-large` | Very large models (30B+) | ~$3.15/hr |
| `a100-large` | Largest models | ~$4.50/hr |
**ZeroGPU Note:** ZeroGPU (`zero-a10g`) provides free GPU access on-demand. The Space runs on CPU, and when a user triggers inference, a GPU is allocated temporarily (~60-120 seconds). **After deployment, you must manually set the runtime to "ZeroGPU" in Space Settings > Hardware.**
## Deployment Decision Tree
```
Analyze Model
│
├── Does it have adapter_config.json?
│ └── YES → It's a LoRA adapter
│ ├── Find base_model_name_or_path in adapter_config.json
│ └── Use Template 3 (LoRA + ZeroGPU)
│
├── Does it have model.safetensors or pytorch_model.bin?
│ └── YES → It's a full model
│ ├── Is it from a major provider with inference widget?
│ │ ├── YES → Use Inference API (Template 1)
│ │ └── NO → Use ZeroGPU (Template 2)
│
└── Neither found?
└── ASK USER - model may be incomplete
```
## Dependencies
**For Inference API (cpu-basic, free):**
```
gradio>=5.0.0
huggingface_hub>=0.26.0
```
**For ZeroGPU full models (zero-a10g, free with quota):**
```
gradio>=5.0.0
torch
transformers
accelerate
spaces
```
**For ZeroGPU LoRA adapters (zero-a10g, free with quota):**
```
gradio>=5.0.0
torch
transformers
accelerate
spaces
peft
```
## CLI Commands (CORRECT Syntax)
```bash
# Create Space
hf repo create my-space-name --repo-type space --space-sdk gradio
# Upload files
hf upload username/space-name ./local-folder --repo-type space
# Download model files to inspect
hf download username/model-name --local-dir ./model-check --dry-run
# Check what files exist in a model
hf download username/model-name --local-dir /tmp/check --dry-run 2>&1 | grep -E '\.(safetensors|bin|json)'
```
## Template 1: Inference API (For Supported Models)
**Use when:** Model has inference widget, is from major provider, or explicitly supports serverless API.
```python
import gradio as gr
from huggingface_hub import InferenceClient
MODEL_ID = "HuggingFaceH4/zephyr-7b-beta" # Must support Inference API!
client = InferenceClient(MODEL_ID)
def respond(message, history, system_message, max_tokens, temperature, top_p):
messages = [{"role": "system", "content": system_message}]
for user_msg, assistant_msg in history:
if user_msg:
messages.append({"role": "user", "content": user_msg})
if assistant_msg:
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
response = ""
for token in client.chat_completion(
messages,
max_tokens=max_tokens,
stream=True,
temperature=temperature,
top_p=top_p,
):
delta = token.choices[0].delta.content or ""
response += delta
yield response
demo = gr.ChatInterface(
respond,
title="Chat Assistant",
description="Powered by Hugging Face Inference API",
additional_inputs=[
gr.Textbox(value="You are a helpful assistant.", label="System message"),
gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max tokens"),
gr.Slider(minimum=0.1, maximum=2.0, value=0.7, step=0.1, label="Temperature"),
gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
],
examples=[
["Hello! How are you?"],
["Write a Python function to sort a list"],
],
)
if __name__ == "__main__":
demo.launch()
```
**requirements.txt:**
```
gradio>=5.0.0
huggingface_hub>=0.26.0
```
**README.md:**
```yaml
---
title: My Chat App
emoji: 💬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
---
```
## Template 2: ZeroGPU Full Model (For Models Without Inference API)
**Use when:** Full model (has model.safetensors) but no Inference API support.
```python
import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "username/my-full-model"
# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Global model - loaded lazily on first GPU call for faster Space startup
model = None
def load_model():
global model
if model is None:
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
return model
@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
model = load_model()
messages = [{"role": "system", "content": system_message}]
for user_msg, assistant_msg in history:
if user_msg:
messages.append({"role": "user", "content": user_msg})
if assistant_msg:
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizeRelated in Image & Video
watch
IncludedWatch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.
physical-ai-defect-image-generation
IncludedUse when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.
accelint-react-best-practices
IncludedReact performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.
elevenlabs-agents
IncludedBuild conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication
humanizer
IncludedHumanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.
generating-mermaid-diagrams
IncludedSalesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.