What is vllm-ascend-serving?

Ideal for AI agents that need vllm ascend serving. vllm-ascend-serving is an AI agent skill for vllm ascend serving.

How do I install vllm-ascend-serving?

Run the command: npx killer-skills add maoxx241/vllm-ascend-workspace/vllm-ascend-serving. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for vllm-ascend-serving?

Key use cases include: vLLM Ascend Serving, Use this skill when, the user asks to start / launch / pull up a vllm-ascend service on a managed machine.

Which IDEs are compatible with vllm-ascend-serving?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for vllm-ascend-serving?

another skill needs to start a service (e.g. ascend-memory-profiling). Do not use this skill when. status and stop do not require parity..

vLLM Ascend Serving

Name: vllm-ascend-serving
Availability: InStock
Author: maoxx241

Manage the lifecycle of a single-node colocated vllm-ascend online service on a workspace-managed ready remote container.

This skill takes structured parameters, handles all SSH escaping and remote execution internally, and returns machine-readable JSON. The agent never needs to construct raw shell commands for service management.

Use this skill when

the user asks to start / launch / pull up a vllm-ascend service on a managed machine
the user asks to restart or relaunch a service (possibly with changed flags or env)
the user asks to check if a running service is alive / ready
the user asks to stop a running service
another skill needs to start a service (e.g. ascend-memory-profiling)

Do not use this skill when

the task is adding, verifying, repairing, or removing a machine (use machine-management)
the task is syncing code to the remote container (use remote-code-parity)
the task is running benchmarks (a separate skill's responsibility)
the task is offline inference
the machine is not yet ready in inventory

Critical rules

start automatically runs remote-code-parity before launching. If parity fails, start is blocked.
status and stop do not require parity.
All remote execution goes through the scripts — never construct raw SSH commands for serving.
Keep local runtime state only under .vaws-local/serving/.
Progress on stderr as __VAWS_SERVING_PROGRESS__=<json>, final result on stdout as JSON.

Cross-platform launcher rule

macOS / Linux / WSL: python3 ...
Windows: py -3 ...

Public entry points

Start a service

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
2  --machine <alias-or-ip> \
3  --model <remote-weight-path> \
4  [--served-model-name <name>] \
5  [--tp <N>] [--dp <N>] \
6  [--devices <0,1,2,3>] \
7  [--extra-env KEY=VALUE ...] \
8  [--port <N>] \
9  [--health-timeout <seconds>] \
10  [--wrap-script <remote-path>] \
11  [--skip-parity] \
12  [-- <extra vllm serve args>]

Launch wrapping (`--wrap-script`)

The serving skill supports a generic --wrap-script mechanism. When provided, the vLLM launch command is written as _serve.sh in the runtime directory, and the wrapper script is called with two arguments: $1 = serve script path, $2 = runtime directory.

This is used by other skills (e.g. ascend-memory-profiling) to wrap the service launch process without the serving skill needing to know the wrapping details. The serving skill is agnostic to what the wrapper does.

The wrap_script path is recorded in the serving state so downstream skills can detect it.

Relaunch with previous config

bash
1# Exact same config
2python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
3  --machine <alias> --relaunch
4
5# Add a debug env
6python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
7  --machine <alias> --relaunch --extra-env VLLM_LOGGING_LEVEL=DEBUG
8
9# Remove an env from previous config
10python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
11  --machine <alias> --relaunch --unset-env MY_DEBUG_FLAG
12
13# Remove a vllm arg from previous config (use = to avoid argparse ambiguity)
14python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
15  --machine <alias> --relaunch --unset-args=--enforce-eager
16
17# Relaunch with a different model
18python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
19  --machine <alias> --relaunch --model /data/models/OtherModel

Probe NPU device availability

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_probe_npus.py \
2  --machine <alias-or-ip>

Returns which NPU devices are free, which are busy (with PID and HBM details), probed on the bare-metal host for cross-container visibility.

Check status

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_status.py \
2  --machine <alias-or-ip>

Stop

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_stop.py \
2  --machine <alias-or-ip> [--force]

Local state

Per-machine launch state is stored under .vaws-local/serving/<alias>.json.

This file records the last successful launch parameters (model, tp, devices, env, extra args, port, pid, log paths, runtime_dir, wrap_script). It is the basis for --relaunch and is read by other skills (e.g. ascend-memory-profiling) in attach mode.

Workflow

1. Resolve the target machine

The --machine argument is looked up in the local machine inventory. The machine must already be managed and ready.

2. Stop any existing service

If a previous service is recorded for this machine, it is stopped before launching a new one.

3. Run remote-code-parity (start only)

Unless --skip-parity is passed, parity_sync.py is called to ensure the container has the current local code. If parity fails, start is blocked.

4. Probe NPUs

NPU availability is checked via npu-smi info on the bare-metal host (not the container). Host-level probing sees processes from all containers, bypassing PID namespace isolation. Devices with HBM usage above 4 GB are also marked busy to catch cross-container occupancy:

If --devices is specified, those devices are verified to be free. If any are busy, start is blocked with the conflict details.
If --devices is not specified but --tp is given, the first N free devices are automatically selected, where N = TP × DP (defaults to TP when DP is not set).
If NPU probe fails (e.g. driver issue), it is treated as a non-fatal warning and launch continues with user-specified devices.

5. Validate and launch

Model path is checked for existence on the remote container.
A free port is auto-detected (or the explicit --port is used).
A bash launch script is built internally with proper escaping — the agent never sees or edits this script.
The process is started via nohup + disown and detached from the SSH session.

6. Wait for readiness

The script polls /health and /v1/models until both return success or the timeout expires.

6a. Diagnose launch failure before any code change

If the service fails during engine initialization or health check timeout:

Read both stdout.log and stderr.log from the remote runtime directory — vllm often logs the actual Python exception to stdout, not stderr.
Identify the actual exception type and message before hypothesizing a cause.
Do not modify source code to work around a launch failure until the root cause is confirmed from logs.
If the root cause is unclear, try the simplest launch configuration first (e.g. tp-only, no speculative decoding, no graph mode) and incrementally add features to isolate the failing component.

7. Return structured JSON

On success:

json
1{
2  "status": "ready",
3  "machine": "blue-a",
4  "base_url": "http://10.0.0.8:38721",
5  "port": 38721,
6  "pid": 12345,
7  "served_model_name": "Qwen3-32B",
8  "model": "/data/models/Qwen3-32B",
9  "log_stdout": "/vllm-workspace/.vaws-runtime/serving/.../stdout.log",
10  "log_stderr": "/vllm-workspace/.vaws-runtime/serving/.../stderr.log"
11}

On failure, includes stderr_tail for diagnosis.

Reference files

.agents/skills/vllm-ascend-serving/references/behavior.md
.agents/skills/vllm-ascend-serving/references/command-recipes.md
.agents/skills/vllm-ascend-serving/references/acceptance.md

vllm-ascend-serving — for Claude Code vllm-ascend-serving, vllm-ascend-workspace, community, for Claude Code, ide skills, vllm-ascend, ascend-memory-profiling, machine-management, remote-code-parity, status

About this Skill

Features

# Core Topics

Skill Overview

Core Value

Ideal Agent Persona

↓ Capabilities Granted for vllm-ascend-serving

! Prerequisites & Limits

About The Source

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions