vllm-ascend-serving 是什么？

Perfect for AI Agents needing seamless integration with vLLM Ascend services, handling SSH escaping and remote execution internally. This skill manages the lifecycle of a single-node colocated vLLM Ascend online service, handling SSH escaping and remote execution internally, and returning machine-readable JSON for seamless integration.

如何安装 vllm-ascend-serving？

运行命令：npx killer-skills add maoxx241/vllm-ascend-workspace/vllm-ascend-serving。支持 Cursor、Windsurf、VS Code、Claude Code 等 19+ IDE/Agent。

vllm-ascend-serving 适用于哪些场景？

典型场景包括：Automating vLLM Ascend service launches with structured parameters、Restarting services with changed flags or environment variables、Checking the status of running services for aliveness and readiness、Stopping running services securely、Integrating with other skills like ascend-memory-profiling for advanced functionality。

vllm-ascend-serving 支持哪些 IDE 或 Agent？

该技能兼容 Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer。可使用 Killer-Skills CLI 一条命令通用安装。

vllm-ascend-serving 有哪些限制？

Requires a ready remote container and a managed machine；Dependent on remote-code-parity for start operations；Limited to online service management, excluding tasks like machine management, code syncing, and offline inference。

vllm-ascend-serving

Manage vLLM Ascend services with this AI agent skill, designed for developers to streamline service lifecycle management and improve productivity.

SKILL.md

Readonly

Imported Repository Instructions

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Supporting Evidence

vLLM Ascend Serving

Name: vllm-ascend-serving
Availability: InStock
Author: maoxx241

Manage the lifecycle of a single-node colocated vllm-ascend online service on a workspace-managed ready remote container.

This skill takes structured parameters, handles all SSH escaping and remote execution internally, and returns machine-readable JSON. The agent never needs to construct raw shell commands for service management.

Use this skill when

the user asks to start / launch / pull up a vllm-ascend service on a managed machine
the user asks to restart or relaunch a service (possibly with changed flags or env)
the user asks to check if a running service is alive / ready
the user asks to stop a running service
another skill needs to start a service (e.g. ascend-memory-profiling)

Do not use this skill when

the task is adding, verifying, repairing, or removing a machine (use machine-management)
the task is syncing code to the remote container (use remote-code-parity)
the task is running benchmarks (a separate skill's responsibility)
the task is offline inference
the machine is not yet ready in inventory

Critical rules

start automatically runs remote-code-parity before launching. If parity fails, start is blocked.
status and stop do not require parity.
All remote execution goes through the scripts — never construct raw SSH commands for serving.
Keep local runtime state only under .vaws-local/serving/.
Progress on stderr as __VAWS_SERVING_PROGRESS__=<json>, final result on stdout as JSON.

Cross-platform launcher rule

macOS / Linux / WSL: python3 ...
Windows: py -3 ...

Public entry points

Start a service

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
2  --machine <alias-or-ip> \
3  --model <remote-weight-path> \
4  [--served-model-name <name>] \
5  [--tp <N>] [--dp <N>] \
6  [--devices <0,1,2,3>] \
7  [--extra-env KEY=VALUE ...] \
8  [--port <N>] \
9  [--health-timeout <seconds>] \
10  [--wrap-script <remote-path>] \
11  [--skip-parity] \
12  [-- <extra vllm serve args>]

Launch wrapping (`--wrap-script`)

The serving skill supports a generic --wrap-script mechanism. When provided, the vLLM launch command is written as _serve.sh in the runtime directory, and the wrapper script is called with two arguments: $1 = serve script path, $2 = runtime directory.

This is used by other skills (e.g. ascend-memory-profiling) to wrap the service launch process without the serving skill needing to know the wrapping details. The serving skill is agnostic to what the wrapper does.

The wrap_script path is recorded in the serving state so downstream skills can detect it.

Relaunch with previous config

bash
1# Exact same config
2python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
3  --machine <alias> --relaunch
4
5# Add a debug env
6python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
7  --machine <alias> --relaunch --extra-env VLLM_LOGGING_LEVEL=DEBUG
8
9# Remove an env from previous config
10python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
11  --machine <alias> --relaunch --unset-env MY_DEBUG_FLAG
12
13# Remove a vllm arg from previous config (use = to avoid argparse ambiguity)
14python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
15  --machine <alias> --relaunch --unset-args=--enforce-eager
16
17# Relaunch with a different model
18python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
19  --machine <alias> --relaunch --model /data/models/OtherModel

Probe NPU device availability

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_probe_npus.py \
2  --machine <alias-or-ip>

Returns which NPU devices are free, which are busy (with PID and HBM details), probed on the bare-metal host for cross-container visibility.

Check status

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_status.py \
2  --machine <alias-or-ip>

Stop

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_stop.py \
2  --machine <alias-or-ip> [--force]

Local state

Per-machine launch state is stored under .vaws-local/serving/<alias>.json.

This file records the last successful launch parameters (model, tp, devices, env, extra args, port, pid, log paths, runtime_dir, wrap_script). It is the basis for --relaunch and is read by other skills (e.g. ascend-memory-profiling) in attach mode.

Workflow

1. Resolve the target machine

The --machine argument is looked up in the local machine inventory. The machine must already be managed and ready.

2. Stop any existing service

If a previous service is recorded for this machine, it is stopped before launching a new one.

3. Run remote-code-parity (start only)

Unless --skip-parity is passed, parity_sync.py is called to ensure the container has the current local code. If parity fails, start is blocked.

4. Probe NPUs

NPU availability is checked via npu-smi info on the bare-metal host (not the container). Host-level probing sees processes from all containers, bypassing PID namespace isolation. Devices with HBM usage above 4 GB are also marked busy to catch cross-container occupancy:

If --devices is specified, those devices are verified to be free. If any are busy, start is blocked with the conflict details.
If --devices is not specified but --tp is given, the first N free devices are automatically selected, where N = TP × DP (defaults to TP when DP is not set).
If NPU probe fails (e.g. driver issue), it is treated as a non-fatal warning and launch continues with user-specified devices.

5. Validate and launch

Model path is checked for existence on the remote container.
A free port is auto-detected (or the explicit --port is used).
A bash launch script is built internally with proper escaping — the agent never sees or edits this script.
The process is started via nohup + disown and detached from the SSH session.

6. Wait for readiness

The script polls /health and /v1/models until both return success or the timeout expires.

6a. Diagnose launch failure before any code change

If the service fails during engine initialization or health check timeout:

Read both stdout.log and stderr.log from the remote runtime directory — vllm often logs the actual Python exception to stdout, not stderr.
Identify the actual exception type and message before hypothesizing a cause.
Do not modify source code to work around a launch failure until the root cause is confirmed from logs.
If the root cause is unclear, try the simplest launch configuration first (e.g. tp-only, no speculative decoding, no graph mode) and incrementally add features to isolate the failing component.

7. Return structured JSON

On success:

json
1{
2  "status": "ready",
3  "machine": "blue-a",
4  "base_url": "http://10.0.0.8:38721",
5  "port": 38721,
6  "pid": 12345,
7  "served_model_name": "Qwen3-32B",
8  "model": "/data/models/Qwen3-32B",
9  "log_stdout": "/vllm-workspace/.vaws-runtime/serving/.../stdout.log",
10  "log_stderr": "/vllm-workspace/.vaws-runtime/serving/.../stderr.log"
11}

On failure, includes stderr_tail for diagnosis.

Reference files

.agents/skills/vllm-ascend-serving/references/behavior.md
.agents/skills/vllm-ascend-serving/references/command-recipes.md
.agents/skills/vllm-ascend-serving/references/acceptance.md

vllm-ascend-serving — for Claude Code vllm-ascend-serving, vllm-ascend-workspace, community, for Claude Code, ide skills, vLLM Ascend serving, single-node colocated services, SSH escaping, remote execution, machine-readable JSON, Claude Code

关于此技能

功能特性

# 核心主题

Killer-Skills Review

核心价值

适用 Agent 类型

↓ 赋予的主要能力 · vllm-ascend-serving

! 使用限制与门槛

Why this page is reference-only

Source Boundary

Browser Sandbox Environment

⚡️ Ready to unleash?

常见问题与安装步骤

? FAQ