Wie installiere ich vllm-ascend-serving?

Führen Sie den Befehl aus: npx killer-skills add maoxx241/vllm-ascend-workspace. Er funktioniert mit Cursor, Windsurf, VS Code, Claude Code und mehr als 19 weiteren IDEs.

Wofür kann ich vllm-ascend-serving verwenden?

Wichtige Einsatzbereiche sind: Anwendungsfall: vLLM Ascend Serving, Anwendungsfall: Use this skill when, Anwendungsfall: the user asks to start / launch / pull up a vllm-ascend service on a managed machine.

Welche IDEs sind mit vllm-ascend-serving kompatibel?

Dieser Skill ist mit Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer kompatibel. Nutzen Sie die Killer-Skills CLI für eine einheitliche Installation.

Gibt es Einschränkungen bei vllm-ascend-serving?

Einschraenkung: another skill needs to start a service (e.g. ascend-memory-profiling). Einschraenkung: Do not use this skill when. Einschraenkung: status and stop do not require parity..

vllm-ascend-serving

Install vllm-ascend-serving, an AI agent skill for AI agent workflows and automation. Explore features, use cases, limitations, and setup guidance.

SKILL.md

Readonly

Upstream Repository Material

The section below is adapted from the upstream repository. Use it as supporting material alongside the fit, use-case, and installation summary on this page.

Upstream Source

vLLM Ascend Serving

Name: vllm-ascend-serving
Availability: InStock
Author: maoxx241

Manage the lifecycle of a single-node colocated vllm-ascend online service on a workspace-managed ready remote container.

This skill takes structured parameters, handles all SSH escaping and remote execution internally, and returns machine-readable JSON. The agent never needs to construct raw shell commands for service management.

Use this skill when

the user asks to start / launch / pull up a vllm-ascend service on a managed machine
the user asks to restart or relaunch a service (possibly with changed flags or env)
the user asks to check if a running service is alive / ready
the user asks to stop a running service
another skill needs to start a service (e.g. ascend-memory-profiling)

Do not use this skill when

the task is adding, verifying, repairing, or removing a machine (use machine-management)
the task is syncing code to the remote container (use remote-code-parity)
the task is running benchmarks (a separate skill's responsibility)
the task is offline inference
the machine is not yet ready in inventory

Critical rules

start automatically runs remote-code-parity before launching. If parity fails, start is blocked.
status and stop do not require parity.
All remote execution goes through the scripts — never construct raw SSH commands for serving.
Keep local runtime state only under .vaws-local/serving/.
Progress on stderr as __VAWS_SERVING_PROGRESS__=<json>, final result on stdout as JSON.

Cross-platform launcher rule

macOS / Linux / WSL: python3 ...
Windows: py -3 ...

Public entry points

Start a service

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
2  --machine <alias-or-ip> \
3  --model <remote-weight-path> \
4  [--served-model-name <name>] \
5  [--tp <N>] [--dp <N>] \
6  [--devices <0,1,2,3>] \
7  [--extra-env KEY=VALUE ...] \
8  [--port <N>] \
9  [--health-timeout <seconds>] \
10  [--wrap-script <remote-path>] \
11  [--skip-parity] \
12  [-- <extra vllm serve args>]

Launch wrapping (`--wrap-script`)

The serving skill supports a generic --wrap-script mechanism. When provided, the vLLM launch command is written as _serve.sh in the runtime directory, and the wrapper script is called with two arguments: $1 = serve script path, $2 = runtime directory.

This is used by other skills (e.g. ascend-memory-profiling) to wrap the service launch process without the serving skill needing to know the wrapping details. The serving skill is agnostic to what the wrapper does.

The wrap_script path is recorded in the serving state so downstream skills can detect it.

Relaunch with previous config

bash
1# Exact same config
2python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
3  --machine <alias> --relaunch
4
5# Add a debug env
6python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
7  --machine <alias> --relaunch --extra-env VLLM_LOGGING_LEVEL=DEBUG
8
9# Remove an env from previous config
10python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
11  --machine <alias> --relaunch --unset-env MY_DEBUG_FLAG
12
13# Remove a vllm arg from previous config (use = to avoid argparse ambiguity)
14python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
15  --machine <alias> --relaunch --unset-args=--enforce-eager
16
17# Relaunch with a different model
18python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
19  --machine <alias> --relaunch --model /data/models/OtherModel

Probe NPU device availability

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_probe_npus.py \
2  --machine <alias-or-ip>

Returns which NPU devices are free, which are busy (with PID and HBM details), probed on the bare-metal host for cross-container visibility.

Check status

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_status.py \
2  --machine <alias-or-ip>

Stop

bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_stop.py \
2  --machine <alias-or-ip> [--force]

Local state

Per-machine launch state is stored under .vaws-local/serving/<alias>.json.

This file records the last successful launch parameters (model, tp, devices, env, extra args, port, pid, log paths, runtime_dir, wrap_script). It is the basis for --relaunch and is read by other skills (e.g. ascend-memory-profiling) in attach mode.

Workflow

1. Resolve the target machine

The --machine argument is looked up in the local machine inventory. The machine must already be managed and ready.

2. Stop any existing service

If a previous service is recorded for this machine, it is stopped before launching a new one.

3. Run remote-code-parity (start only)

Unless --skip-parity is passed, parity_sync.py is called to ensure the container has the current local code. If parity fails, start is blocked.

4. Probe NPUs

NPU availability is checked via npu-smi info on the bare-metal host (not the container). Host-level probing sees processes from all containers, bypassing PID namespace isolation. Devices with HBM usage above 4 GB are also marked busy to catch cross-container occupancy:

If --devices is specified, those devices are verified to be free. If any are busy, start is blocked with the conflict details.
If --devices is not specified but --tp is given, the first N free devices are automatically selected, where N = TP × DP (defaults to TP when DP is not set).
If NPU probe fails (e.g. driver issue), it is treated as a non-fatal warning and launch continues with user-specified devices.

5. Validate and launch

Model path is checked for existence on the remote container.
A free port is auto-detected (or the explicit --port is used).
A bash launch script is built internally with proper escaping — the agent never sees or edits this script.
The process is started via nohup + disown and detached from the SSH session.

6. Wait for readiness

The script polls /health and /v1/models until both return success or the timeout expires.

6a. Diagnose launch failure before any code change

If the service fails during engine initialization or health check timeout:

Read both stdout.log and stderr.log from the remote runtime directory — vllm often logs the actual Python exception to stdout, not stderr.
Identify the actual exception type and message before hypothesizing a cause.
Do not modify source code to work around a launch failure until the root cause is confirmed from logs.
If the root cause is unclear, try the simplest launch configuration first (e.g. tp-only, no speculative decoding, no graph mode) and incrementally add features to isolate the failing component.

7. Return structured JSON

On success:

json
1{
2  "status": "ready",
3  "machine": "blue-a",
4  "base_url": "http://10.0.0.8:38721",
5  "port": 38721,
6  "pid": 12345,
7  "served_model_name": "Qwen3-32B",
8  "model": "/data/models/Qwen3-32B",
9  "log_stdout": "/vllm-workspace/.vaws-runtime/serving/.../stdout.log",
10  "log_stderr": "/vllm-workspace/.vaws-runtime/serving/.../stderr.log"
11}

On failure, includes stderr_tail for diagnosis.

Reference files

.agents/skills/vllm-ascend-serving/references/behavior.md
.agents/skills/vllm-ascend-serving/references/command-recipes.md
.agents/skills/vllm-ascend-serving/references/acceptance.md

vllm-ascend-serving — for Claude Code vllm-ascend-serving, vllm-ascend-workspace, community, for Claude Code, ide skills, vllm-ascend, ascend-memory-profiling, machine-management, remote-code-parity, status

Über diesen Skill

Funktionen

# Kernthemen

Skill Overview

Warum diese Fähigkeit verwenden

Am besten geeignet für

↓ Handlungsfähige Anwendungsfälle for vllm-ascend-serving

! Sicherheit & Einschränkungen

About The Source

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ und Installationsschritte

? Häufige Fragen

Was ist vllm-ascend-serving?