vLLM Ascend Serving
Manage the lifecycle of a single-node colocated vllm-ascend online service on a workspace-managed ready remote container.
This skill takes structured parameters, handles all SSH escaping and remote execution internally, and returns machine-readable JSON. The agent never needs to construct raw shell commands for service management.
Use this skill when
- the user asks to start / launch / pull up a vllm-ascend service on a managed machine
- the user asks to restart or relaunch a service (possibly with changed flags or env)
- the user asks to check if a running service is alive / ready
- the user asks to stop a running service
- another skill needs to start a service (e.g.
ascend-memory-profiling)
Do not use this skill when
- the task is adding, verifying, repairing, or removing a machine (use
machine-management)
- the task is syncing code to the remote container (use
remote-code-parity)
- the task is running benchmarks (a separate skill's responsibility)
- the task is offline inference
- the machine is not yet ready in inventory
Critical rules
start automatically runs remote-code-parity before launching. If parity fails, start is blocked.
status and stop do not require parity.
- All remote execution goes through the scripts — never construct raw SSH commands for serving.
- Keep local runtime state only under
.vaws-local/serving/.
- Progress on
stderr as __VAWS_SERVING_PROGRESS__=<json>, final result on stdout as JSON.
- macOS / Linux / WSL:
python3 ...
- Windows:
py -3 ...
Public entry points
Start a service
bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
2 --machine <alias-or-ip> \
3 --model <remote-weight-path> \
4 [--served-model-name <name>] \
5 [--tp <N>] [--dp <N>] \
6 [--devices <0,1,2,3>] \
7 [--extra-env KEY=VALUE ...] \
8 [--port <N>] \
9 [--health-timeout <seconds>] \
10 [--wrap-script <remote-path>] \
11 [--skip-parity] \
12 [-- <extra vllm serve args>]
Launch wrapping (--wrap-script)
The serving skill supports a generic --wrap-script mechanism. When provided, the vLLM launch command is written as _serve.sh in the runtime directory, and the wrapper script is called with two arguments: $1 = serve script path, $2 = runtime directory.
This is used by other skills (e.g. ascend-memory-profiling) to wrap the service launch process without the serving skill needing to know the wrapping details. The serving skill is agnostic to what the wrapper does.
The wrap_script path is recorded in the serving state so downstream skills can detect it.
Relaunch with previous config
bash
1# Exact same config
2python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
3 --machine <alias> --relaunch
4
5# Add a debug env
6python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
7 --machine <alias> --relaunch --extra-env VLLM_LOGGING_LEVEL=DEBUG
8
9# Remove an env from previous config
10python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
11 --machine <alias> --relaunch --unset-env MY_DEBUG_FLAG
12
13# Remove a vllm arg from previous config (use = to avoid argparse ambiguity)
14python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
15 --machine <alias> --relaunch --unset-args=--enforce-eager
16
17# Relaunch with a different model
18python3 .agents/skills/vllm-ascend-serving/scripts/serve_start.py \
19 --machine <alias> --relaunch --model /data/models/OtherModel
Probe NPU device availability
bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_probe_npus.py \
2 --machine <alias-or-ip>
Returns which NPU devices are free, which are busy (with PID and HBM details), probed on the bare-metal host for cross-container visibility.
Check status
bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_status.py \
2 --machine <alias-or-ip>
Stop
bash
1python3 .agents/skills/vllm-ascend-serving/scripts/serve_stop.py \
2 --machine <alias-or-ip> [--force]
Local state
Per-machine launch state is stored under .vaws-local/serving/<alias>.json.
This file records the last successful launch parameters (model, tp, devices, env, extra args, port, pid, log paths, runtime_dir, wrap_script). It is the basis for --relaunch and is read by other skills (e.g. ascend-memory-profiling) in attach mode.
Workflow
1. Resolve the target machine
The --machine argument is looked up in the local machine inventory. The machine must already be managed and ready.
2. Stop any existing service
If a previous service is recorded for this machine, it is stopped before launching a new one.
3. Run remote-code-parity (start only)
Unless --skip-parity is passed, parity_sync.py is called to ensure the container has the current local code. If parity fails, start is blocked.
4. Probe NPUs
NPU availability is checked via npu-smi info on the bare-metal host (not the container). Host-level probing sees processes from all containers, bypassing PID namespace isolation. Devices with HBM usage above 4 GB are also marked busy to catch cross-container occupancy:
- If
--devices is specified, those devices are verified to be free. If any are busy, start is blocked with the conflict details.
- If
--devices is not specified but --tp is given, the first N free devices are automatically selected, where N = TP × DP (defaults to TP when DP is not set).
- If NPU probe fails (e.g. driver issue), it is treated as a non-fatal warning and launch continues with user-specified devices.
5. Validate and launch
- Model path is checked for existence on the remote container.
- A free port is auto-detected (or the explicit
--port is used).
- A bash launch script is built internally with proper escaping — the agent never sees or edits this script.
- The process is started via
nohup + disown and detached from the SSH session.
6. Wait for readiness
The script polls /health and /v1/models until both return success or the timeout expires.
6a. Diagnose launch failure before any code change
If the service fails during engine initialization or health check timeout:
- Read both
stdout.log and stderr.log from the remote runtime directory — vllm often logs the actual Python exception to stdout, not stderr.
- Identify the actual exception type and message before hypothesizing a cause.
- Do not modify source code to work around a launch failure until the root cause is confirmed from logs.
- If the root cause is unclear, try the simplest launch configuration first (e.g. tp-only, no speculative decoding, no graph mode) and incrementally add features to isolate the failing component.
7. Return structured JSON
On success:
json
1{
2 "status": "ready",
3 "machine": "blue-a",
4 "base_url": "http://10.0.0.8:38721",
5 "port": 38721,
6 "pid": 12345,
7 "served_model_name": "Qwen3-32B",
8 "model": "/data/models/Qwen3-32B",
9 "log_stdout": "/vllm-workspace/.vaws-runtime/serving/.../stdout.log",
10 "log_stderr": "/vllm-workspace/.vaws-runtime/serving/.../stderr.log"
11}
On failure, includes stderr_tail for diagnosis.
Reference files
.agents/skills/vllm-ascend-serving/references/behavior.md
.agents/skills/vllm-ascend-serving/references/command-recipes.md
.agents/skills/vllm-ascend-serving/references/acceptance.md