dev-tpu-ray 是什么？

Open-source framework for the research and development of foundation models.

如何安装 dev-tpu-ray？

运行命令：npx killer-skills add marin-community/marin/dev-tpu-ray。支持 Cursor、Windsurf、VS Code、Claude Code 等 19+ IDE/Agent。

dev-tpu-ray 支持哪些 IDE 或 Agent？

该技能兼容 Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer。可使用 Killer-Skills CLI 一条命令通用安装。

Skill: Legacy Ray Dev TPU

Name: dev-tpu-ray
Availability: InStock
Author: marin-community

Use this skill only when you specifically need the legacy Ray-backed dev TPU workflow. Prefer .agents/skills/dev-tpu/SKILL.md for the current Iris-backed path.

scripts/ray/dev_tpu.py can reserve a temporary TPU VM, sync the repo, and run commands remotely. It is good for:

quick test and benchmark loops,
memory debugging,
profiling and trace capture,
short experiments where you want direct shell access.

It is a bad fit for long unattended experiments or many concurrent TPU commands.

Critical concurrency rule

Run at most one TPU job at a time on a given dev TPU VM. Do not launch concurrent TPU commands from separate shells, tmux panes, or background jobs against the same dev TPU.

Commands

allocate: reserve a TPU VM and keep it alive while the command runs. This also writes an SSH alias into ~/.ssh/config.
connect: open an interactive shell on the TPU.
execute: sync local files to remote ~/marin/ unless --no-sync, then run one command.
watch: rsync + restart on local file changes.

Prerequisites

Authenticate to GCP and set up the Marin development environment.

bash
1gcloud auth login
2gcloud config set project hai-gcp-models
3gcloud auth application-default login
4make dev_setup

Ensure your SSH public key is in project metadata: https://console.cloud.google.com/compute/metadata?resourceTab=sshkeys&project=hai-gcp-models&scopeTab=projectMetadata

Quick Start

Allocate:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  allocate

Connect interactively:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  connect

Run one command with sync:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  execute -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py

dev_tpu.py creates an alias for a TPU VM monitored by Ray. By default it uses your username and the config cluster_name to create a name like dev-<cluster_name>-<user>.

Stop allocation by pressing Ctrl-C in the terminal that is running allocate.

Agent Usage

Always pass --tpu-name to avoid collisions with other agents.

bash
1export TPU_NAME="${USER}-$(git rev-parse --abbrev-ref HEAD | tr '/' '-')-$(date +%H%M%S)"

Then reuse that name for allocate, connect, and execute.

Practical Patterns

Extra environment variables

Use repeatable -e KEY=VALUE with execute:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  --tpu-name "$TPU_NAME" \
4  execute -e LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=50000" -- \
5  uv run --package levanter --extra tpu lib/levanter/scripts/bench/bench_moe_hillclimb.py

Notes:

.levanter.yaml, .marin.yaml, and .config environment entries are injected automatically.
execute already wraps the command in bash -c; do not pass your own bash -c.

Fast inner loop

Skip sync with --no-sync when the remote checkout is already current:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  --tpu-name "$TPU_NAME" \
4  execute --no-sync -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py

Or SSH directly:

bash
1ssh "dev-tpu-${TPU_NAME}"
2cd ~/marin
3source ~/.local/bin/env

Run remote TPU commands sequentially.

Copy remote artifacts

bash
1scp "dev-tpu-${TPU_NAME}:~/marin/<remote-path>" "<local-path>"

Common examples include profiles, traces, logs, and checkpoints. For example:

bash
1mkdir -p ".profiles/${TPU_NAME}"
2scp "dev-tpu-${TPU_NAME}:~/marin/.profiles/<run_name>/plugins/profile/*/*" ".profiles/${TPU_NAME}/"

Multiple clusters

When using multiple clusters at once, always pass explicit --config and --tpu-name.

Example naming:

infra/marin-us-central1.yaml with --tpu-name "${USER}-central1"
infra/marin-us-east5-a.yaml with --tpu-name "${USER}-east5"

Troubleshooting

`Could not infer TPU type from config`

Pass --tpu-type explicitly:

bash
1uv run scripts/ray/dev_tpu.py --config <config> allocate --tpu-type v5p-8

`SSH configuration ... not found`

Run allocate first for that --tpu-name, then retry connect or execute.

Verify cleanup after `allocate`

After finishing work, stop allocation with Ctrl-C in the terminal running allocate.

Recommended verification:

Confirm the allocator exited cleanly.
Confirm no local allocate process is still running for that TPU name.
Confirm the local alias state is cleaned up:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config <config> \
3  --tpu-name <name> execute --no-sync -- /bin/bash -lc 'echo ok'

Expected result after cleanup: it should fail with SSH configuration ... not found.

TPU busy or stale lockfile

If TPU init fails due to lock contention:

bash
1sudo rm -f /tmp/libtpu_lockfile
2sudo lsof -t /dev/vfio/* | xargs -r sudo kill -9

Then rerun the command.

`execute` feels slow

It syncs with rsync before each run by default. Use --no-sync or direct SSH for repeated runs.

Reference Examples

Run tests:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  --tpu-name "$TPU_NAME" \
4  execute -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py

Run a benchmark:

bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2  --config infra/marin-us-east5-a.yaml \
3  --tpu-name "$TPU_NAME" \
4  execute -e LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=50000" -- \
5  uv run --package levanter --extra tpu lib/levanter/scripts/bench/bench_moe_mlp_profile.py

dev-tpu-ray — community dev-tpu-ray, community, ide skills, Claude Code, Cursor, Windsurf

关于此技能

Killer-Skills Review

核心价值

适用 Agent 类型

↓ 赋予的主要能力 · dev-tpu-ray

! 使用限制与门槛

Why this page is reference-only

Source Boundary

Browser Sandbox Environment

⚡️ Ready to unleash?

常见问题与安装步骤

? FAQ

dev-tpu-ray 是什么？

如何安装 dev-tpu-ray？

dev-tpu-ray 支持哪些 IDE 或 Agent？

↓ 安装步骤

! 参考页模式

Imported Repository Instructions

dev-tpu-ray

Skill: Legacy Ray Dev TPU

Critical concurrency rule

Commands

Prerequisites

Quick Start

Agent Usage

Practical Patterns

Extra environment variables

Fast inner loop

Copy remote artifacts

Multiple clusters

Troubleshooting

Could not infer TPU type from config

SSH configuration ... not found

Verify cleanup after allocate

TPU busy or stale lockfile

execute feels slow

Reference Examples

相关技能

寻找 dev-tpu-ray 的替代方案 (Alternative) 或可搭配使用的同类 community Skill？探索以下相关开源技能。

openclaw-release-maintainer

widget-generator

flags

pr-review

`Could not infer TPU type from config`

`SSH configuration ... not found`

Verify cleanup after `allocate`

`execute` feels slow