Skill: Legacy Ray Dev TPU
Use this skill only when you specifically need the legacy Ray-backed dev TPU workflow. Prefer .agents/skills/dev-tpu/SKILL.md for the current Iris-backed path.
scripts/ray/dev_tpu.py can reserve a temporary TPU VM, sync the repo, and run commands remotely. It is good for:
- quick test and benchmark loops,
- memory debugging,
- profiling and trace capture,
- short experiments where you want direct shell access.
It is a bad fit for long unattended experiments or many concurrent TPU commands.
Critical concurrency rule
Run at most one TPU job at a time on a given dev TPU VM. Do not launch concurrent TPU commands from separate shells, tmux panes, or background jobs against the same dev TPU.
Commands
allocate: reserve a TPU VM and keep it alive while the command runs. This also writes an SSH alias into ~/.ssh/config.
connect: open an interactive shell on the TPU.
execute: sync local files to remote ~/marin/ unless --no-sync, then run one command.
watch: rsync + restart on local file changes.
Prerequisites
- Authenticate to GCP and set up the Marin development environment.
bash
1gcloud auth login
2gcloud config set project hai-gcp-models
3gcloud auth application-default login
4make dev_setup
- Ensure your SSH public key is in project metadata:
https://console.cloud.google.com/compute/metadata?resourceTab=sshkeys&project=hai-gcp-models&scopeTab=projectMetadata
Quick Start
Allocate:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 allocate
Connect interactively:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 connect
Run one command with sync:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 execute -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py
dev_tpu.py creates an alias for a TPU VM monitored by Ray. By default it uses your username and the config cluster_name to create a name like dev-<cluster_name>-<user>.
Stop allocation by pressing Ctrl-C in the terminal that is running allocate.
Agent Usage
Always pass --tpu-name to avoid collisions with other agents.
bash
1export TPU_NAME="${USER}-$(git rev-parse --abbrev-ref HEAD | tr '/' '-')-$(date +%H%M%S)"
Then reuse that name for allocate, connect, and execute.
Practical Patterns
Use repeatable -e KEY=VALUE with execute:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 --tpu-name "$TPU_NAME" \
4 execute -e LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=50000" -- \
5 uv run --package levanter --extra tpu lib/levanter/scripts/bench/bench_moe_hillclimb.py
Notes:
.levanter.yaml, .marin.yaml, and .config environment entries are injected automatically.
execute already wraps the command in bash -c; do not pass your own bash -c.
Fast inner loop
Skip sync with --no-sync when the remote checkout is already current:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 --tpu-name "$TPU_NAME" \
4 execute --no-sync -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py
Or SSH directly:
bash
1ssh "dev-tpu-${TPU_NAME}"
2cd ~/marin
3source ~/.local/bin/env
Run remote TPU commands sequentially.
Copy remote artifacts
bash
1scp "dev-tpu-${TPU_NAME}:~/marin/<remote-path>" "<local-path>"
Common examples include profiles, traces, logs, and checkpoints. For example:
bash
1mkdir -p ".profiles/${TPU_NAME}"
2scp "dev-tpu-${TPU_NAME}:~/marin/.profiles/<run_name>/plugins/profile/*/*" ".profiles/${TPU_NAME}/"
Multiple clusters
When using multiple clusters at once, always pass explicit --config and --tpu-name.
Example naming:
infra/marin-us-central1.yaml with --tpu-name "${USER}-central1"
infra/marin-us-east5-a.yaml with --tpu-name "${USER}-east5"
Troubleshooting
Could not infer TPU type from config
Pass --tpu-type explicitly:
bash
1uv run scripts/ray/dev_tpu.py --config <config> allocate --tpu-type v5p-8
SSH configuration ... not found
Run allocate first for that --tpu-name, then retry connect or execute.
Verify cleanup after allocate
After finishing work, stop allocation with Ctrl-C in the terminal running allocate.
Recommended verification:
- Confirm the allocator exited cleanly.
- Confirm no local
allocate process is still running for that TPU name.
- Confirm the local alias state is cleaned up:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config <config> \
3 --tpu-name <name> execute --no-sync -- /bin/bash -lc 'echo ok'
Expected result after cleanup: it should fail with SSH configuration ... not found.
TPU busy or stale lockfile
If TPU init fails due to lock contention:
bash
1sudo rm -f /tmp/libtpu_lockfile
2sudo lsof -t /dev/vfio/* | xargs -r sudo kill -9
Then rerun the command.
execute feels slow
It syncs with rsync before each run by default. Use --no-sync or direct SSH for repeated runs.
Reference Examples
Run tests:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 --tpu-name "$TPU_NAME" \
4 execute -- uv run --package levanter --group test pytest lib/levanter/tests/kernels/test_pallas_fused_cross_entropy_loss.py
Run a benchmark:
bash
1RAY_AUTH_MODE=token uv run scripts/ray/dev_tpu.py \
2 --config infra/marin-us-east5-a.yaml \
3 --tpu-name "$TPU_NAME" \
4 execute -e LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=50000" -- \
5 uv run --package levanter --extra tpu lib/levanter/scripts/bench/bench_moe_mlp_profile.py