What is investigate-golden-flake?

자동화된 CI 로그 분석 및 골든 테스트 실패 조사에 필요한 디버깅 에이전트에게 적합합니다. Debug a failing golden test from GitHub Actions logs and fix the root cause without papering over nondeterminism.

How do I install investigate-golden-flake?

Run the command: npx killer-skills add GregorStocks/mage-bench/investigate-golden-flake. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for investigate-golden-flake?

Key use cases include: CI 로그에서 골든 테스트 실패를 디버깅, GitHub Actions URL에서 실행 및 작업 ID를 자동으로 추출, 실패한 로그 분석을 사용하여 테스트 실패의 근본 원인을 식별.

Which IDEs are compatible with investigate-golden-flake?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for investigate-golden-flake?

GitHub Actions URL 또는 실행 및 작업 ID가 필요. GitHub Actions CI 로그만 해당. 로그 가져오기 및 분석에 명령줄 도구 gh가 필요.

Investigate Golden Flake

Name: investigate-golden-flake
Availability: InStock
Author: GregorStocks

Debug a golden test failure from CI logs, identify the root cause, and fix it.

Input: The user provides a GitHub Actions URL (run or job link) for a failing golden test.

Step 1: Get the CI failure details

Extract the run ID and job ID from the URL and fetch the failed logs:

bash
1gh run view <run_id> --job <job_id> --log-failed

If the user gave a run URL without a job ID, list jobs first:

bash
1gh run view <run_id> --json jobs --jq '.jobs[] | "\(.databaseId) \(.name) \(.conclusion)"'

Identify from the logs:

Which golden test failed (e.g., test_golden_dark_depths_combo)
The diff output — the _json_diff lines showing expected vs. actual values
What phase failed — prompt comparison (assert_golden_prompt) or export comparison (assert_golden_export)
The RPC trace — the [RPC #N] -> tools/call(...) lines show the exact sequence of MCP tool calls

Key invariant: priority serialization

The XMage server serializes priority — only one player has it at a time. A bridge/LLM only becomes active when its player has priority. Both bridges can NEVER be active simultaneously. This means:

Game replay MUST be deterministic. Given the same deck and the same sequence of player decisions, the game state progression is fully determined. If a golden test produces different results across runs, that's a bug in the Java bridge code, not an inherent property of the system.
Never modify test scripts to work around nondeterminism. Changing pass_priority() to pass_priority(until="precombat_main") or adding my_turn yields to "make it deterministic" is papering over a bug. The plain pass_priority() should already be deterministic because only one bridge processes callbacks at a time.
The fix is always in the Java bridge. If the firstPass logic, actionsPassed counter, or auto-pass loop produces different results across runs, something is violating the priority serialization invariant. Find where and fix it.

Bridge contract: block for decisions, not callbacks

Every MCP tool call should block until the bridge reaches a real player decision point (or a terminal condition like game over), not merely until the next callback arrives. Treat this as the external contract even if the current implementation uses pendingAction internally.

pendingAction is an implementation detail, not the semantic boundary. Do not describe or "fix" the system as if "next callback" and "next decision" were equivalent.
A callback can be transient. The callback thread may emit a follow-up callback that is immediately auto-resolved, merged into a later prompt, or replaced by a more meaningful decision before the model should act.
"I observed a callback" is not success. choose_action() and pass_priority() are only behaving correctly if they return at a stable prompt the player should answer next.
When debugging hangs or prompt drift, ask this first: "Did the bridge block/return on a callback boundary, or on a true decision boundary?" If the answer is "callback boundary," that is likely the root cause.
A warning like Action changed before choices were fetched is not harmless churn. It is evidence that the bridge detected a callback but lost the stable decision handoff that the MCP contract actually requires.

Step 2: Classify the failure

Golden flakes fall into a few known categories. Match the diff against these patterns:

Deterministic prompt text drift: If the diff is only wording/punctuation in prompt content fields and reproduces locally every time, check recent Python prompt-rendering changes first. Refactors that "only" touch strings (for example swapping an em dash for a hyphen in a chat reminder) can break prompt goldens without any Java race.

Race conditions (timing-dependent)

game_seq drift: Same payload, different game_seq values. Caused by lastGameView being updated from asynchronous gameUpdate callbacks instead of the authoritative callback. Look at updateLastGameView() call sites in BridgeCallbackHandler.java.
Empty get_game_history: cursor: N -> 0, event_count: N -> 0, history becomes "No game events recorded yet.". Race between handleGameOver() cache population and Python's post-script get_game_history call.
bridge_join timeout: Timeout waiting for bridge/potato to join the table. Usually a keepAlive loop issue where the previous game's cleanup races with the next game's setup. Check gameFinishedLatch, handleGameOver, and client.stop() ordering.
Missing or extra callbacks: pass_priority or choose_action returns different results. The bridge auto-pass loop (actionsPassed counter, firstPass logic) saw a different number of server callbacks. This violates the priority serialization invariant — find where the bridge is seeing callbacks out of order or dropping/duplicating them.

Nondeterministic ordering

llmEvents / llmTrace order: Events at the same seq for different players interleave differently. Check if _strip_volatile sorts by (seq, player).
Embedded JSON key order: Same data, different serialization order. Check if _normalize_embedded_json is being applied.
Short-ID churn: Same cards get different pN IDs across runs. Check ShortIdRegistry and whether IDs are assigned in a deterministic order.

Logic bugs exposed by timing

Auto-pass behavior change: A pass_priority call passes more or fewer times than expected. Check the firstPass optimization, yieldUntil* flags, and playable filtering in passPriority().
Stale response: A callback gets answered by a response meant for a different callback. Look for skip(), waitResponseOpen(), and sendPlayerBoolean() ordering issues.

Step 3: Read the relevant code

Based on the classification, read the key files. Almost all golden flakes trace back to:

Mage.Client.Bridge/src/main/java/mage/client/bridge/BridgeCallbackHandler.java — the central 5000+ line file. Key sections:
- passPriority() (~line 3019): The main priority-passing loop with auto-pass logic
- handleGameOver() (~line 5382): Game cleanup, bridge event caching, activeGames removal
- pullBridgeEvents() (~line 2709): Pulls events from server, falls back to cache
- getGameHistory() (~line 2737): Formats bridge events into human-readable history
- handleGameUpdate(): Asynchronous game state updates from the server
tests/golden_helpers.py — test infrastructure:
- run_golden_scenario() (~line 611): Orchestrates the full test (spectator, bridges, replay, comparison)
- _strip_volatile() (~line 993): Removes timing-dependent fields from export data
- _normalize_prompt_for_golden() (~line 924): Normalizes prompt data (strips short IDs, sorts JSON)
- assert_golden_export() (~line 1022): Export comparison with normalization
- assert_golden_prompt() (~line 949): Prompt comparison
src/magebench/pilot/replay.py — replay script execution:
- execute_replay_script(): Runs the scripted tool calls, captures post-script get_game_state + get_game_history
tests/conftest.py — fixtures for XMage server, bridge sessions, spectator process

Also read the specific golden test file (tests/test_golden_<name>.py) and its golden files (tests/golden/prompts/<name>.json and tests/golden/exports/<name>.json).

Step 4: Trace the bug

Follow the execution path to understand what happened:

Read the diff carefully. The [N].content.field: expected -> actual format tells you exactly which message in the prompt/export array changed and which field diverged.
Cross-reference with the RPC trace. The [RPC #N] lines in the test stdout map to the message array indices. Find the RPC that produced the divergent result.
Identify the threading model. Golden tests run two replay bridges + a spectator concurrently. Player A and Player B each run on their own thread, and the XMage server fires callbacks on a separate callback thread. Identify which threads are involved in the race.
Check ordering constraints. Most races are about operations happening in the wrong order. Look for:
- activeGames.remove() vs cache population (handleGameOver)
- actionLock.notifyAll() vs state updates
- client.stop() vs pending RPCs
- gameFinishedLatch.countDown() vs next game setup

Step 5: Fix the root cause

Critical rules from AGENTS.md:

Never add graceful fallbacks or silent defaults. If the data should be there, fix the ordering so it IS there. Don't add if empty: wait or if null: return default.
Never strip a field from _strip_volatile to mask the flake. _strip_volatile is for fields that are inherently volatile (timestamps, wall-clock durations). If a field like game_seq is nondeterministic, fix the source of nondeterminism.
Never re-run CI to work around a failure. The flake has a root cause — find it.

Common fix patterns:

Reorder operations: Move state population before the signal that makes it visible to other threads (e.g., populate cache before removing from activeGames).
Use the authoritative callback: The XMage server sends specific callbacks (game-over, game-update) with authoritative state. Use that state instead of racing with asynchronous pushes.
Wait for the spectator to actually watch before replay starts: wait-for-ready only proves the table exists. If the observer auto-watches later, it can inject hand-permission dialogs mid-combat and strand both bridges in passPriority. The correct barrier belongs after bridge_join and before replay, using an explicit "spectator is watching" signal from the observer.
"Watching" must mean post-initialization, not merely attached: A signal emitted from watchGame() can still be too early if the observer has not yet processed a real GameView and issued its initial hand-permission requests. Prefer signaling readiness from the first callback path that carries a populated GameView (for example init(...) / updateGame(...) after requestHandPermissions(...)), otherwise the harness can resume replay in the middle of spectator setup and recreate the same flake.
Fresh observer fixtures can still lose the first real GAME_INIT: Recreating the spectator JVM per test does not eliminate observer-side startup races. If ObserverGamePane defers gamePanel.installComponents() with SwingUtilities.invokeLater(...) and watchGame() starts immediately, a synchronous GAME_INIT can arrive before the layered pane is wired, leaving /wait-for-watching stuck on an empty game. Fix the pane-install ordering in the observer UI rather than adding more waits in Python.
KeepAlive spectators must reset per-game watch state: ObserverGamePanel is reused across games, so stale fields like watchingSignaled, permissionsRequested, and the current gameDir can suppress the next game's readiness signal even when the right callback path fires. If a later golden test times out in /wait-for-watching, inspect watchGame(...) for missing state resets before adding new timing logic.
Search spectator logs for stray Ending game... lines during the next scenario: FeedbackPanel.endWithTimeout() schedules a delayed btnRight.doClick() when a game ends. In keepAlive mode, a stale auto-close task can survive into the next game and click that game's live feedback button, causing premature GAME_OVER, phantom responses, or pass_priority stalls. If CI artifacts show Ending game... shortly after a new keepAlive: received command, inspect both FeedbackPanel timer invalidation and whether GamePanel.cleanUp() explicitly cancels any pending auto-close task before the old pane is orphaned.
Search spectator logs for IllegalComponentStateException during fast combat starts: If the observer hits component must be showing on the screen to determine its location from CombatManager.getParentPoint() / getLocationOnScreen(), the observer can enter the modal error path and stop processing callbacks while the game is still in combat. This commonly shows up in very fast scenarios that reach attacks or blockers within a second of keepAlive: auto-watching table .... The fix belongs in combat arrow rendering: only compute on-screen coordinates for components that are actually showing, rather than adding waits to the test.
Fix the bridge's callback handling: If the auto-pass loop sees different numbers of callbacks across runs, the bridge is violating the priority serialization invariant. Look for places where callbacks are dropped, duplicated, or processed out of order (e.g., pendingAction being overwritten, actionLock contention, GAME_UPDATE callbacks interfering with priority callbacks).
Preserve the decision boundary across follow-up callbacks: If a tool waits for a callback and then re-reads live bridge state later, it can lose the actual prompt the player should answer. Carry the exact observed decision forward, or explicitly skip transient auto-resolved callbacks, instead of assuming the next callback is the next decision.
Do not reintroduce fixed post-action waits in choose_action as a hang workaround: A bounded wait can make a replay hang disappear locally, but it also causes real prompt regressions when the next legitimate decision arrives just after the timeout. If the CI failure is a concede / game_end_signal timeout, investigate concede wakeup and end-game ordering first instead of changing choose_action blocking behavior.

Never work around nondeterminism by modifying test scripts. If pass_priority() is nondeterministic, the fix is in the Java bridge, not in switching to pass_priority(until="precombat_main"). See "Key invariant: priority serialization" above.

Step 6: Verify the fix

make build — ensure Java compiles
make check — lint, typecheck, unit tests

Run the specific golden test multiple times:

bash
1for i in 1 2 3 4 5; do echo "=== Run $i ==="; xvfb-run make test-golden K=<test_name> 2>&1 | grep -E 'PASSED|FAILED|Error'; done

All runs should pass. A flake fix that doesn't survive 5 consecutive local runs isn't a fix.

Step 7: Commit and PR

Use /pr to create the pull request. The PR description should include:

The root cause (which threads raced, what the ordering issue was)
The fix (what was reordered/changed and why)
That local golden tests passed N/N consecutive runs

investigate-golden-flake — community investigate-golden-flake, mage-bench, community, ide skills, Claude Code, Cursor, Windsurf

Killer-Skills Review

이 스킬을 사용하는 이유

최적의 용도

↓ 실행 가능한 사용 사례 for investigate-golden-flake

! 보안 및 제한 사항

Why this page is reference-only

Source Boundary

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions

What is investigate-golden-flake?

How do I install investigate-golden-flake?

What are the use cases for investigate-golden-flake?

Which IDEs are compatible with investigate-golden-flake?

Are there any limitations for investigate-golden-flake?

↓ How To Install

! Reference-Only Mode

Imported Repository Instructions

investigate-golden-flake

Investigate Golden Flake

Step 1: Get the CI failure details

Key invariant: priority serialization

Bridge contract: block for decisions, not callbacks

Step 2: Classify the failure

Race conditions (timing-dependent)

Nondeterministic ordering

Logic bugs exposed by timing

Step 3: Read the relevant code

Step 4: Trace the bug

Step 5: Fix the root cause

Step 6: Verify the fix

Step 7: Commit and PR

관련 스킬

Looking for an alternative to investigate-golden-flake or another community skill for your workflow? Explore these related open-source skills.

openclaw-release-maintainer

widget-generator

flags

pr-review

investigate-golden-flake — community investigate-golden-flake, mage-bench, community, ide skills, Claude Code, Cursor, Windsurf

이 스킬 정보

Killer-Skills Review

이 스킬을 사용하는 이유

최적의 용도

↓ 실행 가능한 사용 사례 for investigate-golden-flake

! 보안 및 제한 사항

Why this page is reference-only

Source Boundary

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions

What is investigate-golden-flake?

How do I install investigate-golden-flake?

What are the use cases for investigate-golden-flake?

Which IDEs are compatible with investigate-golden-flake?

Are there any limitations for investigate-golden-flake?

↓ How To Install

! Reference-Only Mode

Imported Repository Instructions

investigate-golden-flake

Investigate Golden Flake

Step 1: Get the CI failure details

Key invariant: priority serialization

Bridge contract: block for decisions, not callbacks

Step 2: Classify the failure

Race conditions (timing-dependent)

Nondeterministic ordering

Logic bugs exposed by timing

Step 3: Read the relevant code

Step 4: Trace the bug

Step 5: Fix the root cause

Step 6: Verify the fix

Step 7: Commit and PR

관련 스킬

Looking for an alternative to investigate-golden-flake or another community skill for your workflow? Explore these related open-source skills.

openclaw-release-maintainer

widget-generator

flags

pr-review