4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132
Closed
dnddnjs wants to merge 1 commit into
Closed
4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132dnddnjs wants to merge 1 commit into
dnddnjs wants to merge 1 commit into
Conversation
… benchmark Go-Explore phase 1 (Ecoffet et al. 2019 / Nature 2021), no neural net: an archive of downscaled-frame cells (11x8, 9 gray levels), emulator state save/restore to return to frontier cells, repeated random actions to explore from them. 12 explorer processes over raw gymnasium ALE (envpool exposes no clone API, hence the separate env_go_explore.py). Result: best end-of-episode score 31,000 at 500M agent steps (~5.5h on a Mac Studio M4 Max), single seed, replay-verified (re-executing the stored 5,336-action demo from reset reproduces the score exactly). Deterministic protocol (no sticky actions) -- a trajectory-search result, not an RL policy score; see the README caveat.
Contributor
Author
|
Superseded by #133 (combined Go-Explore exploration + robustification). |
dnddnjs
pushed a commit
that referenced
this pull request
Jun 12, 2026
…ation + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded #132 branch.
dnddnjs
added a commit
that referenced
this pull request
Jun 12, 2026
…ification) (#133) * Ignore local CLAUDE.md collaboration notes * 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge New chapter for hard-exploration Atari. PPO with Random Network Distillation (Burda et al., 2018) as the curiosity bonus. - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv (uninterrupted episodes so intrinsic returns can chain across deaths). - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS seeded by 50 rollouts of a random agent, intrinsic reward scaled by running std of discounted intrinsic returns, dual GAE (extrinsic episodic + intrinsic non-episodic), predictor updated on 25% of each minibatch, combined advantage A = 2*A_ext + 1*A_int. Not run end-to-end yet. Sanity-checked static shapes and module wiring. * 4-atari-hard: add envpool count-based exploration * 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark PPO+RND made reproducible and resumable. Shared run plumbing (seed, metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary) lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel envs crack the first-key bottleneck (128 envs never scored in 50M); final mean per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO baseline (2497). Adds a README benchmark row. Count-based exploration is deferred to a later PR (not yet trained/benchmarked). * README: link the Montezuma PPO+RND W&B report * 6-atari-go-explore: Go-Explore Phase 1 (exploration) for Montezuma Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated random actions (p=0.95), raw-score accept rule, virtual DONE cell, global experience log with prev_id chains (demo source for robustification), 12-worker spawn pool over raw gymnasium ALE clone/restore. Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume; explog flushed as compressed chunks; checkpoint = archive+log+RNG at batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps. * 6-atari-go-explore: resolve flushed explog chunks from the resumed run's dir Cross-run-dir resume (harness relaunches into a fresh run dir) could not see chunks flushed by the original run; chunk lookup now falls back to the ancestor run's explog dir and resume fails loudly if any chunk is unreachable. * Move Go-Explore into 4-atari-hard alongside PPO+RND Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py (gradient + intrinsic reward, envpool) and 2-go-explore.py (archive + emulator restore, raw ALE). Go-Explore keeps its own plumbing in env_go_explore.py since the two stacks share nothing. * 4-atari-hard/2-go-explore: use sampling-time captures in the archive walk A result earlier in the same batch can replace a cell; walking a later result against the cell's CURRENT score/trajectory stitched actions executed from the old state onto the new prefix, fabricating scores no single playthrough achieved. sample() now freezes snapshot/score/ trajectory per pick and the walk uses the capture — matching the official Go-Explore, which ships these values inside each task. Caught by publish-time demo replay verification (score mismatch). * 4-atari-hard: Go-Explore robustification (backward algorithm) — demo extract + GRU PPO extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint + experience log, replay-verify it reproduces the archived score (31,000), truncate after the last reward, save actions/rewards/periodic ALE states. env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play forward under sticky actions; success = raw score >= demo return; lag/success kills) + ResetManager curriculum (starting points march backward as the agent matches the demo, forward-cumsum move rule per atari-reset, nudge forward on collapse). 3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated BPTT with done-masked state, advantage chains cut at artificial success resets, periodic from-reset sticky eval -> final.json. Single-machine scaled port of openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs end-to-end; curriculum logic covered by harness T0. * 4-atari-hard/3-robustify: eval honors game_over + faithful resume RNG Preflight audit found the from-reset eval reused ReplayResetEnv with the training-curriculum kills active, so eval episodes were cut by lag/success-kill before game_over — value_mean reported a key-but-slower-than-demo policy as ~0. Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma episode (4500 agent steps) so eval runs from reset to a real game_over. Also checkpoint torch + per-env RNG state and restore them on resume (was global numpy RNG only), so the kill->resume contract is faithful for the deterministic streams (MPS policy sampling has no bit determinism). * 4-atari-hard/extract_demo: --max-rewards to truncate at the Kth reward Cutting the demo just after the first reward (--max-rewards 1) yields a short first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the robustification backward curriculum to bootstrap on. * 4-atari-hard/3-robustify: --ent-coef flag for entropy tuning The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34: the policy commits before reliably executing the demo suffix under sticky actions. Expose the entropy bonus as a flag (default unchanged) to test whether more exploration breaks the plateau. * README: Go-Explore robustification — single-machine negative result Document the backward-algorithm robustification (3-robustify.py): the curriculum bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no from-reset score on a single machine. Honest negative result, single seed, no benchmark row claimed. * README: consolidate Montezuma into one table (RND + Go-Explore exploration + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded #132 branch. --------- Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Go-Explore phase 1 (exploration only) on Montezuma's Revenge — the archive + emulator-restore paradigm, side by side with the PPO+RND row.
Best end-of-episode score: 31,000 at 500M agent steps (~5.5h, Mac Studio M4 Max, 12 explorer processes, no neural network). Single seed. Replay-verified: re-executing the stored 5,336-action trajectory from reset reproduces exactly 31,000.
Protocol notes (also in the README block):
W&B (full metrics history + gameplay video): https://wandb.ai/rlcode/rl-atari-hard-go-explore/runs/m6ox4l3m
(Single-seed diagnostic run; merge is a human decision.)