4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark by dnddnjs · Pull Request #132 · rlcode/reinforcement-learning

dnddnjs · 2026-06-08T00:46:36Z

Go-Explore phase 1 (exploration only) on Montezuma's Revenge — the archive + emulator-restore paradigm, side by side with the PPO+RND row.

Best end-of-episode score: 31,000 at 500M agent steps (~5.5h, Mac Studio M4 Max, 12 explorer processes, no neural network). Single seed. Replay-verified: re-executing the stored 5,336-action trajectory from reset reproduces exactly 31,000.

Protocol notes (also in the README block):

Deterministic ALE (no sticky actions, frameskip 4, fixed seed) — required by restore-based exploration, not comparable to the sticky-action RL rows.
Score = best end-of-episode trajectory found by search, not an RL policy score; the paper's robustification phase is not run here.
Reference: Nature exploration-phase mean without domain knowledge is 24,758 at the same 2B-frame budget (50+ seeds vs our single seed). Rooms found: 24.

W&B (full metrics history + gameplay video): https://wandb.ai/rlcode/rl-atari-hard-go-explore/runs/m6ox4l3m

(Single-seed diagnostic run; merge is a human decision.)

… benchmark Go-Explore phase 1 (Ecoffet et al. 2019 / Nature 2021), no neural net: an archive of downscaled-frame cells (11x8, 9 gray levels), emulator state save/restore to return to frontier cells, repeated random actions to explore from them. 12 explorer processes over raw gymnasium ALE (envpool exposes no clone API, hence the separate env_go_explore.py). Result: best end-of-episode score 31,000 at 500M agent steps (~5.5h on a Mac Studio M4 Max), single seed, replay-verified (re-executing the stored 5,336-action demo from reset reproduces the score exactly). Deterministic protocol (no sticky actions) -- a trajectory-search result, not an RL policy score; see the README caveat.

dnddnjs · 2026-06-12T03:21:00Z

Superseded by #133 (combined Go-Explore exploration + robustification).

…ation + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded #132 branch.

…ification) (#133) * Ignore local CLAUDE.md collaboration notes * 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge New chapter for hard-exploration Atari. PPO with Random Network Distillation (Burda et al., 2018) as the curiosity bonus. - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv (uninterrupted episodes so intrinsic returns can chain across deaths). - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS seeded by 50 rollouts of a random agent, intrinsic reward scaled by running std of discounted intrinsic returns, dual GAE (extrinsic episodic + intrinsic non-episodic), predictor updated on 25% of each minibatch, combined advantage A = 2*A_ext + 1*A_int. Not run end-to-end yet. Sanity-checked static shapes and module wiring. * 4-atari-hard: add envpool count-based exploration * 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark PPO+RND made reproducible and resumable. Shared run plumbing (seed, metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary) lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel envs crack the first-key bottleneck (128 envs never scored in 50M); final mean per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO baseline (2497). Adds a README benchmark row. Count-based exploration is deferred to a later PR (not yet trained/benchmarked). * README: link the Montezuma PPO+RND W&B report * 6-atari-go-explore: Go-Explore Phase 1 (exploration) for Montezuma Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated random actions (p=0.95), raw-score accept rule, virtual DONE cell, global experience log with prev_id chains (demo source for robustification), 12-worker spawn pool over raw gymnasium ALE clone/restore. Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume; explog flushed as compressed chunks; checkpoint = archive+log+RNG at batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps. * 6-atari-go-explore: resolve flushed explog chunks from the resumed run's dir Cross-run-dir resume (harness relaunches into a fresh run dir) could not see chunks flushed by the original run; chunk lookup now falls back to the ancestor run's explog dir and resume fails loudly if any chunk is unreachable. * Move Go-Explore into 4-atari-hard alongside PPO+RND Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py (gradient + intrinsic reward, envpool) and 2-go-explore.py (archive + emulator restore, raw ALE). Go-Explore keeps its own plumbing in env_go_explore.py since the two stacks share nothing. * 4-atari-hard/2-go-explore: use sampling-time captures in the archive walk A result earlier in the same batch can replace a cell; walking a later result against the cell's CURRENT score/trajectory stitched actions executed from the old state onto the new prefix, fabricating scores no single playthrough achieved. sample() now freezes snapshot/score/ trajectory per pick and the walk uses the capture — matching the official Go-Explore, which ships these values inside each task. Caught by publish-time demo replay verification (score mismatch). * 4-atari-hard: Go-Explore robustification (backward algorithm) — demo extract + GRU PPO extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint + experience log, replay-verify it reproduces the archived score (31,000), truncate after the last reward, save actions/rewards/periodic ALE states. env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play forward under sticky actions; success = raw score >= demo return; lag/success kills) + ResetManager curriculum (starting points march backward as the agent matches the demo, forward-cumsum move rule per atari-reset, nudge forward on collapse). 3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated BPTT with done-masked state, advantage chains cut at artificial success resets, periodic from-reset sticky eval -> final.json. Single-machine scaled port of openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs end-to-end; curriculum logic covered by harness T0. * 4-atari-hard/3-robustify: eval honors game_over + faithful resume RNG Preflight audit found the from-reset eval reused ReplayResetEnv with the training-curriculum kills active, so eval episodes were cut by lag/success-kill before game_over — value_mean reported a key-but-slower-than-demo policy as ~0. Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma episode (4500 agent steps) so eval runs from reset to a real game_over. Also checkpoint torch + per-env RNG state and restore them on resume (was global numpy RNG only), so the kill->resume contract is faithful for the deterministic streams (MPS policy sampling has no bit determinism). * 4-atari-hard/extract_demo: --max-rewards to truncate at the Kth reward Cutting the demo just after the first reward (--max-rewards 1) yields a short first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the robustification backward curriculum to bootstrap on. * 4-atari-hard/3-robustify: --ent-coef flag for entropy tuning The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34: the policy commits before reliably executing the demo suffix under sticky actions. Expose the entropy bonus as a flag (default unchanged) to test whether more exploration breaks the plateau. * README: Go-Explore robustification — single-machine negative result Document the backward-algorithm robustification (3-robustify.py): the curriculum bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no from-reset score on a single machine. Honest negative result, single seed, no benchmark row claimed. * README: consolidate Montezuma into one table (RND + Go-Explore exploration + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded #132 branch. --------- Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>

dnddnjs mentioned this pull request Jun 12, 2026

4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification) #133

Merged

dnddnjs closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132
dnddnjs wants to merge 1 commit into
masterfrom
ai/montezuma-go-explore

dnddnjs commented Jun 8, 2026

Uh oh!

dnddnjs commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dnddnjs commented Jun 8, 2026

Uh oh!

dnddnjs commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant