4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification) by dnddnjs · Pull Request #133 · rlcode/reinforcement-learning

dnddnjs · 2026-06-12T03:20:50Z

Go-Explore on Montezuma's Revenge, both phases.

Phase 1 — exploration (2-go-explore.py): restore-based archive exploration (knowledge-free downscaled cells), a deterministic trajectory search. The score is a search result, not an RL-policy score, so it is not cross-compared with the sticky-action RND numbers.

Phase 2 — robustification (3-robustify.py): distil the demo into a recurrent policy that plays under sticky actions via the backward algorithm (Salimans & Chen 2018; Ecoffet et al. 2021). Reported as an honest single-machine negative result:

Bootstrap works — a first-key-only demo (~250 actions) + 128 envs makes the curriculum retreat immediately (as_good_as_demo → 1.0); the full ~5,300-action demo at 16 envs never moves.
But it plateaus ~22% of the way; the policy masters the last ~55 actions but not the earlier platforming. Unchanged by 10× frames or 100× entropy bonus — a scale ceiling vs the original hundreds–thousands of envs. No from-reset score, so no benchmark row is claimed.

README updated with the honest write-up. Single seed throughout. Supersedes #132 (exploration phase alone). Merge left to a maintainer.

New chapter for hard-exploration Atari. PPO with Random Network Distillation (Burda et al., 2018) as the curiosity bonus. - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv (uninterrupted episodes so intrinsic returns can chain across deaths). - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS seeded by 50 rollouts of a random agent, intrinsic reward scaled by running std of discounted intrinsic returns, dual GAE (extrinsic episodic + intrinsic non-episodic), predictor updated on 25% of each minibatch, combined advantage A = 2*A_ext + 1*A_int. Not run end-to-end yet. Sanity-checked static shapes and module wiring.

PPO+RND made reproducible and resumable. Shared run plumbing (seed, metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary) lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel envs crack the first-key bottleneck (128 envs never scored in 50M); final mean per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO baseline (2497). Adds a README benchmark row. Count-based exploration is deferred to a later PR (not yet trained/benchmarked).

Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated random actions (p=0.95), raw-score accept rule, virtual DONE cell, global experience log with prev_id chains (demo source for robustification), 12-worker spawn pool over raw gymnasium ALE clone/restore. Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume; explog flushed as compressed chunks; checkpoint = archive+log+RNG at batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps.

…n's dir Cross-run-dir resume (harness relaunches into a fresh run dir) could not see chunks flushed by the original run; chunk lookup now falls back to the ancestor run's explog dir and resume fails loudly if any chunk is unreachable.

Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py (gradient + intrinsic reward, envpool) and 2-go-explore.py (archive + emulator restore, raw ALE). Go-Explore keeps its own plumbing in env_go_explore.py since the two stacks share nothing.

…walk A result earlier in the same batch can replace a cell; walking a later result against the cell's CURRENT score/trajectory stitched actions executed from the old state onto the new prefix, fabricating scores no single playthrough achieved. sample() now freezes snapshot/score/ trajectory per pick and the walk uses the capture — matching the official Go-Explore, which ships these values inside each task. Caught by publish-time demo replay verification (score mismatch).

…extract + GRU PPO extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint + experience log, replay-verify it reproduces the archived score (31,000), truncate after the last reward, save actions/rewards/periodic ALE states. env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play forward under sticky actions; success = raw score >= demo return; lag/success kills) + ResetManager curriculum (starting points march backward as the agent matches the demo, forward-cumsum move rule per atari-reset, nudge forward on collapse). 3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated BPTT with done-masked state, advantage chains cut at artificial success resets, periodic from-reset sticky eval -> final.json. Single-machine scaled port of openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs end-to-end; curriculum logic covered by harness T0.

Preflight audit found the from-reset eval reused ReplayResetEnv with the training-curriculum kills active, so eval episodes were cut by lag/success-kill before game_over — value_mean reported a key-but-slower-than-demo policy as ~0. Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma episode (4500 agent steps) so eval runs from reset to a real game_over. Also checkpoint torch + per-env RNG state and restore them on resume (was global numpy RNG only), so the kill->resume contract is faithful for the deterministic streams (MPS policy sampling has no bit determinism).

Cutting the demo just after the first reward (--max-rewards 1) yields a short first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the robustification backward curriculum to bootstrap on.

The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34: the policy commits before reliably executing the demo suffix under sticky actions. Expose the entropy bonus as a flag (default unchanged) to test whether more exploration breaks the plateau.

Document the backward-algorithm robustification (3-robustify.py): the curriculum bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no from-reset score on a single machine. Honest negative result, single seed, no benchmark row claimed.

…ation + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded #132 branch.

# Conflicts: # README.md

dnddnjs and others added 14 commits May 24, 2026 12:08

Ignore local CLAUDE.md collaboration notes

eaa1c15

4-atari-hard: add envpool count-based exploration

9b8097f

README: link the Montezuma PPO+RND W&B report

1d3b510

4-atari-hard/extract_demo: --max-rewards to truncate at the Kth reward

4e38dcb

Cutting the demo just after the first reward (--max-rewards 1) yields a short first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the robustification backward curriculum to bootstrap on.

dnddnjs mentioned this pull request Jun 12, 2026

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark #132

Closed

ssoyyoung added 2 commits June 12, 2026 12:25

Merge remote-tracking branch 'origin/master' into ai/go-explore

a77ebc2

# Conflicts: # README.md

dnddnjs merged commit 3d421f6 into master Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification)#133

4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification)#133
dnddnjs merged 16 commits into
masterfrom
ai/go-explore

dnddnjs commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dnddnjs commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants