4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification)#133
Merged
Conversation
New chapter for hard-exploration Atari. PPO with Random Network Distillation (Burda et al., 2018) as the curiosity bonus. - env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv (uninterrupted episodes so intrinsic returns can chain across deaths). - 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS seeded by 50 rollouts of a random agent, intrinsic reward scaled by running std of discounted intrinsic returns, dual GAE (extrinsic episodic + intrinsic non-episodic), predictor updated on 25% of each minibatch, combined advantage A = 2*A_ext + 1*A_int. Not run end-to-end yet. Sanity-checked static shapes and module wiring.
PPO+RND made reproducible and resumable. Shared run plumbing (seed, metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary) lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel envs crack the first-key bottleneck (128 envs never scored in 50M); final mean per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO baseline (2497). Adds a README benchmark row. Count-based exploration is deferred to a later PR (not yet trained/benchmarked).
Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated random actions (p=0.95), raw-score accept rule, virtual DONE cell, global experience log with prev_id chains (demo source for robustification), 12-worker spawn pool over raw gymnasium ALE clone/restore. Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume; explog flushed as compressed chunks; checkpoint = archive+log+RNG at batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps.
…n's dir Cross-run-dir resume (harness relaunches into a fresh run dir) could not see chunks flushed by the original run; chunk lookup now falls back to the ancestor run's explog dir and resume fails loudly if any chunk is unreachable.
Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py (gradient + intrinsic reward, envpool) and 2-go-explore.py (archive + emulator restore, raw ALE). Go-Explore keeps its own plumbing in env_go_explore.py since the two stacks share nothing.
…walk A result earlier in the same batch can replace a cell; walking a later result against the cell's CURRENT score/trajectory stitched actions executed from the old state onto the new prefix, fabricating scores no single playthrough achieved. sample() now freezes snapshot/score/ trajectory per pick and the walk uses the capture — matching the official Go-Explore, which ships these values inside each task. Caught by publish-time demo replay verification (score mismatch).
…extract + GRU PPO extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint + experience log, replay-verify it reproduces the archived score (31,000), truncate after the last reward, save actions/rewards/periodic ALE states. env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play forward under sticky actions; success = raw score >= demo return; lag/success kills) + ResetManager curriculum (starting points march backward as the agent matches the demo, forward-cumsum move rule per atari-reset, nudge forward on collapse). 3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated BPTT with done-masked state, advantage chains cut at artificial success resets, periodic from-reset sticky eval -> final.json. Single-machine scaled port of openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs end-to-end; curriculum logic covered by harness T0.
Preflight audit found the from-reset eval reused ReplayResetEnv with the training-curriculum kills active, so eval episodes were cut by lag/success-kill before game_over — value_mean reported a key-but-slower-than-demo policy as ~0. Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma episode (4500 agent steps) so eval runs from reset to a real game_over. Also checkpoint torch + per-env RNG state and restore them on resume (was global numpy RNG only), so the kill->resume contract is faithful for the deterministic streams (MPS policy sampling has no bit determinism).
Cutting the demo just after the first reward (--max-rewards 1) yields a short first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the robustification backward curriculum to bootstrap on.
The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34: the policy commits before reliably executing the demo suffix under sticky actions. Expose the entropy bonus as a flag (default unchanged) to test whether more exploration breaks the plateau.
Document the backward-algorithm robustification (3-robustify.py): the curriculum bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no from-reset score on a single machine. Honest negative result, single seed, no benchmark row claimed.
…ation + robustification) Merge the three Montezuma results into a single table with a protocol column, trim the prose to one note. Restores the exploration row (31,000, replay-verified) that lived only on the superseded #132 branch.
# Conflicts: # README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Go-Explore on Montezuma's Revenge, both phases.
Phase 1 — exploration (
2-go-explore.py): restore-based archive exploration (knowledge-free downscaled cells), a deterministic trajectory search. The score is a search result, not an RL-policy score, so it is not cross-compared with the sticky-action RND numbers.Phase 2 — robustification (
3-robustify.py): distil the demo into a recurrent policy that plays under sticky actions via the backward algorithm (Salimans & Chen 2018; Ecoffet et al. 2021). Reported as an honest single-machine negative result:README updated with the honest write-up. Single seed throughout. Supersedes #132 (exploration phase alone). Merge left to a maintainer.