Skip to content

4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification)#133

Merged
dnddnjs merged 16 commits into
masterfrom
ai/go-explore
Jun 12, 2026
Merged

4-atari-hard: Go-Explore on Montezuma's Revenge (exploration + robustification)#133
dnddnjs merged 16 commits into
masterfrom
ai/go-explore

Conversation

@dnddnjs

@dnddnjs dnddnjs commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Go-Explore on Montezuma's Revenge, both phases.

Phase 1 — exploration (2-go-explore.py): restore-based archive exploration (knowledge-free downscaled cells), a deterministic trajectory search. The score is a search result, not an RL-policy score, so it is not cross-compared with the sticky-action RND numbers.

Phase 2 — robustification (3-robustify.py): distil the demo into a recurrent policy that plays under sticky actions via the backward algorithm (Salimans & Chen 2018; Ecoffet et al. 2021). Reported as an honest single-machine negative result:

  • Bootstrap works — a first-key-only demo (~250 actions) + 128 envs makes the curriculum retreat immediately (as_good_as_demo → 1.0); the full ~5,300-action demo at 16 envs never moves.
  • But it plateaus ~22% of the way; the policy masters the last ~55 actions but not the earlier platforming. Unchanged by 10× frames or 100× entropy bonus — a scale ceiling vs the original hundreds–thousands of envs. No from-reset score, so no benchmark row is claimed.

README updated with the honest write-up. Single seed throughout. Supersedes #132 (exploration phase alone). Merge left to a maintainer.

dnddnjs and others added 14 commits May 24, 2026 12:08
New chapter for hard-exploration Atari. PPO with Random Network
Distillation (Burda et al., 2018) as the curiosity bonus.

- env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the
  standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv
  (uninterrupted episodes so intrinsic returns can chain across deaths).
- 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with
  LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS
  seeded by 50 rollouts of a random agent, intrinsic reward scaled by
  running std of discounted intrinsic returns, dual GAE (extrinsic
  episodic + intrinsic non-episodic), predictor updated on 25% of each
  minibatch, combined advantage A = 2*A_ext + 1*A_int.

Not run end-to-end yet. Sanity-checked static shapes and module wiring.
PPO+RND made reproducible and resumable. Shared run plumbing (seed,
metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary)
lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel
envs crack the first-key bottleneck (128 envs never scored in 50M); final mean
per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO
baseline (2497). Adds a README benchmark row. Count-based exploration is
deferred to a later PR (not yet trained/benchmarked).
Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural
net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated
random actions (p=0.95), raw-score accept rule, virtual DONE cell, global
experience log with prev_id chains (demo source for robustification),
12-worker spawn pool over raw gymnasium ALE clone/restore.

Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume;
explog flushed as compressed chunks; checkpoint = archive+log+RNG at
batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps.
…n's dir

Cross-run-dir resume (harness relaunches into a fresh run dir) could not
see chunks flushed by the original run; chunk lookup now falls back to
the ancestor run's explog dir and resume fails loudly if any chunk is
unreachable.
Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py
(gradient + intrinsic reward, envpool) and 2-go-explore.py (archive +
emulator restore, raw ALE). Go-Explore keeps its own plumbing in
env_go_explore.py since the two stacks share nothing.
…walk

A result earlier in the same batch can replace a cell; walking a later
result against the cell's CURRENT score/trajectory stitched actions
executed from the old state onto the new prefix, fabricating scores no
single playthrough achieved. sample() now freezes snapshot/score/
trajectory per pick and the walk uses the capture — matching the
official Go-Explore, which ships these values inside each task.
Caught by publish-time demo replay verification (score mismatch).
…extract + GRU PPO

extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint +
experience log, replay-verify it reproduces the archived score (31,000),
truncate after the last reward, save actions/rewards/periodic ALE states.

env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play
forward under sticky actions; success = raw score >= demo return; lag/success
kills) + ResetManager curriculum (starting points march backward as the agent
matches the demo, forward-cumsum move rule per atari-reset, nudge forward on
collapse).

3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated
BPTT with done-masked state, advantage chains cut at artificial success resets,
periodic from-reset sticky eval -> final.json. Single-machine scaled port of
openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs
end-to-end; curriculum logic covered by harness T0.
Preflight audit found the from-reset eval reused ReplayResetEnv with the
training-curriculum kills active, so eval episodes were cut by lag/success-kill
before game_over — value_mean reported a key-but-slower-than-demo policy as ~0.
Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma
episode (4500 agent steps) so eval runs from reset to a real game_over.

Also checkpoint torch + per-env RNG state and restore them on resume (was global
numpy RNG only), so the kill->resume contract is faithful for the deterministic
streams (MPS policy sampling has no bit determinism).
Cutting the demo just after the first reward (--max-rewards 1) yields a short
first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the
robustification backward curriculum to bootstrap on.
The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34:
the policy commits before reliably executing the demo suffix under sticky actions.
Expose the entropy bonus as a flag (default unchanged) to test whether more
exploration breaks the plateau.
Document the backward-algorithm robustification (3-robustify.py): the curriculum
bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no
from-reset score on a single machine. Honest negative result, single seed, no
benchmark row claimed.
…ation + robustification)

Merge the three Montezuma results into a single table with a protocol column,
trim the prose to one note. Restores the exploration row (31,000, replay-verified)
that lived only on the superseded #132 branch.
@dnddnjs dnddnjs merged commit 3d421f6 into master Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants