HierarchicalReinforcementLearning

资源描述

Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Hierarchical Reinforcement Learning,Ronald Parr,Duke University,2005 Ronald Parr,From ICML 2005 Rich Representations for Reinforcement Learning Workshop,1,Why?,Knowledge transfer/injection,Biases exploration,Faster solutions (even if model known),2,Why Not?,Some cool ideas and algorithms, but,No killer apps or wide acceptance, yet.,Good idea that needs more refinement:,More user friendliness,More rigor in,Problem specification,Measures of progress,Improvement = Flat (Hierarchical + Hierarchy),What units?,3,Introductory Comments,X: Hierarchical X nearly as old as X,Very hard to evaluate methods,Measuring improvement,Improvement = Flat (Hierarchical + Hierarchy),What units?,Algorithmic desiderata,Convergence?,Optimality?,4,Overview,Temporal Abstraction,Goal Abstraction,Challenges,Not orthogonal,5,Temporal Abstraction,Whats the issue?,Want “macro” actions (multiple time steps),Advantages:,Avoid dealing with (exploring/computing values for) less desirable states,Reuse experience across problems/regions,Whats not obvious (except in hindsight),Dealing w/Markov assumption,Getting the math right (stability),6,Looking back,Classical planning analogs,Macro-actions,Chunking,EBL,RL roots,Maes & Brooks 90,Singh 91,Mahadevan & Connell 93,MDP/control roots are oldest,Forestier & Varaiya 78,Grappling w/represenation,Markov assumption,Got the math right,Ignored representation,7,Stable Temporal Abstraction,Recall the dynamic programming operator:,T is a contraction in maximum norm,Can use to prove:,Convergence of value iteration (& bounds),Convergence of policy iteration (& bounds),Convergence of TD, Q-learning,(Not necessarily the only way to prove these things!),8,The Big(?) Insight,Replace Bellman operator:,With,Where,9,State Transitions,Macro Transitions,F plays the role of generalized transition function,More general:,Need not be a probability,Coefficient for value of one state in terms of others,May be:,P (special case),Arbitrary SMDP (discount varies w/state, etc.),Discounted probability of following a policy/running program,10,Whats so special?,Modified Bellman operator:,T,is also a contraction in max norm,Free goodies!,Optimality (Hierarchical Optimality),Convergence & stability,11,Historical Comment,Forestier & Varaiya studied avg. reward,(Recognized that macro actions implicitly defined transition functions),Recognized informally/intuitively by many,Initially formalized by Sutton (95),Collective forehead smack (97-98),Mahadevan,Precup, Sutton & Singh: Options,Hauskrecht et al.: Macro actions,Parr & Russell: HAMs,Dietterich: MAXQ,(though not the most important contribution of MAXQ),12,Using Temporal Abstraction,Accelerate convergence (usually),Avoid uninteresting states,Improve exploration in RL,Avoid computing all values for MDPs,Can finesse partial observability (a little),Simplify state space with “funnel” states,13,Funneling,Proposed by Forestier & Varaiya 78,Define “supervisor” MDP over boundary states,Selects policies at boundaries to,Push system back into nominal states,Keep it there,Nominal,Region,Boundary,states,Boundary,states,Control theory,version of,maze world!,14,HAMs,Teaspoon more general than some others,Partially specified,finite state machines,States could conditionally (on observation):,Output an action,Switch to another state,Encode a set of choices (choice point),HAM + MDP = SMDP (cross product state space),NB: Traditional policy = special case,Extended by Andre,15,Why this Isnt Enough,Many problems still have too many states!,Funneling is tricky,Doesnt happen in some problems,Hard to guarantee,Controllers can get “stuck”,Requires (extensive?) knowledge of the environment,16,Where do Macros Come From?,Not directly addressed!,Sometimes by hand or implicitly through a program,Sometimes result of solving some sub-MDP,Pick subset of states,Make certain states special/absorbing,Assign “pseudo rewards”,Relationship between pseudo-rewards and optimality,Parr (98),Hauskrecht et al. (98),17,Burning Issues,Better way to define macro actions?,Better approach to large state spaces?,18,Overview,Temporal Abstraction,Goal/State Abstraction,Challenges,Not orthogonal,19,Goal/State Abstraction,Why are these together?,Abstract goals typically imply abstract states,Makes sense for classical planning,Classical planning uses state sets,Implicit in use of state variables,What about factored MDPs?,Does this make sense for RL?,No goals,Markov property issues,20,Feudal RL,(Dayan & Hinton 95),Lords dictate subgoals to serfs,Subgoals = reward functions?,Demonstrated on a navigation task,Markov property problem,Stability?,Optimality?,NIPS paper w/o equations!,21,MAXQ,(Dietterich 98),Included temporal abstraction,Handled subgoals/tasks elegantly,Subtasks w/repeated structure can appear in multiple copies throughout state space,Subtasks can be isolated w/o violating Markov,Separated subtask reward from completion reward,Introduced “safe” abstraction,Example taxi/logistics domain,Subtasks move between locations,High level tasks pick up/drop off assets,22,A-LISP,(Andre & Russell 02),Combined and extended ideas from:,HAMs,MAXQ,Function approximation,Allowed partially specified LISP programs,Very powerful when the stars aligned,Halting,“Safe” abstraction,Function approximation,23,Why Isnt Everybody Doing It?,Totally “safe” state abstraction is:,Rare,Hard to guarantee w/o domain knowledge,“Safe” function approximation hard too,Developing hierarchies is hard (like threading a needle in some cases),Bad choices can make things worse,Mistakes not always obvious at first,24,Overview,Temporal Abstraction,Goal/State Abstraction,Challenges,Not orthogonal,25,Usability,Make hierarchical,RL more user friendly!,26,Measuring Progress,Hierarchical RL not a well defined problem,No benchmarks,Most hammers have customized nails,Need compelling “real” problems,What can we learn from HTN planning?,27,Automatic Hierarchy Discovery,Hard in other contexts (classical planning),Within a single problem:,Battle is lost if all states considered (polynomial speedup at best),If fewer states considered, when to stop?,Across problems,Considering all states OK for few problems?,Generalize to other problems in class,How to measure progress?,28,Promising Ideas,Idea: Bottlenecks are interestingmaybe,Exploit,Connectivity (Andre 98, McGovern 01),Ease of changing state variables (Hengst 02),Issues,Noise,Less work than learning a model?,Relationship between hierarchy and model?,29,Representation,Model, hierarchy, value function,should all be integrated in some meaningful way,“Safe” state abstraction is a kind of factorization,Need approximately safe state abstraction,Factored models w/approximation?,Boutilier et al.,Guestrin, Koller & Parr (linear function approximation),Relatively clean for discrete case,30,A Possible Path,Combine hierarchies w/Factored MDPs,Guestrin & Gordon (UAI 02),Subsystems defined over variable subsets (subsets can even overlap),Approximate LP formulation,Principled method of,Combining subsystem solutions,Iteratively improving subsystem solutions,Can be applied hierarchically,31,Conclusion,Two types of abstraction,Temporal,State/goal,Both are powerful, but knowledge heavy,Need language to talk about relationship between model, hierarchy, function approximation,32,

展开阅读全文

HierarchicalReinforcementLearning

最新文档