HierarchicalReinforcementLearning

上传人:xx****x 文档编号:243023180 上传时间:2024-09-14 格式:PPT 页数:32 大小:72.50KB
返回 下载 相关 举报
HierarchicalReinforcementLearning_第1页
第1页 / 共32页
HierarchicalReinforcementLearning_第2页
第2页 / 共32页
HierarchicalReinforcementLearning_第3页
第3页 / 共32页
点击查看更多>>
资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Hierarchical Reinforcement Learning,Ronald Parr,Duke University,2005 Ronald Parr,From ICML 2005 Rich Representations for Reinforcement Learning Workshop,1,Why?,Knowledge transfer/injection,Biases exploration,Faster solutions (even if model known),2,Why Not?,Some cool ideas and algorithms, but,No killer apps or wide acceptance, yet.,Good idea that needs more refinement:,More user friendliness,More rigor in,Problem specification,Measures of progress,Improvement = Flat (Hierarchical + Hierarchy),What units?,3,Introductory Comments,X: Hierarchical X nearly as old as X,Very hard to evaluate methods,Measuring improvement,Improvement = Flat (Hierarchical + Hierarchy),What units?,Algorithmic desiderata,Convergence?,Optimality?,4,Overview,Temporal Abstraction,Goal Abstraction,Challenges,Not orthogonal,5,Temporal Abstraction,Whats the issue?,Want “macro” actions (multiple time steps),Advantages:,Avoid dealing with (exploring/computing values for) less desirable states,Reuse experience across problems/regions,Whats not obvious (except in hindsight),Dealing w/Markov assumption,Getting the math right (stability),6,Looking back,Classical planning analogs,Macro-actions,Chunking,EBL,RL roots,Maes & Brooks 90,Singh 91,Mahadevan & Connell 93,MDP/control roots are oldest,Forestier & Varaiya 78,Grappling w/represenation,Markov assumption,Got the math right,Ignored representation,7,Stable Temporal Abstraction,Recall the dynamic programming operator:,T is a contraction in maximum norm,Can use to prove:,Convergence of value iteration (& bounds),Convergence of policy iteration (& bounds),Convergence of TD, Q-learning,(Not necessarily the only way to prove these things!),8,The Big(?) Insight,Replace Bellman operator:,With,Where,9,State Transitions,Macro Transitions,F plays the role of generalized transition function,More general:,Need not be a probability,Coefficient for value of one state in terms of others,May be:,P (special case),Arbitrary SMDP (discount varies w/state, etc.),Discounted probability of following a policy/running program,10,Whats so special?,Modified Bellman operator:,T,is also a contraction in max norm,Free goodies!,Optimality (Hierarchical Optimality),Convergence & stability,11,Historical Comment,Forestier & Varaiya studied avg. reward,(Recognized that macro actions implicitly defined transition functions),Recognized informally/intuitively by many,Initially formalized by Sutton (95),Collective forehead smack (97-98),Mahadevan,Precup, Sutton & Singh: Options,Hauskrecht et al.: Macro actions,Parr & Russell: HAMs,Dietterich: MAXQ,(though not the most important contribution of MAXQ),12,Using Temporal Abstraction,Accelerate convergence (usually),Avoid uninteresting states,Improve exploration in RL,Avoid computing all values for MDPs,Can finesse partial observability (a little),Simplify state space with “funnel” states,13,Funneling,Proposed by Forestier & Varaiya 78,Define “supervisor” MDP over boundary states,Selects policies at boundaries to,Push system back into nominal states,Keep it there,Nominal,Region,Boundary,states,Boundary,states,Control theory,version of,maze world!,14,HAMs,Teaspoon more general than some others,Partially specified,finite state machines,States could conditionally (on observation):,Output an action,Switch to another state,Encode a set of choices (choice point),HAM + MDP = SMDP (cross product state space),NB: Traditional policy = special case,Extended by Andre,15,Why this Isnt Enough,Many problems still have too many states!,Funneling is tricky,Doesnt happen in some problems,Hard to guarantee,Controllers can get “stuck”,Requires (extensive?) knowledge of the environment,16,Where do Macros Come From?,Not directly addressed!,Sometimes by hand or implicitly through a program,Sometimes result of solving some sub-MDP,Pick subset of states,Make certain states special/absorbing,Assign “pseudo rewards”,Relationship between pseudo-rewards and optimality,Parr (98),Hauskrecht et al. (98),17,Burning Issues,Better way to define macro actions?,Better approach to large state spaces?,18,Overview,Temporal Abstraction,Goal/State Abstraction,Challenges,Not orthogonal,19,Goal/State Abstraction,Why are these together?,Abstract goals typically imply abstract states,Makes sense for classical planning,Classical planning uses state sets,Implicit in use of state variables,What about factored MDPs?,Does this make sense for RL?,No goals,Markov property issues,20,Feudal RL,(Dayan & Hinton 95),Lords dictate subgoals to serfs,Subgoals = reward functions?,Demonstrated on a navigation task,Markov property problem,Stability?,Optimality?,NIPS paper w/o equations!,21,MAXQ,(Dietterich 98),Included temporal abstraction,Handled subgoals/tasks elegantly,Subtasks w/repeated structure can appear in multiple copies throughout state space,Subtasks can be isolated w/o violating Markov,Separated subtask reward from completion reward,Introduced “safe” abstraction,Example taxi/logistics domain,Subtasks move between locations,High level tasks pick up/drop off assets,22,A-LISP,(Andre & Russell 02),Combined and extended ideas from:,HAMs,MAXQ,Function approximation,Allowed partially specified LISP programs,Very powerful when the stars aligned,Halting,“Safe” abstraction,Function approximation,23,Why Isnt Everybody Doing It?,Totally “safe” state abstraction is:,Rare,Hard to guarantee w/o domain knowledge,“Safe” function approximation hard too,Developing hierarchies is hard (like threading a needle in some cases),Bad choices can make things worse,Mistakes not always obvious at first,24,Overview,Temporal Abstraction,Goal/State Abstraction,Challenges,Not orthogonal,25,Usability,Make hierarchical,RL more user friendly!,26,Measuring Progress,Hierarchical RL not a well defined problem,No benchmarks,Most hammers have customized nails,Need compelling “real” problems,What can we learn from HTN planning?,27,Automatic Hierarchy Discovery,Hard in other contexts (classical planning),Within a single problem:,Battle is lost if all states considered (polynomial speedup at best),If fewer states considered, when to stop?,Across problems,Considering all states OK for few problems?,Generalize to other problems in class,How to measure progress?,28,Promising Ideas,Idea: Bottlenecks are interestingmaybe,Exploit,Connectivity (Andre 98, McGovern 01),Ease of changing state variables (Hengst 02),Issues,Noise,Less work than learning a model?,Relationship between hierarchy and model?,29,Representation,Model, hierarchy, value function,should all be integrated in some meaningful way,“Safe” state abstraction is a kind of factorization,Need approximately safe state abstraction,Factored models w/approximation?,Boutilier et al.,Guestrin, Koller & Parr (linear function approximation),Relatively clean for discrete case,30,A Possible Path,Combine hierarchies w/Factored MDPs,Guestrin & Gordon (UAI 02),Subsystems defined over variable subsets (subsets can even overlap),Approximate LP formulation,Principled method of,Combining subsystem solutions,Iteratively improving subsystem solutions,Can be applied hierarchically,31,Conclusion,Two types of abstraction,Temporal,State/goal,Both are powerful, but knowledge heavy,Need language to talk about relationship between model, hierarchy, function approximation,32,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 大学资料


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!