任民:中国科学院自动化研究所在读博士,本科毕业于国防科技大学,研究兴趣为生物特征识别。
报告题目:hapter 3: Finite Markov Decision Processes
报告摘要:In this chapter we introduce the formal problem of finite Markov decision processes, or finite MDPs, which we try to solve in the rest of the book. This problem involves evaluative feedback, as in bandits, but also an associative aspect—choosing different actions in different situations. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus, MDPs involve delayed reward and the need to trade off immediate and delayed reward. Whereas in bandit problems we estimated the value q⇤(a) of each action a, in MDPs we estimate the value q⇤(s, a) of each action a in each states, or we estimate the value v⇤(s) of each state given optimal action selections. These state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.
Spotlight:
从形象走向抽象,给出强化学习的形式化描述;
建立基于马尔科夫过程的数学模型。