In this post we introduce the basic ideas of Markov Decision Processes (MDPs). MDPs are a mathematical formalization of sequential decision processes. An agent takes an action in an uncertain environment, and two things happen as a result. One, the agent receives a reward, and two, the state of the environment changes in response to the action. The goal is to take actions so as to maximize some measure of the cumulative reward. Formally, we characterize a MDP as a sequence , where (S, R, A) are the state, reward, and action, respectively.
State Dynamics, Rewards, and Policies
The dynamics of a MDP are characterized by a four argument “dynamics” probability distribution , which gives the probability of receiving a reward r, and transitioning to a new state s’, given that the agent takes an action a in state s. The agents policy is characterized by a “policy” probability distribution , which gives the probability of taking the action a in the state s. It is in this function that the “intelligence” of the agent resides.
Returns and Value Functions
The dynamics and the policy induce a state value function on the space of all states, an action value function on the space of all (state, action) tuples, and a concept of returns. The return is defined as
where is a discount factor introduced to make sure that the sum converges for MDPs that do not end. Thus, the return is a discounted sum of all future rewards. The goal of optimal policies is to maximize the expected return. Thus, a policy induces the following value function
and the following action value function
A policy is said to be optimal, if its value function is at least as good as any other policy for all states, i.e.
The goal of “solving” a MDP is to find the optimal policy and the corresponding value function (though finding the value function is not mandatory).