r:how_to:markov_decision

Markov Decision Processes Toolbox

Markov Decision Processes toolbox provides packages for MATLAB, GNU Octave, Scilab, and R.

Installation

MDPtoolbox can be installed with the following code. You could choose any mirror site near you when a dialog or a message asks you to choose one.

install.packages(c("MDPtoolbox"), dependencies = TRUE)

You may need to set proxy before running the code above.

Sys.setenv("http_proxy"="http://130.153.8.66:8080/")

Preparation

You need to run the following code every time you open R to use this package.

library(MDPtoolbox)

This code let R load MDPtoolbox package.

Example

A textbook case from Hillier and Lieberman (2005).

Case

4 states.

State	Condition
0	Good as new
1	Operable — minor deterioration
2	Operable — major deterioration
3	Inoperable — output of unacceptable quality

3 actions.

Decision	Action	Relevant States
1	Do nothing	0, 1, 2
2	Overhaul (return system to state 1)	2
3	Replace (return system to state 0)	1,2,3

3 transition matrices with size 4 by 4.

Theoretical transition probability for “Do notiong” action: $P^{(1)}=\left(\begin{array}{cccc} 0 & 7/8 & 1/16 & 1/16 \\ 0 & 3/4 & 1/8 & 1/8 \\ 0 & 0 & 1/2 & 1/2 \\ 0 & 0 & 0 & 1 \end{array}\right)$

Theoretical transition probability for “Overhaul” action: $P^{(2)}=\left(\begin{array}{cccc} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right)$

Theoretical transition probability for “Replace” action: $P^{(3)}=\left(\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \end{array}\right)$

Cost (= Reward * -1) table

State/Acton	1	2	3
0	0
1	1000		6000
2	3000	4000	6000
3			6000

Reward table

State/Acton	1	2	3
0	0
1	-1000		-6000
2	-3000	-4000	-6000
3			-6000

Extension for practice

Extended transaction probabitlity matrix:

$P^{(2)}=\left(\begin{array}{cccc} 0 & 7/8 & 1/16 & 1/16 \\ 0 & 3/4 & 1/8 & 1/8 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right)$

Extended reward table

State/Acton	1	2	3
0	0	-4000	-6000
1	-1000	-4000	-6000
2	-3000	-4000	-6000
3	-8000	-8000	-6000

Setting for R

Conifuguration

n.state <- 4
n.action <- 3
P <- array(0, c(n.state, n.state, n.action))
R <- array(0, c(n.state, n.action))

Definition of the set of transition probability matrices

P[,,1] <- matrix(
  c(
    0, 7/8, 1/16, 1/16,
    0, 3/4, 1/8,   1/8,
    0, 0,     1/2,   1/2,
    0, 0,     0,       1),
  nrow=n.state, ncol=n.state, byrow=TRUE)
P[,,2] <- matrix(c(
    0, 7/8, 1/16, 1/16,
    0, 3/4, 1/8, 1/8,
    0, 1, 0, 0,
    0, 0, 0, 1),
  nrow=n.state, ncol=n.state, byrow=TRUE)
P[,,3] <- matrix(c(
    1, 0, 0, 0,
    1, 0, 0, 0,
    1, 0, 0, 0,
    1, 0, 0, 0),
  nrow=n.state, ncol=n.state, byrow=TRUE)
dimnames(P)[[1]] <- c("new","minor det","major det", "inoperable")
dimnames(P)[[2]] <- c("new","minor det","major det", "inoperable")
dimnames(P)[[3]] <- c("do nothing", "overhaul", "replace")
P

Definition of a reward matrix

R <- array(0, c(n.state, n.action))
R[, 1] <- -c(0, 1000, 3000, 8000)
R[, 2] <- -c(4000, 4000, 4000, 8000)
R[, 3] <- -c(6000, 6000, 6000, 6000)
rownames(R) <- c("new","minor det","major det", "inoperable")
colnames(R) <- c("do nothing", "overhaul", "replace")
R

Policy Optimization

MDP solvers for inifinite horizon problems:

mdp_LP(P, R, 0.9)
mdp_policy_iteration(P, R, 0.9)
mdp_policy_iteration_modified(P, R, 0.9)
mdp_Q_learning(P, R, 0.9)
mdp_value_iteration(P, R, 0.9)
mdp_value_iterationGS(P, R, 0.9)

0.9 is the discount rate.

Policy Evaluation

Evaluation of a policy for infinite horizon problems

mdp_eval_policy_iterative(P, R, 0.99,
policy=c(1,1,2,3), c(0,0,0,0), epsilon=0.0001, max_iter=4000)

目次