目次

Markov Decision Processes Toolbox

Markov Decision Processes toolbox provides packages for MATLAB, GNU Octave, Scilab, and R.

Installation

MDPtoolbox can be installed with the following code. You could choose any mirror site near you when a dialog or a message asks you to choose one.

install.packages(c("MDPtoolbox"), dependencies = TRUE)

You may need to set proxy before running the code above.

Sys.setenv("http_proxy"="http://130.153.8.66:8080/")
Preparation

You need to run the following code every time you open R to use this package.

library(MDPtoolbox)

This code let R load MDPtoolbox package.

Example

A textbook case from Hillier and Lieberman (2005).

Case

4 states.

StateCondition
0Good as new
1Operable — minor deterioration
2Operable — major deterioration
3Inoperable — output of unacceptable quality

3 actions.

DecisionActionRelevant States
1Do nothing0, 1, 2
2Overhaul (return system to state 1)2
3Replace (return system to state 0)1,2,3

3 transition matrices with size 4 by 4.

Theoretical transition probability for “Do notiong” action: 
P^{(1)}=\left(\begin{array}{cccc}
    0 & 7/8 & 1/16 & 1/16 \\
    0 & 3/4 & 1/8  & 1/8  \\
    0 & 0   & 1/2  & 1/2  \\
    0 & 0   & 0    & 1
  \end{array}\right)

Theoretical transition probability for “Overhaul” action: 
P^{(2)}=\left(\begin{array}{cccc}
    0 & 0   & 0    & 0    \\
    0 & 0   & 0    & 0   \\
    0 & 1   & 0    & 0    \\
    0 & 0   & 0    & 0
  \end{array}\right)

Theoretical transition probability for “Replace” action: 
P^{(3)}=\left(\begin{array}{cccc}
    1 & 0   & 0    & 0    \\
    1 & 0   & 0    & 0   \\
    1 & 0   & 0    & 0    \\
    1 & 0   & 0    & 0
  \end{array}\right)

Cost (= Reward * -1) table

State/Acton123
00
11000 6000
2300040006000
3 6000

Reward table

State/Acton123
00
1-1000 -6000
2-3000-4000-6000
3 -6000
Extension for practice

Extended transaction probabitlity matrix:


P^{(2)}=\left(\begin{array}{cccc}
    0 & 7/8 & 1/16 & 1/16 \\
    0 & 3/4 & 1/8  & 1/8  \\
    0 & 1   & 0    & 0    \\
    0 & 0   & 0    & 0
  \end{array}\right)

Extended reward table

State/Acton123
00-4000-6000
1-1000-4000-6000
2-3000-4000-6000
3-8000-8000-6000
Setting for R

Conifuguration

n.state <- 4
n.action <- 3
P <- array(0, c(n.state, n.state, n.action))
R <- array(0, c(n.state, n.action))

Definition of the set of transition probability matrices

P[,,1] <- matrix(
  c(
    0, 7/8, 1/16, 1/16,
    0, 3/4, 1/8,   1/8,
    0, 0,     1/2,   1/2,
    0, 0,     0,       1),
  nrow=n.state, ncol=n.state, byrow=TRUE)
P[,,2] <- matrix(c(
    0, 7/8, 1/16, 1/16,
    0, 3/4, 1/8, 1/8,
    0, 1, 0, 0,
    0, 0, 0, 1),
  nrow=n.state, ncol=n.state, byrow=TRUE)
P[,,3] <- matrix(c(
    1, 0, 0, 0,
    1, 0, 0, 0,
    1, 0, 0, 0,
    1, 0, 0, 0),
  nrow=n.state, ncol=n.state, byrow=TRUE)
dimnames(P)[[1]] <- c("new","minor det","major det", "inoperable")
dimnames(P)[[2]] <- c("new","minor det","major det", "inoperable")
dimnames(P)[[3]] <- c("do nothing", "overhaul", "replace")
P

Definition of a reward matrix

R <- array(0, c(n.state, n.action))
R[, 1] <- -c(0, 1000, 3000, 8000)
R[, 2] <- -c(4000, 4000, 4000, 8000)
R[, 3] <- -c(6000, 6000, 6000, 6000)
rownames(R) <- c("new","minor det","major det", "inoperable")
colnames(R) <- c("do nothing", "overhaul", "replace")
R
Policy Optimization

MDP solvers for inifinite horizon problems:

mdp_LP(P, R, 0.9)
mdp_policy_iteration(P, R, 0.9)
mdp_policy_iteration_modified(P, R, 0.9)
mdp_Q_learning(P, R, 0.9)
mdp_value_iteration(P, R, 0.9)
mdp_value_iterationGS(P, R, 0.9)

0.9 is the discount rate.

Policy Evaluation

Evaluation of a policy for infinite horizon problems

mdp_eval_policy_iterative(P, R, 0.99,
policy=c(1,1,2,3), c(0,0,0,0), epsilon=0.0001, max_iter=4000)