# SMHS DecisionTheory

## Contents

## Scientific Methods for Health Sciences - Decision Theory

### Overview

Decision theory is concerned with determining the optimal course of action when a number of alternatives, whose consequences cannot be forecasted with certainty, are present. Namely, decision theory is method to make decisions in the presence of statistical knowledge when some uncertainties are involved. In this section, we present an introduction to decision theory and illustrate its application with specific examples. Sample R codes will also be provided to help apply decision theory in the programming background.

### Motivation

Suppose a drug company is deciding whether they should market a new drug. Two of the main factors to consider including the proportion of people for which the drug will prove effective $(\theta_{1})$ and the proportion of the market the drug will capture ($\theta_{2})$. Both of these two factors are generally unknown even with experiments conducted to obtain statistical information about them. This kind of problem is one of the application where decision theory in that ultimate purpose is to decide whether to market the drug and how much to market and questions like this. So, what is decision theory and how does it work?

### Theory

- Decision theory: concerned with the problem of making decisions in the presence of statistical knowledge which sheds light on some of the uncertainty involved in the decision problem. In most cases, we will assume that these uncertainties can be considered to be unknown numerical quantities, and will represent them by $\theta$, which could be a vector or matrix.
- Statistics is directed towards the use of sample information in making references about $\theta$ without regard to the use to which they are to be put. Beside, we try to combine the sample information with other relevant aspects of the problem in order to make the optimal decisions. The relevant information include knowledge of the possible consequences of the decision, quantified by determining the loss that would be incurred for each possible decision and for various think in terms of losses and non-sample information that is useful to consider, which is called prior information considering about $\theta$ arising from sources other than statistical investigation. Generally speaking, prior information comes from past experience about similar situations involving similar $\theta$ and l as the set of all possible actions under consideration.
- The uncertain quantity $\theta$, which affects the decision process is commonly referred to as the state of nature. It is clearly important to consider what the possible states of nature are when making decisions. We use the symbol $\Theta$ to denote the set of all possible states of nature (parameter space) and $\theta$ (parameter). Loss function is an important element in decision theory. If a particular action $a_{1}$ is taken and $\theta_{1}$ turns out to be the true state of nature, then a loss function $L(\theta_{1},a_{1})$ is defined for all $(\theta,a) \in\Theta×\ell.$ For technical convenience, only loss function satisfying $L(\theta,a)≥-K>-\infty$ will be considered.
- With a statistical investigation, the outcome will be denoted as X, which is often referred to as a vector $X=(X_{1},X_{2},…,X_{n})$, where $X_{i}$ are independent observations from a common distribution. A particular realization of $X$ will be denoted $x$ and the set of possible outcomes is the sample space, which is denoted as $\mathcal {L}$, usually a subset of $R^{n}$, n-dimensional Euclidean space. The possible distribution of X depends on the unknown state of nature $\theta$. Let $P_{\theta}(A)$ or $P_{\theta}$ $(X\in A)$ denote the probability of the event $A\subset \mathcal {L}$ when $\theta$ is the true state of nature. For simplicity $X$ will be assumed to be either continuous or discrete random variable with density $\mathcal{f}(x|\theta)$. If $ X $ is continuous then $P_{θ} (A)=\int_{A}\mathcal {f}(x│\theta)dx$ when $X$ is discrete $P_{\theta}(A)=\sum_{X\in A} \mathcal {f}(x│\theta)$.
- Example: In the drug example above, assume it is desired to estimate $\theta_{2}$, which is a proportion and $\Theta={\theta_{2}:0≤\theta_{2}≤1}=[0,1]$. The action is the process of estimating $\theta_{2}$, hence $\ell=[0,1]$ as for estimation actions the action and parameter spaces coincide ($\ell \equiv \Theta$). The company might determine the loss function (in units of utility) to be

$$ L(\theta_{2},a) = \begin{cases} \theta_{2}-a, & \text{if }\theta_{2}-1\ge 0 \\ 2(a-\theta_{2}), & \text{if } \theta_{2}-a\le 0 \end{cases}. $$

- Note that an overestimate of demand is considered twice as costly as an underestimate of demand, and that otherwise the loss is linear in the error. We could perform a sample survey to get reasonable information about $\theta_{2}$. For example, assume n people are interviewed and the number $X$ who would buy the drug is observed. It might be reasonable to assume that $X$ is $B(n,\theta_{2})$, which has density function $f(x│θ_{2})={n\choose x} \theta_{2}^{x} (1-\theta_{2} )^{n-x}$. There could well be considerable prior information about $\theta_{2}$ arising from previous introductions of new similar drugs into the market. Assume new drugs tend to capture between $\frac{1}{10}$ and $\frac{1}{5}$ of the market with all values between $\frac{1} {10}$ and $\frac{1}{5}$ being equally likely. This prior information could be modeled by giving $\theta_{2}$ a $U(0.1,0.2)$ prior density, i.e. letting $\pi(\theta_{2})=10I_{(0.1,0.2)}\theta_{2}$. The above estimates of $L$, $f$, and $\pi$ are quite crude and usually much more detailed constructions are required to obtain satisfactory results. The techniques for doing this will be developed as we proceed.

- The main point to remember that a well-defined loss function and explicit prior information are needed in decision theory. Many statisticians chose statistical inference as a shield to ward off consideration of losses and prior information, which is a mistake for the following reasons: (1) it reports from statistical inference should be constructed so that they can be easily utilized in individual decision-making; (2) investigator may very well possess such information like losses and prior information; (3) the choice of an inference (beyond mere data summarization) can be viewed as a decision problem where the action space is the set of all possible inference statements and a loss function reflecting the success in conveying knowledge is used.

### Expected Loss, Decision Rules and Risk

- Bayesian Expected Loss: if $\pi^{*}(\theta)$ is the believed probability distribution of $\theta$ at the time of decision-making, the Bayesian expected loss of an action $a$ is $\rho(\pi^{*},a)=E^{\pi^*} L(\theta,a)=\int_{\Theta}{L(\theta,a)dF^{\pi^*}(\theta)}.$

- With the example of the drug evaluation given above, assume no data is obtained, so that the believed distribution of $\theta_{2}$ is simply

$$\pi(\theta_{2})=10 I_{(0.1,0.2)}(\theta_{2}) \text{, then } \rho(\pi^{*},a)=E^{\pi^*}L(\theta,a)=\int_{\Theta}L(\theta,a)dF^{\pi^*}(\theta)=$$ $$\int_{0}^{1}L(\theta_{2},a)\pi(\theta_{2})d\theta_{2}=\int_{0}^{a}2(a-\theta_{2})10I_{(0.1,0.2)}(\theta_{2})d\theta_{2}+\int_{a}^{1}(\theta_{2}-a)10I_{(0.1,0.2)} (\theta_{2})d\theta_{2}.$$

- Thus,

$$\pi(\theta_{2})= \begin{cases} 0.15-1, & \text{if } a\le 0.1 \\ 15a^{2}-4a+0.3, & \text{if } 0.1\le a\le 0.2 \\ 2a-0.3, & \text{if } a\le 0.2 \end{cases} .$$

- $\pi^{*}$ instead of $\pi$ because the latter usually refer to the initial prior distribution of $\theta$ while the former typically represent the final (posterior) distribution of $\theta$ after seeing the data.

- Frequentist Risk: the frequentist (classical) is the non-Bayesian school of decision theory adopts a quite different expected loss based on an average over the random $X$.
- A decision rule $δ(x)$ is a function from $L$ into $l$. If $X=x$ is the observed value of the sample information, then $δ(x)$ is the action that will be taken. (For a no-data problem, a decision rule is simply an action). Two decision rules, $δ_1$ and $δ_2$ are considered equivalent if $P_θ(δ_1 (X)=δ_2 (X))=1$ for all $θ$. For the example of drug above, $δ(x)=\frac{x}{n}$ is the standard decision rule for estimating $θ_2$. This estimator does not make use of the loss function or prior information given in the example. It will be seen later how to develop estimators which do so.
- The risk function of a decision rule $δ(x)$ is defined by $R(θ,δ)=E_θ^X [L(θ,δ(X))]=\int_X{L(θ,δ(x))dF^X (x|θ)}$, for a no-data problem, $R(θ,δ)=L(θ,δ)$. To a frequentist, it is desirable to use a decision rule $δ$ which has small $R(θ,δ)$. However, whereas the Bayesian expected loss of an action was a single number, the risk is a function on $Θ$, and since $θ$ is unknown we have a problem in saying what ‘small’ means. The following partial ordering of decision rule is a first step in defining a ‘good’ decision rule.
- A decision rule $δ_1$ is R-better than a decision rule $δ_1$ if $R(θ,δ_1) \leq 1 R(θ,δ_2)$ for all $θ∈Θ$, with strict inequality for some $θ$. A rule $δ_1$ is R-equivalent to $δ_2$ if $R(θ,δ_1 )=R(θ,δ_2)$ for all $θ$.
- A decision rule $δ$ is admissible if there exists no R-better decision rule. A decision rule $δ$ is inadmissible if there does exist an R-better decision rule.
- It is fairly clear that an inadmissible decision rule should not be used, since a decision rule with smaller risk can be found. Unfortunately, there is usually a large class of admissible decision rules for a particular problem. These rules will have risk functions, which cross, i.e., which are better in different places. An example of these ideas is given below.
- Example: assume $X \sim N(θ,1)$, and we want to estimate $θ$ under loss $L(θ,a)=(θ-a)^2$. Consider the decision rules, $δ_c (x)=cx$, $R(θ,δ_c )=E_θ^X L(θ,δ_c (X))=E_θ^X (θ-cX)^2=$ $E_θ^X (θ-cX)^2=E_θ^X (c[θ-X]+[1-c]θ)^2=$ $c^2 E_θ^X (θ-X)^2+2c(1-c)θE_θ^X (θ-X)^2+(1-c)^2 θ^2=c^2+(1-c^2 ) θ^2$, since for $c>1$, $R(θ,δ_1 )=1<c^2 (1-c)^2 θ^2=R(θ,δ_c)$, $δ_1$ is R-better than $δ_c$ for $c>1$. Hence the rules $δ_c$ are inadmissible for $c>1$. On the other hand, for $0≤c≤1$, $δ_c$ is admissible.
- Bays risk of a decision rule $δ$, with respect to a prior distribution $\pi$ on $Θ$ is defined as $r(π,δ)=E^π [R(θ,δ)]$. For the example just mentioned, suppose that $π(θ)$ is a $N(0,\tau^2 )$ density. Then for the decision rule $δ_c$, $r(π,δ_c )=E^π [R(θ,δ_c )]=E^π [c^2+(1-c^2 ) θ^2 ]=c^2+(1-c^2 ) E^π [θ^2 ]=c^2+(1-c^2 ) \tau ^2$. The Bayes risk of a decision rule will be seen to play an important role in virtually any approach to decision theory.

#### Decision principles

Below are the major methods of actually making a decision or choosing a decision rule.

- The Conditional Bayes Decision Principle: choose an action $a\in l$, which minimized $ρ(π^*,a)$ (assuming the minimum is attained). Such an action will be called a Bayes action and will be denoted as $a^{π^*}$. Back to the drug example above, we have shown that

$$ρ(π,a)= \begin{cases} 0.15-1, & \text{if } a\le 0.1 \\ 15a^{2}-4a+0.3, & \text{if } 0.1\le a\le 0.2 \\ 2a-0.3, & \text{if } a\ge 0.2 \end{cases} $$

- is minimized ($ρ(π,a^*)=\frac{1}{30}=0.03333333$) for $a^*=2/15=0.1333333$. This would be the estimate for $θ_2$, the market share of the new drug assuming no data was available.

- You can try the following R code:

x<-seq(-1, 2, by=0.01) quadraticF <- function (x){ result <- rep(0, length(x)) result[(x < 0.1)] <- 0.15-1 result[(0.1<=x) & (x < 0.2)] <- 15*x*x -4*x +0.3 result[(0.2<=x)] <- 2*x-0.3 return(result)} quadraticF(2/15) optim(0.1, quadraticF, method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN","Brent"), lower = 0.1, upper = 0.2)

$\$ $par [1] 0.1333333 $\$ $value [1] 0.03333333

plot(quadraticF, 0.1, 0.2)

#### Frequentist Decision Principles

- The Bayes risk principle: a decisions rule $δ_1$ is preferred to a rule $δ_2$ if $r(π,δ_1 )<r(π,δ_2)$. A decision rule which minimizes $r(π,δ)$ is optimal, it is called a Bayes rule, and will be denoted as $δ^π$. The quantity $r(π)=r(π,δ^π)$ is called the Bayes risk for $π$. From the above example, we have that $r(π,δ_c )=c^2+(1-c^2 ) \tau ^2$, when $π~N(0,\tau^2)$. Minimizing with respect to $c$ shows that $c_0=\frac{\tau^2}{1+\tau^2}$ is the best value. Thus $δ_π$ has the smallest Bayes risk among all estimators of the form $δ_c$;

- The
**minimax**principle: a decision rule $δ_1^*$ is preferred to a rule of $δ_2^*$ if $\sup_{θ\in Θ} {R(θ,δ_1^* )}=\sup_{θ\in Θ} {R(θ,δ_2^* )}$. A rule $δ^(*M)$ is a minimax decision rule if it minimizes $\sup_{θ\in Θ} {R(θ,δ^* )}$ among all randomized rules in $L^*$, i.e., if $\sup_{θ\in Θ} {R(θ,δ^{*M})}=\inf_{δ^* \in L^* } {\sup_{θ\in Θ} {R(θ,δ^* )}}$. For the example above, with the decision rule $δ_c$, $\sup_θ{R(θ,δ_c )}=\sup_θ {[c^2+(1-c^2 ) θ^2 ]}=\begin{cases} 1, & \text{if } c=1 \\ \infty, & \text{if } c \not= 1 \end{cases} $. According to the minimax principle, $δ_1$ is the best among the rules $δ_c$.

- The invariance principle: if two problems have identical formal structures (i.e., have the same space, parameter space, densities, and loss function), then the same decision rule should be used in each problem.

#### Standard loss functions

Typically, analyses of decision rules are carried out for certain standard loss and three certain standard loss functions including:

- Square-Error Loss: $L(θ,a)=(θ-a)^2$. This method is straightforward and simple and is familiar to statisticians due to its similarity to classical least squares theory. There are a number of situations, however, in which squared-error loss may be appropriate. For example, in many statistical problems for which a loss symmetric in $(θ-a)$ is suitable, the exact functional form of the loss is not crucial to the conclusion. Squared-error loss may then be a useful approximation to the true loss. Several problems of this nature will be encountered in later chapters. Another situation in which squared-error loss can arise is when $L(θ,a)= - U(θ,a)= E_{θ,a} [U(Z)]$. A generalization of squared-error loss is called the weighted squared-error loss: $L(θ,a)= w(θ) (θ-a)^2$, where $(θ-a)^2$ is weighted by a function of $θ$. Another variant of squared-error loss is quadratic loss: If $θ=(θ_1,…,θ_p )^t$ is a vector to be estimated by $a=(a_1,…,a_p )^t$, and $Q$ is a $p\times p$ positive definite matrix, then $L(θ,a)= (θ,a)'Q(θ,a)$, which is called quadratic loss. When $Q$ is diagonal, this is reduced to $L(θ,a)= \sum_{i=1}^p{q_i (θ_i,a_i )^2}$ and is a natural extension of squared-error loss to the multivariate situation.

- Linear Loss: when the utility function is approximately linear, the loss function will tend to be linear. Thus the linear loss is

- $L(θ,a)= \begin{cases} K_0 (θ- a), & \text{if } θ-a\ge 0 \\ K_1 (a-θ), & \text{if } θ-a<0 \end{cases} $, where the constants $K_0$ and $K_1$ can be chosen to reflect the relative importance of underestimation and overestimation. These constants will usually be different. When they are equal, the loss is equivalent to $L(θ,a)=|θ- a|$, which is called absolute error loss. If $K_0$ and $K_1$ are functions of $θ$, then the loss will be called weighted linear loss.

### Applications

- This article suggests that decision-making under uncertainty is at least partly, case-based. It proposed a model in which cases are primitive, and provided a simple approximation of a decision rule that chooses a “best” act based on its past performance in similar cases. Each act is evaluated by the sum of the utility levels that resulted from using this act in past cases, each weighted by the similarity of that past case to the problem at hand. The formal model of case-based decision theory naturally rises to the notions of satisficing decisions and aspiration levels.

- This article was written to describe naturalistic decision-making. This is our attempt to understand how human decision makers actually make decisions in complex real-world settings and to learn how to support those processes. Section A introduces the main themes of naturalistic decision-making, describes classical decision theory in order to discuss some of its limitations, and presents examples of the types of decisions that need to be explained. Section B presents examples of naturalistic research paradigms that have emerged within the last few years. Section C examines a range of issues concerning our need to develop methodology to conduct research in naturalistic settings. Section D examines applications and extensions of naturalistic decision-making. Section E attempts to evaluate the issues raised by this book.

### Problems

A pharmaceutical company purchases raw chemicals necessary to manufacture certain drugs $\$$0.89 and sells the drugs for $\$$1.39 (per unit). Any drugs unsold by the expiration date are disposed of at a price of $\$$0.39 (per unit). Assume the probability distribution of the drug demand by consumers is as follows:

Demand x | Probability p(x) |

50 | 0.15 |

60 | 0.25 |

70 | 0.20 |

80 | 0.15 |

90 | 010 |

100 | 0.08 |

110 | 0.07 |

(1) Let's assume that there is no goodwill loss (from patients) associated with the drug being out of stock, how many units of the drug should be manufactured by the company to minimize the potential loss?

(2) If the goodwill loss is 0.20 per unit of unsatisfied demand, how many units should be produced?

(3) If the company traditionally makes 80 units of the drug (for a given period). When shown the answer to (1) above, the pharmaceutical company execs state that a goodwill cost for being out of stock has been incorporated. What is the implicit per unit goodwill cost associated with the inventory policy of 80 units?

(4) In a given decision problem, the cost of act $i$ is $ c_{i}(x)= (x - a_{i} )^{2},(i = 1,2,...,n)$. In words, the cost is a quadratic function of $x$, and $a_{i}$ are known constants. Show that the optimum act is the one that minimizes $(\mu- a_{i})^{2}$.

A problem frequently faced by production departments concerns the number of items that must be scheduled for production in order to fill an order for a specified number of items. Each item may be good or defective, and there is uncertainty about the total number of good items in the production run. If too few items are produced, there may not be enough good items to fill the order; if too many items are produced, there may be more good items available than are required and the surplus items will be wasted.

As a simple illustration, consider the case of a production manager who has an order for 5 units of a particular product. The cost of setting up a production run is $\$$150 regardless of the size of the run. The variable manufacturing cost amounts to $\$$20 per unit produced. For a number of reasons, it is not possible to test each unit before the next unit is produced; rather, all the items in the run are tested at the same time, and the good items are separated from the defective ones at that stage. From past experience, the long-run fraction of defective units is estimated to be about 10$\%$.

If there are fewer than 5 good units in the first run, a second run must be scheduled. The same comments apply for the second run as for the first. If the number of good units in the second run falls short of the number required, a third run may be necessary. Theoretically, a fourth, fifth, . . . , run may be required if the number of good units in three runs still falls short of the requirements.

If more good units are produced in any one run than are required, the surplus good units are sold as scrap at $\$$10 per unit. Defective units are worthless.

(5) Under what conditions is it reasonable to suppose that the probability distribution of the number of good units in a run of n units is binomial with parameters n and p = 0.90? When answering the following questions, assume that these conditions are satisfied.

(6) In order to simplify the problem initially, assume that if a second run is required, it is possible to avoid having any defectives in that run by employing workers with more skill. The variable manufacturing cost in the second run will increase to $\$$40 per unit. How many units should be produced in the first run so that the expected cost per order is minimized?

(7) Suppose that if a second run is needed, it will be made under the same conditions as the first. However, if a third run is required, skilled workers will be employed, there will be no defective units in this run, and the variable cost will be $\$$40 per unit. Without doing any calculations, explain how you would determine the policy that minimizes the expected cost per order.

### References

- Chapter 3 Decision Theory / York University
- Decision Theory Wikipedia
- Statistical Decision Theory and Bayesian Analysis / James O. Berger

- SOCR Home page: http://www.socr.umich.edu

Translate this page: