Content extract
Source: http://www.doksinet Introduction to Decision Theory Sudheesh Kumar Kattumannil University of Hyderabad 1 Introduction Decision theory as the name would imply is concerned with the process of making decisions and was introduced by Abraham Wald. Unlike classical statistics which is only directed towards the use of sampling information in making inferences about unknown numerical quantities, an attempt in decision theory is made to combine the sampling information with a knowledge of the consequences of our decisions. The extension to statistical decision theory includes decision making in the presence of statistical knowledge which provides some information where there is uncertainty. The elements of decision theory are quite logical and even perhaps intuitive The classical approach to decision theory facilitates the use of sample information in making inferences about the unknown quantities. Other relevant information includes that of the possible consequences which is
quantified by loss and the prior information which arises from statistical investigation. The use of Bayesian analysis in statistical decision theory is natural Their unification provides a foundational framework for building and solving decision problems. That is in decision theory an attempt is made to combine the sample information with loss function and prior information. The basic ideas of decision theory and of decision theoretic methods lend themselves to a variety of applications and computational and analytic advances. Several decision models were discussed in statistical literature. These include, Classical Decision Models Bayesian Decision Models, Maximum Entropy Models, Markov Decision Models, Hidden Markov Models, Dempster Shafer Evidence Models and Fuzzy Decision Mod1 Source: http://www.doksinet els. In classical decision models, there are three basic elements, state of nature, action space and loss functions. State of nature: The unknown quantity θ which affect the
decision process is commonly called state of nature. The symbol Θ will be used to denote the set of all possible state of nature and it is parameter space. Action space: A space of all possible values of decisions/actions/rules/estimators. Decisions are more commonly called actions in the literature. Particular action/decionsion will be denoted by a, while the set of all possible action under consideration by A. Definition 1 A decision function d is a statistic that takes value in A. That is d is a Borelmeasurable function that maps Rn in to A The class of all decision function that maps Rn in to A is the decision space dnoted by D. Note that, if X = x is observed, then we take an action d(X) ∈ A. Remark In non-data problems the decision space and action space are same. Loss function: Loss function denoted by L(θ, a) a non-negative function on θ ×A. Note that it is a measure of how much we lose by choosing action d when θ is used. The loss function L(θ, a) maps θ × A into the
set of real numbers and defines a cost to the statistician when he takes the action a and the true value of the parameter is θ. Particularly in estimation, a measure of the accuracy of estimators d of θ. Commonly used loss functions are squared error loss function (L(θ, a) = (θ − a)2 ), absolute error loss function (L(θ, a) = |θ − a)|) and the 0 − 1 loss(L(θ, a) = I(|θ − a| > m)), where I denote the indicator function. The loss function usually satisfies the following properties, L(a, a) = 0 and L(θ, a) is nondecreasing function of |θ − a|. Definition 2 The average of the loss function is called risk function which is usually denote by R(θ, d) and is given by Z R(θ, d) = E(L(θ, d(X)) = 2 L(θ, d(X))dx. Source: http://www.doksinet A risk function R(θ, d) characterizes the performance of the rule d for each value of parameter θ ∈ Θ. Since the risk function is defined as an average loss with respect to a sample space, it is called the frequentist risk.
When the loss function is given by L(θ, a) = (θ − a)2 , then R(θ, d) = E((θ − d)2 ). Remark In non-data problems the frequentist risk and Bayes risk are same. Under the above setup the problem of decision theory can be stated as follows. Given an action space A and a loss function L(θ, a), find a decision rule d in D such that the risk R(θ, d) is minimum in some sense. There are several principles for assigning the preference among the rules in D as mentioned in the beginning. Here we discuss minimax and Bayes’ rule. Definition 3 The principle of minimax is to choose d∗ ∈ D so that sup R(θ, d∗ ) ≤ sup R(θ, d), θ θ for all d ∈ D. If such a rule d∗ exist, is called minimax decision rule Before defining the Bayes rule, we define the Bayes risk with respect to prior probability π(θ). Definition 4 The Bayes risk of a decision function d is defined as Z R(π, d) = Epi (R(θ, d)) = R(θ, d)π(θ)dθ, where the expectation is taken with respect to θ. Once the
posterior distribution π(θ|x) is available, the Bayes risk is given by Z R(π, d) = Z f (x) χ L(θ, d)π(θ|x)dθdx, θ where f (x) is the marginal probability density function of X found from the joint probability density of X and θ. 3 Source: http://www.doksinet Definition 5 A Bayes decision rule is a decision function d∗ which minimize the Bayes risk. That is d∗ satisfies R(π, d∗ ) = inf R(π, d). d Theorem 1 Consider the problem of estimation of a parameter θ ∈ Θ with respect to quadratic loss function. Then the Bayes rule is the posterior mean Proof: Bayes decision rule is a decision function d(x) which minimize the Bayes risk defined as Z R(π, d) = Eπ (R(θ, d)) = R(θ, d)π(θ)dθ, where R(θ, d) is the frequentist risk, π(θ) is the prior probability and the expectation is taken with respect to θ. Once the posterior distribution π(θ|x) is available, the Bayes risk is given by Z Z R(π, d) = f (x) L(θ, d)π(θ|x)dθdx, χ θ where f (x) is
the marginal probability density function of X found from the joint probability density of X and θ. Minimizing R(π, d) is same as minimizing Z L(θ, d)π(θ|x)dθ. θ Hence, under squared error loss function we have to minimize 2 Z Eθ|x (θ − d(X)) = (θ − d(x))2 π(θ|x)dθ. θ Consider ϕ(d) = Eθ|x (θ − d(X))2 ) = d2 + 2dEθ|x (θ) + Eθ|x (θ)2 . 0 Then we have to minimize ϕ, using calculus, d is the solution of the equation ϕ (d) = 0. Hence d = Eθ|x (θ). Theorem 2 Consider the problem of estimation of a parameter θ ∈ Θ with respect to absolute 4 Source: http://www.doksinet loss function. Then the Bayes rule is the posterior median Proof: Recall that (Using first principle we can easily prove it) Z b(x) 0 Z b(x) 0 0 0 ϕ (x, t)dt + ϕ(x, b(x))b (x) − ϕ(x, a(x))a (x). ϕ(x, t)dt = a(x) a(x) And that the median, m, of random variable X is defined as P (X ≥ m) ≥ 1/2 and P (X ≤ m) ≤ 1/2. Here we have to minimize the function given by Z Z
(d(x) − θ)π(θ|x)dθ. (θ − d(x))π(θ|x)dθ + ϕ(d) = Eθ|x |θ − d(X)| = θ≤d θ≥d Then 0 Z Z ϕ (d) = − π(θ|x)dθ + 0 − 0 + θ≥d π(θ|x)dθ + 0 − 0. θ≤d = −P (θ ≥ d) + P (θ ≤ d). Hence the value of d is the solution of −P (θ ≥ d) + P (θ ≤ d) = 0, so that d is the posterior median. Theorem 3 Consider the problem of estimation of a parameter θ ∈ Θ with respect to linear loss function given by L(θ, a) = K0 (θ − a) when θ ≥ a and L(θ, a) = K1 (a − θ) when θ < a. Then the Bayes rule is K1 /(K1 + K2 ) percentile of the posterior distribution. Proof: By similar line of proof of Theorem 2, we arrive at K1 P (θ ≥ d) = K2 P (θ ≤ d), which can be written as P (θ ≤ d) = K1 /K2 (1 − P (θ ≤ d)). By solving the above equation, we obtain P (θ ≤ d) = K1 /(K1 + K2 ). 5 Source: http://www.doksinet Hence the Bayes rule is the K1 /(K1 + K2 ) percentile of the posterior distribution. Theorem 4 Consider the
problem of estimation of a parameter θ ∈ Θ with respect to loss function given by L(θ, C) = b × lengthC − Iθ C. Then the Bayes rule is given by C = {θ : π(θ|x) ≥ b}, where π(θ|x) is unimodel density. Proof: Bayes decision rule is a decision function C which minimizing the Z L(θ, C)π(θ|x)dθ. θ Hence, under the given loss function we have to find C which minimizes Z (b × lengthC − Iθ C)π(θ|x)dθ ϕ(C) = θ Z = b × lengthC − π(θ|x)dθ. C Let C = [c1 , c2 ], then the above expression can be written as Z c2 ϕ(c1 , c2 ) = b(c2 − c1 ) − π(θ|x)dθ. c1 To minimize ϕ(c1 , c2 ), we want to solve the equations ∂ϕ(c1 , c2 ) = 0 and ∂c1 ∂ϕ(c1 , c2 ) = 0. ∂c2 1 ,c2 ) 1 ,c2 ) Now ∂ϕ(c = 0 implies −b + π(c1 |x) = 0, and ∂ϕ(c = 0 implies b − π(c2 |x) = 0. That ∂c1 ∂c2 is π(c1 |x) = b, and π(c2 |x) = b. Since π(θ|x) is unimodel, for any θ0 lies between c1 and c2 , π(θ0 |x) > b. Hence the decision rule is given by C =
[c1 , c2 ] = {θ : π(θ|x) ≥ b}. Next we state a theorem which gives the relationship between minimax and Bayes estimators. Theorem 5 If the risk function corresponds to Bayes estimator d∗ is constant, then d∗ is a 6 Source: http://www.doksinet minimax estimator. The proof is very simple and omit as an exercise. Example 1 Suppose that the state of nature and action space are contain two points, that is Θ = {θ1 , θ2 } and A = {a1 , a2 }, where a1 -invest in the bound, a2 - does not invest in the bound, θ1 - no default occurs, θ2 -default occurs. The prior probability is given by P (θ) = 0.1 if θ = θ2 and P (θ1 ) + P (θ2 ) = 1. The loss function is given by Find a minimax rule and Bayes rule for the above problem. In non data problem we know that risk is equal to loss, hence R(θ, d) = L(θ, a). So we have sup L(θ, a1 ) = sup{−500, 1000} = 1000 θ θ And sup L(θ, a2 ) = sup{−300, −300} = 300. θ θ Clearly minimax decision rule is not to invest in
the bound. Now, Bayes risk is given by R(π, a1 ) = −350 and R(π, a2 ) = 300. Hence the Bayes decision rule is to invest in the bound 7