Language learning | English » Szabó Botond - Bayesian adaptation using conditionally Gaussian priors

Please log in to read this in our online viewer!

Szabó Botond - Bayesian adaptation using conditionally Gaussian priors

Please log in to read this in our online viewer!


 2010 · 37 page(s)  (327 KB)    English    17    February 20 · 2011  
       
Comments

No comments yet. You can be the first!

Content extract

http://www.doksihu Final Thesis Bayesian adaptation using conditionally Gaussian priors By Botond Szabó Applied Mathematics Eötvös Loránd Tudomány Egyetem Department of Probability Theory and Statistics 2010 Supervisors: Prof.dr V Proka j Prof.dr J H van Zanten Prof.dr AW van der Vaart http://www.doksihu Contents 1 2 3 Introduction 3 1.1 5 Notation . Reproducing Kernel Hilbert Space 7 2.1 RKHS in general . 7 2.2 The Riemann-Liouville process . 9 2.3 Rescaled multiple integrated Brownian motion . 10 Preliminary results 14 3.1 The fundamental theorem . 16 3.2 Determination of constants . 18 4 Main Theorem 23 5 Examples 32 5.1 The Extended Riemann-Liouville process . 32 5.2 The Rescaled process . 34 1 http://www.doksihu Preface This master thesis has been prepared at Vrije Universiteit

Amsterdam between October, 2008 and June, 2009 and has been extended and corrected in the Eötvös Loránd Tudományegyetem between September, 2009 and June, 2010. I would like to thank Harry van Zanten for his patient, his helpfulness and for the fruitful discussions we had during the time I had been working on the master project. It has been a pleasure to work with him Furthermore I would like to thank to Vilmos Prokaj for careful reading of the manuscript in the thesis, for his technical help on using LaTeX and nally for the fruitful conversations on the topic. I am also thankful to Aad van der Vaart for the helpful answers to my questions. 2 http://www.doksihu 1 Introduction Consider the problem of estimating a probability density p0 relative to a dominating measure µ on a sample space (X , A) based on an iid sample X1 , . , Xn from this density. Usually we use R, Rd , [0, 1] or [0, 1]d as X In our case X = [0, 1] and µ is the Lebesgue measure. Assume p0 > 0 and log

p0 is bounded. Gaussian processes can be adopted to construct prior distributions on innite-dimensional statistical settings. For example they can be used in nonparametric density estimation as a prior distribution on a collection of probability densities (relative to the above mentioned µ). Let W be a Borel measurable, zero-mean Gaussian random element in L∞ (X ), then dene the prior distribution as pW (x) = R eWx . eWx µ(dx) (1.1) These functions are normalized and nonnegative, since the Gaussian process is exponentiated, so we get probability densities. We will refer to this prior distribution as Π. According to Bayes' rule, the prior distribution and the iid sample yield a posterior distribution. We use the frequentist set up, which means that the data is sampled from a xed "true distribution". The corresponding posterior distribution often contracts around the xed true distribution, which is referred to as posterior consistency. We study the rate of

contraction of the posterior distribution, which is the radius of the balls, where the posterior distribution puts most of its mass, as a function of n. The formal denition is that the posterior distribution has rate of contraction at least εn , if for a suciently large constant M Πn (p : d(p, p0 ) ≥ M εn |X1 , . , Xn ) 0 in probability as n ∞, where d is the Hellinger or total variation distance. The posterior distribution is the random measure given by R Qn p(Xi ) dΠ(p) Πn (B|X1 , X2 , . , Xn ) = RB Qni=1 . i=1 p(Xi ) dΠ(p) (1.2) Since the posterior distribution puts all of its mass on the support of the prior, consistency can hold only if the parameter w0 , dening the true distribution of the data, belongs to this support. 3 http://www.doksihu The Hölder space of order α > 0 is denoted by C α [0, 1]. It is a collection of functions f ∈ C[0, 1] that have bαc continuous derivatives for bαc the biggest integer strictly smaller then α with the bαcth

derivative f (α) being Lipschitz continuous of order β , for all 0 < β < α − bαc. We will also refer to the elements of C α [0, 1] as α-regular functions. In the simplest case, we would like to estimate α-regular density functions, where α > 0 is a known parameter. By good choice of the prior distribution in α form of (1.1) we can achieve the contraction rate εn,α = n− 2α+1 , for examples see the fourth chapter of [8]. Let X1 , X2 , ., Xn be an iid sample from density p ∈ P , where P is some set of probability densities. The minimax rate for estimating p is up to a constant multiplier the square root of  inf sup Ep d2 (p, p̂n ) , p̂n p∈P where the inmum is taken over all density function estimator p̂n = p̂n (·|X1 , . ., Xn ) and d is a metric, for example, the Hellinger or total variation distance It is known that the minimax rate for estimating an α-regular density p0 α on [0, 1]d is εn,α = n− 2α+d , see the introduction part of [3].

Hence by good choice of the prior distribution we can construct estimators, with the help of the posterior distribution (1.2), which achieves the minimax rate We deal in this paper with the generalization of this problem, when the regularity of the density p0 is unknown. In this case we use a set of models P α indexed by α, where α ∈ A ⊂ R+ and the model P α can be used to estimate α-regular functions. Assume that for all α > β (α, β ∈ A) P α ⊂ P β We can put a prior on the α values and let the data decide the correct one through the corresponding posterior distribution. This Bayesian procedure ts within the framework of adaptive estimation, which focuses on constructing estimators that automatically choose a best model from a given set of models. An estimator on a given set of models is called rate adaptive if it attains the same rate of convergence that would have been attained if only the best model had been used. For example when the model Pα contains the

α-regular densities on [0, 1]d and the set of models contains all of the β -regular densities, where β > 0, then an estimator is called rate adaptive, if it attains the rate n−α/(2α+d) whenever the true density is α-smooth, for any α > 0. Our main goal is to construct a rate adaptive Bayesian estimation. For this we have 4 http://www.doksihu to give a prior distribution on the set of models. For simplicity this prior is discrete, and to achieve rate adaptiveness, it varies with the sample size. This is an area of intensive research, see eg. [3], where a similar result to our main theorem was proved and applied to LogSpline models. This paper is organized as follows. In Chapter 2 the notion of Reproducing Kernel Hilbert Space, the Riemann-Liouville process and the rescaled Gaussian process are discussed. Chapter 3 prepares the ground for our main theorem which is given in Chapter 4. The last chapter is for examples 1.1 Notation The ε-covering numbers of a metric

space (P, d), denoted by N (ε, P, d), shows how many ε balls are needed to cover P . The ε-packing number D(ε, P, d) of P is the supremum number of points in P , such that the distance between every pair of points is at least ε. These two denitions are related by the inequalities ε N (ε, P, d) ≤ D(ε, P, d) ≤ N ( , P, d). 2 Since we are only interested in rates of convergence, the additional constant 12 is not important, hence we can replace the ε-covering number with the ε-packing number. The set of centers of ε-balls covering P is called an ε-net We use the Hellinger, total variation, L2 distances, Kullback-Leibler divergence K(·, ·) and V (·, ·). For two probability densities p and q relative to the measure µ they are dened as sZ h(p, q) = √ √ | p − q|2 dµ, Z kp − qk1 = |p − q| dµ sZ kp − qk2 = |p − q|2 dµ, Z p K(p, q) = p log dµ, q Z p V (p, q) = log2 p dµ. q Except of the L2 distance they don't depend on the dominating measure

µ. 5 http://www.doksihu For a random element W in L∞ (X ) we dene the concentration function of W around w ∈ L∞ (X ) ϕw (ε) = − log P (kW − wk∞ < ε), (1.3) where k · k∞ is the supremum norm. Let L(p) denote the n-sample likelihood ratio L(p) = n Y p p i=1 0 (Xi ). (1.4) Furthermore, we use the notation H1 for the unit ball on the Hilbert space H. 6 http://www.doksihu 2 Reproducing Kernel Hilbert Space In this section we give rst the denition and some properties of the Reproducing Kernel Hilbert Space [1] and after that we give two examples [9], which will be applied in the fth section. 2.1 RKHS in general Let W = (Wt : t ∈ T ) be a zero-mean Gaussian stochastic process on the probability space (Ω, U, P ). In our paper we deal with the T = [0, 1] case The nite-dimensional distribution of such a process is determined by the covariance function K : T × T 7 R, dened by K(s, t) = EWs Wt . For a Gaussian process W we isometrically assign

a function space (H, k · kH ), called Reproducing Kernel Hilbert Space (RKHS). The RKHS is important for us, because it determines the support and the geometry of the Gaussian measure, given by the Gaussian process Wt . Next, we give the classical denition and after that the denition for centered stochastic process of the RKHS Let H be a Hilbert space of functions dened on T . A function K :T ×T C is a reproducing kernel of the Hilbert space H if and only if ∀t ∈ T, K(., t) ∈ H and ∀t ∈ T, ∀ϕ ∈ H hϕ, K(., t)i = ϕ(t), where the last condition is called the reproducing property. A Hilbert space of complex-valued functions which possesses a reproducing kernel is called a Reproducing Kernel Hilbert Space. Next we give two examples for RKHS Example 2.1 Let H be a nite dimensional complex vector space of functions with basis (f1 , f2 , ., fn ) and dene the inner product h·, ·iH by the numbers hfi , fj iH := gi,j , where G = (gi,j ) is an arbitrary

hermitian and positive denite matrix. Then (H, h·, ·iH ) is a Hilbert space. Let G−1 = (g ij ) denote the inverse of the matrix 7 http://www.doksihu G. Then the bivariate function K(x, y) = n X ḡ i,j fi (x)f¯j (y) i,j=1 is the reproducing kernel of the Hilbert space H, hence H is an RKHS. Example 2.2 Let T = [0, 1] and H = {f : f (0) = 0, f is absolute continuous, f 0 ∈ L2 [0, 1]}. H is a Hilbert space with the inner product Z hϕ, ψi = ϕ0 ψ̄ 0 dλ. [0,1] Then H has reproducing kernel K(x, y) = min{x, y}, hence H is an RKHS. The next theorem gives an alternative characterization of a Reproducing Kernel Hilbert Space. [ ] Theorem 2.3 (Theorem 1 of 1 ) A Hilbert space H of complex valued func- tions on T has reproducing kernel if and only if all the evaluation functionals δt , t ∈ T , are continuous on H. A function K : T × T C is called a positive type function (or positive denite function ) if ∀n ≥ 1, ∀(a1 , ., an ) ∈ Cn , ∀(x1 , , xn )

∈ T n , n X n X ai āj K(xi , xj ) ∈ R+ , i=1 j=1 where R+ denotes the set of nonnegative real numbers. The linkage between reproducing kernels and positive type functions: Lemma 2.4 (Lemma 2 of [1]). Any reproducing kernel is a positive type function. [ ] Theorem 2.5 (Moore-Aronsza jn Theorem (Theorem 3 of 1 )) Let K be a positive type function on T × T . There exists only one Hilbert space HK of functions on T with K as a reproducing kernel. Another property of the RKHS is that if H1 and H2 are two RKHS's on T1 and T2 , then H1 ⊗ H2 is a RKHS on T1 × T2 . Next we give the denition of the Hilbert space generated by a centered process. Let's take an arbitrary square integrable centered process {Xt : t ∈ 8 http://www.doksihu T } and denote L(X) the linear space spanned by the variables Xt , t ∈ T . Let L̄(X) be the closure in L2 (Ω, A, P ) of L(X). L̄(X) is equipped with the inner product induced by that of L2 (Ω, A, P ) hU, V iL2 (Ω,A,P ) =

E(U V ). We call the space L̄(X) the Hilbert space generated by the centered process X . Then the RKHS generated by the centered process X is the following: [ ] Theorem 2.6 (Loéve's theorem (Theorem 35 of 1 )) The Hilbert space L̄(X) generated by the centered process {Xt , t ∈ T } with covariance function K is congruent to the RKHS HK 2.2 The Riemann-Liouville process For α > 0 and W a standard Brownian motion, the Riemann-Liouville process with parameter α > 0, RL(α) for short, is dened as Z t 1 α t ∈ [0, 1]. Rt = (t − s)α− 2 dWs , 0 The process Rtα is zero-mean Gaussian process, with continuous sample paths. Let's dene the α-order Riemann-Liouville fractional integral of f for a measurable function f as α I0+ f (t) = 1 Γ(α) Z t (t − s)α−1 f (s) ds. 0 Hence we can also view Rtα as a multiple of the (α − 12 )-fractional integral of the derivative of the Brownian motion (dWt ). So for α ≥ 12 the Riemannα− 1 α maps

Liouville process is equal to Γ(α + 21 )I0+ 2 W . The transformation I0+ β -regular functions into α+β regular functions, see [7]. Since Brownian motion is 21 -regular, the RL(α) process is a good model for α-regular densities. Theorem 2.7 (Theorem 42 of [8]). The RKHS of the Riemann-Liouville α+ 1 process with parameter α > 0 is H = I0+ 2 (L2 [0, 1]) and the RKHS-norm is given by α+ 1 kI0+ 2 f kH = kf k2 . Γ(α + 12 ) We can use the RL(α) process for approximating C α [0, 1] functions which have [α] vanishing derivatives at zero, where [α] means the oor of α. By 9 http://www.doksihu adding random elements in polynomial form, the trajectories of the process won't have [α] vanishing derivatives necessarily so we dene the process Xtα = [α] i X t i! i=0 Zi + 1 Rα , Γ(α + 12 ) t where Z0 , . , Z[α] , Rtα are independent, Zi 's are standard normal variables and Rtα is an RL(α) process. [ ] Theorem 2.8 (Theorem 43 of 8 ) For all

α > 0 the support of the process Xtα is the whole space C[0, 1]. For any w ∈ C α [0, 1], the concentration function −1 α of Xt satises ϕw (ε) = O(ε α ) as ε & 0. By Theorem 2.8 the concentration function of Xtα satises that ϕw (ε) = 1 O(ε− α ) so the inequality ϕw (εn,α ) ≤ nε2n,α (2.1) α holds with εn,α = cn− 2α+1 which is exactly the minimax rate of estimating α-regular functions (w0 ∈ C α [0, 1]). The RKHS of the process Xtα , where k = α − 21 ∈ N, is the set of functions f : [0, 1] 7 R that are k -times dierentiable, their k th-derivatives are absolutely continuous and square integrable, equipped with the inner product Z 1 k X (i) (i) hf, giHα = f (0)g (0) + f (k+1) (s)g (k+1) (s) ds, 0 i=0 which denes the norm kf kHα = k X (i) f (0) 2 Z 1 + 2 f (k+1) (s) ds, (2.2) 0 i=0 see Section 10 of [9]. 2.3 Rescaled multiple integrated Brownian motion In this subsection we use the results of [10]. Our rst goal

is to give a prior distribution on α-regular functions for all α ∈ (0, k + 1], for a k large enough. The main idea is to rescale a smooth enough process, with rescaling constants depending on the sample size. Given a xed Gaussian process (Wt : t ≥ 0) indexed by the positive time axis and scaling constants cn > 0 we use the rescaled sample path t 7 Wt/cn , t ∈ [0, 1] 10 http://www.doksihu as a prior model for a given function w0 : [0, 1] 7 R. By rescaling we change the appearance of the process by stretching or shrinking its sample paths. There are two cases, if cn ∞ then we stretch the sample path t 7 Wt on the short interval [0, 1/cn ] to [0, 1]. Typically this has the eect of smoothing the sample paths. In the second case scaling factors cn 0 use the sample path on the long interval [1, 1/cn ] and shrink this to interval [0, 1]. This usually makes the prior rougher. We will use self similar processes as smooth prior processes in this section. If W is the k

-fold integrated Brownian motion (plus an independent polynomial part) then using W as prior on the (k+1/2)-regular functions yields an optimal convergence rates for the posterior, see [8]. We will see that for any α ∈ (0, k+1] there exist scaling sequences cn such that choosing the prior based on the rescaled process Wt/cn , the posterior will give the optimal contraction rate whenever the true density is α-regular. If α ∈ (0, k +1/2) then we have cn 0, else if α ∈ (k + 1/2, k + 1] then we have cn ∞. Finally for α = k + 1/2 we don't need rescaling. Consider an α-order, self-similar, zero-mean Gaussian process (Wt : t ≥ 0), which means that the processes (cα Wt/c : t ≥ 0) and (Wt : t ≥ 0) are equal in distribution for every c > 0. So the rescaled process (Wt/c : t ≥ 0) has the same law as the process (c−α Wt : t ≥ 0). Hence the RKHS and small ball probabilities are equal in the time axis and in the vertical axis rescaled processes. We use the

notation HW for the RKHS of the Wt process and ϕ0 (ε; W ) = − log P (kW k ≤ ε) for the exponent in the small ball probability. The RKHS of the process W c = (Wt/c : 0 ≤ t ≤ 1) is the set of functions HW equipped with the norm khkW c = cα khkW . The centered small ball exponent of W c satises − log P (kW c k ≤ ε) = ϕ0 (cα ε; W ). We deal with the k -fold integrated Brownian motion, which has k derivatives at 0 equal to 0, hence its RKHS satises similar condition. So a better choice of the prior is to add an independent polynomial part to the process. We consider the process 1 k B)t/c + √ Vtc,a = (I0+ k X a i=0 Zi ti i! (2.3) for scaling factors c, a > 0, B a standard Brownian motion and independent standard normal variables Z0 , Z1 , . , Zk , independent of B Theorem 2.6 of [10] gives a centered small deviation bound for the process 11 http://www.doksihu V c,a , and describes the approximation of smooth functions by elements of its RKHS Hc,a .

Theorem 2.9 The process Vtc,a given in (2.3) satises for ε > 0 small enough, 1 − log P ( sup |Vtc,a | ≤ 2ε) . ck+1/2 ε 0≤t≤1 Moreover, for w ∈ C β 1  k+1/2 1 + k log √ . aε [0, 1] and β ≤ k + 1, inf{khk2Hc,a : kh − wk∞ ≤ ε} . c2k+1 1 (2k+2−2β)/β 1 ((2k−2β)/β)∨0 +a . ε ε where the notation . is used for "smaller then or equal to a universal constant c,a times" and P is the probability dened by the process Vt . In the proof of Theorem 2.9 ([10]) it was shown that the RKHS of the process Vtc,a , which is the Sobolev space H k+1 [0, 1] of functions h : [0, 1] R that are k -times continuously dierentiable with absolutely continuous k th derivative that is the integral of function h(k+1) ∈ L2 [0, 1], equipped with the norm with square khk2Hc,a = c2k+1 kh(k+1) k22 + a k X h(i) (0)2 . i=0 Theorem 3.2 from the same paper gives for all α ∈ (0, k + 1], for an arbitrary k , a sequence of processes, which can be used

in the prior distribution, such that the posterior distribution achieves the minimax rate. Theorem 2.10 For α > 0 and k ∈ N, let Vtn be the modied k -fold integrated Brownian motion dened in (2.3) with the scaling constant c replaced by cn,α = n 1) α−(k+ 2 1 )(1+2α) (k+ 2 and a replaced by a sequence of an,α satisfying 1+2α−2k an,α ≤ n 1+2α . α n Dene the prior Πn as in (1.1), with the Gaussian process V Then if log p0 ∈ C α [0, 1] and α ≤ k + 1, we have E0 Πα (p : h(p, p0 ) > M εn,α |X1 , . , Xn ) 0 for all M large enough, where εn,α = n α − 1+2α 12 and h is the Hellinger distance. http://www.doksihu So we can use the rescaled processes in (1.1) to estimate α-regular functions for all α > 0 and it achieves the minimax rate in the Bayesian estimation procedure. Remark 2.11 In the proof of Theorem 210 it was shown that the concentration function ϕw (ε) for w ∈ C α [0, 1] satises ϕw (εn,α ) ≤ nε2n,α , α

where εn,α = n− 1+2α . 13 http://www.doksihu 3 Preliminary results α Recall from the rst chapter that εn,α = n− 2α+1 is the minimax rate for estimating α-regular functions on the interval [0,1]. By good choice of the Gaussian process W α we get a prior distribution Πα according to (1.1), such that the posterior contraction rate achieves the minimax rate. Our main goal is to give a minimax estimation for an α-regular function, where α > 0 is an unknown parameter. First we give a hierarchical prior on the densities. Let λn be a discrete probability distribution on a nite subset of R+ We construct a new, hierarchical prior Πn such that it rst chooses α according to λn and next p according to Πα for the chosen α: Πn = X λn (α)Πα . α∈Bn The corresponding posterior distribution is given by R Qn p(Xi ) dΠn (p) Πn (B|X1 , . , Xn ) = RB Qni=1 p(Xi ) dΠn (p) R R i=1 Qn p(Xi ) dΠα (p) λn (dα) = R RB Qni=1 . α i=1 p(Xi ) dΠ (p) λn

(dα) (3.1) In section 2.2 and section 23 dierent choices of W α Gaussian processes were given such that the concentration function ϕαw dened by W α satises ϕw (εn,α ) ≤ nε2n,α α for every w ∈ C α [0, 1], where εn,α = n− 2α+1 . In the following, four conditions are given which are applied in the next theorem. 1. For all α > 0 there exists a set W α ⊂ L∞ and a sequence of positive numbers εn,α & 0, such that nε2n,α % ∞ and for every w ∈ W α ϕw (εn,α ) ≤ nε2n,α if n is large enough, (3.2) where ϕw (ε) is the concentration function of W α dened in (1.3) Furthermore assume that for all 0 < α < β εn,α ∞. εn,β 14 (3.3) http://www.doksihu 2. For all 0 < α < β , where α, β ∈ N and for all c1 , c2 > 0 λn (α) c1 nε2n,β α e Π (c2 εn,β < h(p, po ) ≤ c2 εn,α ) 0 as n ∞. λn (β) 3. For all β > 0 there exists a constant δ > 0 not depending on n such that 2 λn (β) >

δe−nεn,β X λn (α). α≥β 4. For all 0 < α < β Hβ1 ⊆ KHα1 , where Hα1 is the unit ball of Hα . Conditions 2 and 3 are satised if we choose the prior distribution λn as 2 (3.4) λn (α) ∼ cα e−nεn,α with an arbitrary convergent sequence of positive numbers cα , regardless of the prior Πα . Theorem 3.1 Assume that Conditions 1, 2, 3 and 4 hold If log p0 ∈ W β for some β ∈ N, then E0 Πn (p : h(p, p0 ) > M εn,β |X1 , . , Xn ) 0 for M large enough. From this theorem it follows that by good choice of the Gaussian processes W α for all α ∈ N and with the special choice of λn , given in (3.4), we can construct an estimator (for example): p̂0 = arg max Πn (p : h(p, q) < M εn,β |X1 , . , Xn ), q which achieves the minimax rate, whenever we estimate β -regular functions, with an unknown β ∈ N parameter. It is true, since the intersection of the balls with radius M εn,β around p0 and p̂0 is not empty. In the next

chapter we will give the generalization of Theorem 3.1, where the same statement holds for β ∈ R+ , so not only for natural numbers. 15 http://www.doksihu 3.1 The fundamental theorem Theorem 2.1 of [2] is a very important theorem in this topic and we will refer to it as the fundamental theorem. Theorem 3.2 (The fundamental theorem) Let Πn be the sequence of prior probability measures supported on some set of probability measures P . 2 Suppose that for a sequence εn with εn 0 and nεn ∞, a constant C > 0 and sets Pn ⊂ P , we have (3.5) log D(εn , Pn , d) ≤ nε2n ; 2 (3.6) Πn (Pnc ) ≤ e−nεn (C+4) ;  2 −Cnε2n Πn p : K(p0 , p) ≤ ε2n , V (p0 , p) ≤ εn ≥ e . (3.7) Then for suciently large M , we have that Πn (p : h(p, p0 ) ≥ M εn |X1 , . , Xn ) 0, n in (P0 ) -probability, where h is the Hellinger metric. We give some explanation to the conditions. The rst and the third conditions of the theorem are the essential ones

Condition (35) says that Pn is not too big. It is true for all ε0n > εn as soon as it is true for εn , so it is a restriction on the minimal value of εn . When the densities are uniformly bounded, we can use L2 -distance also, which is bounded above by the multiple of the Hellinger distance. If the densities are uniformly bounded and uniformly bounded away from zero, then the Hellinger, total variation and L2 -distance are equivalent. Condition (3.6) says that Pn is a good approximation of the prior Condition (3.6), combined with condition (35), can be interpreted as saying that a part of P that barely receives prior mass Pnc is not small. Condition (37) is the other main determinant of the posterior rate given by the theorem. It says that the prior measure puts some sucient amount of mass in the small neighborhood of the true density p0 . This condition is also satised for all ε0n > εn , if it is satised for εn , hence it is another restriction on a minimal value of

εn . The assertion of the theorem says that the posterior mass outside of a ball of radius proportional to εn is approximately zero in-probability. The in-probability statement can be improved to an almost sure assertion under stronger conditions. In the proof of Theorem 3.2 the existence of tests ϕn : X n 7 [0, 1] were applied. We use the word test as it is used in randomized hypothesis testing 16 http://www.doksihu for the function giving the rejection probability. These ϕn 's are tests of p0 versus {p : d(p, p0 ) > ε} the complement of the ball of radius ε around p0 . For every pair p0 and p1 in the model P there exist tests ϕn such that, for some universal constant K En0 ϕn ≤ exp(−Knd2 (p0 , p1 )), Enp (1 − ϕn ) ≤ exp(−Knd2 (p0 , p1 )), sup p: d(p,p1 )<d(p0 ,p1 )/2 where d could be both the Hellinger and the total variation distance. With the help of this tests it was shown in the proof of the fundamental theorem ([2]) that: Lemma 3.3 Assume that

the conditions of the fundamental theorem hold n Then there exist a sequence of test ϕn and a set of events Qn ⊂ X , such that 2 E0 Πn (p : h(p, p0 ) > M εn |X1 , . , Xn )ϕn ≤ 2e−Knεn , 2 E0 Π(p : h(p, p0 ) > M εn |X1 , . , Xn )(1 − ϕn )IQn ≤ 2e−2nεn , 1 E0 IQcn ≤ 2 . nεn In the proof of the fundamental theorem Lemma 8.1 of [2] was applied also which we will need later. ε > 0 and probability measure Π on the set n o p : K(p0 , p) ≤ ε2 , V (p0 , p) ≤ ε2 Lemma 3.4 For every we have, for every C > 0, P0 n Z Y p p i=1 0  (Xi ) dΠ(p) ≤ exp(−(C + 1)nε2 ) ≤ 1 C 2 nε2 . In the proof of our main theorem we will use a version of Borell's inequality given in [8]. Let W be a Borel measurable, zero-mean Gaussian random element in L∞ (X ). Recall, that for the concentration function ϕ0 of W e−ϕ0 (ε) = P (W ∈ εB1 ), where B1 is the unit ball of (L∞ (X ), k · k∞ ). Theorem 3.1 (Theorem 51 of [8]). For any ε

> 0 and M ≥ 0, P (W ∈ εB1 + M H1 ) ≥ Φ(Φ−1 (e−ϕ0 (ε) ) + M ), where Φ is the distribution function of the standard normal distribution and H1 is the unit ball of the RKHS of W . 17 http://www.doksihu 3.2 Determination of constants In the proofs of the following theorems the constant multipliers are hidden, hence we copy their proofs and extend them with the computation of a suciently large constants. First we give Lemma 8 of [4] p and q we have  Ep log p/q ≤ h2 (p, q) 2 + log kp/qk∞ , 2 Ep log2 (p/q) ≤ h2 (p, q) 3 + log kp/qk∞ . Lemma 3.5 For every pair of probability densities (3.8) (3.9) Proof. In both inequality there is nothing to prove if p = q , that is h(p, q) = 0 So we can and do assume that h(p, q) > 0. Let r̃ : (0, ∞) R be the function r̃(x) = 2 x−1−log(x) if x 6= 1 and (x−1)2 √ r̃(1) = 1. Then r̃ is continuous Put also r(x) = r̃( x) With this notation √ √ log x = 2( x − 1) − r(x)( x − 1)2 or equivalently

log √ √ 1 = 2(1 − x) + r(x)( x − 1)2 . x (3.10) The following representation of r̃ is useful: Z 1 Z 1 y y 1 r̃(x) = dy = dy. 2 0 (x − 1)y + 1 0 xy + 1 − y From this, it is easily seen that (i) r̃ and hence also r is non-negative and decreasing. (ii) r̃ is concave, since r̃0 < 0 for x > 0. This can be seen from Z 1 1 0 y2 r̃ (x) = − dy. 2 2 0 (yx + (1 − y)) Another way to express the derivative r̃0 (x)  0 x − 1 − log(x) 1 0 r̃ (x) = 2 (x − 1)2 1 − x1 1 1 = − 2 r̃(x) (x − 1)2 x−12 1 1 =− + ( r̃(x) − 1) . x 1−x If x ∈ (0, 1] then r̃(x) ≥ 1, hence 12 r̃0 (x)+1/x > 0 and therefore 2 log(x)+ r̃(x) is increasing. Hence on (0, 1) we have the following estimate also r̃(x) ≤ 1 − 2 log(x), 1 r(x) = r̃(x1/2 ) ≤ 1 − 2 log x1/2 = 1 + log . x 18 http://www.doksihu (iii) For b ≥ 1 the function xb r̃(x) is increasing. Hence xb r(x) is also increasing for b ≥ 2 0 Remark 3.6 Using the concavity of r̃(x)+2 log(x) and

the fact that r̃ (1) = 4/3 we also obtain the next sharper estimate on (0, 1)     1 4 1 1 ≤ r̃(x) ≤ − + x + log . −2 + 3x + log x 3 3 x Using (3.10) with X = q/p and integrating with respect to p we obtain that  √ 2  Ep log(p/q) = Ep log 1/X = h2 (p, q) + Ep r(X) X − 1 . To estimate the last term on the right take ε ∈ (0, 1) and split the integral according to X > ε or X ≤ ε, and use that √ √ r(X)( X − 1)2 χ(X>ε) ≤ r(ε)( X − 1)2 since r is decreasing by Property (ii). For the other part take b > 2 then using Property (i) of r √ X b r(X b ) √ r(X)( X − 1)2 χ(X≤ε) ≤ ( X − 1)2 χ(X≤ε) ≤ εb r(ε)X −b . Xb Taking expectation we obtain that h  εb Ep (p/q)b i Ep log(p/q) ≤ h2 (p, q) 1 + r(ε) 1 + h2 (p, q) h  1  εb Ep (p/q)b i ≤ h2 (p, q) 1 + 1 + log 1+ . ε h2 (p, q) To nish the proof of the rst part of the statement choose εb < 1 depending on b such that εbb Ep (p/q)b 0 as b ∞ and 1/εb

kp/qk∞ > 1. The inequality kp/qk∞ > 1 follows from the facts that p 6= q and both p and q are probability densities. With this choice we obtain Ep log(p/q) ≤ h2 (p, q)(2 + log kp/qk∞ ) This proves (3.8) To prove the second inequality, note that  2 √ √ 1 log + 2( x − 1) = r2 (x)( x − 1)4 . x 19 http://www.doksihu Substituting X = q/p and integrating on the set {X ≤ c} with c ≥ 4 we get by applying the Cauchy Schwartz inequality Ep (χ(X≤c) log2 X) + 4H 2 − 4Ep (χ(X≤c) log2 X)1/2 H ≤ √  Ep r2 (X)( X − 1)4 χ(X≤c) , (3.11) √  where H 2 = Ep χ(X≤c) ( X − 1)2 . Denote the right hand side of the inequality by A and put z = Ep (χ(X≤c) log2 X)1/2 Then (311) can be written as z 2 − 4zH + 4H 2 − A ≤ 0 This gives that z≤ 4H + √ √ 16H 2 − 16H 2 + 4A = 2H + A 2 (3.12) A can be bounded as above, by taking ε ∈ (0, 1) and b ≥ 2 and splitting the integral according to X ≥ ε or X < ε. Here we use that c ≥ 4

hence √ √ √ ( X − 1)4 χ(X≤c) ≤ ( X − 1)2 ( c − 1)2 χ(X≤c) . Thus √ A ≤ Ep r2 (X)( X − 1)4 χ(X≤c)  √ ε2b Ep (p/q)2b  ≤ ( c − 1)2 H 2 r2 (ε) 1 + H2 2  √ ε2b Ep (p/q)2b  ≤ ( c − 1)2 H 2 1 + log(1/ε) 1 + H2 2b We choose again a sequence εb such that 1/εb kp/qk∞ and ε2b b Ep (p/q) 0 when b ∞. This gives √ A ≤ ( c − 1)2 H 2 (1 + log kp/qk∞ ) Plugging this into (3.12) with c = 4 yields Ep (χ(q≤4p) log2 p/q) ≤ H 2 (2 + (1 + log kp/q k∞ ))2 ≤ H 2 (3 + log kp/q k∞ )2 . √ To nish the proof we use that supx≥4 log2 x/( x − 1)2 ≤ log2 4 ≤ 2. This gives that √ Ep χ(X≥4) log2 X ≤ 2Ep χ(X≥4) ( X − 1)2 . √ Since H 2 + Ep χ(X≥4) ( X − 1)2 = h2 (p, q) we obtain that √  Ep log2 p/q ≤ H 2 (3 + log kp/q k∞ )2 + 2Ep χ(X≥4) ( X − 1)2 ≤ h2 (p, q) (3 + log kp/q k∞ )2 . 20 http://www.doksihu In the next lemma we use the notation pw = cew , where c is the normalizing constant. Lemma

3.7 (Lemma 31 of [8]). For bounded measurable functions v, w : X 7 R, we have the following: h(pv , pw ) ≤ kv − wk∞ ekv−wk∞ /2 ; K(pv , pw ) . kv − wk2∞ ekv−wk∞ (1 + kv − wk∞ ); V (pv , pw ) . kv − wk2∞ ekv−wk∞ (1 + kv − wk∞ )2 , where the notation . is used for smaller then or equal to a universal constant times. Next we prove a version of the previous lemma. Lemma 3.8 For any measurable functions v, w : X 7 R, for which kv − wk∞ < 1/3 log(16/9), we have the following: K(pv , pw ) ≤ 4kv − wk2∞ ; V (pv , pw ) ≤ 16kv − wk2∞ . Proof. We use that log kp/qk∞ = k log(p/q)k∞ , (38) and the following result: pv ≤ 2kv − wk∞ . pw ∞ From these and Lemma 3.5 we get that   pv pv Ep log ≤ h2 (pv , pw ) 2 + log pw pw ∞ log ≤ kv − wk2∞ ekv−wk∞ (2 + 2kv − wk∞ ) (3.13) √ √ √ It is easy to check that ex ≤ 2 if x ≤ 0.5 log 2 and (1+x) < 2 if x < 2−1 So if kv − wk∞ < 1/3 log(16/9), then

from (3.13) we can see that pv Ep log ≤ 4kv − wk2∞ . pw We prove the second part of the lemma in the same way. We can check that 1 1 1 ex ≤ (16/9) 3 if x ≤ 1/3 log(16/9) and (1 + x) < (16/9) 3 if x < (16/9) 3 − 1. Hence for all v , w : X 7 R for which kv − wk∞ < 1/3 log(16/9) holds,  2 pv 2 V (pv , pw ) ≤ h (pv , pw ) 3 + log pw ∞ ≤ kv − wk2∞ ekv−wk∞ (3 + 2kv − wk∞ )2 ≤ 9kv − wk2∞ ekv−wk∞ (1 + kv − wk∞ )2 ≤ 16kv − wk2∞ . 21 http://www.doksihu The next theorem gives us a construction of a sequence of sets Bn , which helps us to construct a sequence of sets Pn for Theorem 3.2 above Theorem 3.9 (Theorem 21 of [8]). Let W be a Borel measurable, zero- mean, Gaussian random element in a common separable Banach space (B, k·k), with RKHS (H, k · kH ) and let w0 be contained in the closure of H in B. For a sequence εn > 0 satisfying (3.2) for ϕw0 given by (13) and any C > 1 with exp[−Cnε2n ] < 1/2, there

exists a measurable set Bn ⊂ B such that log N (3εn , Bn , k · k) ≤ 6Cnε2n , P (W ∈ / Bn ) ≤ exp[−Cnε2n ], P (kW − w0 k2 < 4ε2n ) ≥ exp[−nε2n ]. (3.14) (3.15) (3.16) The method of construction is presented in the next chapter, in the proof of our main theorem. So the Pn sets are constructed as ewx Pn = { R wx : w ∈ Bn }. e dx The three assertion of this theorem can be matched one by one with the conditions of the fundamental theorem. The main dierence between them is that in (3.14), (315) and (316) the norm of the Banach space stands, while in the fundamental theorem the assumptions are in terms of metrics appropriate to the statistical problem under consideration (in our case this is usually the Hellinger or the total variation metric). Hence we need Lemma 37 or Lemma 3.8 to ensure that our above constructed Pn,α sets satisfy the conditions of the fundamental theorem. Finally we give Theorem 3.1 of [8] Theorem 3.10 Let dom element in L ∞ W be a

Borel measurable, zero-mean tight Gaussian ran- (X ). Suppose that w0 = log p0 is contained in the support of W and let ϕw0 be the function in (1.3) with k · k the uniform norm on L Furthermore assume that εn satises (3.2) ∞ . Then the posterior distribution relative to the prior Π satises E0 Π(pw : h(pw , pw0 ) > M εn |X1 , X2 , . , Xn ) 0 for a suciently large constant M , which is independent from the choice of W. Remark 3.11 In the proof of Theorem 310 in [8] it was shown that the three conditions of the fundamental theorem are satised. So if the conditions of Theorem 3.10 are satised, then we can apply the results from Lemma 33 22 http://www.doksihu 4 Main Theorem In this chapter we give a hierarchical prior, for which the posterior contraction rate is rate adaptive. Let {W α : α ∈ A} denote the collection of zero mean Gaussian random elements, where A ⊆ (0, ∞) and dene for all α ∈ A the prior distribution Πα according to (1.1) with W

α Then dene εn,α for all α ∈ A, such that it satises (3.2), where the concentration function is given by (1.3) Furthermore assume that nε2n,α = ng(α) , where g : [0, ∞) 7 (0, ∞) is a strictly decreasing continuous function. By good choice of λn distribution we generalize Theorem 3.1 Theorem 4.1 There is a hierarchical model, depending on the sample size, such that the rate of contraction is rate adaptive. More precisely assume that for all α < β (α, β ∈ A) Hβ1 ⊆ KHα1 holds, where W α (4.1) α is zero mean Gaussian random element and H1 is the unit α ball of the RKHS of W . Moreover, assume that εn,α satises (3.2) for all w ∈ W α for all α ∈ A and nε2n,α = ng(α) , with a strictly decreasing continuous function g : [0, ∞) 7 (0, ∞). Then there exists a prior distribution λn such that the hierarchical posterior distribution (3.1) satises E0 Πn (p : h(p, p0 ) > M εn,β |X1 , . , Xn ) 0 if log p0 ∈ W β (4.2) for β ∈

A and M is large enough. Proof. First we give the construction of λn Lemma 4.2 Let A ⊂ (0, ∞) be the set of α parameters and g : [0, ∞) 7 (0, ∞) be a strictly decreasing function. For an arbitrary constant H > 0 we can construct a nite set Bn ⊂ A, such that #Bn ≤ 2g(0) log n log H and there exists α̃n ≥ max Bn , such that for n large enough we have ng(α̃n ) − log log n ∞ 23 (4.3) http://www.doksihu Moreover, for all β ∈ A there exists a sequence βn ∈ Bn such that for n large enough we have 0 ≤ g(βn ) − g(β) < log H/ log n (4.4) and for all α ∈ Bn ∩ (0, βn ) g(α) − g(βn ) ≥ log H 2 log n (4.5) Let Bn be the set obtained from Lemma 4.2 with the constant H suciently large, H = 1302 + 1 is an appropriate choice, and dene λn as 2 e−nεn,α IBn (α) . λn (α) = P −nε2n,α α∈Bn e (4.6) Assume that log p0 ∈ W β with some β ∈ A. Then by Lemma 42 there exists a sequence βn satisfying (4.4) Next we

show that E0 Π(p : h(p, p0 ) > M 0 εn,βn |X1 , . , Xn ) 0 (4.7) if M 0 is large enough. This formula is the same as (42) except β is replaced with βn . After showing this we are nished, since from (44) 1 log H 2 ≥ log εn,βn − log εn,β holds. Hence from this and (47) 1 E0 Π(p : h(p, p0 ) > M 0 H 2 εn,β |X1 , . , Xn ) ≤ E0 Π(p : h(p, p0 ) > M 0 εn,βn |X1 , . , Xn ) 0 Denoting by D the set {p : h(p, p0 ) ≥ M 0 εn,βn }, we have R P α α λn (α) D L(p)Π (dp) R P Π(D|X1 , . , Xn ) = λn (α) L(p)Πα (dp) R Pα α α<βn λn (α) D L(p)Π (dp) R ≤ P + Π0 (D|X1 , . , Xn ) α α λn (α) L(p)Π (dp) = I + II, where Π0 = X λ0n (α)Πα , α≥βn 24 (4.8) http://www.doksihu with λn (α) . α≥βn λn (α) (4.9) λ0n (α) = P We study I and II separately. The main dierence between the proof of Theorem 3.1 and the proof of this theorem takes place in this part In Theorem 3.1 the number of terms in expression I is

uniformly bounded in n, while in our case it may go to innity with n. λn given in (4.6) and L(p) dened in (14) R P α α<βn λn (α) D L(p)Π (dp) R E0 P − 0 as n ∞, α α λn (α) L(p)Π (dp) Lemma 4.3 For 0 β where D = {p : h(p, p0 ) ≥ M εn,βn }, βn is given by (4.4) and log p0 ∈ W The following lemma ensures that the expected value of the expression II also goes to 0. Lemma 4.4 For the hierarchical prior distribution Π0n given by (4.8) E0 Π0n (p : h(p, p0 ) ≥ M 0 εn,βn |X1 , . , Xn ) − 0 as n ∞, β where βn is given by (4.4) and log p0 ∈ W Finally one can see that from Lemma 4.3 and Lemma 44 the convergence (4.7) holds Remark 4.5 Assume that A has no accumulation point and there exists a constant δ > 0 such that for every distinct pair of α, β ∈ A |α − β| > δ holds. In this case we can use Theorem 3.1 (with small modications), instead of using Theorem 4.1, since the expression I contains only uniformly bounded amount

of points for all n. The estimation of II is proved in a similar way in both of the theorems. The application of Theorem 31 makes the λn (α) prior distribution simpler, because we do not have to deal with the Bn sets, but we can say that the support of the prior λn is the whole set A for all n. Next we give the proofs of the lemmas used in the proof of Theorem 4.1 Proof of Lemma 4.2 In the proof we apply Example 29 of [6] Choose α en % ∞ such that ng(eαn ) − log log n ∞. Let Bn ⊂ (0, α en ] ∩ A 25 http://www.doksihu be such that {g(α) log n : α ∈ Bn } is a log H -net in the set {g(α) log n : α ∈ (0, α en ] ∩ A} and there doesn't exist α1 , α2 ∈ Bn (α1 6= α2 ), such that H . We can choose the set Bn for example by greedy |g(α1 ) − g(α2 )| < 2log log n algorithm. Since α̃n % ∞, g(β) log n is in the set for n large enough Dene the sequence βn as βn = max{α ≤ β : α ∈ Bn }. In this case |g(βn ) − g(β)| < log H/ log

n =⇒ βn β . One can easily see that 2g(0) #Bn < log log n. H Proof of Lemma 4.3 Let D = D1 ∪ D2 , where D1 = {p : h(p, p0 ) ≥ M εn,α } and D2 = {p : M εn,βn < h(p, p0 ) ≤ M εn,α }. We can write R R P α X D L(p)Πα (dp) α<βn λn (α) D L(p)Π (dp) R R1 P ≤ α (dp) λ (α) L(p)Π L(p)Πα (dp) n α α<β n α∈Bn R X λn (α) D L(p)Πα (dp) R2 + . βn λ n (βn ) L(p)Π (dp) α<β n α∈Bn From Remark 3.11 there exist tests ϕαn and a set of events Qn ⊂ X n , such that 2 E0 Πα (p : h(p, p0 ) > M εn,α |X1 , . , Xn )ϕαn ≤ 2e−Knεn,α , (410) 2 E0 Πα (p ∈ Qn : h(p, p0 ) > M εn,α |X1 , . , Xn )(1 − ϕαn )IQn ≤ 2e−2nεn,α , (411) 1 (4.12) E0 IQcn ≤ 2 , nεn,α where K is an universal constant. From the seventh section of [2] in the Hellinger distance's case this universal constant is K = 12 . One can easily see that for an arbitrary constant c > 0, and for enough large x 2e−cx ≤ 1 . x (4.13)

Specially for c = 2 it is true for all x and for c = 12 for x ≥ 0.715 From assumption nε2n,α = ng(α) , where g(α) > 0 for all α > 0, nε2n,α > 1 for all n ∈ N. So for all α > 0 and n ∈ N (413) holds with x = nε2n,α and c = 2 or c = 21 . Hence from (410), (411), (412) and (413) one can easily see R L(p)Πα (dp) D1 R E0 = E0 Πα (p : h(p, p0 ) > M εn,α |X1 , . , Xn ) α L(p)Π (dp) 3 ≤ 2 . nεn,α 26 (4.14) http://www.doksihu From (4.3), (414) and the inequality εn,α > εn,βn for all α < βn R X D L(p)Πα (dp) X 3 R1 E0 ≤ nε2n,α L(p)Πα (dp) α<βn α∈Bn α<βn α∈Bn 6g(0) log n log H nε2n,βn 6g(0) log n ≤ log H ng(β) ≤ holds, where the right side goes to 0 as n % ∞. Next, we study the second term. From Lemma 34 and the fact that nε2n,βn ∞, it follows that Z 6 2 L(p)Πβn (dp) ≥e−2·2 nεn,βn   (4.15) × Πβn p : K(p0 , p) ≤ 26 ε2n,βn , V (p0 , p) ≤ 26 ε2n,βn on an event An with P0

-probability tending to 1. From Theorem 39 and Lemma 3.8 for an arbitrary w0 ∈ W βn , which for log p0 = w0 ,   Πβn p : K(p0 , p) ≤ 26 ε2n,βn , V (p0 , p) ≤ 26 ε2n,βn Lemma 3.8 ≥ P (4kW − w0 k2∞ ≤ 26 ε2n,βn , 16kW − w0 k2∞ ≤ 26 ε2n,βn ) = P (kW − w0 k∞ ≤ 2εn,βn ) (3.16) 2 ≥ e−nεn,βn . (4.16) Hence by (4.15) and (416) Z 7 2 2 L(p)Πβn (dp) ≥ e−2 nεn,βn e−nεn,βn 7 2 = e−(2 +1)nεn,βn on the event An . In the following inequality we apply Fubini's theorem, the fact that E0 L(p) = 1 and the denition of distribution λn . Hence R X λn (α) 7 X λn (α) D L(p)Πα (dp) (2 +1)nε2n,βn α R2 I ≤ e Π (D2 ) E0 A n λn (βn ) λn (βn ) L(p)Πβn (dp) α<βn α∈Bn α<βn α∈Bn X ≤ 7 2 2 e(2 +1+1)nεn,βn −nεn,α α<βn α∈Bn ≤ 27 2g(0) −(H 12 −27 −2)nε2n,β n log n, e log H (4.17) http://www.doksihu where the last inequality comes from (4.5), because for all α ∈ Bn ∧

(0, βn ): 1 log H + log εn,βn ≤ log εn,α 4 1 εn,βn H 4 ≤ εn,α 1 −nε2n,βn H 2 ≥ −nε2n,α . Finally from the denition of H (H = 1302 + 1 > (27 + 2)2 ) we get that the right hand side of (4.17) goes to 0 Proof of Lemma 4.4 Theorem 32 says that if the following conditions hold for ε0n,βn = 8εn,βn , where ε0n,βn & 0, n(ε0n,βn )2 ∞, and Pn,βn measurable sets (4.18) log N (ε0n,βn , Pn,βn , h) ≤ c1 n(ε0n,βn )2 , 0 2 c Π0 (Pn,β ) ≤ c2 e−(c3 +4)n(εn,βn ) , n  0 2 Π0 p : K(p0 , p) ≤ (ε0n,βn )2 , V (p0 , p) ≤ (ε0n,βn )2 ≥ e−c3 n(εn,βn ) , (4.19) (4.20) then E0 Π0 (p : h(p, p0 ) ≥ M ∗ ε0n,βn |X (n) ) = E0 Π0 (D|X (n) ) 0, with M 0 = 8M ∗ . So our goal is to verify these conditions Since βn β , there exists N > 0, such that for all n > N : ε0n,βn ≤ ε0n, β 0 2 as n ∞. From assumption n(ε0n,βn )2 ≥ n(ε0n,β )2 , where the right hand side goes to innity, so n(ε0n,βn )2 ∞. Lemma 4.6

There is a δ > 0 such that we have X 2 λn (β) > δe−nεn,β λn (α) (4.21) α≥β α∈Bn for all n and β ∈ Bn . According to (4.9) Lemma 46 says that exists δ > 0 such that λ0n (βn ) > δ exp{−n(ε0n,βn )2 }. From this, (416) and ϕw0α (ε0n,α ) ≤ n(ε0n,α )2 we get that   Π0 p : K(p0 , p) ≤ (ε0n,βn )2 , V (p0 , p) ≤ (ε0n,βn )2   = Π0 p : K(p0 , p) ≤ 26 ε2n,βn , V (p0 , p) ≤ 26 ε2n,βn   ≥ λ0n (βn )Πβn p : K(p0 , p) ≤ 26 ε2n,βn , V (p0 , p) ≤ 26 ε2n,βn ≥ λ0n (βn )P(kW βn − w0 k∞ ≤ 2εn,βn ) 2 ≥ δe−2nεn,βn = δe−n2 2 −5 (ε0 n,βn ) 28 http://www.doksihu for some δ > 0 independent of n. Hence the prior mass condition (420) is satised. We choose Pn,βn equal to Pn,βn = {pw : w ∈ ε0n,βn B1 + Mnβn Hβ1 n }, 4 with B1 the unit ball in L∞ (X ) and for all α > 0 0 2 Mnα = 2KΦ−1 (1 − e−(c3 +4)n(εn,α ) ), with Φ the cdf of the standard normal distribution.

By the proof of Theorem 2.1 of [8] we can verify, that these sets satisfy the right entropy condition (4.18), see below Let h1 , . , hn be contained in Mnβn Hβ1 n and be ε0n,βn /2-separated for the norm k · k, such that doesn't exist h ∈ Mnβn Hβ1 n , which for |hj − h| > ε0n,βn /2 for every j ∈ {1, ., n} Then the k · k-balls hj + (ε0n,βn /4)B1 of radius ε0n,βn /4 around these hj points are disjoint, hence 1≥ N X P (W βn j=1 ≥ N X 1 ε0n,βn ∈ hj + B1 ) 4 2 e− 2 khj kHβn P (W βn ∈ j=1 1 βn 2 ε0n,βn B1 ) 4 0 ≥ N e− 2 (Mn ) e−ϕ0 (εn,βn /4) , (4.22) where the second inequality follows from 4.16 of [5] Since the hj + (ε0n,βn /2)B1 balls cover Mnβn Hβ1 n 1 βn 2 0 N (ε0n,βn /2, Mnβn Hβ1 n , k · k) ≤ N ≤ e 2 (Mn ) +ϕ0 (εn,βn /4) , where the last inequality follows from (4.22) By its denition, any point of the set Bn = {(ε0n,βn /4)B1 + Mnβn Hβ1 n } is within distance ε0n,βn /4 of some point of

Mnβn Hβ1 n . From the denition of ε0n,βn : ε0n,βn /4 > εn,βn , hence (32) holds with ε0n,βn /4 also. These implies that log N   ε0  n,βn ε0n,βn , Bn , k · k ≤ log N , Mnβn Hβ1 n , k · k 4 2 1 βn 2 ≤ Mn + ϕ0 (ε0n,βn /4) 2 ≤ 5(c3 + 4)K 2 n(ε0n,βn )2 + n(ε0n,βn /4)2 3 ≤ (5c3 + 21)K 2 n(ε0n,βn )2 29 (4.23) http://www.doksihu q 0 2 by the denition of Mnβn if e−(c3 +4)n(εn,βn ) < 12 , because Φ−1 (y) ≥ − 52 log y1 and is negative for every y ∈ (0, 21 ). From the proof of Theorem 31 of [8] log N (ε0n,βn , Pn,βn , h) ≤ log N (3/4)ε0n,βn , Bn , k · k  (4.24) holds. From (423) and (424) we get that log N (ε0n,βn , Pn,βn , h) ≤ (5c3 + 21)K 2 n(ε0n,βn )2 , which is the same up to a constant multiplier as (4.18), with c1 = 6(c3 + 4)K 2 By concentration assumption (4.1) and Borell's inequality (Theorem 31) c Πα (Pn,β ) = P(W α ∈ / (ε0n,βn /4)B1 + Mnβn Hβ1 n ) n ≤ P(W α ∈ / (ε0n,βn /4)B1

+ Mnβn α H ) K 1 Mnβn ) K  for α ≥ βn , where cn = Φ−1 Πα (ε0n,βn /4)B1 . Since ε0n,α ≤ ε0n,βn , we have  cn ≥ Φ−1 Πα (ε0n,α /4)B1 and Mnα ≤ Mnβn . From the following expressions ≤ 1 − Φ(cn + − 1 1 0 2 Mnβn ≤ − Mnα = Φ−1 (e−(c3 +4)n(εn,α ) ) 2K 2K  −1 −n(ε0n,α /4)2 ≤ Φ (e ) ≤ Φ−1 Πα (ε0n,α /4)B1 ≤ cn (4.25) 1 1 Mnβn , hence cn + K1 Mnβn ≥ 2K Mnβn . The inequalwe can see that cn ≥ − 2K  0 ity (4.25) follows from P W α ∈ / (ε0n,α /4)B1 = e−ϕ0 (εn,α /4) and ϕ0 (ε0n,α /4) ≤ n(ε0n,α /4)2 . So we can write that c Πα (Pn,β ) ≤ 1 − Φ(cn + n ≤ 1 − Φ( Mnβn ) K Mnβn 0 2 ) = e−(c3 +4)n(εn,βn ) . 2K Hence the last condition (4.19) is also satised Proof of Lemma 4.6 We use the notation kn = X 2 e−nεn,α α∈Bn during the proof. An upper bound can be formulated for the right hand side 30 http://www.doksihu of (4.21) 2 e−nεn,β X 2 λn (α) ≤

e−nεn,β X λn (α) α∈Bn α≥β α∈Bn ≤ 2g(0) −nε2n,β log n −nε2n,αe n e e log H kn 2 e−nεn,β 2g(0) −ng(αen ) = e log n. kn log H (4.26) From the denition of α en the convergence log log n −∞ eng(αen )  g(αen ) holds. So if n is big enough, then 2g(0)/ log H e−n log n < 1 and from this and (4.26) 2 −nε2n,β e e−nεn,β = λn (β). λn (α) ≤ k n α≥β X α∈Bn Choosing δ suciently small (4.21) holds for all n 31 http://www.doksihu 5 Examples In this chapter we give two possible constructions of the models {P α : α ∈ A}, where A ⊂ R+ and a λn distribution on the models for which the Bayesian estimation is rate adaptive. 5.1 The Extended Riemann-Liouville process In the rst example we deal with the (α) Xt = k X ti i=0 i! k Zi + (I0+ W )t process, where k = α − 1/2 ∈ N. In section 22 and section 32 we have shown that we can apply the Riemann-Liouville process to estimate α-regular functions,

where α is known (log p0 ∈ W α = C α [0, 1]), and the posterior contraction rate is the minimax rate. We would like to extend our estimation procedure for unknown α. Our goal is to show that the posterior contraction rate is rate adaptive in the special case when our true density is α-regular, where α − 1/2 ∈ N. First we prove that (41) holds for this we need the following lemma. Lemma 5.1 For all α, β ∈ N, α < β and for all f β -times dierentable, f (β) absolut continuous, square integrable functions: kf kα ≤ √ ekf kβ , where kf kα = α X (i) Z 1 2 f (0) + f (α+1) (s)2 ds. 0 i=0 Proof. First we will show that for all α ∈ N kf k1 ≤ √ ekf kα . The following inequality holds (a0 + a1 + a2 ak 2 1 1 + · · · + ) ≤ (1 + 1 + + · · · + )(a20 + a21 + · · · + a2k ) 2! k! 2! k! 2 2 2 ≤ e(a0 + a1 + · · · + ak ), (5.1) because 32 http://www.doksihu 2 ai aj ai a a a a2 a2 √ j ≤ ( √ i )2 + ( √ j )2 ≤ i + j .

= 2√ i! j! j! i! i!j! i!j! i!j! i!j! For an arbitrary s ∈ (0, 1) Z s (2) (2) f (3) (z1 ) dz1 f (s) = f (0) + Z0 s Z z1 (2) (3) f (4) (z2 ) dz2 dz1 = f (0) + f (0) + 0 Z z2 Z s Z0 z1 (4) (2) (3) f (5) (z3 ) dz3 dz2 dz1 = . f (0) + = f (0) + sf (0) + 0 0 0 s2 sα−2 (α) = f (2) (0) + sf (3) (0) + f (4) (0) + · · · + f (0) 2! (α − 2)! Z s Z z1 Z zα−2 f (α+1) (zα−1 ) dzα−1 . dz2 dz1 . + 0 0 0 s2 sα−2 (α) = f (2) (0) + sf (3) (0) + f (4) (0) + · · · + f (0) 2! (α − 2)! Z s 1 + f (α+1) (zα−1 )(s − zα−1 )α−1 dzα−1 . (α − 1)! 0 Hence (5.1) f (2) (s)2 ≤ e f (2) (0)2 + s2 f (3) (0)2 + s4 f (4) (0)2 + . + s2(α−2) f (α) (0)2 + Z s f (α+1) α−1 (z)(s − z) dz 2 ! 0 Jensen, s≤1,CSB ≤ Z s + e f (2) (0)2 + f (3) (0)2 + f (4) (0)2 + . + f (α) (0)2 f (α+1) (z)2 dz 0 s∈(0,1) ≤ Z s ! (s − z)2(α−1) dz 0 e f (2) (0)2 + f (3) (0)2 + f (4) (0)2 + . + f (α) (0)2 Z 1 + ! f (α+1) (z)2 dz . 0 With

the help of the above we can nish the proof, since 33 (5.2) http://www.doksihu kf k21 = f (0)2 + f (1) (0)2 + Z 1 f (1) (z) dz 0 (5.2) ≤ e f (0)2 + f (1) (0)2 + f (2) (0)2 + . + f (α) (0)2 + Z 1 ! f (α+1) (z)2 dz 0 = ekf k2α . Applying this result to f (α−1) and α0 = β − α we get that kf k2α = α−2 X (i) 2 f (0) + f (α−1) 2 (0) + f (α) Z 1 2 (0) + f (α+1) (s)2 ds 0 i=0 β α−2 X X ≤e f (i) (0)2 + e f (i) (0)2 + i=0 i=α−1 Z 1 f (β+1) (zt )2 dzt  0 = ekf k2β . The norm kf kHα given in (2.2) is equal to the norm kf kα−1/2 for all f ∈ Hα and α ∈ N. Hence from Lemma 51 for all α < β , where α − 1/2, β − 1/2 ∈ N √ eHα1 ⊇ Hβ1 . With λn dened in (4.6) the conditions of Theorem 41 hold, hence we get the minimax rate in contraction. 5.2 The Rescaled process In the second example we use in the construction of the prior distribution the process k 1 X ti n,α k Vt = (I0+ B)t/cn,α + √ Zi ,

an,α i=0 i! where an,α and cn,α are given in section 2.3 In section 23 and section 32 we have shown that for a known parameter α > 0 we can estimate α-regular functions applying this process and we will achieve the minimax rate in the α Bayesian procedure (εn,α = n− 2α+1 ). We would like to extend this procedure for unknown parameter α. For this we take rst a k large enough constant and after that, with the help of Theorem 4.1, we show that for all unknown α α ∈ (0, k + 1] the posterior contraction rate is εn,α = n− 2α+1 . So we want to apply Theorem 4.1, but for this we have to check rst that its conditions hold 34 http://www.doksihu c ,a From the RKHS of the process Vt n,α n,α and the denition of the sequences an,α , cn,α one can see that for all k ∈ N0 and for all β > α (β, α ∈ (0, k + 1]) the inequalities cn,α ≤ cn,β , an,α ≤ an,β hold. Hence the condition Hα1 ⊇ Hβ1 is satised. From the proof of Theorem 210 one can

see that (32) holds for α εn,α = n− 2α+1 . Finally we give the prior distribution on the models. The prior Παn dened by the substitution of the process Vtn,α into the prior distribution (1.1) gives α the minimax contraction rate εn,α = n− 2α+1 , when log p0 ∈ C α [0, 1]. With probability λn (α), dened in (4.6), we choose the prior Παn , which was dened by the help of the process Vtn,α . Now we can apply Theorem 4.1, since all of its conditions hold The theorem says that for all β ∈ (0, k + 1] and log p0 ∈ C β [0, 1] the hierarchical β posterior convergence rate is εn,β = n− 1+2β , which is exactly the minimax rate. A more general approach is to rescale the smooth Gaussian distribution with an independent random variable A. This random variable could be for example the gamma distribution. It is an extension of the example to derive a similar result to the process k X ti Zi , A i=0 i! 1 k Vtα = (I0+ B)Ct + √ where A and C are independent

random variables and are independent from the B Brownian motion also. 35 http://www.doksihu References [1] A. Berlinet and C Thomas-Agnan Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2004. [2] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W van der Vaart Convergence rates of posterior distributions Ann. Statist, 28(2):500531, 2000. [3] Subhashis Ghosal, Jüri Lember, and Aad van der Vaart. Nonparametric Bayesian model selection and averaging. Electron J Stat, 2:6389, 2008 [4] Subhashis Ghosal and Aad van der Vaart. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann Statist, 35(2):697723, 2007. [5] James Kuelbs, Wenbo V. Li, and Werner Linde The Gaussian measure of shifted balls. Probab Theory Related Fields, 98(2):143162, 1994 [6] Jüri Lember and Aad van der Vaart. On universal Bayesian adaptation Statist. Decisions, 25(2):127152, 2007 [7] Stefan G. Samko, Anatoly A Kilbas,

and Oleg I Marichev Fractional integrals and derivatives. Gordon and Breach Science Publishers, Yver- don, 1993. Theory and applications, Edited and with a foreword by S M. Nikolski, Translated from the 1987 Russian original, Revised by the authors. [8] A. W van der Vaart and J H van Zanten Rates of contraction of posterior distributions based on Gaussian process priors. Ann Statist, 36(3):14351463, 2008. [9] A. W van der Vaart and J H van Zanten Reproducing kernel Hilbert spaces of Gaussian priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, volume 3 of Inst Math Stat Collect., pages 200222 Inst Math Statist, Beachwood, OH, 2008 [10] A.W van der Vaart and JH van Zanten Bayesian inference with rescaled Gaussian process priors. Electron J Stat, 1:433448 (electronic), 2007 36