Massive data
West Nile
Covariance Matching Constrained Kriging

Covariance Matching Constrained Kriging (CMCK)
Download R code to carry out CMCK     Print-friendly pdf version of this page
Spatial Concept: Expectations
Lucio Fontana
Museum Ludwig, Koln, Germany
The text does not read well for all browsers. Click here for a pdf version of this page.

The motivating example on this webpage involves a region whose soils have been contaminated by tetrachlorodibenzo-p-dioxin (TCDD)

The motivating example on this webpage involves a region whose soils have been contaminated by tetrachlorodibenzo-p-dioxin (TCDD). Remediation is required when TCDD concentrations are above predetermined levels. That is, for a geostatistical process {Z(s) : s D IRd} and a given threshold t, we are interested in spatial prediction of such nonlinear functionals as I(Z(A) t) based on data Z (Z(s1),,Z(sn))', where Z(A) ave{Z(s): s A} and A D.

The prediction of linear functionals of Z, such as Z(s0) at a known point s0 or Z(A) for a given block can be carried out via kriging (e.g., Matheron, 1963; Journel and Huijbregts, 1978, Chapter V; Cressie 1993b, Chapter 3). When predicting nonlinear functionals of Z, the major thrust in geostatistics has been to use nonlinear predictors such as indicator kriging (Journel, 1983), indicator cokriging (Lajaunie, 1990), and disjunctive kriging (Matheron, 1976). While these methods are appropriate for nonlinear functionals like I(Z(s0) t), they do not generalize to the problem of predicting I(Z(A) t). Clearly, conditional simulation (e.g., Deutsch and Journel, 1992) based on specifying more than second moments can be used for inference. Restricting to specification of only second moments, we would like to explore covariance matching constrained kriging (CMCK) due to Aldworth and Cressie (2003), where constraints are added to the usual kriging equations that force elements of the variance matrix of a vector of linear predictors to match those from the corresponding predictands. The CMCK predictor is unbiased, has approximate optimal mean squared prediction error, handles additive measurement error straightforwardly, and can predict nonlinear functionals of Z(A) just as easily as those of Z(s0). Its strength is its generality for handling many types of nonlinearity.


CMCK Equations
The universal kriging (UK) block predictor of Z(A) is well known to be,

The universal kriging (UK) block predictor of Z(A) is well known to be,

where S = var(Z); c(A) = (C(s1, A),,C(sn, A))' and C(s, A) (A Cθ(s-u)du)/|A|; Cθ(s-u) cov(Z(s), Z(u)); X (xj(si)) is an n p matrix of explanatory variables; x(A) A x(u)du/|A|; and is the best linear unbiased estimator of , namely

Now consider prediction of g(Z(A)), where g is a smooth nonlinear function and |A| > 0 or A is countable. Cressie (1993a) proposed the constrained kriging (CK) predictor , which is an optimal linear predictor that is unbiased for Z(A) (as is ) and satisfies the extra constraint, . Cressie (1993a) showed that .

In the spirit of wanting the predictor's statistical properties to match those of the target quantity, Aldworth and Cressie (2003) extended this methodology to include further constraints, where (some) covariances are matched in addition to the variance(s). This covariance-matching constrained kriging (CMCK) predictor includes as a special case.

Consider the problem of predicting the M-dimensional vector, Z(A) (Z(A1),, Z(AM))', where Ai D; i = 1, , M and A {A1,, AM}. The CMCK predictor, given by Aldworth and Cressie (2003), is:

where XM (xj(Ai)) is an M p matrix, CM (c(A1),, c(AM)) is an n M matrix, and K is an M M matrix defined as follows. Write and Under the assumption that both of these M M matrices are positive-definite, there exist nonsingular matrices P1 and Q1 such that P=P1'P1 and Q=Q1'Q1. Then K Q1-1P1. Notice that the two constraints, and are satisfied; the latter constraint involves covariances as well as variances.

When the CMCK predictor is not defined (i.e., P or Q is not positive-definite), the covariance constraints can be relaxed until it is defined. Cressie and Johannesson (2001) took the approach of including the (M 1) nearest data locations to prediction region A, and then predicting Z(A) using the implied M(M+1)/2 variance-covariance constraints. This is what we shall do in the next section on TCDD contamination. Also, in the next section, we shall detrend the data first and apply CMCK to the residual process R() with constant trend, namely E(R()) = β0. This amounts to putting p = 1 and x(s) 1 in the formulas above.


Download R code to carry out CMCK.     Print-friendly pdf version of this page
TCDD contamination in soil


In environmental-remediation problems, soil contamination at one location often leads to contamination at other locations because of the conductivity properties of soil. This has been demonstrated by many case studies such as for the TCDD (tetrachlorodibenzo-p-dioxin) data, which were analyzed by Zirschky and Harris (1986) and Waller and Gotway (2004). The analysis we give below has been taken from Cressie, Zhang, and Craigmile (2005).

In 1971, a truck transporting dioxin-contaminated residues dumped an unknown quantity of waste in a rural area of Missouri in order to prevent citations for being overweight. Although the highest concentration occurred where the waste was dumped, contamination had spread to the shoulders of an adjacent state highway. In November 1983, The US Environmental Protection Agency (EPA) collected soil samples along transects and measured the TCDD concentration (in μg/kg) in each sample. Samples were composited along the transects. In our analysis, we consider only the data close to the dumping location, where the TCDD concentrations tend to be larger. (Recall that our goal is to estimate high extrema and their extent.) These data are based on 50-foot transects and there are no non-detects in the region.

Thus, we consider measurements of 31 samples of TCDD within a region of D, a 458 78 square-foot rectangle. The direction parallel to the highway and the transects is defined to be the x-direction, and the y-direction is then perpendicular to the highway. The x-coordinates of the data are the x-values of the start of each transect. A plot of the data and their spatial locations is shown in Figure 1(a), where the decimal point represents the (x,y) coordinate of the data.

Spatial Analysis

The TCDD data appear to be lognormally distributed; see Figure 1(b). Let {Z(si) : i =1,...,31} denote the observed log concentrations. Based on some exploratory plots and regression analysis, we considered the model, for the log-concentration process, where μ(s) = x(s)'β, β = (β0, β1)', and x(s) (1, y2)' denotes a quadratic trend in the y-direction for s = (x, y). The ordinary least squares estimates of the regression coefficients are and. Recall that the x-axis runs along the center of the highway; then we conclude that the quadratic surface of log concentrations in the y-direction is probably due to the drainage system along the highway, which is designed to let water run off the road quickly. See Figure 1(c).

We then removed the trend component and analyzed the spatial process generated by the residuals:


Histograms of the residuals show that normality is a fairly good approximation (given the small sample size); see Figure 2(a).


Fig 1. The top panel displays the spatial location and value of the log TCDD concentrations; the decimal points represent the (x, y) coordinates of the data. The bottom left panel shows a histogram of the log TCDD concentrations. The bottom right panel displays the estimated trend for the log TCDD concentrations.


Directional empirical semivariograms of the residuals demonstrated that the residual process is severely anisotropic. To make the empirical semivariograms look isotropic, we multiplied the y-values by a factor of 7 (and the x-values were left unchanged). With this transformation, the data almost lie on a square coordinate system (the transformed y-values range between [-266, 280] and the x-values range between [-230, 228]). We then considered various semivariogram models for the residual process. Using the weighted-least-squares method (Cressie, 1993b, p. 99), we deemed that the spherical model fitted best.


The general form for the covariance function derived from the isotropic spherical model is,

where h denotes the Euclidean distance on the transformed coordinate system and the vector of spatial parameters is θ = (c0, s2, r)'. The semivariogram is given by,


The estimate of the nugget effect, c0, was 0; the estimate of the range, r, was 91.61; and the estimate of the partial sill, s2, was 0.36. Figure 2(b) shows the estimated semivariogram . Notice that the estimated value implies a stronger spatial dependence than is apparent from the empirical semivariogram values, also shown in Figure 2(b). The estimated covariance function is


Fig 2. The left panel displays a histogram of the residuals. The right panel shows the empirical semivariogram for the residuals; the solid line is the estimated spherical semivariogram .

Comparison of prediction methods

Based on the spatial analysis we carried out in the previous section, we now consider prediction of the log-concentration process. Our aim is to estimate the value and extent of high extremes of the log TCDD concentration process in a region B. For this analysis, we let B D, the rectangular region that encloses all the observed sites. We shall predict on a 2 2 ft. grid of points throughout D. We denote this discrete approximation to D by {sj*: j =1,..., m = 9200}.

For the CMCK predictor, we first predicted the residual process R() using CMCK and then we added the estimated trend component, , to obtain predictions of the log-concentration process, Z(). The exceedance set for the process Z() in the region B associated with an absolute threshold t IR is defined by , which we estimate by for some generic predictor ; see Craigmile et al. (2005). The two panels of Figure 3 display the exceedance sets obtained when we use ordinary kriging plus trend and CMCK plus trend with M = 2 (i.e., one neighbor and constraining the variance and the covariance between nearest neighbors). In each case, we let the threshold be the 90th percentile of the inverse averaged Gaussian cumulative distribution function (averaged over B). The spatial patterns of exceedances show two regions of exceedance at around x 100 feet and CMCK shows another region of exceedance at around x -100 feet. Ordinary kriging estimates 3.16% of the prediction locations are in the exceedance set and CMCK estimates 17.09% of the prediction locations are in the exceedance set.


Fig 3. The gray regions denote those locations with predicted values that exceed the 90th percentile of the inverse ACDF for ordinary kriging plus trend and CMCK plus trend.


Aldworth, J

Aldworth, J. and N. Cressie (2003). Prediction of nonlinear spatial functionals. Journal of Statistical Planning and Inference 112, 3-41.

Craigmile, P. F., N. Cressie, T. J. Santner, and Y. Rao (2004). Bayesian inferences on environmental exceedances and their spatial locations. Submitted for review.

Cressie, N. (1993a). Aggregation in geostatistical problems. In A. Soares (Ed.), Geostatistics Troia 1992, vol. 1, 25-36. Kluwer: Dordrecht.

Cressie, N. (1993b). Statistics for Spatial Data (Revised edition). Wiley: New York.

Cressie, N. and G. Johannesson (2001). Kriging for cut-offs and other difficult problems. In P. Monestiez, D. Allard, and R. Froidevaux (Eds.) geoENV III Geostatistics for Environmental Applications . 299-310. Kluwer: Dordrecht.

Cressie, N., J. Zhang, and P. Craigmile (2005). Geostatistical prediction of spatial extremes and their extent. In P. Renard et al. (Eds.) geoENV 2004 - Geostatistics for Environmental Applications . Kluwer: Dordrecht, forthcoming.

Deutsch, C. V. and A. G. Journel (1992). GSLIB: Geostatistical Software Library and User's Guide, Oxford University Press: New York.

Journel, A. G. (1983). Nonparametric estimation of spatial distributions. Journal of the International Association for Mathematical Geology 15, 445-468.

Journel, A. G. and C. J. Huijbregts (1978). Mining Geostatistics. Academic Press: London.

Lajaunie, C. (1990). Comparing some approximate methods for building local confidence intervals for predicting regionalized variables. Mathematical Geology 22, 123-144.

Matheron, G. (1963). Principles of geostatistics. Economic Geology 58, 1246-1266.

Matheron, G. (1976). A simple substitute for conditional expectation: the disjunctive kriging. In M. Guarascio, M. David, and C. Huijbregts (Eds.), Advanced Geostatistics in the Mining Industry, pp. 221-236. Reidel: Dordrecht.

Waller, L. A. and C. A. Gotway (2004). Applied Spatial Analysis for Public Health Data. Wiley: New York.

Zirschky, J. and D. Harris (1986). Geostatistical analysis of hazardous waste site data. Journal of Environmental Engineering 112, 770-783.


Download R code to carry out CMCK     Print-friendly pdf version of this page