The motivating example on this webpage involves a region whose soils
have been contaminated by tetrachlorodibenzo-p-dioxin (TCDD)
The
motivating example on this webpage involves a region whose soils have been
contaminated by tetrachlorodibenzo-p-dioxin (TCDD).
Remediation is required when TCDD concentrations are above predetermined
levels. That is, for a geostatistical process {Z(s) : sÎDÌIRd} and a given
threshold t, we are interested in spatial
prediction of such nonlinear functionals as I(Z(A) ³t) based on data Zº (Z(s1),…,Z(sn))', whereZ(A) ºave{Z(s):sÎA} and AÌD.
The prediction of linearfunctionals
of Z, such as Z(s0)
at a known point s0
or Z(A) for a given block can be carried out via kriging (e.g., Matheron, 1963; Journel and Huijbregts, 1978, Chapter V; Cressie 1993b, Chapter 3). When predicting nonlinearfunctionals of Z,
the major thrust in geostatistics has been to use
nonlinear predictors such as indicator kriging (Journel,
1983), indicator cokriging (Lajaunie,
1990), and disjunctive kriging (Matheron, 1976).
While these methods are appropriate for nonlinear functionals
like I(Z(s0) ³t), they do not generalize to the problem of
predicting I(Z(A) ³t). Clearly,
conditional simulation (e.g., Deutsch and Journel,
1992) based on specifying more than second moments can be used for inference. Restricting to specification of only second moments, we would like to explore covariance matching
constrained kriging (CMCK) due to Aldworth and
Cressie (2003), where constraints are added to the usual kriging equations that
force elements of the variance matrix of a vector of linear predictors to match
those from the corresponding predictands. The CMCK
predictor is unbiased, has approximate optimal mean squared prediction error,
handles additive measurement error straightforwardly, and can predict nonlinear
functionals of Z(A) just as
easily as those of Z(s0). Its strength is its
generality for handling many types of nonlinearity.
CMCK Equations
The universal kriging (UK) block predictor of Z(A) is well known to be,
The
universal kriging (UK)
block predictor of Z(A) is well known to be,
where
S = var(Z); c(A)
= (C(s1,A),…,C(sn,A))' and C(s,A) º (∫ACθ(s-u)du)/|A|; Cθ(s-u) º cov(Z(s), Z(u)); Xº (xj(si))
is an n´p matrix of
explanatory variables; x(A) º∫A x(u)du/|A|;
and is the best linear unbiased estimator of , namely,
Now
consider prediction of g(Z(A)),
where g is a smooth nonlinear
function and |A| > 0 or A is countable. Cressie (1993a) proposed
the constrained kriging (CK) predictor, which is an optimal linear
predictor that is unbiased for Z(A) (as is ) and satisfies the extra
constraint, . Cressie (1993a) showed that .
In
the spirit of wanting the predictor's statistical properties to match those of
the target quantity, Aldworth and Cressie (2003) extended this methodology to
include further constraints, where (some) covariances are matched in addition
to the variance(s). This covariance-matching
constrained kriging (CMCK) predictorincludes as a special case.
Consider
the problem of predicting the M-dimensional
vector, Z(A) º (Z(A1),…,Z(AM))', where AiÌD; i
= 1, …, M and A º {A1,…,AM}. The CMCK
predictor, given by Aldworth and Cressie (2003), is:
where
XMº (xj(Ai)) is an M ´ p matrix, CM
º (c(A1),…,c(AM)) is an n´M matrix,
and K is an M ´M matrix defined as follows. Write and Under the assumption that both of these M´M matrices are
positive-definite, there exist nonsingular matrices P1 and Q1
such that P=P1'P1
and Q=Q1'Q1.
Then K ºQ1-1P1. Notice that the two constraints,
and are satisfied; the latter
constraint involves covariances as well as variances.
When the
CMCK predictor is not defined (i.e., P
or Q is not positive-definite), the
covariance constraints can be relaxed until it is defined. Cressie and
Johannesson (2001) took the approach of including the (M – 1) nearest data locations to prediction region A, and then predicting Z(A)
using the implied M(M+1)/2 variance-covariance
constraints.This is what we shall do in
the next section on TCDD contamination. Also, in the next section, we shall
detrend the data first and apply CMCK to the residual process R(×) with constant trend, namely E(R(×))
= β0. This amounts to putting p = 1 and x(s) º 1 in the formulas above.
In
environmental-remediation problems, soil contamination at one location often
leads to contamination at other locations because of the conductivity
properties of soil. This has been demonstrated by many case studies such as for
the TCDD (tetrachlorodibenzo-p-dioxin) data, which
were analyzed by Zirschky and Harris (1986) and Waller and Gotway
(2004). The analysis we give below has been taken from
Cressie, Zhang, and Craigmile (2005).
In 1971, a truck transporting dioxin-contaminated
residues dumped an unknown quantity of waste in a rural area of Missouri in order to
prevent citations for being overweight. Although the highest concentration
occurred where the waste was dumped, contamination had spread to the shoulders
of an adjacent state highway. In November 1983, The US Environmental Protection
Agency (EPA) collected soil samples along transects and measured the TCDD
concentration (in μg/kg) in each sample. Samples were composited along the transects. In
our analysis, we consider only the data close to the dumping location, where
the TCDD concentrations tend to be larger. (Recall that our goal is to estimate
high extrema and their extent.) These data are based
on 50-foot transects and there are no non-detects in the
region.
Thus, we consider measurements of 31 samples of TCDD
within a region of D, a 458 ´ 78 square-foot rectangle. The direction parallel to
the highway and the transects is defined to be the x-direction, and the y-direction is then perpendicular to the
highway. The x-coordinates of the
data are the x-values of the start of each transect. A plot of the
data and their spatial locations is shown in Figure 1(a), where the decimal
point represents the (x,y)
coordinate of the data.
Spatial Analysis
The TCDD data appear to
be lognormally distributed; see Figure 1(b).Let {Z(si) : i =1,...,31} denote the observed
log concentrations. Based on some exploratory plots and regression analysis, we
considered the model, for the log-concentration process, where
μ(s) = x(s)'β, β = (β0, β1)', and x(s)
º
(1, y2)' denotes a
quadratic trend in the y-direction
for s = (x, y). The ordinary least
squares estimates of the regression coefficients are and.
Recall that the x-axis runs along the
center of the highway; then we conclude that the quadratic surface of log
concentrations in the y-direction is
probably due to the drainage system along the highway, which is designed to let
water run off the road quickly.See
Figure 1(c).
We then removed the
trend component and analyzed the spatial process generated by the residuals:
.
Histograms of the
residuals show that normality is a fairly good approximation (given the small
sample size); see Figure 2(a).
Fig1.The
top panel displays the spatial location and value of the log TCDD
concentrations; the decimal points represent the (x,y) coordinates of the data.The bottom left panel shows a histogram of the log TCDD
concentrations.The bottom right panel
displays the estimated trend for the log TCDD concentrations.
Directional empirical semivariograms of the residuals demonstrated that the
residual process is severely anisotropic. To make the empirical semivariograms look isotropic, we multiplied the y-values by a factor of 7 (and the x-values were left unchanged). With this
transformation, the data almost lie on a square coordinate system (the
transformed y-values range between
[-266, 280] and the x-values range
between [-230, 228]).We then considered
various semivariogram models for the residual process.
Using the weighted-least-squares method (Cressie, 1993b, p. 99), we deemed that
the spherical model fitted best.
The general form for the
covariance function derived from the isotropic spherical model is,
whereh denotes the Euclidean distance on the transformed coordinate
system and the vector of spatial parameters is θ = (c0,s2,r)'. The semivariogram
is given by,
.
The estimate of the
nugget effect, c0, was 0;
the estimate of the range, r, was
91.61; and the estimate of the partial sill, s2, was 0.36. Figure 2(b)
shows the estimated semivariogram.
Notice that the estimated value implies a stronger spatial dependence than is
apparent from the empirical semivariogram values,
also shown in Figure 2(b).The estimated
covariance function is
Fig 2.The left panel
displays a histogram of the residuals.The right panel shows the empirical semivariogram
for the residuals; the solid line is the estimated spherical semivariogram.
Comparison of prediction
methods
Based on the spatial analysis we carried out in the previous section, we now consider
prediction of the log-concentration process. Our aim is to estimate the value
and extent of high extremes of the log TCDD concentration process in a region B. For this analysis, we let BºD, the rectangular region
that encloses all the observed sites. We shall predict on a 2 ´ 2 ft. grid of points throughout D. We denote this discrete approximation to D by {sj*: j =1,...,m = 9200}.
For the CMCK predictor, we first predicted the
residual process R(×)
using CMCK and then we added the estimated trend component, , to obtain predictions of the
log-concentration process, Z(×).
The exceedance set for the process Z(×) in the region B
associated with an absolute threshold tÎIR is
defined by , which we estimate by for some generic predictor ; see Craigmile
et al. (2005). The two
panels of Figure 3
display the exceedance sets obtained when we use ordinary kriging plus trend
and CMCK plus trend with M=2
(i.e., one neighbor and constraining the variance and the covariance between
nearest neighbors). In each case, we let the threshold be the 90th
percentile of the inverse averaged Gaussian cumulative distribution function (averaged over B). The
spatial patterns of
exceedances show two regions of exceedance at around x» 100 feet and CMCK shows
another region of exceedance at around x»
-100 feet. Ordinary kriging estimates 3.16% of the prediction locations are in
the exceedance set and CMCK estimates 17.09% of the prediction locations are in
the exceedance set.
Fig 3.The gray regions denote those locations with predicted
values that exceed the 90th percentile of the inverse ACDF for ordinary kriging
plus trend and CMCK plus trend.
References
Aldworth, J
Aldworth, J. and N. Cressie (2003).Prediction of nonlinear spatial functionals.Journal
of Statistical Planning and Inference 112, 3-41.
Craigmile, P. F., N. Cressie, T. J. Santner, and Y. Rao (2004).Bayesian inferences on
environmental exceedances and their spatial
locations.Submitted for review.
Cressie, N. (1993a). Aggregation in geostatistical problems.In A. Soares
(Ed.), GeostatisticsTroia 1992, vol.
1, 25-36.Kluwer:
Dordrecht.
Cressie, N. (1993b). Statistics for Spatial Data (Revised
edition). Wiley:New York.
Cressie, N. and G. Johannesson
(2001).Kriging for cut-offs and other difficult
problems.In P. Monestiez,
D. Allard, and R. Froidevaux (Eds.) geoENV III – Geostatistics
for Environmental Applications .299-310.Kluwer: Dordrecht.
Cressie, N., J. Zhang, and P. Craigmile (2005).Geostatistical prediction of spatial extremes and their extent.In P. Renard et al.(Eds.) geoENV 2004- Geostatistics for Environmental
Applications .Kluwer: Dordrecht, forthcoming.
Deutsch, C. V. and A. G. Journel (1992).GSLIB: Geostatistical Software Library and User's
Guide, OxfordUniversity Press: New York.
Journel, A. G. (1983). Nonparametric
estimation of spatial distributions.Journal
of the International Association for Mathematical Geology 15, 445-468.
Journel, A. G. and C. J. Huijbregts (1978).Mining Geostatistics. Academic Press:London.
Lajaunie, C. (1990). Comparing some
approximate methods for building local confidence intervals for predicting
regionalized variables.Mathematical
Geology 22, 123-144.
Matheron, G. (1963). Principles of geostatistics.Economic
Geology 58, 1246-1266.
Matheron, G. (1976). A simple substitute for conditional
expectation: the disjunctive kriging. In M. Guarascio,
M. David, and C. Huijbregts (Eds.), Advanced Geostatistics
in the Mining Industry, pp. 221-236. Reidel:Dordrecht.
Waller, L. A. and C. A. Gotway (2004).Applied Spatial
Analysis for Public Health Data. Wiley:New
York.
Zirschky, J. and D. Harris (1986).Geostatistical analysis of hazardous waste site data.Journal of Environmental Engineering 112, 770-783.