Publications
of Tao Shi
[ Journal Papers and Preprints]
[ Conference Papers]
[Ph.D. Thesis]
[Others]
Modeling of spatio-temporal processes on a global scale can be challenging due to difficulties caused by the size of spatio-temporal datasets and the specification of the spatio-temporal interactions. In this article, we present a hierarchical statistical model that includes a spatio-temporal random effects (STRE) model as the dynamical com- ponent and a temporally independent spatial component for the pixel-scale variation. Optimal spatio-temporal predictions are derived in terms of a low-dimensional Kalman filter, resulting in a methodology we call fixed rank filtering (FRF). A simulation study is carried out to quantify the possible gains to be made by modeling the temporal variability and incorporating both current and past data in the FRF, compared to suc- cessive, spatial-only predictions using fixed rank kriging (FRK). Method-of-moments estimators of the parameters are proposed, and the methodology is applied to a large remote-sensing, spatio-temporal dataset; rapid implementation of FRF (along with a measure of its uncertainty), is demonstrated.
Datasets from remote-sensing platforms and sensor networks are often spatial, temporal, and very large. Processing massive amounts of data to provide current estimates of the (hidden) state is challenging, even for the Kalman filter. A large number of spatial locations observed through time can quickly lead to an overwhelmingly high-dimensional statistical model. Dimension reduction without sacrificing complexity is our goal in this article. We demonstrate how a spatio-temporal random effects (STRE) component of a statistical model reduces the problem to one of fixed dimension with a very fast statistical solution, namely fixed rank filtering (FRF). This is compared to successive, spatial-only predictions based on an analogous spatial random effects (SRE) model, and the value of exploiting temporal dependence is demonstrated. A remote sensing dataset of aerosol optical depth (AOD), from the Multi-angle Imaging SpectroRadiometer (MISR) instrument on the Terra satellite, is analyzed using FRF. We obtain rapid production of optimal, gap-filled, filtered AOD predictions, along with their prediction standard errors, and demonstrate their superiority over successive, spatial-only AOD predictions when there are large gaps in the current data. We processed over 100,000 spatio-temporal data: parameter estimation took 54.8 seconds to compute, and predictions and their standard errors took 11.2 seconds to compute.
This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the Data Spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
The difference between aerosol optical depths (AODs) retrieved from the Multi-angle Imaging SpectroRadiometer (MISR) and Moderate Resolution Imaging Spectroradiometer (MODIS) is examined over mainland Southeast Asia from a spatial perspective. Though ideally the differences between these measurement types should be small and randomly distributed over space, our analysis suggested that MISR/MODIS AOD differences have a strong negative relationship with MODIS AODs and tend to be spatially clustered. In this paper, we quantify the spatial dependence in MISR/MODIS AOD differences and explore the extent to which the spatial patterns in these differences can be explained by variables that reflect influence of the environment and human activities. While these variables show a significant relationship with MISR/MODIS AOD differences, the results also suggest further research is needed to fully understand the spatial dependence of these differences.
Biomass burning is a major source of black carbon aerosols. These aerosols have negative human health impacts and can affect the radiation budget and climate directly and indirectly. Uncertainty regarding the contribution of biomass burning to the concentration of aerosols is higher in Southeast Asia than in some other regions of substantial biomass burning because of other sources of pollution such as significant fossil fuel combustion. The slash-and-burn agricultural tradition is still evident in the region. Significant expansion of cash-crop production is also associated with biomass burning, as is the seasonal burning of crop residue. The effects of such land-use processes extend into the atmosphere, and localized events have regional and global implications for air pollution-related health effects and climate. This paper synthesizes the issue of biomass burning and aerosols in the context of land-use practices in Southeast Asia, and makes suggestions of how to use available data sources in an integrated analysis.
Global climate models predict that the strongest dependences of surface air temperatures on increasing atmospheric carbon dioxide levels will occur in the Arctic. A systematic study of these dependences requires accurate arctic-wide measurements, especially of cloud coverage. Thus cloud detection in the Arctic is extremely important, but it is also challenging because of the similar remote sensing characteristics of clouds, ice- and snow-covered surfaces.
This paper proposes two new operational arctic cloud detection algorithms using Multi-angle Imaging SpectroRadiometer (MISR) imagery. The key idea is to identify cloud-free surface pixels in the imagery, instead of cloudy pixels as in the existing MISR operational algorithms. Through extensive exploratory data analysis and using domain knowledge, three physically useful features have been identified that differentiate well surface pixels from cloudy pixels. The first algorithm, Enhanced Linear Correlation Matching (ELCM), thresholds the features with either fixed or data-adaptive cut-off values. Furthermore, probability labels are obtained by using ELCM labels as training data for Fisher's Quadratic Discriminant Analysis (QDA), leading to the second (ELCM-QDA) algorithm. Both algorithms are automated and computationally efficient for operational processing of the massive MISR data.
Based on five million expert-labeled pixels, ELCM results are significantly better both in terms of accuracy (92%) and coverage (100%) when compared with two MISR operational algorithms, one with an accuracy of 80% and coverage of 27% and the other with an accuracy of 83% and a coverage of 70%. The ELCM-QDA probability prediction is also consistent with the expert labels and more informative. In conclusion, ELCM and ELCM-QDA provide the best performances to date among all available operational algorithms using MISR data.
In climate models, aerosol forcing is the major source of uncertainty in climate forcing, over the industrial period. To reduce this uncertainty, instruments on satellites have been put in place to collect global data. However, missing and noisy observations impose considerable difficulties for scientists researching the global distribution of aerosols, aerosol transportation, and comparisons between satellite observations and global climate model outputs. In this paper, we fit a Spatial Mixed Effects (SME) statistical model to predict the missing values, denoise the observed values, and quantify the spatial-prediction uncertainties. The computations associated with the SME model are linear scalable to the number of data points, which makes it feasible to process massive global satellite data. We apply the methodology, which is called Fixed Rank Kriging (FRK), to the level-3 Aerosol Optical Depth (AOD) dataset collected by NASA's Multi-angle Imaging SpectroRadiometer (MISR) instrument flying on the Terra satellite. Overall, our results were superior to those from non-statistical methods and, importantly, FRK has an uncertainty measure associated with it that can be used for comparisons over different regions or at different time points
Expert labels were used to evaluate arctic cloud detection accuracies of several methods based on Multi-angle Imaging Spectroradiometer (MISR) angular radiances and Moderate Resolution Imaging Spectroradiometer (MODIS) spectral radiances. The accuracy of cloud detections was evaluated relative to 5.086 million expert labels applied to 7.114 million 1.1-km resolution pixels with valid radiances from 57 scenes. The accuracy of the MODIS operational cloud mask was 90.72% for the 32 partly cloudy scenes and 93.37% for the 25 completely clear and overcast scenes. An automated, simple threshold algorithm based on three features extracted from MISR radiances and the MODIS operational cloud mask agreed with each other for 74.91% of the pixels in the 32 partly cloudy scenes and 78.44% of the pixels in the 25 completely clear and overcast scenes. These subsets of pixels had, relative to the expert labels, classification accuracies of 96.53% for the 32 partly cloudy scenes and 99.05% for the 25 clear and overcast scenes. Fisher's quadratic discriminate analysis (QDA) trained on expert labels from the 32 partly cloud scenes was applied to MISR radiances, three fea- tures based on MISR radiances, MODIS radiances, and the six features of the MODIS operational cloud mask. The resulting classification accuracies were 87.51%, 88.45%, 96.43%, and 95.61%, respectively. The accuracies increased to 96.98% (96.71%) when QDA with expert labels was applied to combined radiances (features) from both MISR and MODIS. Providing a sufficient number of expert labels for characterization of rapidly changing arctic cloud and surface conditions is a daunting task. To process data operationally a second group of classifiers, also QDA-based, used as training labels those pixels for which the MISR automated, simple threshold and MODIS operational cloud mask algorithms agreed. Training the QDA classifier on these automatic labels using MISR radiances, three features based on MISR radiances, MODIS radiances, and the six features of the MODIS operational cloud mask led to accuracies of 85.23%, 88.05%, 93.62%, and 93.55%, respectively, for the 32 partly cloudy scenes. For combined radiances (features) from both MISR and MODIS accuracies are 93.74% (93.40%) for the 32 partly cloud scenes. A scheme that combines training a QDA classsifier with MISR and MODIS automatic labels for the 32 partly cloudy scenes and thresholding of three MISR features for classification (with 95.39% accuracy) of the 25 completely clear and overcast scenes produced an accuracy of 94.51% for the 57 scenes, the highest classification rate of any automated procedure that was tested in the study. These results suggest that both MISR and MODIS radiances are sufficient for cloud detection in daytime polar regions.
Gaussian kernel regularization is widely used in the machine learning literature and has proved successful in many empirical experiments. The periodic version of Gaussian kernel regularization has been shown to be minimax rate optimal in estimating functions in any finite order Sobolev space. However, for a data set with n points, the computation complexity of the Gaussian kernel regularization method is of order O(n^3). In this paper we propose to use binning to reduce the computation of Gaussian kernel regularization in both regression and classification. For periodic Gaussian kernel regression, we show that the binned estimator achieves the same minimax rates as the unbinned estimator, but the computation is reduced to O(m^3) with m as the number of bins. To achieve the minimax rate in the kth order Sobolev space, m needs to be in the order of O(kn^{1/(2k+1)}), which makes the binned estimator computation of order O(n) for k = 1, and even less for larger k. Our simulations show that the binned estimator (binning 120 data points into 20 bins in our simulation) provides almost the same accuracy with only 0.4% of computation time. For classification, binning with L2-loss Gaussian kernel regularization and Gaussian kernel Support Vector Machines is tested in a polar cloud detection problem.
In this paper we develop a spectral framework for estimating mixture distributions, specifically Gaussian mixture models. In physics, spectroscopy is often used for the identification ofsubstances through their spectrum. Treating a kernel function K(x,y) as "light" and the sampled data as "substance", the spectrum of their interaction (eigenvalues and eigenvectors of the kernel matrix K) unveils certain aspects of the underlying parametric distribution p, such as the parameters of a Gaussian mixture. Our approach extends the intuitions and analyses underlying the existing spectral techniques, such as spectral clustering and Kernel Principal Components Analysis (KPCA).
We construct algorithms to estimate parameters of Gaussian mixture models, including the number of mixture components, their means and covariance matrices, which are important in many practical applications. We provide a theoretical framework and show encouraging experimental results.
In climate models, aerosol forcing is the major source of uncertainty in climate forcing, over the industrial period. To reduce this uncertainty, instruments on satellites have been put in place to collect global data. However, missing and noisy observations impose considerable difficulties for scientists researching global aerosol distribution, aerosol transportation, and comparisons between satellite observations and global-climate-model outputs. In this paper, we propose a Spatial Mixed Effects (SME) statistical model to predict the missing values, denoise the observed values, and quantify the spatial-prediction uncertainties. The computations associated with the SME model are linear scalable to the number of data points, which makes it feasible to process massive global satellite data. We apply our proposed methodology, which we call Fixed Rank Kriging (FRK), to the level-3 Aerosol Optical Depth dataset collected by NASA's Multi-angle Imaging SpectroRadiometor (MISR) instrument flying on the Terra satellite. Overall, our results were superior to those from nonstatistical methods and, importantly, FRK has an uncertainty measure associated with it.
Clouds play a major role in controlling Earth's climate, and cloud detection is a crucial step in the Numerical Weather Prediction and Global Climate Models. Multi-angle Imaging SpectroRadiometer (MISR) and Moderate Resolution Imaging Spectroradiometer (MODIS) were launched in 1999 by NASA to provide multi-angle and hyper-spectral data to detect clouds. However, cloud detection algorithms using either MISR or MODIS data separately do not take full advantage of the data collected by both sensors
In this paper, we propose and test two schemes to combine MISR and MODIS data for cloud detection in polar regions. Both schemes are followups of a two-step polar cloud detection algorithm using MISR data: Enhanced Linear Correlation Matching Classification followed by Quadratic Discriminate Analysis (ELCMC-QDA). The first scheme is mapping the MODIS cloud detection results to the MISR grid based on a nearest neighbor method, then only reporting the agreed pixels of the ELCMC-QDA results (from MISR) and MODIS operational results. This scheme improves the classification accuracy, but reduces the coverage of the results. Instead of combining the MISR and MODIS results directly, the second scheme uses the agreed pixels of ELCMC results and MODIS operational results as the training data for the QDA on MISR features, and output the results from the QDA. Both schemes are tested over a region where expert labels show that both MISR and MODIS operational algorithms do not work well (according to expert labels, 53% and 12.72% misclassification rates for MISR and MODIS operational algorithms respectively). The first scheme only makes 0.72% of errors, but leaves 68.72% of pixels unclassified. The second scheme reaches a 2.93% of misclassification rate, which is smaller than a 4.09% rate from ELCMC-QDA, and it provides a full coverage. Hence we propose using QDA on ELCMC and MODIS agreed pixels as an algorithm to fuse the MISR and MODIS information for the polar cloud detection.