Tao Shi's Research
With the explosion of information technology, many scientists are confronted with problems of analyzing and modeling extremely large datasets with complicated structures. To organize, summarize, transmit, visualize, and analyze these massive datasets, new statistical methods and efficient algorithms need to be developed. My current research interests include machines learning, statistical methodology and computation on massive data, and statistical applications in atmospheric science and geoscience.
1. Machine Learning and Data Mining
Machine learning algorithms and data mining techniques have been widely used and proved successful in many practical problems. Analyzing statistical properties of these methods has been a hot research area in Statistics. In my current research with Professor Mikhail Belkin (CSE, OSU), Professor Yoonkyung Lee (Statistics, OSU) and Professor Bin Yu (Statistics, U.C. Berkeley), we study properties and computation of spectral algorithms, such as spectral estimation of mixture models, spectral clustering algorithms, semi-supervised learning algorithms and Support Vector Machines.
1.1. Data Spectroscopic clustering (DaSpec)
|
This research focuses on obtaining clustering information in a distribution when i.i.d. data are given. First, we develop theoretical results for understanding and using clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function (with a sufficiently fast tail decay). We provide population analyses to give insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the data. In particular, we learned that top eigenvectors do not contain all the clustering information. Second, we use heuristics from these analyses to design the Data Spectroscopic clustering (DaSpec) algorithm that uses properly selected top eigenvectors, determines the number of clusters, gives data labels, and provides a classification rule for future data, all based on only one eigen decomposition. Our findings not only extend and go beyond the intuitions underlying existing spectral techniques (e.g. spectral clustering and Kernel Principal Components Analysis), but also provide insights about their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the promise of our proposed data spectroscopy clustering algorithm relative to k-means and one spectral method (Ng et. al.). In particular, DaSpec seems to be able to handle unbalanced groups and recover clusters of different shapes better than competing methods. |
|
1.2. Spectroscpic Estimation of Gaussian Mixture Models
|
Mixture models, particularly Gaussian mixture models, are a widely used modeling tool with applications ranging from speech recognition to image segmentation and data compression. In scientific practice researchers fit mixture models to the data in order to extract meaningful structure, which is encoded in the locations and other parameters of the mixture components. However, fitting these models is a difficult computational problem. It is generally not computationally feasible to find the optimal solution. Various heuristic algorithms, notably Expectation Maximization (EM) with k-means initialization, are commonly used in practice. Typically these methods do not automatically find the correct number of components and are susceptible to local maxima. |
|
In our research, we develop algorithms for estimating parameters of Gaussian mixture models, based on ideas from spectral theory of operators. These algorithms differ from the previous work as they use eigenfunctions and eigenvalues of certain integral operators associated to the data to provide explicit estimates for the parameters of the mixture model, including the number of components. Moreover the resulting optimization problems do not have local maxima. Our results show that these algorithms are computationally efficient and provide good estimates or model parameters. We investigate the theoretical guarantees of the estimation accuracy by analyzing perturbation of the spectra of the constructed operators.
Related publications:
Data Spectroscopy: Learning Mixture Models with Eigenspaces of Convolution Operators
Shi, T., Belkin, M., and Yu, B. (2008)
Proceedings of the 25th International Conference on Machine Learning (ICML 2008)
Data Spectroscopy: Eigenspaces of Convolution Operators and Clustering
Shi, T., Belkin, M. and Yu, B. (2008) [ software ]
Preprint 812, Department of Statistics, The Ohio State University
2. Statistical Methodology and Computation on Massive Data
Computation involving large matrix manipulation is usually the bottleneck of many statistical methods, such as kernel methods in non-parametric statistics and machine learning and kriging in spatial statistics. The computation complexity of inverting a n by n matrix is n3, which prohibits us to handle large data sets. I currently work with Professor Noel Cressie (Statistics, OSU) and Professor Bin Yu (Statistics, U.C. Berkeley) on developing computational methods applicable to large data sets.
|
2.1. Statistical Modelling and Computation on Massive Spatial Data Kriging, or spatial best linear unbiased prediction (spatial BLUP), has become very popular in the earth and environmental sciences, where it is sometimes known as optimum interpolation. Kriging methodology is able to produce maps of optimal predictions and associated prediction standard errors from incomplete and noisy spatial data (Cressie 1993, Ch. 3). However, solving the kriging equations directly involves inversion of an n x n covariance matrix &Sigma, where n data may require O(n3) computations to obtain &Sigma-1. Under these circumstances, s traightforward kriging on global MISR AOD data is impossible. I am working with Professor Noel Cressie (Statistics, OSU) on methods to carry out on massive data sets. In Shi and Cressie (2007), we look for classes of covariance functions for which kriging can be done exactly. We use a spatial covariance function based on what we call a Spatial Random Effects (SRE) model, which leads to a Spatial Mixed Effects (SME) model for the data process. In our application, we use multi-resolution basis functions to capture the spatial dependence in the data. The kriging computations that follow from the SME model can be carried out using Fixed Rank Kriging (FRK), proposed in Cressie and Johannesson 2006. It was shown there that FRK is linear scalable in the number of data, so it is capable of handling the very large global datasets collected by NASA's satellite instruments. |
|
2.2. Binning in Gaussian Kernel Regularization
Gaussian kernel regularization is widely used in the machine learning literature and has proved successful in many empirical experiments (Whaba 1990, Smola et al 1998, and Lin and Brown 2004). The periodic version of Gaussian kernel regularization has been shown (Lin and Brown 2004) to be minimax rate optimal in estimating functions in any finite order Sobolev space. However, for a data set with n points, the computation complexity of the Gaussian kernel regularization method is of order O(n3).
In Shi and Yu (2006), we propose to use binning to reduce the computation of Gaussian kernel regularization in both regression and classification. For periodic Gaussian kernel regression, we show that the binned estimator achieves the same minimax rates as the unbinned estimator, but the computation is reduced to O(m3) with m as the number of bins. To achieve the minimax rate in the kth order Sobolev space, m needs to be in the order of O(kn1/(2k+1)), which makes the binned estimator computation of order O(n) for k = 1, and even less for larger k. Our simulations show that the binned estimator (binning 120 data points into 20 bins in our simulation) provides almost the same accuracy with only 0.4% of computation time. For classification, binning with L2-loss Gaussian kernel regularization and Gaussian kernel Support Vector Machines is tested in a polar cloud detection problem and binned SVM provides the best results with the least computation burden.
Related publications:
Global Statistical Analysis of MISR Aerosol Data:
A Massive Data Product from NASA's TERRA Satellite
Shi, T., and Cressie N. (2007)
Environmetrics, 18, 665-680
Data Mining of MISR Aerosol Product using
Spatial Statistics
Shi, T., and Cressie, N. (2007)
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining, 712-719
Satellite Data: Massive but Sparse
Cressie, N., Shi, T., and Johannesson, G. (2006)
Proceedings of the 2nd NASA Data Mining Workshop
Binning in Gaussian Kernel Regularization
Shi, T. and Yu, B. (2006)
Statistica Sinica 16, 541-567
3. Statistical Applications in Atmospheric Science and Geoscience
Climate change refers to any significant change in measures of climate (such as temperature, precipitation or wind) lasting for an extended period (decades or longer). According to US Environmental Protection Agency, "As through much of its history, the Earth's climate is changing. Right now it is getting warmer. Most of the warming in recent decades is very likely the result of human activities (IPCC 2007)". To study the natural variability and the effect of human activities, global climate data have been collected and climate models have been developed. New statistical methodology and computational algorithms are needed to analyze and model those massive and complicated data. Collaborating with geo-scientists and statisticians, I'm involved in several projects:
3.1 Polar Cloud Detection using EOS Data
NASA's Earth Observing System is designed for studying the Earth from space using a multiple-instrument, multiple-satellite approach. The EOS program is critical for improving our scientific understanding of ongoing natural and human-induced global climate change (such as global warming) and providing a sound scientific basis for developing global environmental policies. A key to predicting climate change is to observe and understand the global distribution of clouds, their physical properties (such as thickness and droplet size), and their relationship to regional and global climates. The Multi-angle Imaging SpectroRadiometer (MISR) and the Moderate Resolution Imaging Spectroradiometer (MODIS), on the first EOS satellite TERRA, were launched in 1999 to provide scientists with data for global cloud research. However, clouds above snow- and ice-covered surfaces over polar regions are especially difficult to detect because their temperature and reflectivity are similar to that of the surface. I collaborated with MISR science team members Dr. Amy Braverman (Jet Propulsion Laboratory, NASA) and Professor Eugene Clothiaux (Department of Meteorology, PSU) on building computational efficient polar cloud detection algorithms for TERRA data.
|
Daytime Polar Cloud Detection Using MISR Data The massive data size (389.22 Megabytes raw data per minute) and the lack of ground truth in polar regions are major obstacles to building and validating online cloud detection algorithms. Based on his expertise, Professor Clothiaux labeled around 5 million pixels of MISR data collected over Greenland during summer 2002, and these highly valuable validation data provide us with a solid base for building and testing new algorithms. Based on physics knowledge, we constructed three features using the multi-angle data provided by MISR (Shi et al. (2008), (2004), and (2002)). Since no expert labels are available in the online data processing, we built a threshold algorithm on these features with fixed or data driven thresholds. The algorithm was called as the Enhanced Linear Correlation Matching (ELCM) algorithm. Compared to the labelled data, the ELCM algorithm provides a much higher agreement rate and better coverage than the MISR operational algorithms. Moreover, the ELCM results provide labels for training Quadratic Discriminate Analysis (QDA) to report probability of cloudiness for each pixels. As a result of our research, the MISR science team is trying to implementing the ELCM algorithms in the online data process and making the results a part of MISR data products available to scientists. |
|
Improving Polar Cloud Detection by Fusing MISR and MODIS Data
|
Fusing the data from different instruments is one of NASA's high priorities in the multiple-instrument and multiple-satellite EOS system, since the combined information of two or more sources of complementary data can validate and improve the results from each part. The Moderate Resolution Imaging Spectroradiometer (MODIS) is a hyper-spectral instrument and it provides a wider spectral coverage than MISR. Since MISR and MODIS provide concurrent ground coverage simultaneously, fusing the multi-angle and hyper-spectral data from these two instruments should improve the cloud detection accuracy over polar regions. In Shi et al. (2007), (2006) and (2004), we propose methods to improve polar cloud detection by fusing MISR and MODIS data. Compared to expert labels, the agreed pixels of MISR ELCM results and MODIS operational cloud mask are highly accurate. Therefore, those pixels may serve as a good source of accurate labels for training classifiers. A Quadratic Discriminate Analysis classifier is trained on all MISR and MODIS features using the agreed pixels, and the classifier is then applied to the full data. The QDA classifier provides an error rate much lower than those of using either MISR or MODIS data alone. |
|
Related publications:
Arctic Clouds Detection using MISR Data with Case Studies
Shi, T., Yu, B., Clothiaux, E. E., and Braverman, A. J. (2008)
Journal of American Statistical Association 103 (482), 584-593
Detection of Daytime Arctic Clouds using MISR and MODIS Data
Shi, T., Clothiaux, E. E., Yu, B., Braverman, A. J. and Groff, G. N. (2007)
Remote Sensing of Environment 107, 172-184
Polar Cloud Detection using MISR and MODIS Data
Shi, T., Yu, B., Clothiaux, E. E., and Braverman, A. J. (2006)
Proceedings of the 2nd NASA Data Mining Workshop
Fusing MISR and MODIS Information for Polar Cloud Detection
Shi, T., Yu, B., Clothiaux, E. E., and Braverman, A. J. (2004)
Proceedings of the 38th Asilomar Conference on Signals, Systems, and Computers,
vol 2, 1705-1709, IEEE press.
MISR Cloud/Ice Classification Using Linear Correlation Matching
Shi, T., Yu, B., and Braverman, A. J. (2002)
Technical Report #630, Department of Statistics, U.C. Berkeley
|
3.2 FLAMES The Fire-Land-Atmosphere Modeling and Evaluation for Southeast Asia (FLAMES) project is a collaboration between researchers in the Departments of Geography and Statistics at The Ohio State University. The project is funded by NASA's Research Opportunities for Space and Earth Science (ROSES-2005 Award #NNG06GD31G) as part of the Land-Cover/Land-Use Change Program and is endorsed by the Global Land Project, a joint research agenda of the International Human Dimensions Programme (IHDP) and the International Geosphere-Biosphere Programme (IGBP). |
|
|
This project focus on studying generation/transportation of aerosol generated by forest fires and pollution. The tools we uses includes Baysiscian Statistics modelling, global and local atmosphere transportation models, satellite measurements from MISR and MODIS, climate and weather records, and ground measurements. The results from this study will enhance our understanding of the pattern of aerosol transportation and provide more accurate information for studying the connection of aerosol and effect of fires/pollution to our atmosphere. |
|
|
Related publications:
Spatial Characteristics of the Difference between MISR and MODIS
Aerosol Optical Depth Retrievals over
Mainland Southeast Asia
Xiao, N., Shi, T., Calder, C. A., Munroe, D. K., Berrett, C. and Wolfinbarger, S., and Li, D. (2008)
To appear at Remote Sensing of Environment.
The Relationships Between Biomass Burning, Land-cover/use Change, and the Distribution of Carbonaceous
Aerosols in Mainland Southeast Asia: A Review and Synthesis
Munroe, D. K., Wolfinbarger, S.R., Calder, C. A., Shi, T., Xiao, N., Lam, C.Q. and Li, D. (2008)
To appear at Journal of Land Use Science
Fire-land-atmosphere Modeling and Evaluation for Southeast Asia
Munroe, D.K., Xiao, N., Calder, C.A., and Shi, T. (2008)
Newsletter of the Global Land Project. Issue No. 3. January, 2008