Statistical Learning comes in two forms: supervised and unsupervised.
The supervised form starts with a training set that contains both
explanatory variables and a response variable; the goal is to learn
enough about the association to predict the response of a new "test
case" of the explanatory variables. Perhaps the simplest example of
this is linear regression, where the form of the association and
distribution of the response are specified, and the only thing to be
learned is the set of coefficients relating response to the
explanatory variables. Dozens of more general techniques have been
developed, and a general theory has emerged, based on a principle
that the complexity of the explanation should be limited by the
amount of training data.
The unsupervised form is best known to statisticians through the
method of clustering. Here all variables are explanatory and the
goal is to infer natural groupings. An example is to learn about
"functional" groupings of genes from their patterns of expression in
a target class of cells.
The intention of this talk is to introduce some modern learning
methods, with an emphasis on SVM (Support Vector Machines) as a
supervised learning method, and COSA (Clustering of Objects using
Subsets of Attributes) as an unsupervised method. Both methods will
be discussed in the context of constructing new drugs that may
operate through unknown pathways.