Difference between revisions of "SMHS BigDataBigSci CrossVal LDA QDA"

Latest revision as of 13:14, 19 October 2020

1 Big Data Science and Cross Validation - Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting
- 1.1 Summary
- 1.2 Mathematical formulation
2 LDA (Linear Discriminant Analysis)
- 2.1 R code (QDA)
3 k-Nearest Neighbors algorithm
4 k-Means Clustering (k-MC)
5 Appendix: R Debugging
6 See also

Big Data Science and Cross Validation - Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting

Summary

Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X|Y=k)\) for each class \(k\). Their predictions are obtained by using Bayesian theorem:

\begin{equation} P(Y=k|X)=\frac{P(X|Y=k)P(Y=k)}{P(X)}=\frac{P(X|Y=k)P(Y=k)}{\sum_{l=0}^∞P(X|Y=l)P(Y=l)} \end{equation}

and we select the class \(k\), which maximizes this conditional probability (maximum likelihood estimation).

In linear and quadratic discriminant analysis, \(P(X|Y)\) is modeled as a multivariate Gaussian distribution with density:

\begin{equation} P(X|Y=k)=\frac{1}{(2\pi)^n|\sum_k|^{1/2}}×e^{\Big(-\frac{1}{2}(x-\mu_k)^T\sum_k^{-1}(X-\mu_k)\Big)} \end{equation}

This model can be used to classify data by using the training data to estimate:

(1) the class prior probabilities \(P(Y = k)\) by counting the proportion of observed instances of class \(k\),

(2) the class means \(μ_k\) by computing the empirical sample class means, and

(3) the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., lasso).

In the linear case (LDA), the Gaussians for each class are assumed to share the same covariance matrix:

\(Σ_k=Σ\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\) ):

\(LOR=log\Big(\frac{P(Y=k│X)}{P(Y=l│X)}\Big)\) (the LOR=0 ↔the two probabilities are identical, i.e., same class)

\(LOR=log\Big(\frac{P(Y=k│X}{P(Y=l│X)}\Big)=0 ⇔ (\mu_k-\mu_l)^T\sum^{-1}(\mu_k-\mu_1)=\frac{1}{2}({\mu_k}^T\sum^{-1}\mu_k-{\mu_l}^T\sum^{-1}\mu_l) \)

But, in the more general, quadratic case of QDA, there are no assumptions on the covariance matrices \(Σ_k\) of the Gaussians, leading to quadratic decision surfaces.

Mathematical formulation

Fisher's Linear discriminant analysis is a technique aiming to identify a linear combination of features characterizing, or separating, two or more groups of objects. The linear combination can be used to predict, or linearly classify, heterogeneous datasets.

LDA is related to principal component analysis, independent component analysis, and factor analysis. However, LDA explicitly attempts to model the difference between the binary or multinomial classes, where LDA works with continuous independent variables. Discriminant correspondence analysis represents the analogue when dealing with categorical independent variables.

For simplicity, assume we are looking for a binary classification of an outcome variable, \(\mathcal{Y}=\{0, 1\}\). The decision boundary of the Bayes classifier, representing a linear hyperplance separating the objects in the two possible outcome labels/classes, is defined as \(D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}\).

To explicitly derive the Bayes classifier's decision boundary, we will denote the likelihood function by \( P(X=x|Y=y) = f_y(x) \) and the prior probability \( P(Y=y) = \pi_y \).

LDA assumes that both outcome classes have multivariate normal (Gaussian) distributions and share the same variance-covariance matrix \(\Sigma\).

Then, \( P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)\), and we can derive the Bayes classifier decision boundary by setting \( P(Y=1|X=x) = P(Y=0|X=x) \) and expanding this equation as follows:

\(Pr(Y=1|X=x)=Pr(Y=0|X=x)\)

\( \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}\) (Bayes' rule)

\( Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)\) (canceling the denominators)

\( f_1(x)\pi_1=f_0(x)\pi_0\)

\( \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0\)

\( \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0\)

\( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)\) (log-transform to both sides)

\( \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0\) (expand)

\( \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1} \mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0.\) (cancel and factoring the terms).

Therefore, the Bayes's classifier's decision boundary \(D(h^*)\) is linear \(ax+b=0\) in terms of hte argument \(x\).

The 2-class LDA classification formulation can be generalized to multinomial classification using \(k \ge 2\) classes. For two class indices, the corresponding \(m\)-class to \(n\)-class (Bayes classifier's decision) boundary will be \( D(h^*)=\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1} \mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0\).

In the special case when both classes have the same number of samples, the resulting decision boundary would represent a linear hyperplane exactly halfway between the mean centers of the sample points in the pair of classes.

Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and revisited in 2000, is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.

LDA assumptions include:

The data in each class has a Gaussian distribution,
The mean rather than the variance is the discriminating factor, and
In certain situations, LDA may over-fit the training data.

LDA (Linear Discriminant Analysis)

LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempts to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.

predfun.lda = function(train.x, train.y, test.x, test.y, neg)
{
require("MASS")
lda.fit = lda(train.x, grouping=train.y)
ynew = predict(lda.fit, test.x)\(\\(\(class
 out.lda = confusionMatrix(test.y, ynew, negative=neg)
 return( out.lda )
 }

=='"`UNIQ--h-4--QINU`"'QDA (Quadratic Discriminant Analysis)==

==='"`UNIQ--h-5--QINU`"'Mathematical Formulation===

[http://en.wikipedia.org/wiki/Quadratic_classifier Quadratic Discriminant Analysis] searches for a more complex, ''quadratic boundary'' which represents a second order combination of the observed features. QDA assumes that each class \(k\) has its own variance-covariance matrix \(\Sigma_k\).

The math derivation of the QDA Bayes classifier's decision boundary \(D(h^*)\) is similar to that of LDA. FOr simplicity, we'll still consider a binary classification for the outcome \( \mathcal{Y}=\{0, 1\}\):

\( Pr(Y=1|X=x)=Pr(Y=0|X=x)\)

\( \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}\) (Bayes' Rule)

\( Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)\) (cancel denominators)

\( f_1(x)\pi_1=f_0(x)\pi_0\)

\( \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0\)

\( \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0\) (by cancellation)

\( -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)\) (log transform of both sides)

\( \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0\) (expand out)

\( \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0.\)

The QDA the decision boundary \( D(h^*): \{ax^2+bx+c=0\}\) is quadratic in \(x\).

The general multinomial outcome case is similarly derived by replacing the binary classes with a pair of class labels \( m ,n\) and computing the QDA Bayes classifier decision boundary \(D(h^*): \{ \log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0\} \).

R code (QDA)

predfun.qda = function(train.x, train.y, test.x, test.y, neg)
{
require("MASS") # for lda function
qda.fit = qda(train.x, grouping=train.y)
ynew = predict(qda.fit, test.x)\(\\(\(class
out.qda = confusionMatrix(test.y, ynew, negative=neg)
return( out.qda )
}

k-Nearest Neighbors algorithm

k-Nearest Neighbors algorithm (k-NN) is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its k nearest neighbors (k is always a small positive integer). When k=1, then an object is assigned to the class of its single nearest neighbor.

In k-NN regression, the output is the property value for the object representing the average of the values of its k nearest neighbors.

#X = as.matrix(input) # Predictor variables X = as.matrix(input.short2)

#Y = as.matrix(output) # Outcome

#KNN (k-nearest neighbors)

    library("class")
    #knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X) \(\\(\(class 
    #table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")
    #This can be used for polytomous variable (multiple classes)

    predfun.knn = function(train.x, train.y, test.x, test.y, neg)
    {
    require("class")
    knn.fit = knn(train.x, test.x, cl = train.y, prob=T) 	# knn is already a prediction function!!!
    #ynew = predict(knn.fit, test.x)\(\\(\(class			# no need of another prediction, in this case
    out.knn = confusionMatrix(test.y, knn.fit, negative=neg)
    return( out.knn )
    }
cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")

Compare all 3 classifiers (lda, qda, knn, and logit)

diagnosticErrors(cv.out.lda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); 
diagnosticErrors(cv.out.logit\(\\(\(stat);

Now let’s look at the actual prediction models!

There are different approaches to split the data (partition the data) into Training and Testing sets.

#TRAINING: 75% of the sample size

sample_size <- floor(0.75 * nrow(input))
##set the seed to make your partition reproducible
set.seed(1234)
input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)
input.train <- input[input.train.ind, ]
output.train <- as.matrix(output)[input.train.ind, ]

TESTING DATA

input.test <- input[-input.train.ind, ]
output.test <- as.matrix(output)[-input.train.ind, ]

k-Means Clustering (k-MC)

k-MC aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms²).

kmeans_model <- kmeans(input.train, 2)
layout(matrix(1,1))
plot(input.train, col = kmeans_model\(\\(\(cluster) 
points(kmeans_model\(\\(\(centers, col = 1:2, pch = 8, cex = 2)

##cluster centers "fitted" to each obs.:
fitted.kmeans <- fitted(kmeans_model);  head(fitted.kmeans)
resid.kmeans <- (input.train - fitted(kmeans_model))
#define the sum of squares function	
ss <- function(data) sum(scale(data, scale = FALSE)^2)

##Equalities 
cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], 	# the same two columns
    c (ss(fitted.kmeans), ss(resid.kmeans),  ss(input.train)))

#validation
stopifnot(all.equal(kmeans_model\(\\(\(totss,        ss(input.train)),
    all.equal(kmeans_model\(\\(\(tot.withinss, ss(resid.kmeans)),
##these three are the same:
    all.equal(kmeans_model\(\\(\(betweenss,    ss(fitted.kmeans)),
    all.equal(kmeans_model\(\\(\(betweenss, kmeans_model\(\\(\(totss - kmeans_model\(\\(\(tot.withinss),
##and hence also
    all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))
)
kmeans(input.train,1)\(\\(\(withinss     # trivial one-cluster, (its W.SS == ss(input.train))

²http://escholarship.org/uc/item/1rb70972

(1) k-Nearest Neighbor Classification

library("class")
knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
plot(knn_model)
summary(knn_model)
attributes(knn_model)

#cross-validation
knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
summary(knn_model.cv)

Appendix: R Debugging

Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper -- this procedure to identify and fix bugs is “debugging”.

R tools for debugging: traceback(), debug() browser() trace() recover()

traceback(): Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred. The function calls are printed in reverse order.

f1<-function(x) { r<- x-g1(x); r }

g1<-function(y) { r<-y*h1(y); r }

h1<-function(z) { r<-log(z); if(r<10) r^2 else r^3}

f1(-1)

Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced

traceback() 
3:  h(y)
2: g(x)
1: f(-1)

debug()

traceback() does not tell you where is the error. To find out which line causes the error, we may step through the function using debug().

debug(foo) flags the function foo() for debugging. undebug(foo) unflags the function.

When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and user can interact with the R shell.

This allows us to inspect a function line-by-line.

Example: compute sum of squared error SS

## compute sum of squares 
SS<-function(mu,x) {  d<-x-mu; d2<-d^2; ss<-sum(d2);  ss }
set.seed(100);  x<-rnorm(100); SS(1,x)

## to debug
debug(SS); SS(1,x)
debugging in: SS(1, x) debug: {
d <- x - mu d2 <- d^2
ss  <-  sum(d2) ss
}

In the debugging shell (“Browse[1]>”), users can:

• Enter n (next) executes the current line and prints the next one;

• Typing c (continue) executes the rest of the function without stopping;

• Enter Q quits the debugging;

• Enter ls() list all objects in the local environment;

• Enter an object name or print(<object name>) tells the current value of an object.

Example:

debug(SS)
SS(1,x)
debugging in: SS(1, x) debug: {
d <- x - mu d2 <- d^2
...
Browse[1]>  n
debug: d <- x - mu  		## the next command 
Browse[1]>  ls()		##  current  environment [1] "mu" "x"	        ## there is no d
Browse[1]>  n			## go one step debug:  d2  <-  d^2	        ## the next command
Browse[1]> ls()	                ## current environment [1] "d"   "mu" "x"       ## d has been created
Browse[1]> d[1:3]		## first three elements of d [1] -1.5021924 -0.8684688 -1.0789171
Browse[1]> hist(d)  		## histogram of d
Browse[1]> where		## current position in call stack where 1: SS(1, x)
Browse[1]>  n
debug: ss <- sum(d2) 
Browse[1]> Q			##  quit

undebug(SS)	                ## remove debug label, stop debugging process
SS(1,x)		                ## now call SS again will without debugging

You can label a function for debugging while debugging another function

f<-function(x) { r<-x-g(x); r }
g<-function(y) { r<-y*h(y); r }
h<-function(z) { r<-log(z); if(r<10) r^2 else r^3 }

debug(f)			# ## If you only debug f, you will not go into g
f(-1)
Browse[1]> n
Browse[1]> n
Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message: 
In log(z) : NaNs produced

But, we can also label g and h for debugging when we debug f

f(-1)
Browse[1]>  n
Browse[1]> debug(g) 
Browse[1]> debug(h) 
Browse[1]> n

Inserting a call to browser() in a function will pause the execution of a function at the point where browser() is called. Similar to using debug() except you can control where execution gets paused.

Example:

h<-function(z) {
browser() 	## a break point inserted here 
r<-log(z); if(r<10) r^2 else r^3
}

f(-1)
Browse[1]> ls() 
Browse[1]> z
Browse[1]> n
Browse[1]>  n
Browse[1]> ls()
Browse[1]> c

Calling trace() on a function allows inserting new code into a function. The syntax for trace() may be challenging.

as.list(body(h)) 
trace("h",quote(if(is.nan(r)) {browser()}), at=3, print=FALSE)
f(1)
f(-1)

trace("h",quote(if(z<0) {z<-1}), at=2, print=FALSE)
f(-1)
untrace()

During the debugging process, recover() allows checking the status of variables in upper level functions. recover() can be used as an error handler using options() (e.g. options(error=recover)). When functions throw exceptions, execution stops at point of failure. Browsing the function calls and examining the environment may indicate the source of the problem.

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

Difference between revisions of "SMHS BigDataBigSci CrossVal LDA QDA"

Latest revision as of 13:14, 19 October 2020

Contents

Big Data Science and Cross Validation - Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting

Summary

Mathematical formulation

LDA (Linear Discriminant Analysis)

R code (QDA)

k-Nearest Neighbors algorithm

k-Means Clustering (k-MC)

Appendix: R Debugging

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 2: / Line 2: @@
 ===Summary===
-Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data P(X|Y=''k'') for each class ''k''. Their predictions are obtained by using Bayesian theorem (http://wiki.socr.umich.edu/index.php/SMHS_BayesianInference#Bayesian_Rule):
+Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X|Y=k)\) for each class \(k\). Their predictions are obtained by using [http://wiki.socr.umich.edu/index.php/SMHS_BayesianInference#Bayesian_Rule Bayesian theorem]:
-P(Y=k│X)=(P(X|Y=k)P(Y=k))/(P(X))=(P(X|Y=k)P(Y=k))/(∑_(l=0)^∞▒〖P(X|Y=l)P(Y=l)〗), <mark> FIX FORMULA!! </mark>
+\begin{equation}
+P(Y=k|X)=\frac{P(X|Y=k)P(Y=k)}{P(X)}=\frac{P(X|Y=k)P(Y=k)}{\sum_{l=0}^&#8734;P(X|Y=l)P(Y=l)}
+\end{equation}
-and we select the class k, which '''maximizes''' this conditional probability (maximum likelihood estimation).
+and we select the class \(k\), which '''maximizes''' this conditional probability (maximum likelihood estimation).
-In linear and quadratic discriminant analysis, P(''X''|''Y'')  is modelled as a multivariate Gaussian distribution with density:
+In linear and quadratic discriminant analysis, \(P(X|Y)\)  is modeled as a multivariate Gaussian distribution with density:
-P(X│Y=k)=1/((2π)^n |Σ_k |^(1⁄2) ) 〖×e〗^((-1/2 (X-μ_k )^T Σ_k^(-1) (X-μ_k )) ) <mark> FIX FORMULA!! </mark>
+\begin{equation}
+P(X|Y=k)=\frac{1}{(2\pi)^n|\sum_k|^{1/2}}×e^{\Big(-\frac{1}{2}(x-\mu_k)^T\sum_k^{-1}(X-\mu_k)\Big)}
+\end{equation}
 This model can be used to classify data by using the training data to '''estimate''':
-(1) the class prior probabilities ''P''(''Y = k'') by counting the proportion of observed instances of class ''k'',
+(1) the class prior probabilities \(P(Y = k)\) by counting the proportion of observed instances of class \(k\),
-(2) the class means μ<sub>''k''</sub> by computing the empirical sample class means, and
+(2) the class means \(μ_k\) by computing the empirical sample class means, and
 (3) the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., lasso).
@@ Line 25: / Line 29: @@
 In the <u>linear case</u> (LDA), the Gaussians for each class are assumed to share the same covariance matrix:
-Σ<sub>''k''</sub>=Σ for each class ''k''. This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (''k'' and ''l''):
+\(Σ_k=Σ\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\) ):
-LOR=log((P(Y=k│X))/(P(Y=l│X) ))   (the LOR=0 ↔the two probabilities are identical, i.e., same class)
+\(LOR=log\Big(\frac{P(Y=k│X)}{P(Y=l│X)}\Big)\)
+(the LOR=0 ↔the two probabilities are identical, i.e., same class)
-LOR=log((P(Y=k│X))/(P(Y=l│X) ))=0  □(⇔┬ )  (μ_k-μ_l )^T Σ^(-1) (μ_k-μ_l )=1/2 (〖μ_k〗^T Σ^(-1) μ_k-〖μ_l〗^T Σ^(-1) μ_l )  <mark> FIX FORMULAS!! </mark>
+\(LOR=log\Big(\frac{P(Y=k│X}{P(Y=l│X)}\Big)=0 ⇔ (\mu_k-\mu_l)^T\sum^{-1}(\mu_k-\mu_1)=\frac{1}{2}({\mu_k}^T\sum^{-1}\mu_k-{\mu_l}^T\sum^{-1}\mu_l) \)
-But, in the more general, <u>quadratic case</u> of QDA, there are no assumptions on the covariance matrices Σ<sub>k</sub> of the Gaussians, leading to quadratic decision surfaces.
+But, in the more general, <u>quadratic case</u> of QDA, there are no assumptions on the covariance matrices \(Σ_k\) of the Gaussians, leading to quadratic decision surfaces.
-==LDA (Linear Discriminant Analysis)==
+=== Mathematical formulation===
+Fisher's [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis] is a technique aiming to identify a ''linear combination of features'' characterizing, or separating, two or more groups of objects. The linear combination can be used to predict, or linearly classify, heterogeneous datasets.
-&#35;LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempt to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.
+LDA is related to [http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/05_DimensionalityReduction.html principal component analysis, independent component analysis, and factor analysis]. However, LDA explicitly attempts to model the difference between the binary or multinomial classes, where LDA works with continuous independent variables. Discriminant correspondence analysis represents the analogue when dealing with categorical independent variables.
- predfun.lda = function(train.x, train.y, test.x, test.y, neg)
+For simplicity, assume we are looking for a binary classification of an outcome variable,
- {
+\(\mathcal{Y}=\{0, 1\}\). The ''decision boundary of the Bayes classifier'', representing a linear hyperplance separating the objects in the two possible outcome labels/classes, is defined as
- require("MASS")
+\(D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}\).
- lda.fit = lda(train.x, grouping=train.y)
- ynew = predict(lda.fit, test.x)$\$$class
- out.lda = confusionMatrix(test.y, ynew, negative=neg)
- return( out.lda )
- }
-==QDA (Quadratic Discriminant Analysis)==
+To explicitly derive the Bayes classifier's decision boundary, we will denote the likelihood function by
+\( P(X=x|Y=y) = f_y(x) \) and the prior probability \( P(Y=y) = \pi_y \).
-predfun.qda = function(train.x, train.y, test.x, test.y, neg)
+LDA assumes that both outcome classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and share the same variance-covariance matrix \(\Sigma\).
-{
+Then, \( P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)\), and we can derive the Bayes classifier decision boundary by
+setting \( P(Y=1|X=x) = P(Y=0|X=x) \) and expanding this equation as follows:
-require("MASS") # for lda function
+:: \(Pr(Y=1|X=x)=Pr(Y=0|X=x)\)
+:: \( \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}\) (Bayes' rule)
+:: \( Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)\) (canceling the denominators)
+:: \( f_1(x)\pi_1=f_0(x)\pi_0\)
+:: \( \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0\)
+:: \( \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0\)
+:: \(  -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)\) (log-transform to both sides)
+:: \(  \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left(  x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0  \right)=0\) (expand)
+:: \(  \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left(  \mu_1^\top\Sigma^{-1}
+\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)  \right)=0.\) (cancel and factoring the terms).
-qda.fit = qda(train.x, grouping=train.y)
+Therefore, the Bayes's classifier's decision boundary \(D(h^*)\) is linear \(ax+b=0\) in terms of hte argument \(x\).
-ynew = predict(qda.fit, test.x)$\$$class
+The 2-class LDA classification formulation can be generalized to multinomial classification using \(k \ge 2\) classes. For two class indices, the corresponding \(m\)-class to \(n\)-class (Bayes classifier's decision) boundary will be
+\( D(h^*)=\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left(  \mu_m^\top\Sigma^{-1}
+\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)  \right)=0\).
-out.qda = confusionMatrix(test.y, ynew, negative=neg)
+In the special case when both classes have the same number of samples, the resulting decision boundary would represent a linear hyperplane exactly halfway between the mean centers of the sample points in the pair of classes.
-return( out.qda )
+Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.
-}
+LDA assumptions include:
-==k-Nearest Neighbors algorithm==
+* The data in each class has a Gaussian distribution,
+* The mean rather than the variance is the discriminating factor, and
+* In certain situations, LDA may over-fit the training data.
-k-Nearest Neighbors algorithm (''k''-NN) is a non-parametric method for either classification or regression, where the <u>input</u> consists of the ''k'' closest '''training examples''' in the feature space, but the <u>output</u> depends on whether ''k''-NN is used for classification or regression:
+==LDA (Linear Discriminant Analysis)==
-*In k-NN classification, the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its k nearest neighbors (k is always a small positive integer). When k=1, then an object is assigned to the class of its single nearest neighbor.
+LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempts to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.
-*In k-NN regression, the output is the property value for the object representing the average of the values of its k nearest neighbors.
+ predfun.lda = function(train.x, train.y, test.x, test.y, neg)
+ {
+ require("MASS")
+ lda.fit = lda(train.x, grouping=train.y)
+ ynew = predict(lda.fit, test.x)\(\\(\(class
+ out.lda = confusionMatrix(test.y, ynew, negative=neg)
+ return( out.lda )
+ }
-&#35;X = as.matrix(input)	# Predictor variables X = as.matrix(input.short2)
+==QDA (Quadratic Discriminant Analysis)==
-&#35;Y = as.matrix(output) 	# Outcome
+===Mathematical Formulation===
+[http://en.wikipedia.org/wiki/Quadratic_classifier Quadratic Discriminant Analysis] searches for a more complex, ''quadratic boundary'' which represents a second order combination of the observed features. QDA assumes that each class \(k\) has its own variance-covariance matrix \(\Sigma_k\).
-<u>'''&#35;KNN (k-nearest neighbors)'''</u>
+The math derivation of the QDA Bayes classifier's decision boundary \(D(h^*)\) is similar to that of LDA. FOr simplicity, we'll still consider a binary classification for the outcome \( \mathcal{Y}=\{0, 1\}\):
-library("class")
+::\( Pr(Y=1|X=x)=Pr(Y=0|X=x)\)
+::\(  \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}\) (Bayes' Rule)
+::\(  Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)\) (cancel denominators)
+::\(  f_1(x)\pi_1=f_0(x)\pi_0\)
+::\(  \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0\)
+::\(  \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0\) (by cancellation)
+::\(  -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)\) (log transform of both sides)
+::\(   \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left(  x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0  \right)=0\) (expand out)
+::\(   \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left(  x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0)  \right)=0.\)
-<mark>&#35;knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X)$\$$class </mark>
+The QDA the decision boundary \( D(h^*): \{ax^2+bx+c=0\}\) is quadratic in \(x\).
-<mark>&#35;table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")</mark>
+The general multinomial outcome case is similarly derived by replacing the binary classes with a pair of class labels \( m ,n\) and computing the QDA Bayes classifier decision boundary
+\(D(h^*): \{ \log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left(  x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n)  \right)=0\} \).
-<mark>&#35;This can be used for polytomous variable (multiple classes)</mark>
+=== R code (QDA)===
+ predfun.qda = function(train.x, train.y, test.x, test.y, neg)
+ {
+ require("MASS") # for lda function
+ qda.fit = qda(train.x, grouping=train.y)
+ ynew = predict(qda.fit, test.x)\(\\(\(class
+ out.qda = confusionMatrix(test.y, ynew, negative=neg)
+ return( out.qda )
+ }
+==k-Nearest Neighbors algorithm==
-predfun.knn = function(train.x, train.y, test.x, test.y, neg)
+k-Nearest Neighbors algorithm (''k''-NN) is a non-parametric method for either classification or regression, where the <u>input</u> consists of the ''k'' closest '''training examples''' in the feature space, but the <u>output</u> depends on whether ''k''-NN is used for classification or regression:
-{
+*In ''k''-NN '''classification''', the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its ''k'' nearest neighbors (''k'' is always a small positive integer). When ''k''=1, then an object is assigned to the class of its single nearest neighbor.
-require("class")
+*In ''k''-NN '''regression''', the output is the property value for the object representing the average of the values of its ''k'' nearest neighbors.
-knn.fit = knn(train.x, test.x, cl = train.y, prob=T) 	<mark># knn is already a prediction function!!!</mark>
+&#35;X = as.matrix(input) &nbsp;&nbsp;&nbsp; # Predictor variables X = as.matrix(input.short2)
-&#35;ynew = predict(knn.fit, test.x)$\$$class			# no need of another prediction, in this case
+&#35;Y = as.matrix(output)  &nbsp;&nbsp;&nbsp; # Outcome
-out.knn = confusionMatrix(test.y, knn.fit, negative=neg)
-return( out.knn )
-}
+<u>'''&#35;KNN (k-nearest neighbors)'''</u>
-<mark>cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")</mark>
+     library("class")
+     <span style="background-color: #32cd32">&#35;knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X) \(\\(\(class </span>
+     <span style="background-color: #32cd32">&#35;table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")</span>
+     <span style="background-color: #32cd32">&#35;This can be used for polytomous variable (multiple classes)</span>
+     predfun.knn = function(train.x, train.y, test.x, test.y, neg)
+     {
+     require("class")
+     knn.fit = knn(train.x, test.x, cl = train.y, prob=T) 	<span style="background-color: #32cd32"># knn is already a prediction function!!!</span>
+     &#35;ynew = predict(knn.fit, test.x)\(\\(\(class			# no need of another prediction, in this case
+     out.knn = confusionMatrix(test.y, knn.fit, negative=neg)
+     return( out.knn )
+     }
+ <span style="background-color: #32cd32">cv.out.knn = '''crossval::crossval'''(predfun.knn, X, Y, K=5, B=2, neg="1")</span>
 Compare all 3 classifiers (lda, qda, knn, and logit)
-diagnosticErrors(cv.out.lda$\$$stat); diagnosticErrors(cv.out.qda$\$$stat); diagnosticErrors(cv.out.qda$\$$stat);
+ diagnosticErrors(cv.out.lda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat);
+ diagnosticErrors(cv.out.logit\(\\(\(stat);
-diagnosticErrors(cv.out.logit$\$$stat);
@@ Line 123: / Line 174: @@
 &#35;TRAINING: 75% of the sample size
-<blockquote>sample_size <- floor(0.75 * nrow(input))
+ sample_size <- floor(0.75 * nrow(input))
+ &#35;&#35;set the seed to make your partition reproducible
+ set.seed(1234)
+ input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)
+ input.train <- input[input.train.ind, ]
+ output.train <- as.matrix(output)[input.train.ind, ]
-&#35;&#35;set the seed to make your partition reproducible
+TESTING DATA
-set.seed(1234)
+ input.test <- input[-input.train.ind, ]
+ output.test <- as.matrix(output)[-input.train.ind, ]
-input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)
-input.train <- input[input.train.ind, ]
-output.train <- as.matrix(output)[input.train.ind, ]</blockquote>
-&#35;TESTING DATA
-<blockquote> input.test <- input[-input.train.ind, ]</blockquote>
-<blockquote>output.test <- as.matrix(output)[-input.train.ind, ]</blockquote>
 ==k-Means Clustering (k-MC)==
-k-MC aims to partition ''n'' observations into ''k'' clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms<sup>2</sup>).
+k-MC aims to partition ''n'' observations into ''k'' clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms<sup>2</sup>).
-kmeans_model <- kmeans(input.train, 2)
+ kmeans_model <- kmeans(input.train, 2)
+ layout(matrix(1,1))
+ plot(input.train, col = kmeans_model\(\\(\(cluster)
+ points(kmeans_model\(\\(\(centers, col = 1:2, pch = 8, cex = 2)
-layout(matrix(1,1))
+ &#35;&#35;cluster centers "fitted" to each obs.:
+ fitted.kmeans <- fitted(kmeans_model);  head(fitted.kmeans)
+ resid.kmeans <- (input.train - fitted(kmeans_model))
+ &#35;define the sum of squares function
+ ss <- function(data) sum(scale(data, scale = FALSE)^2)
-plot(input.train, col = kmeans_model$\$$cluster)
+ &#35;&#35;Equalities
+ cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], 	# the same two columns
+     c (ss(fitted.kmeans), ss(resid.kmeans),  ss(input.train)))
-points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2)
+ &#35;validation
+ stopifnot(all.equal(kmeans_model\(\\(\(totss,        ss(input.train)),
+     all.equal(kmeans_model\(\\(\(tot.withinss, ss(resid.kmeans)),
+ &#35;&#35;these three are the same:
+     all.equal(kmeans_model\(\\(\(betweenss,    ss(fitted.kmeans)),
+     all.equal(kmeans_model\(\\(\(betweenss, kmeans_model\(\\(\(totss - kmeans_model\(\\(\(tot.withinss),
+ &#35;&#35;and hence also
+     all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))
+ )
+ kmeans(input.train,1)\(\\(\(withinss &nbsp;&nbsp;&nbsp; # trivial one-cluster, (its W.SS == ss(input.train))
+<sup>2</sup>http://escholarship.org/uc/item/1rb70972
-&#35;&#35;cluster centers "fitted" to each obs.:
-fitted.kmeans <- fitted(kmeans_model);  head(fitted.kmeans)
+'''(1) k-Nearest Neighbor Classification'''
-resid.kmeans <- (input.train - fitted(kmeans_model))
+ library("class")
+ knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
+ plot(knn_model)
+ summary(knn_model)
+ attributes(knn_model)
+ #cross-validation
+ knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
+ summary(knn_model.cv)
-&#35;define the sum of squares function
+==Appendix: R Debugging==
-ss <- function(data) sum(scale(data, scale = FALSE)^2)
+Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper -- this procedure to identify and fix bugs is “debugging”.
+R tools for debugging: traceback(), debug() browser() trace() recover()
-&#35;&#35;Equalities
+'''traceback():''' Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred.
+The function calls are printed in reverse order.
-cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], 	# the same two columns
+f1<-function(x) { r<- x-g1(x); r }
-<blockquote> c (ss(fitted.kmeans), ss(resid.kmeans),  ss(input.train)))</blockquote>
+g1<-function(y) { r<-y*h1(y); r }
+h1<-function(z) { r<-log(z); if(r<10) r^2 else   r^3}
-&#35;validation
+f1(-1)
-stopifnot(all.equal(kmeans_model$totss,        ss(input.train)),
+<font color="red">Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message:
+In log(z) : NaNs produced</font>
-<blockquote>all.equal(kmeans_model$\$$tot.withinss, ss(resid.kmeans)),</blockquote>
-&#35;&#35;these three are the same:
+ traceback()
+:  h(y)
+: g(x)
+: f(-1)
+ debug()
-<blockquote>all.equal(kmeans_model$\$$betweenss,    ss(fitted.kmeans)),</blockquote>
+traceback() does not tell you where is the error. To find out which line causes the error, we may step through the function using debug().
-<blockquote>all.equal(kmeans_model$\$$betweenss, kmeans_model$\$$totss - kmeans_model$\$$tot.withinss),</blockquote>
+debug(foo) flags the function foo() for debugging. undebug(foo) unflags the function.
-&#35;&#35;and hence also
+When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and user can interact with the R shell.
-<blockquote>all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))</blockquote>
+This allows us to inspect a function line-by-line.
-)
+'''Example''': compute sum of squared error SS
-kmeans(input.train,1)$\$$withinss &nbsp;&nbsp;&nbsp; # trivial one-cluster, (its W.SS == ss(input.train))
+ &#35;&#35; compute sum of squares
+ SS<-function(mu,x) {  d<-x-mu; d2<-d^2; ss<-sum(d2);  ss }
+ set.seed(100);  x<-rnorm(100); SS(1,x)
+ &#35;&#35; to debug
+ debug(SS); SS(1,x)
+ debugging in: SS(1, x) debug: {
+ d <- x - mu d2 <- d^2
+ ss  <-  sum(d2) ss
+ }
-<sup>2</sup>http://escholarship.org/uc/item/1rb70972
+In the debugging shell (“Browse[1]>”), users can:
+•	Enter <u>'''n'''</u> (next) executes the current line and prints the next one;
-'''(1)&#35;&#35; k-Nearest Neighbor Classification'''
+•	Typing <u>'''c'''</u> (continue) executes the rest of the function without stopping;
-library("class")
+•	Enter <u>'''Q'''</u> quits the debugging;
-knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
+•	Enter <u>'''ls()'''</u> list all objects in the local environment;
-plot(knn_model)
+•	Enter an object name or print(<object name>) tells the current value of an object.
-summary(knn_model)
-attributes(knn_model)
+Example:
+ debug(SS)
+ SS(1,x)
+ debugging in: SS(1, x) debug: {
+ d <- x - mu d2 <- d^2
+ ...
+ Browse[1]>  n
+ debug: d <- x - mu  		## the next command
+ Browse[1]>  ls()		##  current  environment [1] "mu" "x"	        ## there is no d
+ Browse[1]>  n			## go one step debug:  d2  <-  d^2	        ## the next command
+ Browse[1]> ls()	                ## current environment [1] "d"   "mu" "x"       ## d has been created
+ Browse[1]> d[1:3]		## first three elements of d [1] -1.5021924 -0.8684688 -1.0789171
+ Browse[1]> hist(d)  		## histogram of d
+ Browse[1]> where		## current position in call stack where 1: SS(1, x)
+ Browse[1]>  n
+ debug: ss <- sum(d2)
+ Browse[1]> Q			##  quit
+ '''undebug(SS)'''	                ## remove debug label, stop debugging process
+ SS(1,x)		                ## now call SS again will without debugging
-&#35;cross-validation
+You can label a function for debugging while debugging another function
-knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
+ f<-function(x) { r<-x-g(x); r }
+ g<-function(y) { r<-y*h(y); r }
+ h<-function(z) { r<-log(z); if(r<10) r^2 else r^3 }
-summary(knn_model.cv)
+ debug(f)			# ## If you only debug f, you will not go into g
+ f(-1)
+ Browse[1]> n
+ Browse[1]> n
+ <i><font color="red">Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message:</i>
+ <i>In log(z) : NaNs produced</font></i>
-==Appendix: R Debugging==
+But, we can also label ''g'' and ''h'' for debugging when we debug ''f''
-Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper -- this procedure to identify and fix bugs is “debugging”.
+ f(-1)
+ Browse[1]>  n
+ Browse[1]> debug(g)
+ Browse[1]> debug(h)
+ Browse[1]> n
-R tools for debugging: traceback(), debug() browser() trace() recover()
+Inserting a call to '''browser()''' in a function will pause the execution of a function at the point where browser() is called.
+Similar to using debug() except you can control where execution gets paused.
-'''traceback():''' Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred.
+'''Example:'''
-The function calls are printed in reverse order.
+ h<-function(z) {
+ browser() 	## a break point inserted here
+ r<-log(z); if(r<10) r^2 else r^3
+ }
+ f(-1)
+ Browse[1]> ls()
+ Browse[1]> z
+ Browse[1]> n
+ Browse[1]>  n
+ Browse[1]> ls()
+ Browse[1]> c
-f1<-function(x) { r<- x-g1(x); r }
+Calling '''trace()''' on a function allows inserting new code into a function. The syntax for trace() may be challenging.
-g1<-function(y) { r<-y*h1(y); r }
+ as.list(body(h))
+ trace("h",quote(if(is.nan(r)) {browser()}), at=3, print=FALSE)
-h1<-function(z) { r<-log(z); if(r<10) r^2 else   r^3}
+ f(1)
+ f(-1)
+ trace("h",quote(if(z<0) {z<-1}), at=2, print=FALSE)
+ f(-1)
+ untrace()
-f1(-1)
+During the debugging process, '''recover()''' allows checking the status of variables in upper level functions. recover() can be used as an error handler using '''options()''' (e.g. options(error=recover)). When functions throw exceptions, execution stops at point of failure. Browsing the function calls and examining the environment may indicate the source of the problem.
-<font color="red">Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message:
-In log(z) : NaNs produced</font>
 ==See also==
@@ Line 240: / Line 370: @@
 * [[SMHS_BigDataBigSci_SEM| Structural Equation Modeling (SEM)]]
 * [[SMHS_BigDataBigSci_GCM| Growth Curve Modeling (GCM)]]
-* [[SMHS_BigDataBigSci_GEE| Generalized Estimating Equation (GEE) Modeling]]
+* [[SMHS_BigDataBigSci_GCM| Generalized Estimating Equation (GEE) Modeling]]
 * [[SMHS_BigDataBigSci|Back to Big Data Science]]