# Difference between revisions of "SMHS BigDataBigSci CrossVal LDA QDA"

## Big Data Science and Cross Validation - Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting

### Summary

Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data P(X|Y=k) for each class k. Their predictions are obtained by using Bayesian theorem (http://wiki.socr.umich.edu/index.php/SMHS_BayesianInference#Bayesian_Rule):

P(Y=k│X)=(P(X|Y=k)P(Y=k))/(P(X))=(P(X|Y=k)P(Y=k))/(∑_(l=0)^∞▒〖P(X|Y=l)P(Y=l)〗), FIX FORMULA!!

and we select the class k, which maximizes this conditional probability (maximum likelihood estimation).

In linear and quadratic discriminant analysis, P(X|Y) is modelled as a multivariate Gaussian distribution with density:

P(X│Y=k)=1/((2π)^n |Σ_k |^(1⁄2) ) 〖×e〗^((-1/2 (X-μ_k )^T Σ_k^(-1) (X-μ_k )) ) FIX FORMULA!!

This model can be used to classify data by using the training data to estimate:

(1) the class prior probabilities P(Y = k) by counting the proportion of observed instances of class k,

(2) the class means μk by computing the empirical sample class means, and

(3) the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., lasso).

In the linear case (LDA), the Gaussians for each class are assumed to share the same covariance matrix:

Σk=Σ for each class k. This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (k and l):

LOR=log((P(Y=k│X))/(P(Y=l│X) )) (the LOR=0 ↔the two probabilities are identical, i.e., same class)

LOR=log((P(Y=k│X))/(P(Y=l│X) ))=0 □(⇔┬ ) (μ_k-μ_l )^T Σ^(-1) (μ_k-μ_l )=1/2 (〖μ_k〗^T Σ^(-1) μ_k-〖μ_l〗^T Σ^(-1) μ_l ) FIX FORMULAS!!

But, in the more general, quadratic case of QDA, there are no assumptions on the covariance matrices Σk of the Gaussians, leading to quadratic decision surfaces.

## LDA (Linear Discriminant Analysis)

#LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempt to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.

```predfun.lda = function(train.x, train.y, test.x, test.y, neg)
{
require("MASS")
lda.fit = lda(train.x, grouping=train.y)
ynew = predict(lda.fit, test.x)\$\\$\$class
out.lda = confusionMatrix(test.y, ynew, negative=neg)
return( out.lda )
}

predfun.qda = function(train.x, train.y, test.x, test.y, neg)

{

require("MASS") # for lda function

qda.fit = qda(train.x, grouping=train.y)

ynew = predict(qda.fit, test.x)\$\\$\$class
```

out.qda = confusionMatrix(test.y, ynew, negative=neg)

return( out.qda )

}

## k-Nearest Neighbors algorithm

k-Nearest Neighbors algorithm (k-NN) is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:

• In k-NN classification, the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its k nearest neighbors (k is always a small positive integer). When k=1, then an object is assigned to the class of its single nearest neighbor.
• In k-NN regression, the output is the property value for the object representing the average of the values of its k nearest neighbors.

#X = as.matrix(input) # Predictor variables X = as.matrix(input.short2)

#Y = as.matrix(output) # Outcome

#KNN (k-nearest neighbors)

library("class")

#knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X)\$\\$\$class </mark> <mark>#table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")</mark> <mark>#This can be used for polytomous variable (multiple classes)</mark> predfun.knn = function(train.x, train.y, test.x, test.y, neg) { require("class") knn.fit = knn(train.x, test.x, cl = train.y, prob=T) <mark># knn is already a prediction function!!!</mark> #ynew = predict(knn.fit, test.x)\$\\$\$class # no need of another prediction, in this case

out.knn = confusionMatrix(test.y, knn.fit, negative=neg)

return( out.knn )

}

cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")

Compare all 3 classifiers (lda, qda, knn, and logit)

diagnosticErrors(cv.out.lda\$\\$\$stat); diagnosticErrors(cv.out.qda\$\\$\$stat); diagnosticErrors(cv.out.qda\$\\$\$stat); diagnosticErrors(cv.out.logit\$\\$\$stat);

Now let’s look at the actual prediction models!

There are different approaches to split the data (partition the data) into Training and Testing sets.

#TRAINING: 75% of the sample size

sample_size <- floor(0.75 * nrow(input))

##set the seed to make your partition reproducible

set.seed(1234)

input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)

input.train <- input[input.train.ind, ]

output.train <- as.matrix(output)[input.train.ind, ]

#TESTING DATA

input.test <- input[-input.train.ind, ]

output.test <- as.matrix(output)[-input.train.ind, ]

## k-Means Clustering (k-MC)

k-MC aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms2).

kmeans_model <- kmeans(input.train, 2)

layout(matrix(1,1))

plot(input.train, col = kmeans_model\$\\$\$cluster) points(kmeans_model\$\\$\$centers, col = 1:2, pch = 8, cex = 2)

##cluster centers "fitted" to each obs.:

resid.kmeans <- (input.train - fitted(kmeans_model))

#define the sum of squares function

ss <- function(data) sum(scale(data, scale = FALSE)^2)

##Equalities

cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], # the same two columns

c (ss(fitted.kmeans), ss(resid.kmeans), ss(input.train)))

#validation

stopifnot(all.equal(kmeans_model\$totss, ss(input.train)),

all.equal(kmeans_model\$\\$\$tot.withinss, ss(resid.kmeans)),</blockquote> ##these three are the same: <blockquote>all.equal(kmeans_model\$\\$\$betweenss, ss(fitted.kmeans)),

all.equal(kmeans_model\$\\$\$betweenss, kmeans_model\$\\$\$totss - kmeans_model\$\\$\$tot.withinss),</blockquote> ##and hence also <blockquote>all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))</blockquote> ) kmeans(input.train,1)\$\\$\$withinss     # trivial one-cluster, (its W.SS == ss(input.train))

(1)## k-Nearest Neighbor Classification

library("class")

knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)

plot(knn_model)

summary(knn_model)

attributes(knn_model)

#cross-validation

knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)

summary(knn_model.cv)

## Appendix: R Debugging

Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper -- this procedure to identify and fix bugs is “debugging”.

R tools for debugging: traceback(), debug() browser() trace() recover()

traceback(): Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred. The function calls are printed in reverse order.

f1<-function(x) { r<- x-g1(x); r }

g1<-function(y) { r<-y*h1(y); r }

h1<-function(z) { r<-log(z); if(r<10) r^2 else r^3}

f1(-1)

Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced