Difference between revisions of "SMHS BigDataBigSci CrossVal LDA QDA"
(→See also) |
|||
Line 2: | Line 2: | ||
===Summary=== | ===Summary=== | ||
− | Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data | + | Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X|Y=k)\) for each class \(k\). Their predictions are obtained by using [http://wiki.socr.umich.edu/index.php/SMHS_BayesianInference#Bayesian_Rule Bayesian theorem]: |
\begin{equation} | \begin{equation} | ||
− | P(Y=k|X)=\frac{P(X|Y=k)P(Y=k)}{P(X)}=\frac{P(X|Y=k)P(Y=k)}{\sum_{l=0}^∞P(X|Y=l)P(Y=l) | + | P(Y=k|X)=\frac{P(X|Y=k)P(Y=k)}{P(X)}=\frac{P(X|Y=k)P(Y=k)}{\sum_{l=0}^∞P(X|Y=l)P(Y=l)} |
\end{equation} | \end{equation} | ||
− | and we select the class | + | and we select the class \(k\), which '''maximizes''' this conditional probability (maximum likelihood estimation). |
− | In linear and quadratic discriminant analysis, | + | In linear and quadratic discriminant analysis, \(P(X|Y)\) is modeled as a multivariate Gaussian distribution with density: |
\begin{equation} | \begin{equation} | ||
Line 20: | Line 20: | ||
This model can be used to classify data by using the training data to '''estimate''': | This model can be used to classify data by using the training data to '''estimate''': | ||
− | (1) the class prior probabilities | + | (1) the class prior probabilities \(P(Y = k)\) by counting the proportion of observed instances of class \(k\), |
− | (2) the class means | + | (2) the class means \(μ_k\) by computing the empirical sample class means, and |
(3) the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., lasso). | (3) the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., lasso). | ||
Line 29: | Line 29: | ||
In the <u>linear case</u> (LDA), the Gaussians for each class are assumed to share the same covariance matrix: | In the <u>linear case</u> (LDA), the Gaussians for each class are assumed to share the same covariance matrix: | ||
− | + | \(Σ_k=Σ\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\) ): | |
− | + | \(LOR=log\Big(\frac{P(Y=k│X)}{P(Y=l│X)}\Big)\) | |
(the LOR=0 ↔the two probabilities are identical, i.e., same class) | (the LOR=0 ↔the two probabilities are identical, i.e., same class) | ||
− | + | \(LOR=log\Big(\frac{P(Y=k│X}{P(Y=l│X)}\Big)=0 ⇔ (\mu_k-\mu_l)^T\sum^{-1}(\mu_k-\mu_1)=\frac{1}{2}({\mu_k}^T\sum^{-1}\mu_k-{\mu_l}^T\sum^{-1}\mu_l) \) | |
− | But, in the more general, <u>quadratic case</u> of QDA, there are no assumptions on the covariance matrices | + | But, in the more general, <u>quadratic case</u> of QDA, there are no assumptions on the covariance matrices \(Σ_k\) of the Gaussians, leading to quadratic decision surfaces. |
==LDA (Linear Discriminant Analysis)== | ==LDA (Linear Discriminant Analysis)== | ||
− | + | LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempts to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables. | |
predfun.lda = function(train.x, train.y, test.x, test.y, neg) | predfun.lda = function(train.x, train.y, test.x, test.y, neg) | ||
Line 47: | Line 47: | ||
require("MASS") | require("MASS") | ||
lda.fit = lda(train.x, grouping=train.y) | lda.fit = lda(train.x, grouping=train.y) | ||
− | ynew = predict(lda.fit, test.x) | + | ynew = predict(lda.fit, test.x)\(\\(\(class |
out.lda = confusionMatrix(test.y, ynew, negative=neg) | out.lda = confusionMatrix(test.y, ynew, negative=neg) | ||
return( out.lda ) | return( out.lda ) | ||
Line 58: | Line 58: | ||
require("MASS") # for lda function | require("MASS") # for lda function | ||
qda.fit = qda(train.x, grouping=train.y) | qda.fit = qda(train.x, grouping=train.y) | ||
− | ynew = predict(qda.fit, test.x) | + | ynew = predict(qda.fit, test.x)\(\\(\(class |
out.qda = confusionMatrix(test.y, ynew, negative=neg) | out.qda = confusionMatrix(test.y, ynew, negative=neg) | ||
return( out.qda ) | return( out.qda ) | ||
Line 79: | Line 79: | ||
library("class") | library("class") | ||
− | <span style="background-color: #32cd32">#knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X) | + | <span style="background-color: #32cd32">#knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X) \(\\(\(class </span> |
<span style="background-color: #32cd32">#table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")</span> | <span style="background-color: #32cd32">#table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")</span> | ||
<span style="background-color: #32cd32">#This can be used for polytomous variable (multiple classes)</span> | <span style="background-color: #32cd32">#This can be used for polytomous variable (multiple classes)</span> | ||
Line 87: | Line 87: | ||
require("class") | require("class") | ||
knn.fit = knn(train.x, test.x, cl = train.y, prob=T) <span style="background-color: #32cd32"># knn is already a prediction function!!!</span> | knn.fit = knn(train.x, test.x, cl = train.y, prob=T) <span style="background-color: #32cd32"># knn is already a prediction function!!!</span> | ||
− | #ynew = predict(knn.fit, test.x) | + | #ynew = predict(knn.fit, test.x)\(\\(\(class # no need of another prediction, in this case |
out.knn = confusionMatrix(test.y, knn.fit, negative=neg) | out.knn = confusionMatrix(test.y, knn.fit, negative=neg) | ||
return( out.knn ) | return( out.knn ) | ||
Line 95: | Line 95: | ||
Compare all 3 classifiers (lda, qda, knn, and logit) | Compare all 3 classifiers (lda, qda, knn, and logit) | ||
− | diagnosticErrors(cv.out.lda | + | diagnosticErrors(cv.out.lda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); |
− | diagnosticErrors(cv.out.logit | + | diagnosticErrors(cv.out.logit\(\\(\(stat); |
Line 114: | Line 114: | ||
output.train <- as.matrix(output)[input.train.ind, ] | output.train <- as.matrix(output)[input.train.ind, ] | ||
− | + | TESTING DATA | |
input.test <- input[-input.train.ind, ] | input.test <- input[-input.train.ind, ] | ||
Line 121: | Line 121: | ||
==k-Means Clustering (k-MC)== | ==k-Means Clustering (k-MC)== | ||
− | k-MC aims to partition ''n'' observations into ''k'' clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms<sup>2</sup>). | + | k-MC aims to partition ''n'' observations into ''k'' clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms<sup>2</sup>). |
kmeans_model <- kmeans(input.train, 2) | kmeans_model <- kmeans(input.train, 2) | ||
layout(matrix(1,1)) | layout(matrix(1,1)) | ||
− | plot(input.train, col = kmeans_model | + | plot(input.train, col = kmeans_model\(\\(\(cluster) |
− | points(kmeans_model | + | points(kmeans_model\(\\(\(centers, col = 1:2, pch = 8, cex = 2) |
##cluster centers "fitted" to each obs.: | ##cluster centers "fitted" to each obs.: | ||
Line 140: | Line 140: | ||
#validation | #validation | ||
− | stopifnot(all.equal(kmeans_model | + | stopifnot(all.equal(kmeans_model\(\\(\(totss, ss(input.train)), |
− | all.equal(kmeans_model | + | all.equal(kmeans_model\(\\(\(tot.withinss, ss(resid.kmeans)), |
##these three are the same: | ##these three are the same: | ||
− | all.equal(kmeans_model | + | all.equal(kmeans_model\(\\(\(betweenss, ss(fitted.kmeans)), |
− | all.equal(kmeans_model | + | all.equal(kmeans_model\(\\(\(betweenss, kmeans_model\(\\(\(totss - kmeans_model\(\\(\(tot.withinss), |
##and hence also | ##and hence also | ||
all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans)) | all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans)) | ||
) | ) | ||
− | kmeans(input.train,1) | + | kmeans(input.train,1)\(\\(\(withinss # trivial one-cluster, (its W.SS == ss(input.train)) |
<sup>2</sup>http://escholarship.org/uc/item/1rb70972 | <sup>2</sup>http://escholarship.org/uc/item/1rb70972 | ||
− | '''(1) | + | '''(1) k-Nearest Neighbor Classification''' |
library("class") | library("class") | ||
Line 161: | Line 161: | ||
attributes(knn_model) | attributes(knn_model) | ||
− | + | #cross-validation | |
knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2) | knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2) | ||
summary(knn_model.cv) | summary(knn_model.cv) |
Revision as of 09:52, 19 October 2020
Contents
Big Data Science and Cross Validation - Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting
Summary
Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X|Y=k)\) for each class \(k\). Their predictions are obtained by using Bayesian theorem:
\begin{equation} P(Y=k|X)=\frac{P(X|Y=k)P(Y=k)}{P(X)}=\frac{P(X|Y=k)P(Y=k)}{\sum_{l=0}^∞P(X|Y=l)P(Y=l)} \end{equation}
and we select the class \(k\), which maximizes this conditional probability (maximum likelihood estimation).
In linear and quadratic discriminant analysis, \(P(X|Y)\) is modeled as a multivariate Gaussian distribution with density:
\begin{equation} P(X|Y=k)=\frac{1}{(2\pi)^n|\sum_k|^{1/2}}×e^{\Big(-\frac{1}{2}(x-\mu_k)^T\sum_k^{-1}(X-\mu_k)\Big)} \end{equation}
This model can be used to classify data by using the training data to estimate:
(1) the class prior probabilities \(P(Y = k)\) by counting the proportion of observed instances of class \(k\),
(2) the class means \(μ_k\) by computing the empirical sample class means, and
(3) the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., lasso).
In the linear case (LDA), the Gaussians for each class are assumed to share the same covariance matrix:
\(Σ_k=Σ\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\) ):
\(LOR=log\Big(\frac{P(Y=k│X)}{P(Y=l│X)}\Big)\) (the LOR=0 ↔the two probabilities are identical, i.e., same class)
\(LOR=log\Big(\frac{P(Y=k│X}{P(Y=l│X)}\Big)=0 ⇔ (\mu_k-\mu_l)^T\sum^{-1}(\mu_k-\mu_1)=\frac{1}{2}({\mu_k}^T\sum^{-1}\mu_k-{\mu_l}^T\sum^{-1}\mu_l) \)
But, in the more general, quadratic case of QDA, there are no assumptions on the covariance matrices \(Σ_k\) of the Gaussians, leading to quadratic decision surfaces.
LDA (Linear Discriminant Analysis)
LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempts to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.
predfun.lda = function(train.x, train.y, test.x, test.y, neg) { require("MASS") lda.fit = lda(train.x, grouping=train.y) ynew = predict(lda.fit, test.x)\(\\(\(class out.lda = confusionMatrix(test.y, ynew, negative=neg) return( out.lda ) }
QDA (Quadratic Discriminant Analysis)
predfun.qda = function(train.x, train.y, test.x, test.y, neg) { require("MASS") # for lda function qda.fit = qda(train.x, grouping=train.y) ynew = predict(qda.fit, test.x)\(\\(\(class out.qda = confusionMatrix(test.y, ynew, negative=neg) return( out.qda ) }
k-Nearest Neighbors algorithm
k-Nearest Neighbors algorithm (k-NN) is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:
- In k-NN classification, the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its k nearest neighbors (k is always a small positive integer). When k=1, then an object is assigned to the class of its single nearest neighbor.
- In k-NN regression, the output is the property value for the object representing the average of the values of its k nearest neighbors.
#X = as.matrix(input) # Predictor variables X = as.matrix(input.short2)
#Y = as.matrix(output) # Outcome
#KNN (k-nearest neighbors)
library("class") #knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X) \(\\(\(class #table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1") #This can be used for polytomous variable (multiple classes) predfun.knn = function(train.x, train.y, test.x, test.y, neg) { require("class") knn.fit = knn(train.x, test.x, cl = train.y, prob=T) # knn is already a prediction function!!! #ynew = predict(knn.fit, test.x)\(\\(\(class # no need of another prediction, in this case out.knn = confusionMatrix(test.y, knn.fit, negative=neg) return( out.knn ) } cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
Compare all 3 classifiers (lda, qda, knn, and logit)
diagnosticErrors(cv.out.lda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); diagnosticErrors(cv.out.qda\(\\(\(stat); diagnosticErrors(cv.out.logit\(\\(\(stat);
Now let’s look at the actual prediction models!
There are different approaches to split the data (partition the data) into Training and Testing sets.
#TRAINING: 75% of the sample size
sample_size <- floor(0.75 * nrow(input)) ##set the seed to make your partition reproducible set.seed(1234) input.train.ind <- sample(seq_len(nrow(input)), size = sample_size) input.train <- input[input.train.ind, ] output.train <- as.matrix(output)[input.train.ind, ]
TESTING DATA
input.test <- input[-input.train.ind, ] output.test <- as.matrix(output)[-input.train.ind, ]
k-Means Clustering (k-MC)
k-MC aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution (NP-hard problem), but there are efficient algorithms that converge quickly to local optima (e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms2).
kmeans_model <- kmeans(input.train, 2) layout(matrix(1,1)) plot(input.train, col = kmeans_model\(\\(\(cluster) points(kmeans_model\(\\(\(centers, col = 1:2, pch = 8, cex = 2)
##cluster centers "fitted" to each obs.: fitted.kmeans <- fitted(kmeans_model); head(fitted.kmeans) resid.kmeans <- (input.train - fitted(kmeans_model)) #define the sum of squares function ss <- function(data) sum(scale(data, scale = FALSE)^2)
##Equalities cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], # the same two columns c (ss(fitted.kmeans), ss(resid.kmeans), ss(input.train)))
#validation stopifnot(all.equal(kmeans_model\(\\(\(totss, ss(input.train)), all.equal(kmeans_model\(\\(\(tot.withinss, ss(resid.kmeans)), ##these three are the same: all.equal(kmeans_model\(\\(\(betweenss, ss(fitted.kmeans)), all.equal(kmeans_model\(\\(\(betweenss, kmeans_model\(\\(\(totss - kmeans_model\(\\(\(tot.withinss), ##and hence also all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans)) ) kmeans(input.train,1)\(\\(\(withinss # trivial one-cluster, (its W.SS == ss(input.train))
2http://escholarship.org/uc/item/1rb70972
(1) k-Nearest Neighbor Classification
library("class") knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2) plot(knn_model) summary(knn_model) attributes(knn_model) #cross-validation knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2) summary(knn_model.cv)
Appendix: R Debugging
Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need explore deeper -- this procedure to identify and fix bugs is “debugging”.
R tools for debugging: traceback(), debug() browser() trace() recover()
traceback(): Failing R functions report to the screen immediately the error. Calling traceback() will show the function where the error occurred. The traceback() function prints the list of functions that were called before the error occurred. The function calls are printed in reverse order.
f1<-function(x) { r<- x-g1(x); r }
g1<-function(y) { r<-y*h1(y); r }
h1<-function(z) { r<-log(z); if(r<10) r^2 else r^3}
f1(-1)
Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced
traceback() 3: h(y) 2: g(x) 1: f(-1) debug()
traceback() does not tell you where is the error. To find out which line causes the error, we may step through the function using debug().
debug(foo) flags the function foo() for debugging. undebug(foo) unflags the function.
When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and user can interact with the R shell.
This allows us to inspect a function line-by-line.
Example: compute sum of squared error SS
## compute sum of squares SS<-function(mu,x) { d<-x-mu; d2<-d^2; ss<-sum(d2); ss } set.seed(100); x<-rnorm(100); SS(1,x)
## to debug debug(SS); SS(1,x) debugging in: SS(1, x) debug: { d <- x - mu d2 <- d^2 ss <- sum(d2) ss }
In the debugging shell (“Browse[1]>”), users can:
• Enter n (next) executes the current line and prints the next one;
• Typing c (continue) executes the rest of the function without stopping;
• Enter Q quits the debugging;
• Enter ls() list all objects in the local environment;
• Enter an object name or print(<object name>) tells the current value of an object.
Example:
debug(SS) SS(1,x) debugging in: SS(1, x) debug: { d <- x - mu d2 <- d^2 ... Browse[1]> n debug: d <- x - mu ## the next command Browse[1]> ls() ## current environment [1] "mu" "x" ## there is no d Browse[1]> n ## go one step debug: d2 <- d^2 ## the next command Browse[1]> ls() ## current environment [1] "d" "mu" "x" ## d has been created Browse[1]> d[1:3] ## first three elements of d [1] -1.5021924 -0.8684688 -1.0789171 Browse[1]> hist(d) ## histogram of d Browse[1]> where ## current position in call stack where 1: SS(1, x) Browse[1]> n debug: ss <- sum(d2) Browse[1]> Q ## quit undebug(SS) ## remove debug label, stop debugging process SS(1,x) ## now call SS again will without debugging
You can label a function for debugging while debugging another function
f<-function(x) { r<-x-g(x); r } g<-function(y) { r<-y*h(y); r } h<-function(z) { r<-log(z); if(r<10) r^2 else r^3 }
debug(f) # ## If you only debug f, you will not go into g f(-1) Browse[1]> n Browse[1]> n Error in if (r < 10) r^2 else r^3 : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced
But, we can also label g and h for debugging when we debug f
f(-1) Browse[1]> n Browse[1]> debug(g) Browse[1]> debug(h) Browse[1]> n
Inserting a call to browser() in a function will pause the execution of a function at the point where browser() is called. Similar to using debug() except you can control where execution gets paused.
Example:
h<-function(z) { browser() ## a break point inserted here r<-log(z); if(r<10) r^2 else r^3 } f(-1) Browse[1]> ls() Browse[1]> z Browse[1]> n Browse[1]> n Browse[1]> ls() Browse[1]> c
Calling trace() on a function allows inserting new code into a function. The syntax for trace() may be challenging.
as.list(body(h)) trace("h",quote(if(is.nan(r)) {browser()}), at=3, print=FALSE) f(1) f(-1) trace("h",quote(if(z<0) {z<-1}), at=2, print=FALSE) f(-1) untrace()
During the debugging process, recover() allows checking the status of variables in upper level functions. recover() can be used as an error handler using options() (e.g. options(error=recover)). When functions throw exceptions, execution stops at point of failure. Browsing the function calls and examining the environment may indicate the source of the problem.
See also
- Back to Big Data Science and Cross-Validation
- Structural Equation Modeling (SEM)
- Growth Curve Modeling (GCM)
- Generalized Estimating Equation (GEE) Modeling
- Back to Big Data Science
- SOCR Home page: http://www.socr.umich.edu
Translate this page: