SMHS MixtureModeling
Contents
Scientific Methods for Health Sciences - Mixture Modeling
Overview
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.
Motivation
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?
Theory
1) Structure of mixture model: a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation.
2) Estimating parametric mixture models: assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$.
We try the logarithm to turn multiplication into addition: $l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$,
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$.
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$
- EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities.
- Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.
3) Computational examples in R: Snoqualmie Falls Revisited (analyzed using the mclust package in R)
RCODE: snoqualmie <- read.csv ("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE) snoqualmie.vector <- na.omit(unlist(snoqualmie)) snoq <- snoqualmie.vector[snoqualmie.vector > 0] summary(snoq) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 6.00 19.00 32.28 44.00 463.00
# Two-component Gaussian mixture library(mixtools) snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01) summary(snoq.k2) summary of normalmixEM object: comp 1 comp 2 lambda 0.557524 0.442476 mu 10.266172 60.009468 sigma 8.510244 44.997240 loglik at estimate: -32681.21
# Function to add Gaussian mixture components, vertically scaled, to the # current plot # Presumes the mixture object has the structure used by mixtools plot.normal.components <- function(mixture,component.number,...) { curve(mixture$\$$lambda[component.number]* dnorm(x,mean=mixture$\$$mu[component.number], sd=mixture$\$$sigma[component.number]), add=TRUE, ...) } plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE, xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls") lines(density(snoq),lty=2) [[Image:SMHS_MixtureModel_Fig1.png|500px]] Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory. plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE, xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls") lines(density(snoq),lty=2) sapply(1:2,plot.normal.components,mixture=snoq.k2) [[Image:SMHS_MixtureModel_Fig2.png|500px]] As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components. # Function to calculate the cumulative distribution function of a Gaussian # mixture model # Presumes the mixture object has the structure used by mixtools # Doesn't implement some of the usual options for CDF functions in R, like # returning the log probability, or the upper tail probability pnormmix <- function(x,mixture) { lambda <- mixture$\$$lambda k <- length(lambda) pnorm.from.mix <- function(x,component){ lambda[component]*pnorm(x,mean=mixture$\$$mu[component], sd=mixture$\$$sigma[component]) } pnorms <- sapply(1:k,pnorm.from.mix,x=x) return(rowSums(pnorms)) }
#### Figure 3 # Distinct values in the data distinct.snoq <- sort(unique(snoq)) # Theoretical CDF evaluated at each distinct value tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2) # Empirical CDF evaluated at each distinct value # ecdf(snoq) returns an object which is a _function_, suitable for application # to new vectors ecdfs <- ecdf(snoq)(distinct.snoq) # Plot them against each other plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1), ylim=c(0,1)) # Main diagonal for visual reference abline(0,1)
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.
# Probability density function for a Gaussian mixture # Presumes the mixture object has the structure used by mixtools dnormalmix <- function(x,mixture,log=FALSE) { lambda <- mixture$\$$\lambda k <- length(lambda) # Calculate share of likelihood for all data for one component like.component <- function(x,component) { lambda[component]*dnorm(x,mean=mixture$\$$mu[component], sd=mixture$\$$sigma[component]) } # Create array with likelihood shares from all components over all data likes <- sapply(1:k,like.component,x=x) # Add up contributions from components d <- rowSums(likes) if (log) { d <- log(d) } return(d) } # Log likelihood function for a Gaussian mixture, potentially on new data loglike.normalmix <- function(x,mixture) { loglike <- dnormalmix(x,mixture,log=TRUE) return(sum(loglike)) } loglike.normalmix(snoq,mixture=snoq.k2) [1] - 32681.21 # Evaluate various numbers of Gaussian components by data-set splitting # (i.e., very crude cross-validation) n <- length(snoq) data.points <- 1:n data.points <- sample(data.points) # Permute randomly train <- data.points[1:floor(n/2)] # First random half is training test <- data.points[-(1:floor(n/2))] # 2nd random half is testing candidate.component.numbers <- 2:10 loglikes <- vector(length=1+length(candidate.component.numbers)) # k=1 needs special handling mu<-mean(snoq[train]) # MLE of mean sigma<- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE)) for (k in candidate.component.numbers) { mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2) loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture) } loglikes [1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61 [8] -15285.60 -15286.75 -15288.88 plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data") [[Image:SMHS_MixtureModel_Fig4.png|500px]] log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing. snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2) plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE, xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls") lines(density(snoq),lty=2) sapply(1:9,plot.normal.components,mixture=snoq.k9) [[Image:SMHS_MixtureModel_Fig5.png|500px]] With the nine-component Gaussian mixture. # Assigments for distinct.snoq and ecdfs are redundant if you've already done distinct.snoq <- sort(unique(snoq)) tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9) ecdfs <- ecdf(snoq)(distinct.snoq) plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1), ylim=c(0,1)) abline(0,1) [[Image:SMHS_MixtureModel_Fig6.png|500px]] Calibration plot for the nine-component Gaussian mixture. plot(0,xlim=range(snoq.k9$\$$mu),ylim=range(snoq.k9$\$$sigma),type="n", xlab="Component mean", ylab="Component standard deviation") points(x=snoq.k9$\$$mu,y=snoq.k9$\$$sigma,pch=as.character(1:9), cex=sqrt(0.5+5*snoq.k9$\$$lambda))
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.
plot(density(snoq),lty=2,ylim=c(0,0.04), main=paste("Comparison of density estimates\n", "Kernel vs. Gaussian mixture"), xlab="Precipitation (1/100 inch)") curve(dnormalmix(x,snoq.k9),add=TRUE)
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.
# Do the classes of the Gaussian mixture make sense as annual weather patterns? # Most probable class for each day: day.classes <- apply(snoq.k9$\$$posterior,1,which.max) # Make a copy of the original, structured data set to edit snoqualmie.classes <- snoqualmie # Figure out which days had precipitation wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie)) # Replace actual precipitation amounts with classes snoqualmie.classes[wet.days] <- day.classes # Problem: the number of the classes doesn't correspond to e.g. amount of # precipitation expected. Solution: label by expected precipitation, not by # class number. snoqualmie.classes[wet.days] <- snoq.k9$\$$mu[day.classes]
plot(0,xlim=c(1,366),ylim=range(snoq.k9$\$$mu),type="n",xaxt="n", xlab="Day of year",ylab="Expected precipiation (1/100 inch)") axis(1,at=1+(0:11)*30) for (year in 1:nrow(snoqualmie.classes)) { points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2) }
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly # provided by Dr. Derek Young; the patch will be incorporated in the next # update to mixtools, so should not be needed after April 2011 source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R") snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$ # Running this takes about 5 minutes # automatically produced as a side-effect of running boot.comp()
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.
library(mvtnorm) x.points <- seq(-3,3,length.out=100) y.points <- x.points z <- matrix(0,nrow=100,ncol=100) mu <- c(1,1) sigma <- matrix(c(2,1,1,1),nrow=2) for (i in 1:100) { for (j in 1:100) { z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma) } } contour(x.points,y.points,z)
Applications
1) This article demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.
2) This article presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.
3) This article presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.
4) This article is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.
Software
R code for examples in Chapter 20 (see references)
Problems
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.
2) Work through the E-step and M-step for a mixture of two Poisson distributions.
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?
4) Write a function to fit a mixture of exponential distributions using the EM algorithm.
References
Statistical inference / George Casella, Roger L. Berger
- SOCR Home page: http://www.socr.umich.edu
Translate this page: