2.8. Density Estimation (2024)

Density estimation walks the line between unsupervised learning, featureengineering, and data modeling. Some of the most popular and usefuldensity estimation techniques are mixture models such asGaussian Mixtures (GaussianMixture), andneighbor-based approaches such as the kernel density estimate(KernelDensity).Gaussian Mixtures are discussed more fully in the context ofclustering, because the technique is also useful asan unsupervised clustering scheme.

Density estimation is a very simple concept, and most people are alreadyfamiliar with one common density estimation technique: the histogram.

2.8.1. Density Estimation: Histograms#

A histogram is a simple visualization of data where bins are defined, and thenumber of data points within each bin is tallied. An example of a histogramcan be seen in the upper-left panel of the following figure:

A major problem with histograms, however, is that the choice of binning canhave a disproportionate effect on the resulting visualization. Consider theupper-right panel of the above figure. It shows a histogram over the samedata, with the bins shifted right. The results of the two visualizations lookentirely different, and might lead to different interpretations of the data.

Intuitively, one can also think of a histogram as a stack of blocks, one blockper point. By stacking the blocks in the appropriate grid space, we recoverthe histogram. But what if, instead of stacking the blocks on a regular grid,we center each block on the point it represents, and sum the total height ateach location? This idea leads to the lower-left visualization. It is perhapsnot as clean as a histogram, but the fact that the data drive the blocklocations mean that it is a much better representation of the underlyingdata.

This visualization is an example of a kernel density estimation, in this casewith a top-hat kernel (i.e. a square block at each point). We can recover asmoother distribution by using a smoother kernel. The bottom-right plot showsa Gaussian kernel density estimate, in which each point contributes a Gaussiancurve to the total. The result is a smooth density estimate which is derivedfrom the data, and functions as a powerful non-parametric model of thedistribution of points.

2.8.2. Kernel Density Estimation#

Kernel density estimation in scikit-learn is implemented in theKernelDensity estimator, which uses theBall Tree or KD Tree for efficient queries (see Nearest Neighbors fora discussion of these). Though the above exampleuses a 1D data set for simplicity, kernel density estimation can beperformed in any number of dimensions, though in practice the curse ofdimensionality causes its performance to degrade in high dimensions.

In the following figure, 100 points are drawn from a bimodal distribution,and the kernel density estimates are shown for three choices of kernels:

It’s clear how the kernel shape affects the smoothness of the resultingdistribution. The scikit-learn kernel density estimator can be used asfollows:

>>> from sklearn.neighbors import KernelDensity>>> import numpy as np>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)>>> kde.score_samples(X)array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698, -0.41076071])

Here we have used kernel='gaussian', as seen above.Mathematically, a kernel is a positive function \(K(x;h)\)which is controlled by the bandwidth parameter \(h\).Given this kernel form, the density estimate at a point \(y\) withina group of points \(x_i; i=1\cdots N\) is given by:

\[\rho_K(y) = \sum_{i=1}^{N} K(y - x_i; h)\]

The bandwidth here acts as a smoothing parameter, controlling the tradeoffbetween bias and variance in the result. A large bandwidth leads to a verysmooth (i.e. high-bias) density distribution. A small bandwidth leadsto an unsmooth (i.e. high-variance) density distribution.

The parameter bandwidth controls this smoothing. One can either setmanually this parameter or use Scott’s and Silvermann’s estimationmethods.

KernelDensity implements several common kernelforms, which are shown in the following figure:

Kernels’ mathematical expressions#

The form of these kernels is as follows:

Gaussian kernel (kernel = 'gaussian')
\(K(x; h) \propto \exp(- \frac{x^2}{2h^2} )\)
Tophat kernel (kernel = 'tophat')
\(K(x; h) \propto 1\) if \(x < h\)
Epanechnikov kernel (kernel = 'epanechnikov')
\(K(x; h) \propto 1 - \frac{x^2}{h^2}\)
Exponential kernel (kernel = 'exponential')
\(K(x; h) \propto \exp(-x/h)\)
Linear kernel (kernel = 'linear')
\(K(x; h) \propto 1 - x/h\) if \(x < h\)
Cosine kernel (kernel = 'cosine')
\(K(x; h) \propto \cos(\frac{\pi x}{2h})\) if \(x < h\)

The kernel density estimator can be used with any of the valid distancemetrics (see DistanceMetric for a list ofavailable metrics), though the results are properly normalized onlyfor the Euclidean metric. One particularly useful metric is theHaversine distancewhich measures the angular distance between points on a sphere. Hereis an example of using a kernel density estimate for a visualizationof geospatial data, in this case the distribution of observations of twodifferent species on the South American continent:

One other useful application of kernel density estimation is to learn anon-parametric generative model of a dataset in order to efficientlydraw new samples from this generative model.Here is an example of using this process tocreate a new set of hand-written digits, using a Gaussian kernel learnedon a PCA projection of the data:

The “new” data consists of linear combinations of the input data, with weightsprobabilistically drawn given the KDE model.

Examples

Simple 1D Kernel Density Estimation: computation of simple kerneldensity estimates in one dimension.
Kernel Density Estimation: an example of usingKernel Density estimation to learn a generative model of the hand-writtendigits data, and drawing new samples from this model.
Kernel Density Estimate of Species Distributions: an example of Kernel Densityestimation using the Haversine distance metric to visualize geospatial data