API Documentation

NaiveKDE

class KDEpy.NaiveKDE.NaiveKDE(kernel='gaussian', bw=1, norm=2)[source]

This class implements a naive computation of a kernel density estimate. The advantages are that choices of bandwidth, norms, weights and grids are straightforward – the user can do almost anything. The disadvantage is that computations are slow on more than a couple of thousand data points.

Parameters:
  • kernel (str) – The kernel function. See cls._available_kernels.keys() for choices.
  • bw (float, str or array-like) – Bandwidth or bandwidth selection method. If a float is passed, it is the standard deviation of the kernel. If a string it passed, it is the bandwidth selection method, see cls._bw_methods.keys() for choices. If an array-like it passed, it is the bandwidth of each point.
  • norm (float) – The p-norm used to compute the distances in higher dimensions.

Examples

>>> data = np.random.randn(2**10)
>>> # (1) Automatic bw selection using Improved Sheather Jones
>>> x, y = NaiveKDE(bw='ISJ').fit(data).evaluate()
>>> # (2) Explicit choice of kernel and bw (standard deviation of kernel)
>>> x, y = NaiveKDE(kernel='triweight', bw=0.5).fit(data).evaluate()
>>> weights = data + 10
>>> # (3) Using a grid and weights for the data
>>> y = NaiveKDE(kernel='epa', bw=0.5).fit(data, weights).evaluate(x)

References

  • Silverman, B. W. Density Estimation for Statistics and Data Analysis. Boca Raton: Chapman and Hall, 1986.
  • Wand, M. P., and M. C. Jones. Kernel Smoothing. London ; New York: Chapman and Hall/CRC, 1995.
  • Scipy implementation, at scipy.stats.gaussian_kde.

Methods

__call__(*args, **kwargs) Call self as a function.
evaluate([grid_points]) Evaluate on grid points.
fit(data[, weights]) Fit the KDE to the data.
evaluate(grid_points=None)[source]

Evaluate on grid points.

Parameters:grid_points (array-like, int, tuple or None) – A grid (mesh) to evaluate on. High dimensional grids must have shape (obs, dims). If an integer is passed, it’s the number of grid points on an equidistant grid. If a tuple is passed, it’s the number of grid points in each dimension. If None, a grid will be automatically created.
Returns:y – If a grid is supplied, y is returned. If no grid is supplied, a tuple (x, y) is returned.
Return type:array-like

Examples

>>> kde = NaiveKDE().fit([1, 3, 4, 7])
>>> # Two ways to evaluate, either with a grid or without
>>> x, y = kde.evaluate()
>>> x, y = kde.evaluate(256)
>>> y = kde.evaluate(x)
fit(data, weights=None)[source]

Fit the KDE to the data. This validates the data and stores it. Computations are performed when the KDE is evaluated on a grid.

Parameters:
  • data (array-like) – The data points. High dimensional data must have shape (obs, dims).
  • weights (array-like) – One weight per data point. Must have shape (obs,). If None is passed, uniform weights are used.
Returns:

Returns the instance.

Return type:

self

Examples

>>> data = [1, 3, 4, 7]
>>> weights = [3, 4, 2, 1]
>>> kde = NaiveKDE().fit(data, weights=None)
>>> kde = NaiveKDE().fit(data, weights=weights)
>>> x, y = kde()

TreeKDE

class KDEpy.TreeKDE.TreeKDE(kernel='gaussian', bw=1, norm=2.0)[source]

This class implements a tree-based computation of a kernel density estimate. It works by segmenting the space recursively into smaller parts.

This makes computing a kernel density estimate at a location easier, since we are able to query the tree structure for nearby points instead of having to evaluate the kernel function on all data points. For kernels without finite support, their support is approximated. The scipy k-d tree is used as the underlying algorithm.

Parameters:
  • kernel (str) – The kernel function. See cls._available_kernels.keys() for choices.
  • bw (float, str or array-like) – Bandwidth or bandwidth selection method. If a float is passed, it is the standard deviation of the kernel. If a string it passed, it is the bandwidth selection method, see cls._bw_methods.keys() for choices. If an array-like it passed, it is the bandwidth of each point.
  • norm (float) – The p-norm used to compute the distances in higher dimensions.

Examples

>>> data = np.random.randn(2**10)
>>> # (1) Automatic bw selection using Improved Sheather Jones
>>> x, y = TreeKDE(bw='ISJ').fit(data).evaluate()
>>> # (2) Explicit choice of kernel and bw (standard deviation of kernel)
>>> x, y = TreeKDE(kernel='triweight', bw=0.5).fit(data).evaluate()
>>> weights = data + 10
>>> # (3) Using a grid and weights for the data
>>> y = TreeKDE(kernel='epa', bw=0.5).fit(data, weights).evaluate(x)

References

  • Friedman, Jerome H., Jon Louis Bentley, and Raphael Ari Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Trans. Math. Softw. 3, no. 3 (September 1977): 209–226. https://doi.org/10.1145/355744.355745.
  • Maneewongvatana, Songrit, and David M. Mount. It’s Okay to Be Skinny, If Your Friends Are Fat. In Center for Geometric Computing 4th Annual Workshop on Computational Geometry, 2:1–8, 1999.
  • Silverman, B. W. Density Estimation for Statistics and Data Analysis. Boca Raton: Chapman and Hall, 1986. Page 99 for reference to kd-tree.
  • Scipy implementation, at scipy.spatial.KDTree.

Methods

__call__(*args, **kwargs) Call self as a function.
evaluate([grid_points, eps]) Evaluate on grid points.
fit(data[, weights]) Fit the KDE to the data.
evaluate(grid_points=None, eps=0.001)[source]

Evaluate on grid points.

Parameters:
  • grid_points (array-like, int, tuple or None) – A grid (mesh) to evaluate on. High dimensional grids must have shape (obs, dims). If an integer is passed, it’s the number of grid points on an equidistant grid. If a tuple is passed, it’s the number of grid points in each dimension. If None, a grid will be automatically created.
  • eps (float) – The maximal total error in absolute terms when estimating the effective support of a kernel which has infinite support. Setting this too high will produced a jagged estimate.
Returns:

y – If a grid is supplied, y is returned. If no grid is supplied, a tuple (x, y) is returned.

Return type:

array-like

Examples

>>> kde = TreeKDE().fit([1, 3, 4, 7])
>>> # Two ways to evaluate, either with a grid or without
>>> x, y = kde.evaluate()
>>> x, y = kde.evaluate(256)
>>> y = kde.evaluate(x)
fit(data, weights=None)[source]

Fit the KDE to the data. This validates the data and stores it. Computations are performed upon evaluation on a grid.

Parameters:
  • data (array-like) – The data points.
  • weights (array-like) – One weight per data point. Numbers of observations must match the data points.
Returns:

Returns the instance.

Return type:

self

Examples

>>> data = [1, 3, 4, 7]
>>> weights = [3, 4, 2, 1]
>>> kde = TreeKDE().fit(data, weights=None)
>>> kde = TreeKDE().fit(data, weights=weights)
>>> x, y = kde()

FFTKDE

class KDEpy.FFTKDE.FFTKDE(kernel='gaussian', bw=1, norm=2)[source]

This class implements a convolution (FFT) based computation of a KDE. While this implementation is very fast, there are some limitations: (1) the bandwidth must be constant, (2) the KDE must be evaluated on an equidistant grid and (3) the grid must encompass every data point. The finer the grid, the smaller the error.

The evaluation step is split into two phases. First the \(N\) data points are binned using a linear binning routine on an equidistant grid x with \(n\) grid points. This runs in \(O(N 2^d)\) time. Then the kernel is evaluated once on \(\leq n\) points and the result of the kernel evaluation and the binned data is convolved. Using the convolution theorem, this step runs in \(O(n \log n)\) time. While \(N\) may be millions, \(n\) is typically 2**10. The total running time of the algorithm is \(O(N d^2 + n \log n)\). See references for more information.

The implementation is reminiscent of the one found in statsmodels. However, ulike the statsmodels implementation every kernel is available for FFT computation, weighted data is available for FFT computation, and no large temporary arrays are created.

Parameters:
  • kernel (str) – The kernel function. See cls._available_kernels.keys() for choices.
  • bw (float or str) – Bandwidth or bandwidth selection method. If a float is passed, it is the standard deviation of the kernel. If a string it passed, it is the bandwidth selection method, see cls._bw_methods.keys() for choices.

Examples

>>> data = np.random.randn(2**10)
>>> # (1) Automatic bw selection using Improved Sheather Jones
>>> x, y = FFTKDE(bw='ISJ').fit(data).evaluate()
>>> # (2) Explicit choice of kernel and bw (standard deviation of kernel)
>>> x, y = FFTKDE(kernel='triweight', bw=0.5).fit(data).evaluate()
>>> weights = data + 10
>>> # (3) Using a grid and weights for the data
>>> y = FFTKDE(kernel='epa', bw=0.5).fit(data, weights).evaluate(x)
>>> # (4) If you supply your own grid, it must be equidistant
>>> y = FFTKDE().fit(data)(np.linspace(-10, 10, num=2**12))

References

  • Wand, M. P., and M. C. Jones. Kernel Smoothing. London ; New York: Chapman and Hall/CRC, 1995. Pages 182-192.
  • Statsmodels implementation, at statsmodels.nonparametric.kde.KDEUnivariate.

Methods

__call__(*args, **kwargs) Call self as a function.
evaluate([grid_points]) Evaluate on equidistant grid points.
fit(data[, weights]) Fit the KDE to the data.
evaluate(grid_points=None)[source]

Evaluate on equidistant grid points.

Parameters:grid_points (array-like, int, tuple or None) – A grid (mesh) to evaluate on. High dimensional grids must have shape (obs, dims). If an integer is passed, it’s the number of grid points on an equidistant grid. If a tuple is passed, it’s the number of grid points in each dimension. If None, a grid will be automatically created.
Returns:y – If a grid is supplied, y is returned. If no grid is supplied, a tuple (x, y) is returned.
Return type:array-like

Examples

>>> kde = FFTKDE().fit([1, 3, 4, 7])
>>> # Two ways to evaluate, either with a grid or without
>>> x, y = kde.evaluate()
>>> x, y = kde.evaluate(256)
>>> y = kde.evaluate(x)
fit(data, weights=None)[source]

Fit the KDE to the data. This validates the data and stores it. Computations are performed upon evaluation on a specific grid.

Parameters:
  • data (array-like) – The data points.
  • weights (array-like) – One weight per data point. Must have same shape as the data.
Returns:

Returns the instance.

Return type:

self

Examples

>>> data = [1, 3, 4, 7]
>>> weights = [3, 4, 2, 1]
>>> kde = FFTKDE().fit(data, weights=None)
>>> kde = FFTKDE().fit(data, weights=weights)
>>> x, y = kde()