Comparison

In this section we will compare the fast FFTKDE with three popular implementations.

This page is inspired by Kernel Density Estimation in Python, where Jake VanderPlas (the original author of the sklearn implementation) compared kernel density estimators in 2013.

Note

Times will vary from computer to computer, and should only be used to compare the relative speed of the algorithms. The CPU used here was an Intel Core i5-6400.

_images/profiling_1D_epa.png

The graph above shows the speed when using an Epanechnikov kernel, for various data sizes.

Feature summary

The table below summarizes the features available across the libraries.

Feature / Library scipy sklearn statsmodels KDEpy.FFTKDE
Number of kernels 1 (gauss) 6 7 (6 slow) 9
Number of dimensions Any Any Any Any
Weighted data points No No Non-FFT Yes
Automatic bandwidth NR None NR, CV NR, ISJ
Time \(N=10^6\) 42 s 22 s 0.27 s 0.01 s
Time \(N=10^2 \times 10^2\) 0.5 s 1.6 s 1.3 s 0.005 s

The choice of kernel is typically not important, but it might be nice to experiment with different functions. Most users will be interested in 1D or 2D estimates, but it’s assuring to know that every implementation generalizes to arbitrary dimensions \(d\). Being able to weight each data point individually and use a fast algorithm is a great benefit of FFTKDE, which is not found in the other libraries. statsmodels implements a fast algorithm for unweighed data using the Gaussian kernel in 1D, but everything else runs using a naive algorithm, which is many orders of magnitude slower.

Automatic bandwidth selection is not available out-of-the-box in sklearn, but every other implementation has one or more options. Normal reference rules (NR) assume a normal distribution when selecting the optimal bandwidth, cross validation (CV) minimizes an error function and the improved Sheather-Jones (ISJ) algorithm provides an asymptotically optimal bandwidth as the number of data points \(N \to \infty\). The ISJ algorithm is also very robust to multimodal distributions, which NR is not.

The times for the one-dimensional \(N = 10^6\) data points were computed taking the median of 5 runs. The kernel was Gaussian and the number of grid points were chosen to be \(n=2^{10}\). The times for the 2D \(N=10^2 \times 10^2\) data points were also based on the median of 5 runs using a Gaussian kernel.

Speed in 1D

We run the algorithms 20 times on normally distributed data and compare the medians of the running times. The plot below compares the speed of the implementations with a Gaussian kernel. The 1D statsmodels implementation uses a similar algorithm when the kernel is Gaussian, and the performance is therefore somewhat comparable.

FFTKDE is initially slower because it solves a non-linear equation to obtain a support threshold for the Gaussian kernel (which does not have finite support). This is a constant cost, and as \(N \to \infty\) the algorithm is orders of magnitude faster than the competitors.

_images/profiling_1D_gauss.png

Switching to the Epanechnikov kernel (scipy falls back to Gaussian, the only kernel implemented) the picture is very different. FFTKDE knows the support and does not solve an equation, while statsmodels is now forced to use a naive algorithm. The difference is tremendous.

_images/profiling_1D_epa.png

Speed in 2D

We run the 2D algorithms 20 times on normally distributed data and compare medians of the running times. FFTKDE is fast because of the underlying algorithm – it implements a \(d\)-dimensional linear binning routine and uses \(d\)-dimensional convolution.

_images/profiling_2D_gauss.png

Speed as dimension increases

We will now look at how FFTKDE performs as the dimensionality of the data increases. The plot below shows how speed is affected as the dimension \(d\) increases across different number of data points \(N\). The running time increases exponentially with \(d\), and this is no surprise as the binning routine has a theoretical runtime \(\mathcal{O}(N 2^d )\).

_images/profiling_ND.png

The \(N\)-dimensional binning routine has runtime \(\mathcal{O}(N 2^d)\), and the convolution (using FFT if needed) has runtime \(n \log n\), where \(n\) is the number of grid points. The runtime of the complete algorithm is \(\mathcal{O}\left( N2^d + n \log n\right)\).