{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "This Jupyter Notebook introduces [Kernel Density Estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE) alongside with [KDEpy](https://kdepy.readthedocs.io/en/latest/).\n", "Kernel density estimation is an approach to solve the following problem.\n", "\n", "> **Problem.** Given a set of $N$ data points $\\{x_1, x_2, \\dots, x_N\\}$, estimate the probability density function from which the data is drawn.\n", "\n", "There are roughly two approaches to solving this problem:\n", "\n", "1. Assume a **parametric form**, e.g. a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), and estimate the *parameters* $\\mu$ and $\\sigma$. These parameters uniquely determine the distribution, and are typically found using the [maximum likelihood principle](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). The advantage of this approach is that we only need to estimate a few parameters, while the disadvantage is that the chosen parametric form might not fit the data very well.\n", "2. Use **kernel density estimation**, which is a non-parametric method -- we let the data speak for itself. The idea is to place a *kernel function* $K$ on each data point $x_i$, and let the probability density function be given by the sum of the $N$ kernel functions.\n", "\n", "Assuming a parametric form is a perfectly valid approach, especially if there is evidence suggesting the presence of a theoretical distribution.\n", "However, for [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) we might want to assume as little as possible when plotting a distribution, and this is where kernel density estimation and KDEpy comes to the rescue." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.stats import norm\n", "import matplotlib.pyplot as plt\n", "from KDEpy import FFTKDE # Fastest 1D algorithm\n", "\n", "np.random.seed(123) # Seed generator for reproducible results\n", "\n", "distribution = norm() # Create normal distribution\n", "data = distribution.rvs(32) # Draw 32 random samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The histogram\n", "\n", "A KDE may be thought of as an extension to the familiar [histogram](https://en.wikipedia.org/wiki/Histogram). \n", "The purpose of the KDE is to estimate an unknown probability density function $f(x)$ given data sampled from it. \n", "A natural first thought is to use a histogram – it’s well known, simple to understand and works reasonably well.\n", "\n", "To see how the histogram performs on the data generated above, we'll plot the true distribution alongside a histogram. \n", "As seen below, the histogram does a fairly poor job, since:\n", "- The location of the bins and the number of bins are both arbitrary.\n", "- The estimated distribution is discontinuous (not smooth), while the true distribution is continuous (smooth)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "