statsutils/fm.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib widget\n",
    "from collections.abc import Iterable\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from scipy.stats import chi2\n",
    "import math\n",
    "from fractions import Fraction as F"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# FM\n",
    "This contains automations for A-level Further Maths stats ordered in the same way as they are on [integral maths](https://my.integralmaths.org/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def std_deviation(variance):\n",
    "    return math.sqrt(variance)\n",
    "\n",
    "def NOT(p):\n",
    "    return 1 - p\n",
    "\n",
    "# Independent events\n",
    "def AND(*ps):\n",
    "    return math.prod(ps)\n",
    "\n",
    "def OR(*ps):\n",
    "    return 1 - AND(map(NOT, ps))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Permutations and Combinations\n",
    "These just use the `math.perm` and `math.comb` functions throughout, which are defined as follows.\n",
    "\n",
    "- Permutations (pick) are ordered\n",
    "- Combinations (choose) are unordered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def permutations(n: int, take: int) -> int:\n",
    "    return int(math.factorial(n) / math.factorial(n - take))\n",
    "\n",
    "def combinations(n: int, take: int) -> int:\n",
    "    return int(permutations(n, take) / math.factorial(take))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discrete Random Variables (DRVs)\n",
    "\n",
    "This section includes basic stats operations on DRVs.\n",
    "\n",
    "Note:\n",
    "- Expected value (expectation) = mean\n",
    "- Standard deviation = sqrt(variance)\n",
    "- $E(aX + b) = aE(X) + b$\n",
    "- $Var(aX + b) = a^2 Var(X)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DiscreteRandomVariable:\n",
    "    values: list[F]\n",
    "    probabilities: list[F]\n",
    "    size: int\n",
    "\n",
    "    # items are in the form of (value, probability)\n",
    "    # assumed that sum(probabilities) = 1\n",
    "    def __init__(self, items: list[tuple[F, F]]):\n",
    "        self.values = []\n",
    "        self.probabilities = []\n",
    "        for item in items:\n",
    "            self.values.append(item[0])\n",
    "            self.probabilities.append(item[1])\n",
    "        self.size = len(items)\n",
    "\n",
    "    def copy(self):\n",
    "        c = DiscreteRandomVariable([])\n",
    "        c.values = self.values.copy()\n",
    "        c.probabilities = self.probabilities.copy()\n",
    "        c.size = self.size\n",
    "        return c\n",
    "\n",
    "    def expectation(self):\n",
    "        return sum(map(math.prod, zip(self.values, self.probabilities)))\n",
    "\n",
    "    def variance(self):\n",
    "        X2 = self.copy()\n",
    "        X2.values = map(lambda x : x**2, X2.values)\n",
    "        return X2.expectation() - self.expectation()**2\n",
    "\n",
    "    def variance_alt(self):\n",
    "        u = self.expectation()\n",
    "\n",
    "        X_u = self.copy()\n",
    "        X_u.values = map(lambda x : (x - u) ** 2, X_u.values)\n",
    "\n",
    "        return X_u.expectation()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discrete Distributions\n",
    "### Binomial\n",
    "- n independent trials\n",
    "- all trials have a probability p of success\n",
    "- $ X \\sim B(n, p) $\n",
    "\n",
    "### Poisson\n",
    "- infinite independent trials\n",
    "- ... at a uniform mean rate\n",
    "- these are defined by their mean (or expected value), λ\n",
    "- mean = variance\n",
    "- given 2 PDs, X and Y with respective means x and y, X + Y has mean x + y. assumes independent X and Y\n",
    "- $ X \\sim P(λ) $\n",
    "\n",
    "### Geometric\n",
    "- trials until success\n",
    "- all trials have a probability p of success\n",
    "- $ X \\sim Geo(p) $\n",
    "\n",
    "### Discrete Uniform\n",
    "- Specific case of a DRV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "class BinomialDistribution:\n",
    "    n: int\n",
    "    p: F\n",
    "    \n",
    "    def __init__(self, n: int, p: F):\n",
    "        self.n, self.p = n, p\n",
    "\n",
    "    def expectation(self):\n",
    "        return self.n * self.p\n",
    "\n",
    "    def variance(self):\n",
    "        return self.n * self.p * NOT(self.p)\n",
    "    \n",
    "    def P(self, x: int):\n",
    "        return combinations(self.n, x) * self.p**x * NOT(self.p)**(self.n - x)\n",
    "\n",
    "class PoissonDistribution:\n",
    "    u: F\n",
    "\n",
    "    def __init__(self, u: F):\n",
    "        self.u = u\n",
    "\n",
    "    def expectation(self):\n",
    "        return self.u\n",
    "\n",
    "    def variance(self):\n",
    "        return self.u\n",
    "\n",
    "    def P(self, x: int):\n",
    "        return math.e**-self.u * self.u**x / math.factorial(x)\n",
    "\n",
    "class GeometricDistribution:\n",
    "    p: F\n",
    "\n",
    "    def __init__(self, p: F):\n",
    "        self.p = p\n",
    "    \n",
    "    def expectation(self):\n",
    "        return 1 / self.p\n",
    "\n",
    "    def variance(self):\n",
    "        return (1 - self.p) / self.p**2\n",
    "    \n",
    "    def P(self, x: int):\n",
    "        return self.p * NOT(self.p)**(x - 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chi-squared Tests\n",
    "\n",
    "Chi squared stat:\n",
    "\n",
    "$ \\frac{(observed - expected)^2}{expected} $\n",
    "\n",
    "### Distribution Test\n",
    "Expected values are calculated by distribution.\n",
    "\n",
    "### Independence Test\n",
    "Expected values are calculated assuming independence using row and column totals.\n",
    "\n",
    "= $ \\frac{rowTotal \\times columnTotal}{total} $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chi2_stat(observed: list[int], expected: list[int]) -> int:\n",
    "    return sum([\n",
    "        (obs - exp)**2 / exp\n",
    "        for obs, exp in zip(observed, expected)\n",
    "    ])\n",
    "\n",
    "def independent_expected(observed: list[list[int]]) -> list[list[int]]:\n",
    "    row_totals = [sum(row) for row in observed]\n",
    "    col_totals = [sum(col) for col in zip(*observed)]\n",
    "    total = sum(row_totals)\n",
    "\n",
    "    return [\n",
    "        [\n",
    "            row_totals[x] * col_totals[y] / total\n",
    "            for y in range(len(observed[0]))\n",
    "        ]\n",
    "        for x in range(len(observed))\n",
    "    ]\n",
    "\n",
    "def flatten(l: list[any]) -> list[any]:\n",
    "    return list(np.array(l).flatten())\n",
    "\n",
    "def chi2_critical_value(significance_level: float, degrees_of_freedom: int) -> float:\n",
    "    return chi2.ppf(1 - significance_level, df=degrees_of_freedom)\n",
    "\n",
    "def degrees_of_freedom(values: list[list[int]]) -> int:\n",
    "    return (len(values) - 1) * (len(values[0]) - 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bivariate Data\n",
    "\n",
    "### Product Moment Correlation\n",
    "$ r = \\frac{\\sum{(x_i - \\bar{x})(y_i - \\bar{y})}}{\\sqrt{\\sum{(x_i - \\bar{x})^2} \\times \\sum{(y_i - \\bar{y})^2}}} $\n",
    "- $ -1 < r < 1 $\n",
    "- positive $ r $: positive correlation\n",
    "- negative $ r $: negative correlation\n",
    "- $ r = 0 $: no correlation\n",
    "\n",
    "### Spearman's Rank Correlation\n",
    "$ r_s = 1 - \\frac{6\\sum{(x_i - y_i)^2}}{n(n^2 - 1)} $\n",
    "- used when:\n",
    "  - data is given in a ranked form\n",
    "  - data is not from a bivariate normal distribution (is not linear)\n",
    "- $ -1 < r_s < 1 $\n",
    "- positive $ r_s $: positive correlation (not necessarily linear)\n",
    "- negative $ r_s $: negative correlation (not necessarily linear)\n",
    "- $ r_s = 0 $: no correlation\n",
    "\n",
    "### Linear Regression\n",
    "$ y = \\bar{y} - b\\bar{x} + bx $ where $ b = \\frac{\\sum{(x_i - \\bar{x})(y_i - \\bar{y})}}{\\sum{(x_i - \\bar{x})^2}} $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def product_moment_cc(x: list[int], y: list[int]) -> F:\n",
    "    x_avg = F(sum(x), len(x))\n",
    "    y_avg = F(sum(y), len(y))\n",
    "    return sum([\n",
    "        (x_i - x_avg) * (y_i - y_avg)\n",
    "        for x_i, y_i in zip(x, y)\n",
    "    ]) / math.sqrt(sum([\n",
    "        (x_i - x_avg)**2\n",
    "        for x_i in x\n",
    "    ]) * sum([\n",
    "        (y_i - y_avg)**2\n",
    "        for y_i in y\n",
    "    ]))\n",
    "\n",
    "def spearman_rank_cc(x: list[int], y: list[int]) -> F:\n",
    "    n = len(x)\n",
    "    return 1 - F(6 * sum([\n",
    "        (x_i - y_i)**2\n",
    "        for x_i, y_i in zip(x, y)\n",
    "    ]), n * (n**2 - 1))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2 (default, Feb 28 2021, 17:03:44) \n[GCC 10.2.1 20210110]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}