Unit 1: Statistics & Data Analysis

What is Statistics?
Statistics is the science of collecting, organising & analysing data [cite: 1125, 1127, 1129]. It generally deals with the tabulation and interpretation of numerical data [cite: 1141, 1143].

Types of Statistics

  • Descriptive Statistics: Consists of organising & summarising data [cite: 1210, 1211, 1212]. Includes Measures of Central Tendency (Mean, Median, Mode) and Measure of Dispersion (Variance, Standard Deviation) [cite: 1214, 1215, 1216, 1217].
  • Inferential Statistics: Consists of using data you have measured to form conclusions about the population [cite: 1224, 1226]. Involves Hypothesis testing, Z-test, T-test, Chi-square, ANOVA [cite: 1228, 1229, 1230, 1231, 1232].

Formulas for Central Tendency

Sample Mean: Just the average of the values [cite: 1442, 1443].

$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}$$ [cite: 1322]

Standard Deviation:

$$S = \sqrt{\frac{\sum (x - \bar{x})^2}{n-1}}$$ [cite: 1519]

Hypothesis Testing

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter [cite: 1910, 1912, 1914].

  • Null Hypothesis ($H_0$): There is no significant difference between the specified populations [cite: 1888, 1891, 1892].
  • Alternative Hypothesis ($H_a$ or $H_1$): The sample observations are influenced by some non-random cause [cite: 1901, 1904, 1905].
  • Type I Error ($\alpha$): Rejecting $H_0$ when it is True [cite: 1998, 2013].
  • Type II Error ($\beta$): Accepting $H_0$ when it is False [cite: 1998, 2013].

Unit 2: Linear Regression & ANOVA

Aim: To find the Best Fit Line with Minimal Errors [cite: 42]. This is often solved statistically by Ordinary Least Squares [cite: 16, 17, 19].

Simple Linear Regression

Equation of a straight line:

$$Y = a + bX + \epsilon$$ [cite: 330]

Where:
$Y$ = Dependent Variable [cite: 331]
$X$ = Independent Variable [cite: 332]
$a$ = Intercept [cite: 333]
$b$ = Slope [cite: 334, 335]
$\epsilon$ = Residual Error [cite: 336, 337]

Cost Function & Gradient Descent

The squared error cost function is used to find the average of all errors [cite: 100, 116]. We want to minimize $J(\theta_0, \theta_1)$ [cite: 85].

$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x_i) - y_i)^2$$ [cite: 115, 121]

Gradient Descent is the optimizer (Repeat Convergence Algo) [cite: 122, 123, 188].

$$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$ [cite: 191]

Overfitting & Underfitting

  • Overfitting: Low Bias, High Variance [cite: 254, 255, 266, 267]. Train accuracy is high (e.g., 90%), test accuracy is low (e.g., 70%) [cite: 249, 250, 252, 256]. Addressed via Ridge and Lasso Regression [cite: 222, 223, 226].
  • Underfitting: High Bias, High Variance [cite: 261, 270]. Accuracy is generally bad [cite: 268, 269].

Gauss-Markov Theorem (BLUE)

The Gauss-Markov Theorem implies that the least squares estimator has the smallest Mean Squared Error (MSE) among all the linear estimators [cite: 585, 586]. OLS is the Best Linear Unbiased Estimator (BLUE) [cite: 565, 566, 605].

ANOVA & ANCOVA

  • ANOVA: Analysis of Variance. Compares the means of 2 or more groups [cite: 642, 649, 650]. Assumes normality, absence of outliers, and homogeneity of variance [cite: 673, 675, 681].
  • ANCOVA: Analysis of Covariance. ANOVA + Covariate [cite: 383, 385, 387]. Provides a way of statistically controlling the (linear) effect of extraneous variables one does not want to examine [cite: 391, 392, 393].