5  Regression

Author

Abhijith M S

Published

March 24, 2025

5.1 Linear Regression

  • Many engineering and scientific problems aim to determine relationships between variables.

    • Example: In a chemical process, the relationship between output, temperature, and catalyst amount is of interest.
    • Knowing this relationship allows predicting outputs for different temperature and catalyst values.
  • Typically, there is a single response variable \(Y\) (dependent variable) that depends on a set of input variables \(x_1, x_2, \dots, x_r\) (independent variables).

  • The simplest relationship between \(Y\) and \(x_1, x_2, \dots, x_r\) is a linear relationship: \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r \tag{5.1}\]

    • Here, \(\beta_0, \beta_1, \dots, \beta_r\) are constants.
  • In practice, exact predictions are rarely possible due to random errors.

    • The relationship is better expressed as: \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r + e \tag{5.2}\]

      • \(e\) represents the random error, assumed to have a mean of \(0\).
  • Alternatively, the relationship can be expressed in terms of the expected value: \[ E[Y \mid \mathbf{x}] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r \]

    • \(\mathbf{x} = (x_1, x_2, \dots, x_r)\) is the set of independent variables.

    • \(E[Y \mid \mathbf{x}]\) is the expected response given the inputs \(\mathbf{x}\).

  • The Equation 5.2 is called a linear regression equation.

    • It describes the regression of \(Y\) on the independent variables \(x_1, x_2, \dots, x_r\).

    • The quantities \(\beta_0, \beta_1, \dots, \beta_r\) are called regression coefficients, and must usually be estimated from a set of data.

  • A regression equation can be classified based on the number of independent variables:

    • Simple regression equation: Contains a single independent variable (\(r = 1\)).
    • Multiple regression equation: Contains many independent variables.
  • A simple linear regression model assumes a linear relationship between the mean response and a single independent variable.

    • It is expressed as: \[ Y = \alpha + \beta x + e \]

      • \(x\) is the value of the independent variable (also called the input level).

      • \(Y\) is the response.

      • \(e\) represents the random error, assumed to be a random variable with mean \(0\).

      • Regression analysis is used to find the mean value of the dependent variable \(Y\) given the independent variables \(\mathbf{x}\).

        • The mean value of \(Y\) given \(\mathbf{x}\) is denoted as \(E[Y \mid \mathbf{x}]\).

        • By estimating the regression coefficients \(\beta_0, \beta_1, \dots, \beta_r\), we can predict the mean response for any given set of input variables.

        • This is particularly useful in understanding the central tendency of the response variable and making informed decisions based on the predicted mean.

import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for the linear regression model
alpha = 2.0  # Intercept
beta = 1.5   # Slope
num_samples = 100  # Number of data points

# Generate synthetic data
x = np.linspace(0, 10, num_samples)  # Independent variable (input)
e = np.random.normal(0, 1, num_samples)  # Random error with mean 0
Y = alpha + beta * x + e  # Dependent variable (response)

# Plot the data points
plt.scatter(x, Y, color='blue', label='Data Points')

# Plot the true regression line (without error)
plt.plot(x, alpha + beta * x, color='red', label='True Regression Line')

# Add labels and title
plt.xlabel('Independent Variable (x)')
plt.ylabel('Dependent/ Response Variable (Y)')
plt.title('Simple Linear Regression Demonstration')
plt.legend()

# Show the plot
plt.grid(True)
plt.show()

5.1.1 Least Squares Estimators Of The Regression Parameters

  • As mentioned previouslly, a simple linear regression model assumes a linear relationship between the mean response and a single independent variable.

    • It is expressed as: \[ Y = \alpha + \beta x + e \]

      • \(x\) is the value of the independent variable (also called the input level).

      • \(Y\) is the response.

      • \(e\) represents the random error, assumed to be a random variable with mean \(0\).

  • Suppose the responses \(Y_i\) corresponding to the input values \(x_i\) for \(i = 1, \dots, n\) are observed and used to estimate \(\alpha\) and \(\beta\) in a simple linear regression model.

    • Let \(A\) be the estimator of \(\alpha\) and \(B\) be the estimator of \(\beta\).
    • The estimated response for the input \(x_i\) is \(A + Bx_i\).
    • The difference between the actual response \(Y_i\) and the estimated response is \((Y_i - A - Bx_i)\) is called residual.
  • The sum of squares of residuals or residual sum of squares (SSR) is given by: \[ SSR = \sum_{i=1}^n (Y_i - A - Bx_i)^2 \]

  • The method of least squares chooses \(A\) and \(B\) to minimize \(SSR\).

    • To find the minimizing values, differentiate \(SSR\) with respect to \(A\) and \(B\): \[ \frac{\partial SSR}{\partial A} = -2 \sum_{i=1}^n (Y_i - A - Bx_i) \] \[ \frac{\partial SSR}{\partial B} = -2 \sum_{i=1}^n x_i (Y_i - A - Bx_i) \]

    • Setting the partial derivatives to zero yields the normal equations: \[ \sum_{i=1}^n Y_i = nA + B \sum_{i=1}^n x_i \] \[ \sum_{i=1}^n x_i Y_i = A \sum_{i=1}^n x_i + B \sum_{i=1}^n x_i^2 \]

  • Let \(\overline{Y} = \frac{\sum_{i=1}^n Y_i}{n}\) and \(\overline{x} = \frac{\sum_{i=1}^n x_i}{n}\).

    • The first normal equation can be rewritten as: \[ A = \overline{Y} - B \overline{x} \tag{5.3}\]
  • Substituting \(A = \overline{Y} - B \overline{x}\) into the second normal equation: \[ \sum_{i=1}^n x_i Y_i = (\overline{Y} - B \overline{x}) n \overline{x} + B \sum_{i=1}^n x_i^2 \]

    Simplifying: \[ B \left( \sum_{i=1}^n x_i^2 - n \overline{x}^2 \right) = \sum_{i=1}^n x_i Y_i - n \overline{x} \overline{Y} \]

    Solving for \(B\):

    \[ B = \frac{\sum_{i=1}^n x_i Y_i - n \overline{x} \overline{Y}}{\sum_{i=1}^n x_i^2 - n \overline{x}^2} \]

  • Using Equation 5.3 and the fact that \(n \overline{x} = \sum_{i=1}^n x_i\), we obtain the estimators for \(\alpha\) and \(\beta\).

Least Squares Estimators
  • The least squares estimators of \(\beta\) and \(\alpha\) for the data set \((x_i, Y_i)\), where \(i = 1, \dots, n\), are given by:
    • The estimator for \(\beta\) (\(B\)): \[ B = \frac{\sum_{i=1}^n x_i Y_i - \overline{x} \sum_{i=1}^n Y_i}{\sum_{i=1}^n x_i^2 - n \overline{x}^2} \]

    • The estimator for \(\alpha\) (\(A\)): \[ A = \overline{Y} - B \overline{x} \]

    • Here:

      • \(\overline{x} = \frac{\sum_{i=1}^n x_i}{n}\) is the mean of the independent variable \(x\).
      • \(\overline{Y} = \frac{\sum_{i=1}^n Y_i}{n}\) is the mean of the dependent variable \(Y\).
Problem

The raw material used in the production of a certain synthetic fiber is stored in a location without humidity control. Measurements of the relative humidity in the storage location and the moisture content of a sample of the raw material were taken over 15 days, resulting in the following data (in percentages):

Relative Humidity (%) 46 53 29 61 36 39 47 49 52 38 55 32 57 54 44
Moisture Content (%) 12 15 7 17 10 11 11 12 14 9 16 8 18 14 12

Use linear regression to express moisture content as a function of the relative humidity.

The Coefficient of Determination and the Sample Correlation Coefficient

  • Suppose we want to measure the amount of variation in the set of response values \(Y_1, \dots, Y_n\) corresponding to the input values \(x_1, \dots, x_n\).

    • A standard measure of variation in statistics is given by: \[ SST = \sum_{i=1}^n (Y_i - \overline{Y})^2 \]

    (‘SST’ also refers to ‘Total Sum of Squares’)

    • Here, \(\overline{Y} = \frac{\sum_{i=1}^n Y_i}{n}\) is the mean of the response values.

    • If all \(Y_i\) are equal (i.e., \(Y_i = \overline{Y}\) for all \(i\)), then \(SST = 0\).

  • The variation in the response values \(Y_i\) arises from two factors:

    1. Different input values: The input values \(x_i\) are different, leading to different mean values for the responses \(Y_i\).

    2. Inherent variance: Even after accounting for the input values, each response \(Y_i\) has a variance \(\sigma^2\) and will not exactly equal the predicted value at its input \(x_i\).

  • To determine how much variation is due to the different input values and how much is due to inherent variance, consider:

    • The quantity: \[ SSR = \sum_{i=1}^n (Y_i - A - Bx_i)^2 \]

      measures the remaining variation in the response values after accounting for the input values.

    • The difference: \[ SST - SSR \]

      represents the variation in the response values explained by the different input values.

      • The coefficient of determination (\(R^2\)) is defined as the proportion of the total variation in the response values that is explained by the input values: \[ R^2 = \frac{SST - SSR}{SST} \]

        • \(R^2\) ranges from 0 to 1.
        • An \(R^2\) value of 1 indicates that the regression model perfectly explains the variation in the response values.
        • An \(R^2\) value of 0 indicates that the regression model does not explain any of the variation in the response values.
      • The sample correlation coefficient (\(r\)) measures the strength and direction of the linear relationship between the independent variable \(x\) and the dependent variable \(Y\): \[ r = \frac{\sum_{i=1}^n (x_i - \overline{x})(Y_i - \overline{Y})}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2 \sum_{i=1}^n (Y_i - \overline{Y})^2}} = \sqrt{R^2} \]

        • Note that the sign of r is the sign of B.
        • \(r\) ranges from -1 to 1.
        • An \(r\) value of 1 indicates a perfect positive linear relationship.
        • An \(r\) value of -1 indicates a perfect negative linear relationship.
        • An \(r\) value of 0 indicates no linear relationship.
Problem

Find the coefficient of determination and the sample correlation coefficient of the previous regression problem.

    import numpy as np
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score
    import matplotlib.pyplot as plt

    # Data for relative humidity and moisture content
    humidity = np.array([46, 53, 29, 61, 36, 39, 47, 49, 52, 38, 55, 32, 57, 54, 44]).reshape(-1, 1)
    moisture = np.array([12, 15, 7, 17, 10, 11, 11, 12, 14, 9, 16, 8, 18, 14, 12])

    # Fit the linear regression model
    model = LinearRegression().fit(humidity, moisture)

    # Print the coefficients
    print(f"Intercept: {model.intercept_}")
    print(f"Coefficient: {model.coef_[0]}")

    # Predict the moisture content
    moisture_pred = model.predict(humidity)

    # Calculate the coefficient of determination (R^2)
    R_squared = r2_score(moisture, moisture_pred)
    print(f"Coefficient of Determination (R^2): {R_squared}")

    # Calculate the sample correlation coefficient (r)
    correlation_coefficient = np.sqrt(R_squared)
    print(f"Sample Correlation Coefficient (r): {correlation_coefficient}")

    # Plot the data points
    plt.scatter(humidity, moisture, color='blue', label='Data Points')

    # Plot the regression line
    plt.plot(humidity, moisture_pred, color='red', label='Regression Line')

    # Add labels and title
    plt.xlabel('Relative Humidity (%)')
    plt.ylabel('Moisture Content (%)')
    plt.title('Linear Regression: Moisture Content vs. Relative Humidity')
    plt.legend()

    # Show the plot
    plt.grid(True)
    plt.show()
Intercept: -2.5104576516877177
Coefficient: 0.32320356181404014
Coefficient of Determination (R^2): 0.9113639730826797
Sample Correlation Coefficient (r): 0.9546538498757965

5.1.2 Multiple Linear Regression

  • Overview:
    • In most applications, the response \(Y\) of an experiment is better predicted using multiple independent input variables rather than a single one.

    • A typical scenario involves \(k\) input variables, and the response \(Y\) is related to them by the equation: \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + e \]

      where:

      • \(x_j\) is the level of the \(j^{th}\) input variable (\(j = 1, \dots, k\)),
      • \(e\) is a random error term, assumed to be normally distributed with mean \(0\) and constant variance \(\sigma^2\).
  • Parameters:
    • The parameters \(\beta_0, \beta_1, \dots, \beta_k\) and \(\sigma^2\) are unknown and must be estimated from the data.
    • The data consists of observed values \(Y_1, Y_2, \dots, Y_n\), where each \(Y_i\) corresponds to a set of input levels \(x_{i1}, x_{i2}, \dots, x_{ik}\).
  • Expected Value:
    • The expected value of \(Y_i\) is given by: \[ E[Y_i] = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} \]

Determining the Least Squares Estimators

  • Least Squares Estimation:
    • Let \(B_0, B_1, \dots, B_k\) denote estimators of \(\beta_0, \beta_1, \dots, \beta_k\).

    • The sum of squared differences between the observed \(Y_i\) and their estimated expected values is: \[ \sum_{i=1}^n (Y_i - B_0 - B_1 x_{i1} - B_2 x_{i2} - \cdots - B_k x_{ik})^2 \]

    • The least squares estimators are the values of \(B_0, B_1, \dots, B_k\) that minimize the above sum of squared differences.

  • Method:
    • To determine the least squares estimators \(B_0, B_1, \dots, B_k\), we take partial derivatives of the sum of squared differences with respect to each estimator:
      • First, with respect to \(B_0\),
      • Then, with respect to \(B_1\),
      • And so on, up to \(B_k\).
    • By setting these partial derivatives equal to \(0\), we obtain a system of \(k + 1\) equations.
  • System of Equations:
    • The partial derivatives yield the following equations: \[ \sum_{i=1}^n (Y_i - B_0 - B_1 x_{i1} - B_2 x_{i2} - \cdots - B_k x_{ik}) = 0 \] \[ \sum_{i=1}^n x_{i1} (Y_i - B_0 - B_1 x_{i1} - \cdots - B_k x_{ik}) = 0 \] \[ \sum_{i=1}^n x_{i2} (Y_i - B_0 - B_1 x_{i1} - \cdots - B_k x_{ik}) = 0 \] \[ \vdots \] \[ \sum_{i=1}^n x_{ik} (Y_i - B_0 - B_1 x_{i1} - \cdots - B_k x_{ik}) = 0 \]
  • Normal Equations:
    • Rewriting the above equations, the least squares estimators \(B_0, B_1, \dots, B_k\) satisfy the following normal equations: \[ \sum_{i=1}^n Y_i = n B_0 + B_1 \sum_{i=1}^n x_{i1} + B_2 \sum_{i=1}^n x_{i2} + \cdots + B_k \sum_{i=1}^n x_{ik} \] \[ \sum_{i=1}^n x_{i1} Y_i = B_0 \sum_{i=1}^n x_{i1} + B_1 \sum_{i=1}^n x_{i1}^2 + B_2 \sum_{i=1}^n x_{i1} x_{i2} + \cdots + B_k \sum_{i=1}^n x_{i1} x_{ik} \] \[ \vdots \] \[ \sum_{i=1}^n x_{ik} Y_i = B_0 \sum_{i=1}^n x_{ik} + B_1 \sum_{i=1}^n x_{ik} x_{i1} + B_2 \sum_{i=1}^n x_{ik} x_{i2} + \cdots + B_k \sum_{i=1}^n x_{ik}^2 \]
    • These equations form a linear system that can be solved to find the least squares estimators \(B_0, B_1, \dots, B_k\).

    5.1.3 Normal Equations in Matrix Notation

  • Normal Equations:
    • The normal equations for the least squares estimators \(B_0, B_1, \dots, B_k\) are given by: \[ \sum_{i=1}^n x_{i1} Y_i = B_0 \sum_{i=1}^n x_{i1} + B_1 \sum_{i=1}^n x_{i1}^2 + B_2 \sum_{i=1}^n x_{i1} x_{i2} + \cdots + B_k \sum_{i=1}^n x_{i1} x_{ik} \] \[ \vdots \] \[ \sum_{i=1}^n x_{ik} Y_i = B_0 \sum_{i=1}^n x_{ik} + B_1 \sum_{i=1}^n x_{ik} x_{i1} + B_2 \sum_{i=1}^n x_{ik} x_{i2} + \cdots + B_k \sum_{i=1}^n x_{ik}^2 \]
  • Matrix Notation:
    • To simplify the solution of the normal equations, we introduce matrix notation.
    • Let:
      • \(Y\) be the response vector: \[ Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \]
      • \(X\) be the design matrix: \[ X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k} \\ 1 & x_{21} & x_{22} & \cdots & x_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix} \]

      5.1.4 Matrix Representation of Multiple Regression

      • \(\beta\) be the parameter vector: \[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} \]

      • \(e\) be the error vector: \[ e = \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \end{bmatrix} \]

  • Dimensions:
    • \(Y\) is an \(n \times 1\) matrix (response vector),
    • \(X\) is an \(n \times p\) matrix (design matrix), where \(p = k + 1\),
    • \(\beta\) is a \(p \times 1\) matrix (parameter vector),
    • \(e\) is an \(n \times 1\) matrix (error vector).
  • Multiple Regression Model:
    • The multiple regression model can be written in matrix form as: \[ Y = X \beta + e \]
  • Least Squares Estimators:
    • Let \(B\) be the matrix of least squares estimators: \[ B = \begin{bmatrix} B_0 \\ B_1 \\ \vdots \\ B_k \end{bmatrix} \]
    • The normal equations can be written in matrix form as: \[ X^T Y = X^T X B \] where:
      • \(X^T\) is the transpose of the design matrix \(X\),
      • \(X^T X\) is a \(p \times p\) matrix,
      • \(X^T Y\) is a \(p \times 1\) vector.
  • Solution:
    • The least squares estimators \(B\) can be obtained by solving the matrix equation: \[ B = (X^T X)^{-1} X^T Y \] where \((X^T X)^{-1}\) is the inverse of the matrix \(X^T X\).

Residuals and Sum of Squared Residuals (SSR)

  • Residuals:
    • Let \(r_i\) denote the \(i^{th}\) residual, which is the difference between the observed response \(Y_i\) and the predicted value from the regression model: \[ r_i = Y_i - B_0 - B_1 x_{i1} - B_2 x_{i2} - \cdots - B_k x_{ik}, \quad i = 1, \dots, n \]

    • The residual vector \(r\) is defined as: \[ r = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_n \end{bmatrix} \]

    • In matrix notation, the residual vector can be expressed as: \[ r = Y - XB \] where:

      • \(Y\) is the response vector,
      • \(X\) is the design matrix,
      • \(B\) is the matrix of least squares estimators.
  • Sum of Squared Residuals (SSR):
    • The sum of squared residuals (SSR) is given by: \[ SSR = \sum_{i=1}^n r_i^2 \]

    • In matrix notation, the SSR can be written as: \[ SSR = r^T r \] where \(r^T\) is the transpose of the residual vector \(r\).

    • Substituting \(r = Y - XB\), we get: \[ SSR = (Y - XB)^T (Y - XB) \]

    • Expanding the matrix product: \[ SSR = [Y^T - (XB)^T](Y - XB) \]

      \[ SSR = Y^T Y - Y^T XB - B^T X^T Y + B^T X^T XB \]

  • Simplification Using Normal Equations:
    • From the normal equations, we know: \[ X^T XB = X^T Y \]
    • Substituting this into the expression for SSR: \[ SSR = Y^T Y - Y^T XB - B^T X^T Y + B^T X^T XB \] \[ SSR = Y^T Y - Y^T XB \] (since \(B^T X^T Y = B^T X^T XB\) from the normal equations).
  • Transpose Property:
    • Since \(Y^T XB\) is a scalar (a \(1 \times 1\) matrix), it is equal to its transpose: \[ Y^T XB = (Y^T XB)^T = B^T X^T Y \]
    • Therefore, the expression for SSR simplifies further to: \[ SSR = Y^T Y - B^T X^T Y \]
  • Final Computational Formula for SSR:
    • The sum of squared residuals (SSR) can be computed using the formula: \[ SSR = Y^T Y - B^T X^T Y \]
    • This formula is computationally efficient but requires careful handling to avoid roundoff errors.
  • Interpretation:
    • The term \(Y^T Y\) represents the total variation in the response variable \(Y\).
    • The term \(B^T X^T Y\) represents the explained variation by the regression model.
    • The difference \(Y^T Y - B^T X^T Y\) gives the unexplained variation (SSR), which is minimized in the least squares method.

Coefficient of Multiple Determination (\(R^2\))

  • Definition:
    • The coefficient of multiple determination, denoted by \(R^2\), measures the proportion of the total variation in the response variable \(Y\) that is explained by the regression model.
    • It is defined as: \[ R^2 = 1 - \frac{SSR}{\sum_{i=1}^n (Y_i - \overline{Y})^2} \] where:
      • \(SSR\) is the sum of squared residuals (unexplained variation),
      • \(\sum_{i=1}^n (Y_i - \overline{Y})^2\) is the total sum of squares (SST) (total variation in \(Y\)),
      • \(\overline{Y}\) is the mean of the observed response values \(Y_i\).
  • Range of \(R^2\):
    • \(R^2\) ranges between \(0\) and \(1\):
      • \(R^2 = 0\): The regression model explains none of the variation in \(Y\).
      • \(R^2 = 1\): The regression model explains all of the variation in \(Y\).
  • Formula in Terms of SST and SSR:
    • The formula for \(R^2\) can also be written as: \[ R^2 = \frac{SST - SSR}{SST} \]

      where:

      • \(SST = \sum_{i=1}^n (Y_i - \overline{Y})^2\) is the total sum of squares,
      • \(SSR = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2\) is the sum of squared residuals.
  • Usefulness:
    • \(R^2\) is a useful measure of the goodness of fit of the regression model.
    • A higher \(R^2\) indicates that the model explains a larger proportion of the variation in the response variable \(Y\).

### Probabilistic Perspective of Linear Regression

  • Overview:
    • Linear regression can also be viewed from a probabilistic perspective.
    • In this view, the response variable \(Y\) is considered a random variable with a probability distribution that depends on the input variables \(x_1, x_2, \dots, x_r\).
  • Assumptions:
    • The response variable \(Y\) is normally distributed with mean \(\mu_Y\) and variance \(\sigma^2\).
    • The mean \(\mu_Y\) is a linear function of the input variables: \[ \mu_Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r \]
    • The variance \(\sigma^2\) is constant and does not depend on the input variables.
  • Model:
    • The linear regression model can be written as: \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r + \epsilon \] where \(\epsilon\) is a random error term that follows a normal distribution with mean \(0\) and variance \(\sigma^2\): \[ \epsilon \sim N(0, \sigma^2) \]
  • Likelihood Function:
    • Given a set of observed data points \((x_i, Y_i)\) for \(i = 1, \dots, n\), the likelihood function represents the probability of observing the data given the parameters \(\beta_0, \beta_1, \dots, \beta_r\) and \(\sigma^2\).
    • The likelihood function is given by: \[ L(\beta_0, \beta_1, \dots, \beta_r, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(Y_i - \beta_0 - \beta_1 x_{i1} - \dots - \beta_r x_{ir})^2}{2 \sigma^2} \right) \]
  • Log-Likelihood Function:
    • The log-likelihood function is the natural logarithm of the likelihood function: \[ \log L(\beta_0, \beta_1, \dots, \beta_r, \sigma^2) = -\frac{n}{2} \log (2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_{i1} - \dots - \beta_r x_{ir})^2 \]
  • Maximum Likelihood Estimation (MLE):
    • The parameters \(\beta_0, \beta_1, \dots, \beta_r\) and \(\sigma^2\) can be estimated by maximizing the log-likelihood function.
    • The maximum likelihood estimators (MLEs) of \(\beta_0, \beta_1, \dots, \beta_r\) are the same as the least squares estimators.
    • The MLE of \(\sigma^2\) is given by: \[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 \] where \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_r x_{ir}\) are the predicted values from the regression model.
  • Inference:
    • The probabilistic perspective allows for statistical inference about the regression parameters.
    • Confidence intervals and hypothesis tests can be constructed for the parameters \(\beta_0, \beta_1, \dots, \beta_r\) and \(\sigma^2\).
    import numpy as np
    from scipy.stats import norm

    # Generate synthetic data
    np.random.seed(42)
    n = 100
    x = np.linspace(0, 10, n)
    beta_0 = 2.0
    beta_1 = 1.5
    sigma = 1.0
    epsilon = np.random.normal(0, sigma, n)
    Y = beta_0 + beta_1 * x + epsilon

    # Maximum Likelihood Estimation (MLE) for beta_0, beta_1, and sigma^2
    X = np.vstack([np.ones(n), x]).T

    beta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
    
    Y_hat = X @ beta_hat
    sigma_hat = np.sqrt(np.sum((Y - Y_hat) ** 2) / n)

    print(f"Estimated beta_0: {beta_hat[0]}")
    print(f"Estimated beta_1: {beta_hat[1]}")
    print(f"Estimated sigma^2: {sigma_hat ** 2}")

    # Plot the data and the fitted line
    import matplotlib.pyplot as plt

    plt.scatter(x, Y, color='blue', label='Data Points')
    plt.plot(x, Y_hat, color='red', label='Fitted Line')
    plt.xlabel('Independent Variable (x)')
    plt.ylabel('Dependent Variable (Y)')
    plt.title('Probabilistic Linear Regression')
    plt.legend()
    plt.grid(True)
    plt.show()
Estimated beta_0: 1.8271871459226277
Estimated beta_1: 1.5137932673366563
Estimated sigma^2: 0.8149047134980785

5.2 Logistic Regression

  • Overview:
    • Logistic regression is used when the response variable is categorical, typically binary (0 or 1).
    • It models the probability that a given input point belongs to a particular category.
  • Logistic Function:
    • The logistic function (also called the sigmoid function) is used to model the probability: \[ P(Y = 1 \mid \mathbf{x}) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r))} \]
  • Logistic Regression Model:
    • The logistic regression model can be written as: \[ \log \left( \frac{P(Y = 1 \mid \mathbf{x})}{1 - P(Y = 1 \mid \mathbf{x})} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_r x_r \]
    • Here, the left-hand side is the log-odds (logit) of the probability of the response being 1.
  • Estimation of Parameters:
    • The parameters \(\beta_0, \beta_1, \dots, \beta_r\) are estimated using the method of maximum likelihood.
    • The likelihood function for logistic regression is given by: \[ L(\beta_0, \beta_1, \dots, \beta_r) = \prod_{i=1}^n P(Y_i \mid \mathbf{x}_i)^{Y_i} (1 - P(Y_i \mid \mathbf{x}_i))^{1 - Y_i} \]
  • Log-Likelihood Function:
    • The log-likelihood function is: \[ \log L(\beta_0, \beta_1, \dots, \beta_r) = \sum_{i=1}^n \left[ Y_i \log P(Y_i \mid \mathbf{x}_i) + (1 - Y_i) \log (1 - P(Y_i \mid \mathbf{x}_i)) \right] \]
  • Fitting the Model:
    • The parameters are estimated by maximizing the log-likelihood function using numerical optimization techniques.

Checking the goodness of fit

5.2.1 Confusion Matrix, Accuracy, and ROC Curve

  • Confusion Matrix:

    • A confusion matrix is a table used to evaluate the performance of a classification model.
    • It summarizes the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.
    • The matrix helps in understanding the types of errors the model is making.
  • Accuracy:

    • Accuracy is a metric that measures the proportion of correct predictions made by the model.
    • It is calculated as the sum of true positives and true negatives divided by the total number of predictions: \[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]
    • While accuracy is a useful metric, it may not be sufficient for imbalanced datasets where one class is more frequent than the other.

    6 Calculate sensitivity (recall) and specificity

    sensitivity = TP / (TP + FN) specificity = TN / (TN + FP)

    print(f”Sensitivity (Recall): {sensitivity}“) print(f”Specificity: {specificity}“)

  • ROC Curve:

    • The Receiver Operating Characteristic (ROC) curve is a graphical representation of a classification model’s performance.
    • It plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
    • The area under the ROC curve (AUC) is a single scalar value that summarizes the model’s ability to discriminate between positive and negative classes.
    • An AUC value of 1 indicates perfect classification, while an AUC value of 0.5 indicates no discriminative power (equivalent to random guessing).
  import numpy as np
  from sklearn.linear_model import LogisticRegression
  from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc
  import matplotlib.pyplot as plt

  # Generate synthetic data for logistic regression
  np.random.seed(42)
  n = 100
  x = np.linspace(0, 10, n)
  beta_0 = -5.0
  beta_1 = 1.0
  prob = 1 / (1 + np.exp(-(beta_0 + beta_1 * x)))
  Y = np.random.binomial(1, prob, n)

  # Reshape x for sklearn
  x = x.reshape(-1, 1)

  # Fit the logistic regression model
  model = LogisticRegression().fit(x, Y)

  # Predict probabilities
  Y_prob = model.predict_proba(x)[:, 1]

  # Predict class labels
  Y_pred = model.predict(x)

  # Calculate confusion matrix
  conf_matrix = confusion_matrix(Y, Y_pred)
  print("Confusion Matrix:")
  print(conf_matrix)

  # Calculate accuracy
  accuracy = accuracy_score(Y, Y_pred)
  print(f"Accuracy: {accuracy}")

  # Calculate ROC curve and AUC
  fpr, tpr, _ = roc_curve(Y, Y_prob)
  roc_auc = auc(fpr, tpr)
  print(f"AUC: {roc_auc}")

  # Plot ROC curve
  plt.figure()
  plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
  plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('Receiver Operating Characteristic (ROC) Curve')
  plt.legend(loc="lower right")
  plt.grid(True)
  plt.show()
Confusion Matrix:
[[47  5]
 [ 5 43]]
Accuracy: 0.9
AUC: 0.9647435897435896