Logistic regression: Another example using Python 3 with numpy and matplotlib.

The goal of logistic regression in the context of machine learning is to develop a model capable of predicting a categorical output ($y$) given a set of input features ($X$). Like linear regression, the training of this model involves determining a set of parameters (or weights) $\theta$ that linearly map the input features to the output of the model. Choosing these parameters is an optimization problem where we are looking for a minimum of a loss or cost function ($J(\theta)$) in the $\theta$ parameter space. The basic way to approach this problem numerically is to implement the batch gradient descent algorithm.

Below are the vectorized forms of the logistic regression hypothesis $h_\theta(X)$, cost function $J(\theta)$, and the gradient descent algorithm. Logistic regression aims to classify samples based on a categorical or discrete output variable $y \in \{0, 1\}$. Thus the hypothesis maps our $\theta^T X$ to a number between 0 and 1. This value is treated as the probability that the sample has the output value 1. Usually a decision boundary of 0.5 is used, thus when $h_\theta(X) \geq 0.5 \to y = 1$ and $h_\theta(X) \lt 0.5 \to y = 0$.

Hypothesis:

$0 \leq h_\theta(X) \leq 1$

$h_\theta(X) = g(\theta^T X)$

$g(z) = \frac{1}{1 + e^{-z}}$

Cost function:

$J(\theta) = -\frac{1}{m}(log(g(X\theta))^Ty + log(1 - g(X\theta))^T(1 - y))$

Gradient descent:

$\theta := \theta - \frac{\alpha}{m}X^T(g(X\theta) - \vec{y})$

Where $X$ is the feature matrix containing $m$ examples, $\theta$ is a column vector of model parameters, $\vec{y}$ is the column vector of outputs, and $\alpha$ is the learning rate.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Only necessary if you have a HiDPI display
%config InlineBackend.figure_format = 'retina'
# Persistent random data
np.random.seed(0)

Define functions for each equation and algorithm.

In [2]:
def sigmoid(z):
    """Apply the sigmoid function element-wise to the
    input array z."""
    return 1 / (1 + np.exp(-z))

def calculate_cost(theta, X, y):
    m = X.shape[0]
    J = - (1 / m) * (np.log(sigmoid(X * theta)).T * y + np.log(1 - sigmoid(X * theta)).T * (1 - y))
    return J

def gradient_descent(initial_theta, X, y, alpha=0.1, maxiter=100, tol=0.1, return_costs=False):
    m = X.shape[0]
    i = 0
    theta = initial_theta
    costs = []
    while i < maxiter:
        theta = theta - (alpha / m) * X.T * (sigmoid(X * theta) - y)
        cost = calculate_cost(theta, X, y)
        costs.append(float(cost))
        if cost <= tol:
            if return_costs:
                return theta, costs
            else:
                return theta
        i = i + 1
    if return_costs:
        return theta, costs
    else:
        return theta

Next, we define initial guess of $\theta$ (all zeros), $X$ (based on two groups called $X_a$ and $X_b$ here), and $y$.

In [3]:
n_pts = 100
m = 2

bias = np.ones(n_pts)
Xa = np.array([bias,
               np.random.normal(10, 2, n_pts),
               np.random.normal(12, 2, n_pts)])
Xb = np.array([bias,
               np.random.normal(5, 2, n_pts),
               np.random.normal(6, 2, n_pts)])

initial_theta = np.matrix([np.zeros(m + 1)]).T
X = np.append(Xa, Xb, axis=1).T
y = np.matrix(np.append(np.zeros(n_pts), np.ones(n_pts))).T

fig, ax = plt.subplots(figsize=(4,4))
ax.scatter(X[:n_pts,1], X[:n_pts,2], color='lightcoral',
          label='$y = 0$')
ax.scatter(X[n_pts:,1], X[n_pts:,2], color='lightblue',
          label='$y = 1$')
ax.set_title('Sample Dataset')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.legend(loc=4);

Run the gradient descent algorithm.

In [4]:
theta, costs = gradient_descent(initial_theta, X, y, alpha=0.01, maxiter=10000, return_costs=True)

Now let's visualize the results

In [5]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4))

x1 = np.array([X[:,1].min()-1, X[:,1].max()+1])
x2 = - theta.item(0) / theta.item(2) + x1 * (- theta.item(1) / theta.item(2))

# Plot decision boundary
ax[0].plot(x1, x2, color='k', ls='--', lw=2)

ax[0].scatter(X[:int(n_pts),1], X[:int(n_pts),2], color='lightcoral', label='$y = 0$')
ax[0].scatter(X[int(n_pts):,1], X[int(n_pts):,2], color='lightblue', label='$y = 1$')
ax[0].set_title('$x_1$ vs. $x_2$')
ax[0].set_xlabel('$x_1$')
ax[0].set_ylabel('$x_2$')
ax[0].legend(loc=4)

ax[1].plot(costs, color='r')
ax[1].set_ylim(0,ax[1].get_ylim()[1])
ax[1].set_title(r'$J(\theta)$ vs. Iteration')
ax[1].set_xlabel('Iteration')
ax[1].set_ylabel(r'$J(\theta)$')

fig.tight_layout()

Our resulting model seems to classify most of the points correctly, however some points close to the border of the two groups are misclassified. This is due to two reasons: 1) the groups do overlap and there isn't a clear separation between them; and 2) logistic regression will only construct a linear border between the two groups. There are a few ways to get around this. One is to introduce polynomial features (such as $x_{1}^2$ or $x_{2}^2$). Another would be to use a different classification algorithm capable of producing more complex boundaries such as a neural network or a support vector classifier.