Single Neuron with Backpropagation

20-01-2026backpropagation · sigmoid · gradient-descent · mse-loss

Train a single sigmoid neuron end-to-end with manual gradients.

Problem

Implement the full training loop for a single sigmoid neuron. Given features, labels, initial weights, bias, a learning rate, and epoch count, perform forward passes, compute mean squared error, backpropagate analytic gradients for weights and bias, and update parameters each step. Return the final rounded parameters and a history of MSE values per epoch for quick plotting or inspection.

Code

import numpy as np
 
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
def train_neuron(features: np.ndarray, labels: np.ndarray, initial_weights: np.ndarray, initial_bias: float, learning_rate: float, epochs: int) -> (np.ndarray, float, list[float]):
    weights = np.array(initial_weights)
    bias = initial_bias
    features = np.array(features)
    labels = np.array(labels)
    mse_values = []
 
    for _ in range(epochs):
        z = np.dot(features, weights) + bias
        predictions = sigmoid(z)
        mse = np.mean((predictions - labels) ** 2)
        mse_values.append(round(mse, 4))
 
        # Gradient calculation for weights and bias
        errors = predictions - labels
        weight_gradients = (2 / len(labels)) * np.dot(features.T, errors * predictions * (1 - predictions))
 
        bias_gradient = (2 / len(labels)) * np.sum(errors * predictions * (1 - predictions))
 
        # Update weights and bias
        weights -= learning_rate * weight_gradients
        bias -= learning_rate * bias_gradient
 
        updated_weights = np.round(weights, 4)
        updated_bias = round(bias, 4)
 
    return updated_weights, updated_bias, mse_values

Notes

  • Forward pass: z=Xw+bz = Xw + b; predictions y^=σ(z)=11+ez\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}.
  • Loss (per epoch): mean squared error MSE=1Ni(y^iyi)2\mathrm{MSE} = \frac{1}{N}\sum_i (\hat{y}_i - y_i)^2.
  • Gradients:
    • Error term e=y^ye = \hat{y} - y.
    • Sigmoid derivative σ(z)=y^(1y^)\sigma'(z) = \hat{y}(1 - \hat{y}).
    • Chain rule: multiply the residual by the local derivative before projecting: eσ(z)e \odot \sigma'(z) is what feeds both parameter gradients.
    • Weight gradient w=2NX(eσ(z))\nabla_w = \frac{2}{N} X^\top \bigl(e \odot \sigma'(z)\bigr).
    • Bias gradient b=2Nieiσ(zi)\nabla_b = \frac{2}{N}\sum_i e_i\, \sigma'(z_i).
  • Update rule per epoch: wwηww \leftarrow w - \eta \nabla_w, bbηbb \leftarrow b - \eta \nabla_b where η\eta is the learning rate.