Activation Functions

22-01-2026activation-functions · neural-networks · deep-learning

Overview of common activation functions with their mathematical equations, derivatives, and Python implementations.

Sigmoid

Sigmoid Image Credits: https://www.geeksforgeeks.org/machine-learning/derivative-of-the-sigmoid-function/

  • Mathematical function that maps any real-valued number into a value between 0 and 1
  • Non-linear activation function that improves model performance as it gains the ability to handle non-linearity

Equation: σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Derivative: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))

import math
 
def sigmoid(z: float) -> float:
    result = 1 / (1 + math.exp(-1 * z))
    return round(result, 4)

Softmax

  • Mathematical function that converts a vector of real-valued scores into a probability distribution
  • Generalization of sigmoid to multiple classes, ensuring all outputs sum to 1
  • Non-linear activation function commonly used in the output layer for multi-class classification

Equation:

softmax(xi)=exij=1nexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}

Derivative:

For the softmax function, the partial derivative with respect to element xix_i is:

When i=ji = j: softmax(xi)xj=softmax(xi)(1softmax(xi))\frac{\partial \text{softmax}(x_i)}{\partial x_j} = \text{softmax}(x_i)(1 - \text{softmax}(x_i))

When iji \neq j: softmax(xi)xj=softmax(xi)softmax(xj)\frac{\partial \text{softmax}(x_i)}{\partial x_j} = -\text{softmax}(x_i) \cdot \text{softmax}(x_j)

import math
import numpy as np
 
def softmax(scores: list[float]) -> list[float]:
    scores_np = np.array(scores)
    probabilities = np.exp(scores_np) / np.sum(np.exp(scores_np))
    return probabilities.tolist()

ReLU (Rectified Linear Unit)

  • Simple activation function that outputs the input directly if it is positive, otherwise outputs zero
  • Addresses the vanishing gradient problem common in sigmoid and tanh functions
  • Computationally efficient and helps with sparse activations in deep networks

Equation: ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Derivative:

When x>0x > 0: ReLU(x)=1\text{ReLU}'(x) = 1

When x0x \leq 0: ReLU(x)=0\text{ReLU}'(x) = 0

def relu(x: float) -> float:
   return max(0.0, x)

Leaky ReLU

  • Variant of ReLU that allows a small, non-zero gradient when the input is negative
  • Addresses the "dying ReLU" problem where neurons can become inactive and stop learning
  • Uses a small positive slope (typically 0.01) for negative inputs instead of zero
  • Non-linear activation function that maintains gradient flow even for negative inputs

Equation: LeakyReLU(x)=max(αx,x)\text{LeakyReLU}(x) = \max(\alpha x, x)

where α\alpha is a small positive constant (typically 0.01).

Derivative:

When x>0x > 0: LeakyReLU(x)=1\text{LeakyReLU}'(x) = 1

When x0x \leq 0: LeakyReLU(x)=α\text{LeakyReLU}'(x) = \alpha

def leaky_relu(x: float, alpha: float = 0.01) -> float:
    return max(alpha * x, x)