What is the difference between forward and reverse mode automatic differentiation?

TLDR; Mathematically, forward and reverse mode differentiation differ only in what order we choose to compute a sequence of matrix products. In practice, reverse mode differentiation is more complicated and should always-and-only be used for functions with many inputs and few outputs. Otherwise use forward mode differentiation.

Consider a computer program $f = f_{N} \circ \dots \circ f_{1}$ composed of elemental mathematical building blocks $f_{n}$ and intermediate vector-valued states

x_{n + 1} = f_{n + 1} (x_{n}) .

We are interested in computing the output $y = f (x)$ and its gradient $\frac{d f}{d x} (x)$ for a given input $x$ . The chain rule states that

\frac{d f}{d x} = \frac{d f_{N}}{d x_{N - 1}} \frac{d f_{N - 1}}{d x_{N - 2}} \dots \frac{d f_{2}}{d x_{1}} \frac{d f_{1}}{d x},

where $\frac{d f_{n + 1}}{d x_{n}} = \frac{d f_{n + 1}}{d x_{n}} (x_{n})$ is the Jacobian matrix of $f_{n + 1}$ at the point $x_{n}$ . The size is $size (x_{n + 1}) \times size (x_{n})$ . In particular, if the input size is larger than the output size, the Jacobian is a "flat" matrix (and "tall" in the other case).

By associativity of matrix multiplication, i.e. $(A B) C = A (B C)$ for all matrices $A$ , $B$ , and $C$ , we can choose to write the chain rule in two ways:

\begin{aligned} \frac{d f}{d x} & = \frac{d f_{N}}{d x_{N - 1}} (\dots \frac{d f_{3}}{d x_{2}} (\frac{d f_{2}}{d x_{1}} (\frac{d f_{1}}{d x})) \dots) \\ = (\dots ((\frac{d f_{N}}{d x_{N - 1}}) \frac{d f_{N - 1}}{d x_{N - 2}}) \frac{d f_{N - 2}}{d x_{N - 3}} \dots) \frac{d f_{1}}{d x} . \end{aligned}

The first form, where we evaluate the expression from right to left, is called forward mode differentiation. The second form, where we evaluate the expression from left to right, is called reverse mode differentiation. In both cases, when we evaluate the full gradient, we do not need to build all these Jacobian matrices explicitly (which could be costly for large sizes). Instead, it is sufficient to be able to compute the Jacobian-vector products for forward mode and vector-Jacobian products for reverse mode.

Forward mode differentiation

The pushforward function (or Jacobian-vector product) of a differentiable function $f$ is defined as

\dot{f} (x, \dot{x}) = \frac{d f}{d x} (x) \dot{x},

where $x$ is a given state vector and $\dot{x}$ is a gradient seed being "pushed forward" in the computational chain. The seed $\dot{x}$ can be a column vector or a matrix (collection of column vectors) of the same size as $x$ .

Since both the gradients and the states can be computed at the same time, a sequential program $f = f_{N} \circ \dots \circ f_{1}$ can be executed with a single for-loop without saving any intermediate states:

Initialize the state $x_{0} = x$ and the seed ${\dot{x}}_{0} = I (x)$ given as an identity matrix of the same size as $x$ .
For $n \in {0, \dots, N - 1}$ , compute $\begin{aligned} x_{n + 1} & = f_{n + 1} (x_{n}), \\ {\dot{x}}_{n + 1} & = {\dot{f}}_{n + 1} (x_{n}, {\dot{x}}_{n}) . \end{aligned}$ Now $x_{n}$ and ${\dot{x}}_{n}$ are no longer needed and can be discarded.
Return the full gradient ${\dot{x}}_{N} = \frac{d f}{d x} (x)$ .

Here we assume that the computer has access to the pushforward functions of all the elemental functions $(f_{n})_{n = 1}^{N}$ in the program (e.g. the rule $\dot{\cos} (x, \dot{x}) = - \sin (x) \dot{x}$ must be hard-coded somewhere if you want to use $\cos (x)$ in your program).

At first glance, forward mode differentiation seems simple and efficient. It does not require storing all previous states. However, for use cases where the input of the program is high-dimensional and the output is low dimensional, it can be computationally expensive. This is often the case in deep learning, where the input $x$ contains millions of neural network weights and the output $y = f (x)$ is a scalar loss function value. In this case, at each step in the program but the last, ${\dot{x}}_{n}$ is a large matrix of size millions-times-millions, even though the final gradient $\frac{d f}{d x}$ is just a vector of the same size as $x$ .

Forward mode AD in 15 lines of Julia code

A pleasant way of implementing forward mode AD is to define a dual number type $d = (x, \dot{x})$ with the fundamental property $f (d) = (f (x), \dot{f} (x, \dot{x}))$ for all differentiable functions $f \in C^{1}$ . In Julia, by adding methods to basic arithmetic functions such as addition and multiplication for dual numbers, many programs can be made differentiable without any modification at all. Example (for scalar functions composed of additions and multiplications):

julia

import Base: +, * # The functions + and * are not "our" functions
struct Dual
    x
    xdot
end
Dual(x::Real) = Dual(x, 0) # Convert real number to dual with xdot=0
gradient(f, x) = f(Dual(x, 1)).xdot # Gradient is obtained with seed xdot=1
+(a::Dual, b::Dual) = Dual(a.x + b.x, a.xdot + b.xdot)
+(a::Dual, b::Real) = a + Dual(b)
+(a::Real, b::Dual) = Dual(a) + b
*(a::Dual, b::Dual) = Dual(a.x * b.x, a.xdot * b.x + a.x * b.xdot) # Product rule
*(a::Dual, b::Real) = a * Dual(b)
*(a::Real, b::Dual) = Dual(a) * b
gradient(x -> x * x * x, 4) # Returns 48
gradient(x -> 3x + 5, 2) # Returns 3

Furthermore, when inspecting the generated LLVM code, we see that it is able to infer the return value 3 at compile-time (even before the code is run with the input value 2!):

julia

@code_llvm gradient(x -> 3x + 5, 2)

llvm

; Function Signature: gradient(Main.var"#31#32", Int64)
;  @ REPL[15]:1 within `gradient`
define i64 @julia_gradient_2898(i64 signext %"x::Int64") #0 {
top:
; ┌ @ REPL[29]:1 within `#31`
; │┌ @ REPL[10]:1 within `+` @ REPL[17]:1 @ int.jl:87
    ret i64 3
; └└
}

Gradients of large programs can thus be expected to generate similar efficient code.

Reverse mode differentiation

The pullback function (or vector-Jacobian product) of a differentiable function $f$ is defined as

\bar{f} (x, \bar{y}) = \bar{y} \frac{d f}{d x} (x) .

where $x$ is a given state vector and $\bar{y}$ is an adjoint variable being "pulled back" in the computational chain. The adjoint variable $\bar{y}$ should be a row vector or a matrix (collection of row vectors) of the same size as the column vector $y = f (x)$ .

For convenience, we also define the (partially applied) pullback function at a given state $x$ as

\bar{f} (x) : \bar{y} \mapsto \bar{f} (x, \bar{y}) .

The canonical reverse mode differentiation algorithm is implemented using two for-loops; a forward pass to compute (and store!) the states, and a reverse pass to compute the gradients:

Assign the initial state $x_{0} = x$ .
Forward pass: for $n \in {0, \dots, N - 1}$ , compute $\begin{aligned} x_{n + 1} & = f_{n} (x_{n}), \\ g_{n + 1} & = {\bar{f}}_{n + 1} (x_{n}), \end{aligned}$ and store the pullback $g_{n + 1}$ for later use (this may require storing the full state $x_{n}$ so that $g_{n + 1}$ can be called).
Initialize the final adjoint seed ${\bar{x}}_{N} = I (x_{N})$ as an identity matrix of the same size as the output vector $x_{N}$ (typically $x_{N}$ and ${\bar{x}}_{N} = 1$ are both scalars).
Backward pass: for decreasing $n \in {N, \dots, 1}$ , compute ${\bar{x}}_{n - 1} = g_{n} ({\bar{x}}_{n}) .$ Now ${\bar{x}}_{n}$ and $g_{n}$ are no longer needed and can be discarded.
Return the full gradient ${\bar{x}}_{0} = \frac{d f}{d x} (x)$ .

For high-dimensional inputs and low-dimensional outputs, reverse mode differentiation is the go-to method of choice. However, the double for-loops in opposite order ("forward" and "back-propagation") and the requirement to store all the intermediate states of the forward pass can cause quite some headaches (and computer memory issues). Multiple strategies extist to mitigate these issues:

Checkpointing: If storing all the states $(x_{n})_{n = 0}^{N - 1}$ takes up too much space, we can store every 10th state and recompute the 9 missing states between the current and next stored state when needed.
Reverse accumulation (for the bravehearted): Do a forward pass to compute $x_{N}$ , but do not store any intermediate states. In the backward pass, compute $x_{n - 1} = f_{n}^{- 1} (x_{n})$ alongside ${\bar{x}}_{n - 1} = g_{n} ({\bar{x}}_{n})$ . However, the program components $(f_{n})_{n = 1}^{N}$ may be badly conditioned or even non-invertible, in which case this method should not be used.

How "automatic" is automatic differentiation?

AD engines work by decomposing programs into mathematical building blocks $(f_{n})_{n = 1}^{N}$ that it knows how to differentiate. This knowledge needs to be hard-coded by humans. The ChainRules.jl ecosystem in Julia provides a nice framework for specifying pushforward and pullback rules, along with pre-defined rules for many common functions. In computers, most mathematical functions are implemented as some form of truncated power series expansions, for example the exponential function

\exp (x) = 1 + x + \frac{x^{2}}{2} + \frac{x^{3}}{6} + \dots

which is probably implemented as some variant of

s_{N} (x) = \sum_{n = 0}^{N} \frac{x^{n}}{n!},

where $N$ depends on the desired accuracy. A naive AD engine that only knows how to differentiate polynomials might decide to compute the derivative

\frac{d s_{N}}{d x} (x) = \sum_{n = 1}^{N} n \frac{x^{n - 1}}{n!} = s_{N - 1} (x)

which, if the coefficients properly merged, gives the same expression as $s_{N} (x)$ but with precision $N - 1$ instead of $N$ . In addition, during a forward mode AD pass, the AD engine would compute both $y = s_{N} (x)$ and $\frac{d s_{N}}{d x} (x)$ , not knowing that it could in fact reuse the value $y$ in $\dot{y} = y$ . With a hard-coded pushforward rule for $s_{N} (x)$ , we could tell the AD system to compute $y = s_{N} (x)$ once, and then return $(y, y \dot{x})$ . Since a pushforward rule for $s_{N}$ exists in the rule table, the AD system would decide not to decompose $s_{N}$ further, and instead use the rule directly.

WARNING

Numerically speaking, this rule is actually wrong for $s_{N}$ , but correct for $\exp$ which $s_{N}$ is supposed to approximate. If we need exact gradients of the numerical implementation, the pushforward should return $(s_{N} (x), s_{N - 1} (x) \dot{x})$ instead of $(y, y \dot{x})$ with $y = s_{N} (x)$ .

Similar arguments can be made for larger algorithms, such as solving a linear system $A y = x$ using an iterative solver. The function $f : x \mapsto A^{- 1} x$ is then implemented using a for-loop where we compute matrix-vector products such as $A x_{n}$ for intermediate guesses $x_{n}$ . The gradient of $f$ is given by $\frac{d f}{d x} (x) = A^{- 1}$ , and so the pushforward rule is $\dot{f} (x, \dot{x}) = A^{- 1} \dot{x} = f (\dot{x})$ . Instead of differentiating the entire for-loop, we could just do a new linear solve to compute $\dot{f} (x, \dot{x})$ , possibly converging in a different number of iterations.

For forward mode, it is probably fine not to implement the rule for iterative solvers. For reverse mode differentiation of a linear solve using the conjugate gradient method for a symmetric positive definite $A$ , creating two for-loops with a forward pass and a backward pass would be disastrous, when we know that the pullback is

\bar{f} (x, \bar{y}) = \bar{y} \frac{d f}{d x} (x) = \bar{y} A^{- 1} = {(A^{- 1} {\bar{y}}^{T})}^{T} = f {({\bar{y}}^{T})}^{T},

since $(A^{- 1})^{T} = A^{- 1}$ . We could just run the linear solver twice: once for $y = f (x)$ , and once for $\bar{x} = f ({\bar{y}}^{T})^{T}$ .

Finally, the gradient of a program that solves a (partial) differential equation might (or might not) be better computed by obtaining a mathematical equation for the gradient of the exact continous solution and then discretize, instead of discretizing the equation and then differentiate.

Conclusion

While many programs can be differentiated using a naive AD engine knowing only generic rules for elementary functions, large programs that are costly to evaluate should be analyzed for potential performance improvements. This is especially important for reverse mode differentiation, where it can also be rewarding to choose a checkpointing strategy etc. The users should also consider whether they are interested in the exact gradient of the numerical implementation, or whether the gradient of the mathematical function the program approximates can be computed more efficiently.

What is the difference between forward and reverse mode automatic differentiation? ​

Forward mode differentiation ​

Reverse mode differentiation ​

How "automatic" is automatic differentiation? ​

Conclusion ​

See also ​

What is the difference between forward and reverse mode automatic differentiation?

Forward mode differentiation

Reverse mode differentiation

How "automatic" is automatic differentiation?

Conclusion

See also