To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.
Stochastic gradient descent (SGD)
A stochastic approximation of the gradient descent for minimizing an objective function that is a sum of functions.
The true gradient is approximated by the gradient of a randomly chosen single function.
Initialization of a network
Usually, the biases of a neural network are set to zero, while the weights are initialized with independent and identically distributed zero-mean Gaussian noise.
The variance of the noise is chosen in such a way that the magnitudes of input signals does not change drastically.
The scalar by which the negative of the gradient is multiplied in gradient descent.
An algorithm, relying on an iterative application of the chain rule, for computing efficiently the derivative of a neural network with respect to all of its parameters and feature vectors.
The function being minimized in an optimization process, such as SGD.
The input to a neural network is often mean subtracted, contrast normalized and whitened.
A vector containing one in a single entry and zero elsewhere.
Commonly used to quantify the difference between two probability distributions. In the case of neural networks, one of the distributions is the output of the softmax, while the other is a one-hot vector corresponding to the correct class.
A perturbation added to the input of the network or one of the feature vectors it computes.