Analyses of Deep Learning (STATS 385)

Accepts as input:

feature vector of size $H \times W \times D$
probability $p$

Outputs another feature vector of the same size. At train time, every neuron in it is set to the value of the corresponding neuron in the input with probability $p$ , and zero otherwise. At test time, the output feature vector is equal to the input one scaled by $p$ .

Weight decay

Soft $L_2$ constraint on the parameters of the network. This is done by decreasing every parameter in each iteration of SGD by its value times a small constant, corresponding to the strength of the regularization.

Max norm constraints

Hard $L_2$ constraint on the parameters of the network. This is done by imposing an upper bound on the $L_2$ norm of every filter and using projected gradient descent to enforce the constraint.
source

Data augmentation

Creating additional training samples by perturbing existing ones. In image classification this includes randomly flipping the input, cropping subsets from it, etc.

back

Analyses of Deep Learning (STATS 385)

Stanford University, Fall 2019