Gaussian Initialization

In a Gaussian Intialization, we sample each initial weight from an i.i.d. normal distribution. Choosing initail weights randomly has a few distinct advantages. First, random weights are unlikely at a local minima, saddle point, or other bad parts of the optimization space (symmetric or constant weights on the other hand are). A random initialize break symmetries and avoids multiple filters from learning the same or similar concepts.

As a general rule of thumb:

It is fine to initialize any bias to zero (or a constant). In fact this is often recommended.
It is fine to initialize the weights of the last layer to zero, if there is no non-linearity. In fact this is often recommended.
Never initialize weights of layers inside the network to zero. This will lead to gradients of zero and not train the network.

The Gaussian Initialization has two parameters the mean $\mu$ and standard deviation $\sigma$ of the normal distribution. You will almost always choose a mean $\mu=0$ . Tuning the standard deviation can be tricky. There are sevel heuristics that can help. In class, we will almost exclusively use the xavier initialization, which heuristically adjusts the standard deviation with the size of the layer.

PyTorch Usage

conv_layer = t.nn.Conv2d(16, 16)
torch.nn.init.normal_(conv_layer.weight, mean=0, std=0.01)
torch.nn.init.constant_(conv_layer.bias, 0)

Refer to torch.nn.init.normal_() for more details.