Feed Forward: Also called multilayer perceptron. There are no feedback connections since information flows through the function being evaluated from x, and then the intermediate computations which defined f, and finally to the output y. The networks are typically represented by composing together many different functions. The layer consisting of many parallel units, which representing a vector-to-scalar function.
SDG stochastic gradient descent is an algorithm which is an extension of the gradient descent algorithm and also the main way to train large linear models on very large datasets. For a fixed model size, the cost per SGD update does not depend on the training set size m.
Minibatch: we can see that the gradient is an expectation, which can be approximately estimated by a small set of samples. From the training set, we can sample a minibatch of examples drawn uniformly on each step of the algorithm. The minibatch size ranging from one to a few hundred is typically chosen to be a relatively small number of examples. It is usually held fixed as the training set size m grows. So, we use updates computed on only a hundred examples to fit a training set with billions of examples.
Backpropagation: is a method used in artificial neural networks to calculate a gradient, which is needed in the calculation of the weights to be used in the network.
Cost derivative: gives the slope of f(x) at the point x, shows a small change in the input will let to a change in the output in the cost function. And cost function is designed to measures how well the beliefs correspond with reality. We use a training algorithm to minimize that cost function.
Sigmoid: ¦Ò is the logistic sigmoid function. The sigmoid function saturates when its argument is very positive or very negative, meaning that the function becomes very flat and insensitive to small changes in its input.
Sigmoid prime: Derivative of the sigmoid function.