2) Active Learning
A.
Definition: This paper restricts active learning definition to the simple and intuitive form of concept learning via membership queries. In a membership query, the learner queries a point in the input domain and an oracle returns the classification of that point.
Reason: In many formal problems, active learning is provably more powerful than passive learning from randomly given examples.
Example: A simple example is that of locating a boundary on the unit line interval. In order to achieve an expected position error of less than
e, one would need to draw random training examples. If one is allowed to make membership queries sequentially, then binary search is possible and, assuming a uniform distribution, a position error ore
may be reached with queries.
B.
we can maintain two subsets, S is the set of all “most specific” consistent concepts and G is the set of “most general” concepts. We can examine instances that fall in the “difference” of S and G. If an instance in this region proves positive, then some s in S will have to generalize to accommodate the new information; if it proves negative, some g in G will have to be modified to exclude it. In either case, the version space, the space of plausible hypotheses, is reduced with every query.
The name of this kind of query selection is active version-space search. C.
Exponential function appears to govern the relationship between the number of samples and the error rate of the learner. Through linear regression, the
curve function for SG neural network is for a = -0.0218 and b = 0.071 with r2= 0.995. By comparison, the randomly sampled
networks’ fit to achieved only r2 = 0.981.
3) You don¡¯t need that fancy classifier
Thesis: Although a large number of comparative studies have been conducted in attempts to establish the relative superiority of the more sophisticated methods, these comparisons often fail to take into account important aspects of real problems, so that the apparent superiority of more sophisticated methods may be something of an illusion and simple methods typically yield performance almost as good as more sophisticated methods.
Evidence:
In many cases, the simple models accounted for over 90% of the predictive power that could be achieved by ¡°the best¡± model we could find. Many paper have unpractical assumptions such that the distributions from which the design points and the new points are drawn are the same, that the classes are well defined, labels are 100% accurate and the definitions will not change, that the costs of different kinds of misclassification are known accurately, and so on. In real applications, however, these additional assumptions will often not hold. This means that apparent small (laboratory) gains in performance might not be realized in practice.
Many of the comparative studies in the literature are based on brief descriptions of data sets, containing no background information at all on such possible additional sources of variation due to breakdown of implicit assumptions of the kind illustrated above. So their conclusion is doubtable.
I partly agree with their argument. Indeed, many comparison study¡¯s assumptions are doubtable, their data in experiment may be very different from real situation, and the gain they claimed are too small to have no real difference in practical situation. But there are real break- through in classifier technology such as deep learning which achieves the best result in image classification problem. With more computation power and larger data size, we can expect that sophisticated model such as deep neural network achieves better result both in laboratory and in real application.
4) Dropout
A.
Dropout is a technique that can prevent neural networks from overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently.
During training, it randomly drops units along with all its incoming and outgoing connections.
During testing, no units are dropped out but the weights of the network are scaled-down versions of the trained weights.
B.
Model combination is to train multiple different models and combine the results from these different models as final result.
Model combination can reduce overfitting and improve performance. If we only train 1 model, it is prone to over fitting, but by averaging result of multiple models, it can reduce overfitting.
C.
In the experiments, dropout improved generalization performance on all data sets compared to neural networks that did not use dropout.
For example in the MNIST data set, The best performing neural networks for the permutation invariant setting that do not use dropout or unsupervised pretraining achieve an error of about 1.60%. With dropout the error reduces to 1.35%.
D.
Advantages:
Dropout is efficient and can scale to very large network sizes and it is very easy to approximately combining results from many different neural network architectures efficiently at test time.
Bayesian neural nets are slow to train and difficult to scale to very large network sizes and it is expensive to get predictions from many large nets at test time.
Disadvantages:
In dropout, model is weighted equally, whereas in a Bayesian neural network each model is weighted taking into account the prior and how well the model fits the data, which is the more correct approach.
Bayesian Neural Nets are better in domains where data is scarce and drop out is better where data is large.
Within the broad definition of active learning, we will restrict our attention to the simple and intuitive form of concept learning via membership queries. In a membership query, the learner queries a point in the input domain and an
Within the broad definition of active learning, we will restrict our attention to the simple and intuitive form of concept learning via membership queries. In a membership query, the learner queries a point in the input domain