Group 1 – Deep Learning
• What is gradient descent and how is it applied to deep learning
• Describe how the gradient descent process works
• What is stochastic gradient descent
• What is mini-batch stochastic gradient descent
• What are some advantages of stochastic gradient descent over non stochastic gradient descent
• How are partial derivatives used in the gradient descent algorithm
• What does it mean to flatten a matrix and how is this concept applied to deep learning
• What is a hidden layer
• What is forward and back propagation
• How are partial derivatives used in gradient descent computed at run time
• What is a partial derivative
Group 2 – PCA
• What is a loading vector
• What is a score
• What is a projection
• What is a scree plot
• Name 3 use cases for PCA in data science
• Assuming a training data matrix of size 100×3, what is the size of the resulting loading vector matrix
• How would we calculate the score for the 5th row of data using the 2nd principal component. What is the shape of the resulting score.
• How can we perform feature selection using PCA
• What data pre-processing is required in order to perform PCA analysis
• How can we get the original data back from PCA scores
• Describe how PCA can be used to implement a recommender system
Group 3 – Logistic Regression / Regularization
• Describe how logistic regression works from a high level point of view
• What is the characteristic shape of a logistic regression function
• Is logistic regression considered a linear model
• How does the logistic regression training process affect the shape of the characteristic curve. In other words, how do the weights and intercept affect the shape of the curve.
• What is regularization
• Name one model evaluation metric which would indicate regularization may help your model perform better.
• How can regularization be used for feature selection in regression
• How is L1 regularization calculated
• How is L2 regularization calculated
• How is elastic net regularization calculated
• What is model flexibility in regression
• How can we increase / decrease model flexibility
Group 4 Assessing Model Accuracy
• What is generalization performance
• What is a loss function
• Describe the bias / variance trade off
• What is bias
• What is variance
• When is a model optimized in terms of bias and variance
• How does model complexity / flexibility relate the size of the training data
• How would you know if a model is overfitting
• What is K-Fold cross validation and how does it work
• What is a confusion matrix and how is it used in the data science process
• What is a ROC curve and how is it used in the data science process
• How does a confusion matrix relate to the ROC curve
Group 5 Statistical Learning
• What is statistical learning
• What is reducible error
• What is irreducible error
• What is a parametric model
• What is a non-parametric model
• What is the notion of predictions
• What is inference
• What is maximum likelihood estimation and how does it work
Group 6 Other
• Describe TFIDF
• What is TF
• What is IDF
• How is TFIDF used in machine learning
• Describe how TFIDF could be used to determine sentiment of text. Describe the entire process starting with a data frame containing text and walk through the steps needed to determine text sentiment.
• What is spark
• What is hadoop
• What is a resilient distributed dataset
• What is a lineage graph
• Why does spark use lazy execution
• What major improvement does spark have over hadoop
Group 7 Random Forest
• Describe how random forest models work from a high level point of view.
• What is the difference between random forest and regular decision trees.
• How does random forest prevent high correlation between trees in the forest
• Describe the training process.
• Describe one or more similarities between random forest and GBM
• How does the number of trees in the forest relate to variance in the predicted value
• How does the tree depth relate to model bias / variance
• What is entropy
• What is gini index
• What is information gain and how does it relate to relate to the training process
Group 8 GBM
• Describe boosting from a high level point of view
• Describe how gradient boosting machines (GBM) work from a high level point of view.
• Describe some differences between random forest and GBM
• Describe some similarities between random forest and GBM
• Describe some characteristics of individual trees in a GBM forest
• Describe the GBM training process
• How does GBM prevent correlation between trees
• How does the learning rate in GBM relate to the number of trees in the forest
• How does tree depth relate to model accuracy in GBM
Group 9: Bootstrap Sampling / Bagging
• Describe Bootstrap sampling from a high level point of view
• Describe bagging from a high level point of view
• What is the purpose of bootstrap sampling
• Is it possible to have duplicate observations in a bootstrap sample
• What is an out of bag sample
• Describe how out of bag samples are useful in the data science work flow
• What are some important details concerning the bootstrap sampling process.
• Describe a specific data science situation where bootstrap sampling could be useful