程序代写代做 graph deep learning finance data science Bayesian database C go decision tree algorithm • Fintech is most accurately described as: A

• Fintech is most accurately described as: A

• the application of technology to the financial services industry

• the replacement of government-issued money with electronic currencies.

• the clearing and settling securities trades through distributed ledger technology.

• Applications of fintech that are relevant to investment management include? B

• high frequency trading, natural language processing, risk analysis, algorithmic trading, and robo-advisory services.

• text analytics, natural language processing, risk analysis, algorithmic trading, and robo-advisory services.

• high frequency trading, natural language processing, risk analysis, algorithmic trading, and stock price prediction.

• Which of the following technological developments is most likely to be useful for analyzing Big Data? A

• machine learning.

• high-latency capture.

• the Internet of Things.

• Big data can be described as potentially useful information that is generated in the economy and includes: B

• structured data from traditional sources.

• structured and unstructured data from both traditional and non-traditional sources.

• unstructured data from non-traditional sources.

• Big data can be characterized by 5 ‘Vs’, which are they? A

• volume, velocity, veracity, value and variety.

• volume, velocity, vicinity, value and validity.

• volume, velocity, volatility and variability.

• Data science describes methods for processing and visualizing data. Processing methods include?

• capture, curation, storage, search and processing.

• capture, curation, storage, search and transfer.

• capture, cleaning, storage, search and modeling.

• Alternative data from non-traditional sources include: D

• financial related data from social media posts, online reviews, email, and website visits.

• business data from bank records and retail scanner data.

• sensor generated data, such as geolocation, temperature, humidity and PM2 air quality.

• all of the above.

• Artificial intelligence creates patterns from certain types of data, which are they? C

• qualitative, unstructured data.

• quantitative, structured data.

• both qualitative, unstructured data and quantitative, structured data.

• Artificial intelligence holds promise in fintech but how do we overcome challenges posed by data?

• select sparse, low quality data, treat outliers, handle missing data and biases in data.

• select large volume, high quality data, treat outliers, handle missing data, or biases in data.

• select small volume, high quality data, treat outliers, missing data, or biases in data.

• In supervised machine learning, how do we train a model for inference?

• the input and output data are labelled, the machine learns to model the outputs from the inputs, and new data is used to infer from the model.

• the input is unlabeled, the machine learns to model the outputs from the inputs, and new data is used to infer from the model.

• none of the above.

• In unsupervised machine learning, how do we train a model for inference?

• the input and output data are labelled, the machine learns to model the outputs from the inputs, and new data is used to infer from the model.

• the input data is unlabeled, the machine learns to describe the structure of the data.

• none of the above.

• Neural networks are an example of artificial intelligence in that they are programmed to process information in a way similar to:

• rule based expert systems.

• the human brain.

• all of the above.

• Machine learning enables computer systems the ability to improve task performance over time through:

• a computer algorithm is given inputs of source data and outputs of target data. The algorithm models the output data based on the input data or to recognize patterns in the input data.

• a computer algorithm is given inputs of source data or outputs of target data. The algorithm models the output data based on the input data or to recognize patterns in the input data.

• all of the above.

• Deep learning is a technique that uses layers of neural networks to identify patterns and may be employed in:

• supervised learning only.

• unsupervised learning only.

• both supervised, unsupervised learning.

• Machine learning can produce models that overfit or underfit the data, overfitting means?

• learns the input and output data too exactly, treats true parameters as noise, and identifies incorrect patterns and relationships. Resultant model is too complex.

• learns the input and output data too exactly, treats noise as true parameters, and identifies incorrect patterns and relationships. Resultant model is too complex.

• machine fails to identify actual patterns and relationships, treating true parameters as noise. This means the model is not complex enough.

• Machine learning can produce models that overfit or underfit the data, underfitting means?

• learns the input and output data too exactly, treats true parameters as noise, and identifies incorrect patterns and relationships. Resultant model is too complex.

• learns the input and output data too exactly, treats noise as true parameters, and identifies incorrect patterns and relationships. Resultant model is too complex.

• machine fails to identify actual patterns and relationships, treating true parameters as noise. The model is not complex enough.

• In the use of Machine Learning (ML):

• some techniques are termed ‘black boxes’, due to data biases.

• human judgement is unnecessary as algorithms continuously learn from data.

• training data can be learned too precisely, resulting in wrong predictions.

• Text analytics refers to the analysis of:

• structured data, such as text.

• unstructured data, such as voice.

• all of the above.

• Text analytics is suitable application for which of the following?

• economic trend analysis.

• large, structured databases.

• public but not private information.

• Natural language processing refers to the use of artificial intelligence to interpret:

• speech recognition, language translation, text mining and sentiment analysis.

• speech-to-text and text-to-speech translation.

• natural language generation and topic analysis.

• Visualization of data include which of the following?

• word frequency or word cloud.

• charts and graphs.

• all the above.

• Which of the following equations least accurately represents Return On Equity?

• (net profit margin)(equity turnover).

• (net profit margin)(total asset turnover)(assets / equity).

• (ROA)(interest burden)(tax retention rate).

• Which of these are common types of quantitative trading strategies?

• forecasting.

• mean reversion.

• correlation/cointegration.
• all of the above.

• A stock is observed to have an average price of 50 with a +/- 5 variation over the past 100 trading days. You buy when the stock reaches 45 and sell when it reaches 55. What kind of arbitrage is this?

• carry.

• statistical.

• merger.
• liquidation.

• Which of these are valid uses of backtesting? Select all that applies.

• quantify the hypothetical performance of your strategy for comparison with other strategies.

• predict likely capital requirements, trade frequency and risk for your portfolio.
• ensure that your strategy will be profitable in live trading.
• determine your maximum drawdown for your strategy in live trading.
• all of the above.

• Enterprise value is defined as the market value of equity plus:

• the face value of debt minus cash and short-term investments.

• the market value of debt minus cash and short-term investments.

• cash and short-term investments minus the market value of debt.

• Which of the following is least likely a rationale for using price multiples?

• price multiples are easily calculated.

• the fundamental p/e ratio is insensitive to its inputs.

• the use of forward values in the divisor provides an incorporation of the future.

• Statistical arbitrage and index arbitrage account for most of the volume in quantitative trading. Please select the examples of stat arb from the choices below:

• selling an asset on one trading venue at 110 and simultaneously buying it back for 109.90 at a different trading venue.

• selling an asset on one trading venue at 110 and buying it back later for 109 at a different trading venue.

• selling a basket of stocks that matches the composition of the S&P 500 for $300,000 and simultaneously buying 1000 shares of the SPY ETF for $299.70.
• none of the above.

• Which of the following is/are NOT a component of a time-series? Select all that apply.
• trend.
• noise.
• seasonality.
• none of the above

• Which of the following best describes the idea of early stopping used in Machine Learning?
• an improved version of the backpropagation algorithm.

• train the model until a local minimum in the error function is reached.
• add a momentum term to the weight update in the generalized linear rule, so that training converges more quickly.
• evaluate the model on a test dataset after every epoch of training. stop training when the generalization error starts to increase.

• You built a model and calculated its Root Mean Squared Error (RMSE) on test set. Later, you added more samples to your test set and validated the model again. What would happen to the new RMSE value?
• increase.
• decrease.
• stay the same.
• any of the above could potentially happen.

• Tuning which of the following hyperparameter(s) may cause overfitting of a random forest model. Select all that apply.
• learning rate.
• random initial state.
• depth of trees.
• N jobs.
• none of the above.

• You develop an ML algorithm to predict the number of views per article on a website. Your model is based on content features of the article and features of the author. Which of the following evaluation metric(s) would you choose? Select all that apply.
• mean squared error (MSE).
• accuracy.
• f1-score.
• r-squared.
• max absolute weight.

• What challenge(s) you may face if you have applied one hot encoding on a categorical feature:
• some categories of a categorical variable in the test dataset are not present in the train dataset.
• because of low cardinality, it doesn’t work well for NLP problems.
• train and test set would always have same distribution.
• both a and b.
• none of the above.

• Choose the correct option(s) regarding the k-NN algorithm. Select all that apply.
• usually k-NN performs better if all variables have the same scale.
• k-NN works well with a small number of input variables, but struggles when the number of inputs is very large.
• k-NN can be used for both categorical and continuous target variables.
• a and b.
• a, b and c.

• Which of the statements below are correct for K-Means clustering? Select all that apply.
• k-means is sensitive to cluster center initializations.
• bad initialization can lead to poor convergence speed.
• bad initialization can lead to bad overall clustering.
• a and c
• none of the above.

• Which of the statements below are correct for the Gradient Boosting on decision trees?
• always use decision stumps as weak classifiers.
• missing values are handled by the algorithm, no need for imputation.
• it is recommended to reduce number of dimensions with PCA before training the model.
• tuning of trees number will not cause overfitting
• Which of the following techniques can be used to deal with overfitting of neural networks? Select all that apply.
• dropout.
• regularization.
• batch normalization.
• a and b.
• none of the above.

• What is true regarding the One-Vs-All method in Logistic Regression?
• one needs to fit only 1 model to classify into n classes.
• one needs to fit n-1 models to classify into n classes.
• one needs to fit n models in n-class classification problem.
• none of the above.

• Which of the following diagram below illustrates the idea of low bias and high variance estimation?

A
B

C

D

—- End of Section 1 —-

Section II. Answer 10 short questions out of the given 20 questions. (4 marks each for a total of 40 marks)

• Explain in detail what is bias, variance in bias-variance tradeoff.
• Bias – Variance
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
• Bias:
“Bias is error introduced in your model due to oversimplification of machine learning algorithm.” It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms – Decision Trees, k-NN and SVM
High bias machine learning algorithms – Linear Regression, Logistic Regression
• Variance:
“Variance is error introduced in your model due to complex machine learning
algorithm, your model learns noise also from the training dataset and performs bad on test dataset.” It can lead high sensitivity and overfitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
• Bias, Variance trade off:
The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
• The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
• The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.
• Describe examples of alternative data, matching types of machine learning algorithms that may be suitable for each type of alternative data. How might alternative data be useful in Finance?
Alternative data is non-traditional data that can be used in the investment process. There are various categories of alternative data in including consumer transaction data, geo- location data and sentiment data. 1000s of relevant datasets are currently available to sophisticated investors. The Benefits. These datasets may be integrated into the investment process primarily because they provide a greater volume of data and information compared to traditional datasets, unforeseen insight and give managers a competitive edge. web crawled data is frequently used to track website e-commerce activity and include product prices, product listings, promotions, reviews, public commentary, press releases, changes in corporate websites and government filings – Natural Language Algorithms are used to understand these and tie them to predicting corporate earnings that may impact stock prices. Macro-funds tract Chinese exports to predict demand for products and potentially corporate earnings – supervised learning. Asset managers track “Jobs active” posts for employment as an alpha signal that future profitability. Discretionary managers use online retail data to identify improving sell-through trends of products – Supervised Machine learning.

• Describe reinforcement learning and its application in Finance using a scenario to illustrate your understanding.
• Reinforcement Learning is learning what to do and how to map situations to actions. The result is to maximize the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward. Reinforcement learning is inspired by the learning of human beings, it is based on the reward/penalty mechanism.
• Reinforcement learning has been successfully implemented in investment management with application including price dynamics forecast, portfolio management, market making, valuation of derivatives and financial risk management.
• A good example is where outdated valuation models like Black-Scholes are becoming replaced by Deep Neural Networks and Reinforcement Learning.

• Describe the ARIMA model, contrasting its differences with regression. Illustrate your understanding through an example for when ARIMA is suitable compared with regression.
The ARIMA model is the combination of Autoregressive – AR Integrated – to make stationary Moving Average – MA. The ARIMA model consists of three parameters, the p, d and q. The p parameter indicates how many prior periods we are taking into consideration for explained autocorrelation. The q parameter indicates how many prior time periods we are considering for observing sudden trend changes. The additional d parameter signifies the difference d where we are now predicting the difference between one prior period and the new period rather than predicting the new periods value itself. The autoregressive term or AR tells us that this is indeed a regression, but the A means auto, the series has regressed on itself, because the observations occur over time. we would like to know if, by using past observations, we can somehow predict what the future values will be. We could not do this for a linear regression because the observations do not have any time sequence associated with them. we can think of them all occurring at the same time, but you can do this for a time series model by allowing some number of lags of a variable to help you predict what the next value will be. For example, if we run an AR 1 model, then we will use the one lagged or immediately prior value to predict the next. The MA part tells you there is a moving average component. These are the unobserved errors terms.

• What is R2? What are some other metrics that could be better than R2 and why?
r2 is the square of the correlation coefficient and it’s calculated using the formula
1 – (Residual Sum of Squares/ Total Sum of Squares).

This value represents the fraction of the variation in one variable that may be explained by the other variable. When more variables are introduced to a Linear Regression model for eg. The R square always increases or at least remains constant because in case of ordinary least squares the sum of square error never increases by adding more variables to the model. Hence the R squared does not decrease. The adjusted R- squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance

• What is K-Means Clustering? How Can You Select K For K-means?
K-means clustering can be termed as the basic unsupervised learning algorithm. It is the method of classifying data using a certain set of clusters called as K clusters. It is deployed for grouping data in order to find similarity in the data.

It includes defining the K centers, one each in a cluster. The clusters are defined into K groups with K being predefined. The K points are selected at random as cluster centers. The objects are assigned to their nearest cluster center. The objects within a cluster are as closely related to one another as possible and differ as much as possible to the objects in other clusters. K- means clustering works very well for large sets of data.

• What is linear regression? Name 3 methods in which linear regression may be used.
It is the most commonly used method for predictive analytics. The Linear Regression method is used to describe relationship between a dependent variable and one or independent variable. The main task in the Linear Regression is the method of fitting a single line within a scatter plot.

The Linear Regression consists of the following three methods: Determining and analyzing the correlation and direction of the data. Deploying the estimation of the model. Ensuring the usefulness and validity of the model.

It is extensively used in scenarios where the cause effect model comes into play. For example you want to know the effect of a certain action in order to determine the various outcomes and extent of effect the cause has in determining the final outcome.

• What exactly is Logistic Regression? Describe some applications of Logistic Regression in Finance.
It is a statistical technique or a model in order to analyze a dataset and predict the binary outcome. The outcome has to be a binary outcome that is either zero or one or a yes or no.

Applications of logistic regression include predicting or modeling bank failures using quarterly reports submitted by all U.S. based commercial banks to the FDIC, credit card defaults, home loan defaults, etc.

• What Are The Types Of Biases That Can Occur During Sampling?
Selection bias
• Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase “selection bias” most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not considered, then some conclusions of the study may not be accurate.

• The types of selection bias include:
• Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
• Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
• Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
• Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
• Under coverage bias – sampling too few observations from a segment of population.
• Survivorship bias – where you exclude stocks of companies that are no longer trading, along with optimization bias, look-ahead bias and drawdown tolerance bias are four of the main potential weaknesses of backtesting.

• Explain Cross-validation.
It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like over fitting, and get an insight on how the model will generalize to an independent data set.

• Differentiate between univariate, bivariate and multivariate analysis.
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

• A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of having that condition?
Let’s suppose you are being tested for a disease, if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give accurate result that you don’t have the illness. Thus there is a 5% error in case you do not have the illness.

Out of 1000 people, 1 person who has the disease will get true positive result. Out of the remaining 999 people, 5% will also get true positive result.

Close to 50 people will get a true positive result for the disease.

This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.

• During analysis, how do you treat missing values?
The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. There are various factors to be considered when answering this question-

Understand the problem statement, understand the data and then give the answer. Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important. If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

If you have a distribution of data coming, for normal distribution give the mean value.
Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing, then you can answer that you would be dropping the variable instead of treating the missing values.

• What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
In Bayesian estimate we have some knowledge about the data/problem (prior). There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted than computing the weighted sum of these predictions serves the purpose.

Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.

• What do you understand by Recall and Precision? Illustrate your understanding with an detailed example.
Recall measures “Of all the actual true samples how many did we classify as true?”

Precision measures “Of all the samples we classified as true how many are actually true?”

We will explain this with a simple example for better understanding –

Imagine that your wife gave you surprises every year on your anniversary in last 12 years. One day all of a sudden your wife asks -“Darling, do you remember all anniversary surprises from me?”.

This simple question puts your life into danger. To save your life, you need to Recall all 12 anniversary surprises from your memory. Thus, Recall(R) is the ratio of number of events you can correctly recall to the number of all correct events. If you can recall all the 12 surprises correctly then the recall ratio is 1 (100%) but if you can recall only 10 suprises correctly of the 12 then the recall ratio is 0.83 (83.3%).

However, you might be wrong in some cases. For instance, you answer 15 times, 10 times the surprises you guess are correct and 5 wrong. This implies that your recall ratio is 100% but the precision is 66.67%.
Precision is the ratio of number of events you can correctly recall to a number of all events you recall (combination of wrong and correct recalls).

• Can you explain the difference between a Test Set and a Validation Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, test set is used for testing or evaluating the performance of a trained machine leaning model.

In simple terms, the differences can be summarized as-

Training Set is to fit the parameters i.e. weights.
Test Set is to assess the performance of the model i.e. evaluating the predictive power and generalization. Validation set is to tune the parameters.

• Give some situations where you will use an SVM over a Random Forest Machine Learning algorithm and vice-versa.
SVM and Random Forest are both used in classification problems.
• If you are sure that your data is outlier free and clean then go for SVM. It is the opposite – if your data might contain outliers then Random forest would be the best choice
• Generally, SVM consumes more computational power than Random Forest, so if you are constrained with memory go for Random Forest machine learning algorithm.
• Random Forest gives you a very good idea of variable importance in your data, so if you want to have variable importance then choose Random Forest machine learning algorithm.
• Random Forest machine learning algorithms are preferred for multiclass problems.
• SVM is preferred in multi-dimensional problem set – like text classification but as a good data scientist, you should experiment with both of them and test for accuracy or rather you can use ensemble of many Machine Learning techniques.

• What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, RF etc.). Sensitivity is nothing but “Predicted TRUE events/ Total events”. True events here are the events which were true, and model also predicted them as true.

Calculation of sensitivity is straight forward-

Sensitivity = True Positives /Positives in Actual Dependent Variable

Where, True positives are Positive events which are correctly classified as Positives.

• What is Random Forest? How does it work?
Random forest is a versatile machine learning method capable of performing both regression and classification tasks. It is also used for dimentionality reduction, treats missing values, outlier values. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes(Over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

Section 3: Answer 2 questions out of the given 3. (10 marks each for a total of 20 marks)

• What are the various aspects of a machine learning process? Discuss the components involved in solving a problem using machine learning. Illustrate your understanding through an industry example. What do you think is the most critical aspect and how do you overcome this challenge?
• Domain knowledge: This is the first step wherein we need to understand how to extract the various features from the data and learn more about the data that we are dealing with. It has got more to do with the type of domain that we are dealing with and familiarizing the system to learn more about it.
• Feature Selection: This step has got more to do with the feature that we are selecting from the set of features that we have. Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding the type of feature that we want to select to go ahead with our machine learning endeavor.
• Algorithm: This is a vital step since the algorithms that we choose will have a very major impact on the entire process of machine learning. You can choose between the linear and nonlinear algorithm. Some of the algorithms used are Support Vector Machines, Decision Trees, Naïve Bayes, K- Means Clustering, etc.
• Training: This is the most important part of the machine learning technique and this is where it differs from the traditional programming. The training is done based on the data that we have and providing more real world experiences. With each consequent training step the machine gets better and smarter and able to take improved decisions.
• Evaluation: In this step we actually evaluate the decisions taken by the machine in order to decide whether it is up to the mark or not. There are various metrics that are involved in this process and we have to closed deploy each of these to decide on the efficacy of the whole machine learning endeavor.
• Optimization: This process involves improving the performance of the machine learning process using various optimization techniques. Optimization of machine learning is one of the most vital components wherein the performance of the algorithm is vastly improved. The best part of optimization techniques is that machine learning is not just a consumer of optimization techniques but it also provides new ideas for optimization too.
• Testing: Here various tests are carried out and some these are unseen set of test cases. The data is partitioned into test and training set. There are various testing techniques like cross-validation in order to deal with multiple situations.

• How do data management procedures like missing data handling make selection bias worse?
Missing value treatment is one of the primary tasks which a data scientist is supposed to do before starting data analysis. There are multiple methods for missing value treatment. If not done properly, it could potentially result into selection bias. Let see few missing value treatment examples and their impact on selection-

Complete Case Treatment: Complete case treatment is when you remove entire row in data even if one value is missing. You could achieve a selection bias if your values are not missing at random and they have some pattern. Assume you are conducting a survey and few people didn’t specify their gender. Would you remove all those people? Can’t it tell a different story?

Available case analysis: Let say you are trying to calculate correlation matrix for data so you might remove the missing values from variables which are needed for that particular correlation coefficient. In this case your values will not be fully correct as they are coming from population sets.

Mean Substitution: In this method missing values are replaced with mean of other available values. This might make your distribution biased e.g., standard deviation, correlation and regression are mostly dependent on the mean value of variables.

Hence, various data management procedures might include selection bias in your data if not chosen correctly.

• Describe how does machine learning work with respect to financial data, what are the various types of machine learning and when are they most suitable for application.
Describe the 3 types of machine learning, types of alternative data and the various algorithms for each type of alternative data.

Machine learning is the process of generating the predictive power using past data(memory). It is a one-time process where the predictions can fail in the future (if your data distribution changes).

What are the investment pipelines to uncover investment signals using time series forecasting using ARIMA, SARIMA, etc.

The end-to-end pipeline for machine learning with financial data can be summarized below.

This includes the various process steps of Alpha Factor Research given below.

—- End of Mock Exam —-