Big Data Analytics – Mid Term
1. Whichofthefollowingistheoutputrepresentationofadateobject A “2000/01/11”
B 2000/01/11
C “2000 01 11” D 2000 01 11
2. Completethecodetoreturntheexpectedoutput: —— Code ——–
x¡û5 while(………){
print(x)
x¡ûx-1 }
——– Expected Output ——–
1 5 1 4 1 3 1 2 1 1
A x==0
B x>=0
C x<3
D x>0
Name Date
Score
3. Completethecodetoreturntheexpectedoutput:
—— Code ——–
x ¡û ……… (
a = c(1,2,3 ,
b = c(‘uno’, ‘dos’, ‘tres’) )
x
——– Expected Output ——–
$a
1 1 2 3
$b
1 “uno” “dos” “tres”
A data.frame
B dataframe
C vector
D list
4. Completethecodetoreturntheexpectedoutput:
—— Code ——–
numbers ¡û list(
x = c(1, 2, 3 ,
y = c(‘one’, ‘two’, ‘three’), z = c(‘uno’, ‘dos’, ‘tres’)
)
……… (numbers)
——– Expected Output ——– 1 “x” “y” “z”
A name
B colname
C names
D colnames
5. Completethecodetoreturntheexpectedoutput:
—— Code ——–
……. typeof(x)
——– Expected Output ——– 1 “logical”
A x ¡û “True”
B x¡ûTrue
C x ¡û “TRUE”
D x¡ûTRUE
6. Completethecodetoreturntheexpectedoutput: —— Code ——–
……. is.na(x)
——– Expected Output ——– 1 TRUE
A x ¡û “TRUE”
B x ¡û nan()
C x NA
D x ¡û is.na
7. Whatistheoutputofthefollowingcode:p¡û2raise_to_power¡ûfunction(x){x^p} raise_to_power(10)
A 1024
B 200
C 100
D An Error
8. Whatistheoutputofthefollowingcode:square¡ûfunction(x){if(!is.numeric(x)){print(“x must be numeric”) } else { x ^ 2 } } square(“Ten”)
A An Error
B “One Hundred”
C 100
D “x must be numeric”
9. Whichofthefollowwouldnotbeasuitablesituationinwhichtouseanapplystylefunction?
A Fit the same model to hundreds of simulated data sets
B Simulate data from a Normal distribution ten times, setting a different mean each time
C Creating a vector of average values for each of a hundred columns of a data frame
D Applying a function to many data sets where the value of one of the arguments is dependent upon the output of the previous iteration
10. Whatistheoutputofthefollowingcode:gapminder% %summarize(mean=mean(lifeExp)) % % group_by(continent)
A A table of the mean life expectancy per continent
B An Error
C The mean of all life expectancies
D A new variable with the mean of life expectancies
11. Completethecodetoreturntheexpectedoutput:
—— Code ——–
library(stringr)
x ¡û c(“This string has many characters.”, NA, “This string has many characters too!”) ……
——– Expected Output ——– 1 32NA36
A n(x)
B char(x)
C str(x)
D nchar(x)
12. Followingisapreviewofthedataframesx,y,andz.Completethecodetogettheexpected output :
—— data——–
>x pq
1 20 5 2 21 6
>y pr
1 20 Jane 2 22 John
>z ps
1 20 TRUE 2 23 FALSE
—— Code ——–
library(dplyr) x % %
left_join(y, by = “p”) % % ……..(z, by = “p”)
——– Expected Output ——–
pqrs
1 20 5 Jane TRUE
2 23 NA NA
A left_ join
B merge
C inner_ join
D right_ join
FALSE
13. Followingisapreviewofthedataframedf.Completethecodetogettheexpectedoutput: —— data——–
country 1 Afghanistan
2 Albania 3 Algeria
Y1980 21.48678 25.22533 22.25703
Y1981 21.46552
25.23981 22.34745
—— Code ——–
……(df, year, bmi, -country)
——– Expected Output ——–
country year bmi
1 Afghanistan Y1980 21.48678 2 Albania Y1980 25.22533 3 Algeria Y1980 22.25703 4 Afghanistan Y1981 21.46552
5 Albania 6 Algeria
A explode
B gather
C merge
D seperate
Y1981 25.23981 Y1981 22.34745
14. Whichfunctionisusedtoopencsvfiles? A read.csv
B readcsv C read_csv D read
15. Thebias-variancetradeoffdescribes:
A Technical details underpinning model fitting that we don’t need to worry about.
B Incorporating expert knowledge of a process at the expense of model simplicity.
C Balancing how well a model fits with how well it generalizes to new data.
D Choosing to focus on only one of a models ability to explain or predict a process.
16. Whichofthefollowingbestdescribeslinearregression?
A An algorithm that reduces the number of variables for data-modeling to a set of so-called
principal variables.
B An algorithm that describes a factor response variable as a function of one or more predictor variables.
C An algorithm that splits data into groups based on feature similarities.
D An algorithm that describes a continuous response variable as a function of one or more
predictor variables.
17. Whichofthefollowbestdescribesfeatureengineering?
A Finding data and setting up a process to import it into R ready for analysis.
B Creating or updating variables to provide the most relevant information for model fitting.
C Selecting which variables should be included in a given model based on model fit.
D Any activity that is performed to allow a model to be put into production environments.
18. Whichofthefollowingsituationsisbestsuitedfortheutilizationoflinearregression?
A An e-commerce company that wants to predict whether or not a customer will purchase a
particular item.
B A retail company that wants to predict their total sales to understand the impact of online promotions.
C A video-hosting company that wants to know if users have a positive or negative feeling to a given video.
D A healthcare company that wants to use tumor data to predict whether a new tumor is benign or malignant.
19. Considerthesummarybelowfromalinearmodel.Whatdoesthismodelsuggestregarding the relationship between the variables value and status, where status is a percentage and value is dollars in thousands?
Call:
lm(formula = value ~ status, data = Boston) Residuals:
Min 1Q Median 3Q Max
15.168 3.990 1.318 2.034 24.500 Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept) 34.55384 0.56263 61.41 2e-16 ***
status 0.95005 0.03873 24.53 2e-16 ***
Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1 Residual standard error: 6.216 on 504 degrees of freedom Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
A For every $1000 increase in value we expect a reduction in status.
B For every percentage increase in status we expect a reduction in value.
C There is no statistically-significant relationship between value and status.
D For every 34.55384 unit increase in value we expect a reduction in status.
20. Whichofthefollowingmetricswouldnotbeusedwhenassessingtheperformanceofa regression model?
A Accuracy.
B Root Mean Square Error.
C Median Absolute Deviation.
D Akaike Information Criteria.
21. Whatistheoutputofthefollowingcode?
model ¡û glm(
formula = Acceptance ~ GPA, data = MedGPA,
family = binomial,
) predict(
model,
newdata = data.frame(GPA = 4 , type = 'response'
)
A Abinary0or1.
B A probability.
C An Error.
D A continuous prediction.
22. Completethecodetogettheexpectedoutput: ------ Code --------
library(Metrics)
model ¡û glm(Benign ~ Cell.size, data = train, family = binomial) pred_class ¡û predict(model, newdata = test, type = "response") pred_class ¡û ifelse(pred_class > 0.5, 1, 0
……..(actual = test$Benign, predicted = pred_class)
——– Expected Output ——– 1 0.9310345
A accuracy
B mae
C LogLik
D rmse
23. Beforewefitamodeltoourdataweshouldconsidercenteringandscalingthedatasothat:
A It’s easier to interpret the model output because the variables are the same values.
B A feature does not have more influence on the model because of larger or smaller values.
C We can remove outliers from the data because they will all be on the same range.
D We can use both the original and scaled version of the variables in our model.
24. Whichofthefollowingstatementsbestdescribescross-validation?
A Splits our data into a single train/test set, we validate our model once on the test set.
B Splits our data into B test sets, with every observation appearing a random number of times.
C Splits our data into K train/test sets, with every observation appearing once across all test sets.
D Is used to find the final values for all parameters in our model, after training and testing.
25. Whichoneofthefollowingstatementsdescribeunsupervisedlearning?
A A machine learning where we predict a categorical variable as a function of both continuous
and categorical variables.
B A machine learning problem where we seek to understand whether observations fit into distinct groups based on their similarities.
C A machine learning problem where we fit a model a response variable as a function of a set of predictor variables.
D A machine learning problem that deals with the importing and cleaning of a dataset, such that it is tidy.
26. Whichofthefollowingbestdescribesadecisiontree?
A An algorithm that reduces the number of variables to a set of principal variables.
B An algorithm that describes a continuous response variable as a function of one or more predictor variables.
C An algorithm with a sequence of if-else questions about the individual features used to infer the target variable.
D An algorithm that splits unlabeled data into groups based on feature similarities.
27. Inwhichofthefollowingsituationsmightnotwanttouseadecisiontree?
A When model output should be easily interpretable.
B When there is non-linearity in the dataset.
C When there is high variance in the dataset.
D When the model needs to be trained quickly.
28. Inwhichofthefollowingsituationswouldyourecommendtheuseofageneralizedlinear
model with a non-Gaussian distribution rather than a simple linear model?
A Where the response variable is continuous, and the feature variable is continuous.
B Where the response variable is continuous, and the feature variable is categorical.
C Where the response variable is continuous, and the feature variable is binary.
D Where the response variable is binary, and the feature variable is continuous.
29. Inwhichofthefollowingsituationsshouldyounotusearandomforestalgorithmtomodel data?
A When model performance is important.
B When model interpretability is important.
C When there is non-linearity in the data.
D When there is high variance in the dataset.
30. Whatisthemaindifferencebetweengradientboostingandothertreebasedmodels?
A Gradient boosting adresses bias by growing trees all at the same time, and based on the
original data.
B Unlike other tree based models, gradient boosting is efficient as it learns by using bootstrap sampling.
C Gradient boosting adresses overfitting by using information from previously grown trees.
D Gradient boosting adresses underfitting by focusing only on the most significant variables on
the original dataset.
31. Whenwetrainaclassificationtreealgorithmontheirisdata,whatistheaccuracyofthe model?
A 0.82
B 0.85
C 0.97
D 0.98