DSCC 201/401 Midterm Exam Review Topics
Midterm Exam: Wednesday, March 31, 9:00-10:00 a.m. EDT
DSCC 201: Blackboard (Online Only) DSCC 401: Wegmans Hall, Room 1400
Topics to Review:
1. Hardware and Infrastructure for Data Science
a. What is a Linux cluster? What are the main components of a Linux cluster and what function do they perform?
b. What is InfiniBand and how does it differ from Ethernet?
c. What is Slurm?
d. What is a parallel file system?
e. What is a GPU? Why do we care about precision of calculations? What are the tradeoffs
between high precision and high performance? What is a tensor core?
f. What are the differences between GPUs and CPUs (e.g. architecture, number of cores,
amount of memory available, etc.)
g. Virtual machines: Advantages and disadvantages. What are containers?
h. Theoretical computing capacity equation (FLOPS) – How to calculate FLOPS given
number of cores, frequency, instructions per clock cycle, etc.
i. LINPACK benchmark – What is it? What is the difference between theoretical value and
benchmark value? Why is there a difference?
j. Cloud classifications: SaaS, PaaS, and IaaS
2. Linux Command Line Environment and Bash Shell Scripting
a. What is the Linux kernel? What is Bash?
b. What is CLI? How to view contents of a file (cat, more, less)
c. Common Linux commands: ls, cat, grep, cp, mv, mkdir, rm, wc, etc.– What do they do? d. How to navigate directories in a Linux file system (What are /,~, ., ..?)
e. What is the difference between | and >? What are $( ) and $(( )) in Bash?
f. What command do you use to search for strings in a file?
g. What are the commands to copy and move files?
h. What is a Bash script and how is it same/different from command line environment?
i. How is Bash different/similar to a compiled language (e.g. C++) or a scripting language
like R?
j. What is the syntax for conditionals and loops in Bash?
3. Software Parallelization Models and Techniques
a. What is parallel computing and why is it important?
b. What is Amdahl’s Law and how do you calculate speedup? Why is that useful to do? c. What is linear scaling? How do we get close to linear scaling?
d. What is a node, CPU, core, task, and thread? How are they different?
e. What are the classifications for Flynn’s Taxonomy?
f. How are OpenMP and MPI similar and different?
g. What is CUDA?
h. What is a Slurm script? What do the following commands do: squeue, sbatch, scancel,
and sinfo?
4. R for Data Analysis
a. What is a vector? What is a list? How are they different?
b. What is a matrix? What is a data frame? How are they different?
c. How to assign a vector to a variable (e.g. using “:” and “seq” functions)
d. Syntax for filtering rows and columns in matrix and data frame
e. What is a factor? Why are factors useful?
f. Syntax for extracting and manipulating data in data frame
g. Difference between running R in “batch mode” and interactively
h. What is a “workspace image” and why does R ask if you want to save this?
i. What is data pre-processing?
j. Provide definitions and examples of each of the following: data cleaning, data
integration, data transformation, and data reduction
k. What is the definition of machine learning?
l. What is supervised learning and unsupervised learning?
m. What do the following libraries provide: class, cluster, and caret
5. SAS
Details of the Midterm Exam (Subject to Change):
• 27 questions: 25 multiple choice, 2 short answer or essay questions
• Exam will last only 1 hour and will begin at 9:00 a.m. EDT on Wednesday, March 31st
• The use of smart phones, cell phones, tablets, laptops, calculators, and all other electronic
devices to assist in answering any question is prohibited during the exam session. The exam will be “closed book” and no additional materials are allowed to be used for assistance in answering the questions.
• By submitting answers to the exam, you attest that you have not given or received any unauthorized help on this exam, and that all work is your own.
• DSCC 201 ONLY:
o LogintoBlackboardat9:00a.m.onWednesday,March31st,navigatetoLearning
Modules and the Midterm Exam will appear at the bottom. Click this link to begin the
exam.
o Onaverage,spend2minutesoneachmultiplechoicequestionand5minutesoneach
short answer/essay question.
o Examquestionswillberandomizedforallstudents.
o Onequestionwillbedisplayedatatime.
o Onceyouranswertothequestionissubmitted,theansweriscomplete,andyoucannot
go back and change your answers to previous questions.
• DSCC 401 ONLY:
o ArriveatWegmansHall,Room1400before9:00a.m.
o Followallrequiredsocialdistancingprotocolsandwearamask. o Exambookandanswersheetwillbedistributed.
a. Know how to use and the syntax for the data step process (infile, dsd, etc.) b. Know how to use and the syntax for proc (means, print, frequency)
c. Syntax of commenting in SAS code
d. What are the differences between valid and invalid SAS variables?
e. How are SAS formats used (e.g. date formats in SAS)?
o Onaverage,spend2minutesoneachmultiplechoicequestionand5minutesoneach short answer/essay question.
o Examwillendat10:00a.m.EDT.
• Make sure to review the instructions before starting the test in case there are any changes to
the format, structure, presentation, etc. that differ from what is presented here.