Semester review; exam preview
DATA1002/1902 Lecture 13B
Prof Alan Fekete
University of Sydney
DATA1002 sem2 2021 – Lecture 13B 1
2
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on
behalf of the University of Sydney pursuant to Part VB of the Copyright Act
1968 (the Act). The material in this communication may be subject to
copyright under the Act. Any further copying or communication of this
material by you may be the subject of copyright protection under the Act.
Do not remove this notice.
DATA1002 sem2 2021 – Lecture 13B
DATA1002 and DATA1902
• This lecture provides information on
DATA1002
• For DATA1902 students, see the Wednesday
lecture and its slides
DATA1002 sem2 2021 – Lecture 13B 3
Agenda
• Semester review
• Written exam preview
• Administrative issues
• More study of Data Science
DATA1002 sem2 2021 – Lecture 13B 4
Role of the unit
• “This unit covers computation and data handling,
integrating sophisticated use of existing
productivity software, e.g. spreadsheets, with the
development of custom software using the
general-purpose Python language. It will focus on
skills directly applicable to data-driven decision-
making. Students will see examples from many
domains, and be able to write code to automate
the common processes of data science, such as
data ingestion, format conversion, cleaning,
summarization, creation and application of a
predictive model.” [from UoS outline]
DATA1002 sem2 2021 – Lecture 13B 5
DATA1002 Learning outcomes 1
• 1. Ability to automate a computational
process, when given a clear account of the
algorithm to be applied. This will be done by
writing Python programs with core techniques
of procedural programming.
• Grok modules; Tutorials; Lecture 4B; Python
coding test and practice
DATA1002 sem2 2021 – Lecture 13B 6
DATA1002 Learning outcomes 2
• 2. Knowledge of Python syntax and semantics,
to trace and understand idiomatic code typical
of data science activities, including features
such as user-defined functions, exception-
raising and handling.
• Grok modules; lectures 2A, 3A, 3B [a bit], 4A,
5A, 7A, 11B; Tutorials
DATA1002 sem2 2021 – Lecture 13B 7
DATA1002 Learning outcomes 3
• 3. Experience with automation of the computational
process needed for examples of the various activity in
the data science pipeline: data ingestion and cleaning,
data format conversion, data summarization, visual and
tabular presentation of the results from
summarization, creation of a predictive model of a
given form, application of a predictive model to new
data, evaluation of a predictive model (and also,
automation of a pipeline that scripts use of existing
tools for these activities). The examples students have
seen will cover a diversity of application domains
• Lectures 1B, 4B, 5B [a bit], 7A [a bit], 8AB [a bit], 9B [a
bit]; Labs, Tutorials, Project Stages
DATA1002 sem2 2021 – Lecture 13B 8
DATA1002 Learning outcomes 4
• 4. Experience with both spreadsheets, and
programs in Python, for automatically
performing computational processes of data
science, and awareness of the similarities and
differences between tools.
• Lectures 2B, 8AB [a bit]; Labs 2,3
DATA1002 sem2 2021 – Lecture 13B 9
DATA1002 Learning outcomes 5
• 5. Understanding of main issues for data
management in connection with data science
activities, including value of data, importance of
metadata (that describes the format and meaning
of data, constraints on the data, origins of the
data, restrictions on use of the data, etc), issues
when sharing data across time and across users
(eg value of a manager role, access control,
persistence, recovery)
• Lectures 1B, 5B, 11A; Lab 1; Project Stage 1
DATA1002 sem2 2021 – Lecture 13B 10
DATA1002 Learning outcomes 6
• 6. Understanding of how data sets are
represented in computer files, in particular,
the many-to-many relationship between the
physical representation and the logical
representation; advantages and disadvantages
of different representations.
• Lectures 3B, 6A, 7B
DATA1002 sem2 2021 – Lecture 13B 11
DATA1002 Learning outcomes 7
• 7. Understanding of principles of charting and
information presentation, and ability to
produce good charts using both Python
libraries and spreadsheets; also capability to
evaluate charts for effectiveness in
communication.
• Lectures 8AB; Project stage 2; Labs
DATA1002 sem2 2021 – Lecture 13B 12
DATA1002 Learning outcomes 8
• 8. Understand the principles of machine
learning and its role in data science, in
particular the creation, use, and limitations of
predictive models for regression and
classification tasks, issues of over-fitting and
under-fitting, and evaluation of models.
• Lectures 9A, 9B, 12A, 13A; Project stage 3;
Labs
DATA1002 sem2 2021 – Lecture 13B 13
Big ideas from DATA1002
• Data science work activities
• Variety of predictive models
• Data should be managed (including metadata)
• Data representation choices
• Power of automating computation (and how to
do it)
• Notional machine for Python computation
• Libraries and their value
• Communication (techniques and principles)
• Evaluation, recognition of limitations
DATA1002 sem2 2021 – Lecture 13B 14
DATA1002 Assessment
• Weekly Python tasks (worth 10%, marked for
participation)
• Weekly quizzes (worth 10%)
• Python coding tests (worth 10%, in lectureslots
week 10)
• Project stage 1 (worth 5%, week 6)
• Project stage 2 (worth 10%, week 9)
• Project stage 3 (worth 5%, week 12)
• Proctored 2hr online exam (worth 50%)
There are also zero-weight practice coding tests
DATA1002 sem2 2021 – Lecture 13B 15
Reminder
• Minimum requirement: It is a policy of the School of
Computer Science that in order to pass this unit, a
student must achieve at least 40% in the written
examination. A student must also achieve an overall
final mark of 50 or more. Any student not meeting
these requirements may be given a maximum final
mark of no more than 45 regardless of their average.
– From official unit outline on Canvas
• Warning: Canvas report of your overall mark so far
may be inaccurate
– Calculate for yourself, using marks from tasks and the
announced weightings
1
6
DATA1002 sem2 2021 – Lecture 13B
Agenda
• Semester review
• Written exam preview
• Administrative issues
• More study of Data Science
DATA1002 sem2 2021 – Lecture 13B 17
DATA1002 Exam
• Worth 50% of the unit
– And also, minimum requirement of 40 points on exam
in order to Pass the unit
• Proctored (Review+) online exam
• 2 hr duration (plus a bit), done as Canvas quiz
• Scheduled by Exams Office during exam period
• Mix of question types (including written text), will cover
the content of lectures, tutes, labs, and assessments
(including Python programming concepts and skills)
ProctorU Review+ (“Type B”) Exam
• Please see
https://canvas.sydney.edu.au/courses/23380
• There are hardware and software
requirements on your environment
– Make sure you set everything up NOW
– Then do the practice Review+ exam to check it all
works properly: enrol at
https://canvas.sydney.edu.au/enroll/8WH4F7
– In case of technical difficulties: get help from Uni
ICT
DATA1002 sem2 2021 – Lecture 13B 19
https://canvas.sydney.edu.au/courses/23380
https://canvas.sydney.edu.au/enroll/8WH4F7
General principles
• The final exam will allow you to provide evidence that
you have achieved the learning outcomes of the unit
• Questions cover “knowledge” and “doing”
– You should not be surprised or tricked by any questions
– They align with the learning experiences through the
semester
• You will need to be quite fast in reading/understanding
the questions
– So, get familiar with the style and wording, from this
lecture!
DATA1002 sem2 2021 – Lecture 13B 20
Differences to previous years exams
• 2018 and 2019: 2hr exam invigilated in person
– Answers handwritten on paper exam paper
– Allowed: One A4 sheet of handwritten and/or typed notes
double sided
• 2020: 3 hr exam online
– download pdf of questions, produce pdf of answers and upload
– open book
• This year: 2 hr proctored exam online
– answer by typing within Canvas, eg multichoice, fill in text box
– You are allowed to look at hand-written or printed notes (on
paper), but no online material or communication with other
people
DATA1002 sem2 2021 – Lecture 13B 21
Exam Structure
• Exam is available as Canvas “quiz”, at the
time scheduled by the Exams Office
– Answer each question in the quiz (select
appropriate choice, enter text in textbox etc)
– You must sit exactly when scheduled
• The exam is on a special canvas site “Final
Exam for DATA1002”
– This is not the same as usual Canvas site for this
unit
– Make sure you visit this site as soon as it
becomes available (should be on November 15)
DATA1002 sem2 2021 – Lecture 13B 22
Timing
• Your exam is 2 hours and 10 minutes long (130 minutes). This includes 10
minutes of reading time, but you can start writing whenever you are
ready. You are strongly encouraged to use this time to carefully plan and
structure your response before you start writing.
– If you have an academic adjustment, it is supposed to have been added
automatically.
• Quiz buffer time: You will be allowed a buffer time of 40 minutes in case
you experience any technical issues starting your exam. This means that
you have 40 minutes to begin the exam and still get the full time allowed
to complete the exam. If you are unable to start your exam within the
buffer time, you should apply for special consideration. Buffer time does
NOT mean you have extra time to complete your exam.
• Please keep track of your time. Your quiz timer may not update if you
have an internet connection issue. Use the time on your computer so that
you always know how much time you have left. Only questions completed
within the exam time will be marked.
DATA1002 sem2 2021 – Lecture 13B 23
Warning: pay attention to time zone (and possible changes where you are).
All instructions refer to Sydney daylight-savings time.
Examples
The following assume start time of 13:00 – if you have some
different arrangement, your finish will also vary
• Start exam at 13:00, finish and submit by 15:10
• Start exam at 13:35, finish and submit by 15:45
– Start is within buffer, so you get the full allowed time
• Start exam at 13:50. finish and submit by 15:50
– Start is after buffer, so extra delay is taken from your time to sit
– However, if you are unable to start by 13:40 (end of buffer) you
are advised to apply for special consideration rather than start
late
• We advise you to be ready and try to start at 13:00; the
buffer may be needed to deal with technical issues
DATA1002 sem2 2021 – Lecture 13B 24
Submit
• Do not push submit button at bottom of the quiz,
until you are ready to finish work (or timer is
about to expire)
– Exam “quiz” is single-attempt
– System should be saving whatever you typed in
textboxes etc, as long as the internet connection is
maintained
– If connection is broken during exam, your machine
crashes, etc, then you will need to get “special
consideration” based on the technical difficulty (apply
immediately you can)
DATA1002 sem2 2021 – Lecture 13B 25
Resources
• Exam conditions: This is a restricted open
book exam. You are not permitted to use third
party communication or collaboration apps or
websites. Access to any such app or website is
strictly prohibited during your exam and is a
serious academic integrity breach.
– Materials permitted: Handwritten notes, printed
notes and textbooks.
– Materials provided: None
DATA1002 sem2 2021 – Lecture 13B 26
Warning: permitted materials are on paper only;
do not have other device (phone, tablet, laptop etc) nearby;
do not have other tabs or browsers open on your working machine
Notes
• Evidence shows that (i) choosing what facts to
have in notes, and then (ii) writing it out by
hand, is very beneficial for learning
• Once written out by hand, you may want to
type up again and have a printed-out version
• Remember that time is scarce in the exam, so
you need to be able to quickly find things in
the notes
– Organisation is vital
DATA1002 sem2 2021 – Lecture 13B 27
Exam Structure
• Do 19 questions worth in total 100 points
• Part A (Q1 to 10): 20 points [common to
data1002/1902]
• Part B (Q11 to 16): 60 points [common to
data1002/1902]
• Part C (Q17-19): 20 points [only for
data1002 students]
– PartCAdv will be discussed in Wed lecture for
data1902 only
DATA1002 sem2 2021 – Lecture 13B 28
Dataset used in some exam questions
Several of the questions in this exam refer to a comma-separated-values dataset
called employment_sector_technology.csv
This dataset was produced by some transformations and cleaning, on the data downloaded
from https://databank.worldbank.org/reports.aspx?source=jobs#
The first row is a header line, giving the names of the fields, whose meanings are as follows:
• Jurisdiction (that is, country name)
• Region
• Employment in agriculture (% of total employment)
• Employment in industry (% of total employment)
• Employment in services (% of total employment)
• GDP growth (annual %)
• Individuals using the Internet (% of population)
• Fixed broadband Internet subscribers (per 100 people)
• Mobile cellular subscriptions (per 100 people)
Here are the first few lines of the file
Jurisdiction,Region,Emp_agric,Emp_industry,Emp_services,GDP-
growth,Internet_use,Broadband,Mobile
Australia,Oceania,2.6,19.4,77.9,2.8,88.2,30.6,110.1
Austria,Europe,4.3,25.6,70.1,1.5,84.3,29.0,163.8
Belgium,Europe,1.3,21.3,77.5,1.4,86.5,37.6,110.5
Canada,Americas,1.9,19.5,78.5,1.4,91.2,36.9,84.7
Chile,Americas,9.5,23.0,67.5,1.3,83.6,16.2,130.1
Czech Republic,Europe,2.9,38.1,59.0,2.6,76.5,28.9,117.7
Denmark,Europe,2.5,18.6,78.8,2.0,97.0,42.6,122.3
Estonia,Europe,3.9,29.7,66.4,2.1,87.2,30.1,144.6
DATA1002 sem2 2021 – Lecture 13B 29
https://databank.worldbank.org/reports.aspx?source=jobs
Part A
• 10 factual automatically-graded questions,
each worth 2 points
• Questions can be multiple-choice, multiple-
answers, fill-in-blanks, etc
• Cover the range of content of unit
DATA1002 sem2 2021 – Lecture 13B 30
Part B
• 6 longer/deeper questions (hand-graded by
tutors and lecturer), each worth 10 points
DATA1002 sem2 2021 – Lecture 13B 31
Question 11
• 10 points
• “[Situation described] Write an explanation
for [target reader] about [particular aspects of
the situation]”
– Marking will reflect both content and effective
communication
DATA1002 sem2 2021 – Lecture 13B 32
Question 12
• 10 points
• Trace code given to indicate what would be
printed at indicated points in the execution.
• We encourage you to draw a notional machine
diagram (on paper) to help you to answer this
question.
– Note that you cannot use Grok or any
programming environment during the exam
– Similar to question in tute13
DATA1002 sem2 2021 – Lecture 13B 33
Question 13
• 10 points
• “Using the employment_sector_technology.csv dataset described above,
write well-documented and easily readable python code that will print
[some calculation described, example of output format given] You do not
need to deal with mis-formatted files or other errors. You are allowed to
use a library like Pandas, but this is not required. It is important that your
comments should clearly describe the data structure used for storing the
data in your program (eg if you use a dictionary, you must explain what
the keys and values represent; if you use Pandas, you must indicate the
indices of the dataframes your code refers to, etc). Ensure you use the
‘Preformatted’ block option to keep your code readable, accessed by
clicking the dropdown saying ‘Paragraph’ in the answer box font settings.”
– Marking will reflect appropriate logic, knowledge of Python techniques, and
documentation
– Note that you cannot use Grok or any programming environment during the
exam
– Similar to question in tute13
DATA1002 sem2 2021 – Lecture 13B 34
Question 14
• 10 points
• “[person] has created the following chart [image shown below]
from the employment_sector_technology.csv dataset which we
described above, showing [some aspect of the situation]. This chart
presents information from 4 of the data attributes, namely: [list of
data attributes shown]
For this chart:
• State what kind of chart this is (eg is it pie chart, scatter plot, etc) [1
point]
• State what type of encoding has been used for each of the 4 data
attributes shown [4 points]
• State 2 issues with this chart that limit its usefulness [2 points]
• Describe an alternative chart and/or encoding scheme you believe
would be a better choice for this data and explain why [3 points]”
– Marking will reflect knowledge of charting principles, their application
in this specific situation, skill in evaluation
– Similar to question in lab13
DATA1002 sem2 2021 – Lecture 13B 35
Question 15
• 10 points
• For a given situation involving the storage and
handling of data in a project, discuss concerns
or consequences, and indicate
recommendations for doing it better.
– Marking will reflect knowledge of data
management principles, linking these to described
situation, and sensible decisions
DATA1002 sem2 2021 – Lecture 13B 36
Question 16
• 10 points
• “You are working on a data analysis project that involves a new
algorithm which uses a predictive model built with machine
learning. [description of the task, and an issue that has been
identified in how the system is operating for different cases].
State 1 possible reason why that issue may be occurring.
State 1 possible solution to mitigate this issue, and explain whether
you believe this solution will completely or partially correct the
problem and why. If there is no solution possible, or if you believe
the issue should be left without being changed, state this, and
explain why this is the case.
– Marking will reflect understanding of ethical issues, links to this
specific situation, clarity of thought [but we do not require a particular
ethical decision (eg you can get marks whether you propose a change,
or propose to leave things unchanged, as long as you show good
awareness of the issues and justify your approach carefully
– Similar to question in lab13
DATA1002 sem2 2021 – Lecture 13B 37
Part C
• [Only for students in data1002]
• 3 questions
DATA1002 sem2 2021 – Lecture 13B 38
Question 17
• 8 points
• [One type of work you did in some stage of
the project] Describe what this activity
involves, explain why it is important in data
science. “In your answer, give an example of
[this activity] which you (or someone in your
group) [did] during the Project work.”
– Marking will reflect knowledge and having
appropriate detailed specific example
DATA1002 sem2 2021 – Lecture 13B 39
Question 18
• 4 points
• Show the output produced from executing
given Python code.
– Similar in style to question 12
DATA1002 sem2 2021 – Lecture 13B 40
Question 19
• 8 points
• Describe [some issue in data management],
and how this can be important/useful.
– Similar in style to Question 15
DATA1002 sem2 2021 – Lecture 13B 41
Exam technique
• Plan how you will allocate time (wisely)
– Use “reading time” to check your understanding
– Also to plan time allocation to questions
• Answer everything (get the “easy marks” in each part)
– Even if you don’t know the full answer, show that you have some
relevant knowledge
– Respond to question details (eg if it asks for “explain” and
“example”, then you should provide both in your answer)
• Write clearly and efficiently
– Start with outline/bullet points, then expand if you have time
– No need for fancy style
– When question describes a target reader, you should
communicate effectively for that target (see lecture 6B)
4
2
DATA1002 sem2 2021 – Lecture 13B
How much to write?
• A 10 point question would be expected to take
about 10-12 minutes to answer
– Including thinking, typing, checking, revising
• A good answer can often be done in two or three
focused paragraphs
• You need to show the marker that you know and
understand the concepts
– And you ought to answer the specific question that is
asked
• Watch the instructions carefully
DATA1002 sem2 2021 – Lecture 13B 43
Agenda
• Semester review
• Written exam preview
• Administrative issues
• More study of Data Science
DATA1002 sem2 2021 – Lecture 13B 44
Scheduling
• Written exam scheduled by Exams Office
– Time is described in Sydney timezone
– Special arrangements available for those who are
in timezone where the written exam schedule is
late at night or very early morning
• Apply through official process
• You will instead be scheduled in replacement exam
period
DATA1002 sem2 2021 – Lecture 13B 45
Illness
• If you are unwell, and it seems that you won’t be able to
demonstrate your knowledge/skill properly, then you can
request special consideration
• Follow the same procedure as during semester (get medical
person to fill out special USyd form, scan and attach when
you fill in the online form, within 3 days)
– (or, make “student declaration”)
• Usual outcome: an alternate test in “replacement exam
period”
• If you become sick during the exam itself, submit whatever
you have done
– And apply for special consideration
• The University goal is to get a fair assessment of what you
have achieved
4
6
DATA1002 sem2 2021 – Lecture 13B
Technical and logistical issues
• Final exam is quite time constrained
• Make sure you will have a good place to work
(quiet for those 2+ hours, comfortable place
for typing, reliable internet, no-one else in
room, etc)
• Make sure you know how to use tools
• If anything goes wrong technically, apply for
special consideration on this basis
– With “student declaration” as evidence
DATA1002 sem2 2021 – Lecture 13B 47
During the exam
• Teaching staff are not allowed to communicate to
students during the exam
• If any MCQ question seems wrong or confusing
(typo etc): pick the best answer you can
(afterwards, report it to us in private post on Ed)
• If any essay question seems wrong or confusing:
note this at start of your answer, say how you are
interpreting the question, then answer based on
that
DATA1002 sem2 2021 – Lecture 13B 48
Academic integrity
• You must not get assistance from other people
or use resources other than what is allowed
• You must not reveal the questions (neither
during the exam, nor afterwards)
DATA1002 sem2 2021 – Lecture 13B 49
Agenda
• Semester review
• Written exam preview
• Administrative issues
• More study of Data Science
DATA1002 sem2 2021 – Lecture 13B 50
What’s next
• DATA2001/2901 Data Science: Big Data and Data Diversity [semester 1]
• “This course focuses on methods and techniques to efficiently explore and analyse
large data collections. Where are hot spots of pedestrian accidents across a city?
What are the most popular travel locations according to user postings on a travel
website? The ability to combine and analyse data from various sources and from
databases is essential for informed decision making in both research and industry.
Students will learn how to ingest, combine and summarise data from a variety of
data models which are typically encountered in data science projects, such as
relational, semi-structured, time series, geospatial, image, text. As well as
reinforcing their programming skills through experience with relevant Python
libraries, this course will also introduce students to the concept of declarative data
processing with SQL, and to analyse data in relational databases. Students will be
given data sets from, eg., social media, transport, health and social sciences, and
be taught basic explorative data analysis and mining techniques in the context of
small use cases. The course will further give students an understanding of the
challenges involved with analysing large data volumes, such as the idea to
partition and distribute data and computation among multiple computers for
processing of ‘Big Data’.” [UoS outline in Handbook]
DATA1002 sem2 2021 – Lecture 13B 51
Final General advice
• Be prepared
– Ready for technical issues and environment
• See https://canvas.sydney.edu.au/courses/23380/pages/record+-help-
and-faqs
– Ready pedagogically
• Do examprep-MCQ quiz, lab13 and tute13 quiz (all under exam-like
conditions)
• Work through revision exercises
• Make a set of notes
• Watch Ed discussions
– Ready Mentally and physically
• Be well-rested, and reasonably fed (but not over-full)
– Have plenty of water (In a clear bottle or glass)
• Relax
Good luck!
https://canvas.sydney.edu.au/courses/23380/pages/record+-help-and-faqs