代写代考 COMP20008 Elements of Data Processing

Visualisation – I
School of Computing and Information Systems
@University of Melbourne 2022

Copyright By PowCoder代写 加微信 powcoder

The power of ‘preattentive perception’
COMP20008 Elements of Data Processing

Data types
small/medium/large; grade or mark H1 H2A H2B; score 50, 83, 96
Nominal/Categorical
Which state or territory you live in: VIC; SA; TAS; NSW; ACT; NT; WA; QLD What city are you travelling from?
Date of birth
Datestamp for a process
True or false value, eg: Are you a full-time student?
A search query on an e-commerce website: “electric sit-stand desk”
New daily cases of covid-19 cases since July 1 2020: 72, 72, 65, 101, 67, 124, 164, 118, 156, 281, .. Winners age grade 40-45 yo mens medium distance: [ , , ]
Numerical Continuous
Distance traveled: 245.7 km
Numerical Discrete
Number of people in each household; Number of students enrolled in a subject…
COMP20008 Elements of Data Processing

Simple descriptive statistics
• Count (Frequency) – For each category
• Min, Max, Range
• Quartiles, and Percentiles.
• Mean, Median, Mode
• Variance, Standard deviation
COMP20008 Elements of Data Processing

Basic visualisation
• Line plots
• Boxplots
• Histograms • Bar charts
• Scatter plots • Heatmap
• Parallel Coordinate plots
COMP20008 Elements of Data Processing

https://ourworldindata.org/coronavirus/country/australia?country=~AUS
COMP20008 Elements of Data Processing

Shows distribution based on a 5-number summary of a set of data points (e.g. distance)
• Median:themidwaydatapoint
• Firstquartile(Q1):middledatapointbelow
the median.
• Thirdquartile(Q3):middledatapointabove
the median
• Maximum:thelargestvalueinthedata
• IQR=interquartilerange=Q3-Q1
25% • Minimum:thesmallestvalueinthedata 25%
COMP20008 Elements of Data Processing

Boxplots – Patterns
Symmetric or skewed? Tightly or loosely grouped?
1 Median Q3 Left skew Right skew
COMP20008 Elements of Data Processing

Outliers and
Median: the middle data point (once sorted) Q1: middle data point below the median. Q3: middle data point above the median. IQR = interquartile range = Q3 – Q1
Whiskers (inner fence)
•Upper-limit: Q3 + 1.5 × IQR
•Upper inner fence: Highest data point ≤ Upper-limit
•Lower-limit Q1 – 1.5 × IQR
•Lower inner fence: Lowest data point ≥ Lower-limit
Suspected outliers (circle)
• >1.5×QRbelowQ1oraboveQ3 Outliers (black dot)
• >3×IQRbelowQ1oraboveQ3
Suspected outlier
Upper-limit
Upper inner fence
Lower inner fence
COMP20008 Elements of Data Processing
Lower-limit

Visualisation – II
School of Computing and Information Systems
@University of Melbourne 2022

Basic Visualisation
üLine plots üBoxplots
• Histograms • Bar charts
• Scatter plots • Heatmap
• Parallel Coordinate plots
COMP20008 Elements of Data Processing

Histograms
www.education.vic.gov.au
COMP20008 Elements of Data Processing
• Frequency distribution of a set of continuous data points.
• Inspect the underlying distribution (shape), is it normal? skewed? outliers?

Histograms with equal width bins
• Commonly used histograms
• x-axis: Divide the range of values into consecutive, non-
overlapping, and equal width intervals.
• y-axis: height proportional to the frequency of the bin
COMP20008 Elements of Data Processing

Histogram with variable width bins
• Not very common
• x-axis: Divide the range of values into consecutive, non-
overlapping, and variable width intervals.
• y-axis: height proportional to frequency density—the number of cases per unit of the variable. The rectangle has
its area proportional to the frequency
COMP20008 Elements of Data Processing

Histogram with variable width bins
By Qwfp at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=20290683
COMP20008 Elements of Data Processing

Histograms – patterns
• Symmetric? Left/right skewed, unimodal, bimodal, multimodal?
COMP20008 Elements of Data Processing

Histograms – cont.
• Histograms of the same dataset may look different with different bins sizes
• Problem: Hard to choose an appropriate bin size for histogram • Too small → normal objects in empty/rare bins, false positive
• Too big → outliers in some frequent bins, false negative
COMP20008 Elements of Data Processing

Iris dataset
• Well known dataset introduced by statistician with 150 objects (https://en.wikipedia.org/wiki/Iris_flower_data_set)
• Four features • Petal width
• Petal length • Sepalwidth • Sepal length
• Three flower species (classes): • Setosa
• Virginica
• . . Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
COMP20008 Elements of Data Processing

Histogram – petal width of Iris flowers
Histograms of the same dataset may look different with different bins sizes
COMP20008 Elements of Data Processing

Outliers and histograms
Paternity case:
“The study of outliers”, V. Barnett, Journal of the Royal Statistical Society, 27(3), 1978
COMP20008 Elements of Data Processing

Bar charts
• Summarise data points over a categorical variable.
X-axis: categorical variable Y-axis: numeric value
COMP20008 Elements of Data Processing

Bar charts vs histograms
• Histograms:
X-axis is intervals of a numeric variable
Y-axis is the frequency or frequency-density Only sensible to be ordered in one way
• Bar charts:
X-axis is a categorical variable Y-axis is a numeric quantity Can be in any order
They look similar but they have different semantics.
COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

Visualisation – III
School of Computing and Information Systems
@University of Melbourne 2022

Basic Visualisation
üLine plots üBoxplots üHistograms üBar charts
• Scatter plots • Heatmap
• Parallel Coordinate plots
COMP20008 Elements of Data Processing

Scatter plots
Two numeric variables
https://www.data-to-viz.com/graph/scatter.html
COMP20008 Elements of Data Processing

Scatter plots
• X-axis: one numeric variable
• Y-axis: the other numeric
• A dot is a data point with 2- values as the x, y coordinates.
COMP20008 Elements of Data Processing
https://datavizcatalogue.com/methods/scatterplot.html

Scatter plots – patterns
Relationship between two variables
https://datavizcatalogue.com/methods/scatterplot.html
COMP20008 Elements of Data Processing

Outliers and scatter plots

Outliers detection with PLS regression for NIR spectroscopy in Python


COMP20008 Elements of Data Processing

More than 2 features with scatter plots 1. Bubble plots
• A special scatter plot representing 3-dimensional data
• Size of circle around a point indicates the value of the 3rd dimension.
COMP20008 Elements of Data Processing

2. Enhanced scatter plots
• Use colours for the values of the 3rd dimension.
COMP20008 Elements of Data Processing

3. Scatterplot matrix
• A matrix of scatter plots of all pairs of dimensions (variables)
• Inspect many relationships simultaneously.
• Convenient for spotting correlation between variables
• Spotting outliers
COMP20008 Elements of Data Processing

‘Overplotting’ in scatter plots
When there are many data points, dots tend to overlap
– Reduce dot size
– Sampling
– Jitter (for moderate overplotting) – Use other plots
See https://python-graph-gallery.com/134-how-to-avoid-overplotting-with-python/
COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

Visualisation – IV
School of Computing and Information Systems
@University of Melbourne 2022

Basic Visualisation
üLine plots üBoxplots üHistograms üBar charts üScatter plots • Heat maps
• Parallel Coordinates plots
COMP20008 Elements of Data Processing

Which Iris is which?
Which one is Setosa and which one is Virginica?
COMP20008 Elements of Data Processing

• Plot the data matrix
• Individual values contained in a matrix are represented as colours
• This can be useful when objects are sorted according to class/type
• Typically, features are normalised or standardised to prevent one attribute from dominating the plot
COMP20008 Elements of Data Processing

Drawing a heat map
-1.5 0.9 2.3 0.5
COMP20008 Elements of Data Processing

Heat map – (standardised) Iris data
[Columns have been standardised to have a mean of zero and standard deviation of 1]
COMP20008 Elements of Data Processing

How common is your birthday?
https://www.abc.net.au/news/2017-12-13/australias-most-and-least-popular-birthdays-revealed/9241978
COMP20008 Elements of Data Processing

Parallel Coordinates
• A widely used visualisation technique for exploring multi- dimensional data sets
• Use a set of parallel axes (coordinate axes)
• The values of each data object are plotted as a point on each corresponding coordinate axis and the points are connected by a line.
• Thus, each data object is represented as a line
COMP20008 Elements of Data Processing

Parallel coordinates – Iris data
Note: standardised measurements by subtracting mean and dividing by the standard deviation
https://www.data-to-viz.com/graph/parallel.html
COMP20008 Elements of Data Processing

Patterns in parallel coordinates
• Reveal a distinct class of object group • Show data characteristics such as
• different data distributions
• Associations of feature pairs.
COMP20008 Elements of Data Processing

Patterns – cont.
• Highlight specific patterns of association on different features
http://joules.de/files/heinrich_parallel_2015.pdf
COMP20008 Elements of Data Processing

Patterns – cont.
• Highlight specific patterns of association on different features
http://joules.de/files/heinrich_parallel_2015.pdf
COMP20008 Elements of Data Processing

Axes scaling with parallel coordinates
Scaling of Axes
• Inconsistent scaling can lead to mis-interpretation
https://aedeegee.github.io/cgf12.pdf
COMP20008 Elements of Data Processing

Axes scaling – cont.
• Axes scaling affects the visualization
• May choose to scale all features via a pre-processing step
https://www.data-to-viz.com/graph/parallel.html
COMP20008 Elements of Data Processing
https://aedeegee.github.io/cgf12.pdf

Axes ordering in parallel coordinates
Ordering of axes
• Influences the relationships that can be seen. Correlations between pairs of features may only be visible in certain orderings
• Can decrease the clutter
• Can reveal distinct class more clearly
COMP20008 Elements of Data Processing

Parallel coordinates – ordering of axes
COMP20008 Elements of Data Processing
https://www.data-to-viz.com/graph/parallel.html

Very high dimensional data?
• Parallel coordinates leads to clutter and over-plotting with very large dataset and very high dimensions
• Not enough space to draw all lines
• Difficult to trace a line for a data object
• Only look at an important subset of attributes • Domain experts
• Feature selection techniques
• Dimensionality reduction techniques: covered later in the subject COMP20008 Elements of Data Processing

Elements of a good visualisation
• Meaningful title
• Appropriate scales, annotation
• Suitability to the dataset and the context of the data question
• Can be interpreted on its own.
• Caption can be used to explain the context, the dataset, and a brief
interpretation of plot, where appropriate.
• Has no redundant, information unimportant to the plot.
COMP20008 Elements of Data Processing

• Visualisation tools allow a quick summary of the data • Easy to glean the important features
• Can be a visual tool to help analysis
• Assist in getting to know your data
• Excellent communication tool
• Given your data:
• What are the best ways to visualise the information?
• Different aspects of the data may lend themselves better to different visualisations
COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com