COMP 3115 Exploratory Data Analysis and Visualization
Lecture 5: Basic Data Visualization Dr.
§ More about Matplotlib
§ Visualize Data Distribution – Histogram
Copyright By PowCoder代写 加微信 powcoder
– Quantile Plots
§ Scatterplot § Line Chart
Quantile Plots
Scatterplot
Line Chart
Visualization using Matplotlib
§ A popular plotting library in Python.
§ Provides a set of drawing APIs through the pyplot module, and hides the
complex structure composed of many drawing objects within this set of APIs.
§ To install (https://matplotlib.org/stable/users/installing/index.html):
§ Use Anaconda’s GUI
§ With Conda: conda install matplotlib
§ With PyPl (Python Package Index): pip install -U matplotlib
§ Reference:
§ Examples: https://matplotlib.org/gallery/index.html
§ User guide: https://matplotlib.org/tutorials/index.html
np.linspace(start, stop, num) returns a one-dimensional ndarray with num evenly spaced samples over the interval ‘[start, stop]’.
Styling Line Plot: Line Colors
Styling Line Plot: Line Style
Styling Line Plot: Axes limits
Styling Line Plot: Label
§ A figure can have many kinds of labels § – figure title: plt.title()
§ – axis labels: plt.xlabel(), plt.ylabel()
§ – legends: plt.legend()
Styling Line Plot: Simple Legend
More about scatter plots
More Symbols for Scatter Plots
Plot Bar Chart with plt.bar()
Plot Error Bars by plt.errorbar()
Plot Pie Chart with plot.pie()
Visualize Data Distribution
§ The ability to visualize the distribution shape in exploratory data analysis is important.
§ First, we can use it to summarize a data set to better understand general characteristics such as shape, spread, or location. In turn, this information can be used to suggest transformations or probabilistic models for the data.
§ Second, we can use these methods to check model assumptions, such as symmetry, normality, etc.
Histograms
§ A histogram is a way to graphically summarize or describe a data set by visually conveying its distribution using vertical bars.
Univariate Histograms
§ A frequency histogram is obtained by first creating a set of bins or intervals that cover the range of the data set.
– It is important that these bins do not overlap and that they have equal width.
§ We then count the number of observations that fall into each bin.
§ To visualize this information, we place a bar at each bin, where the height of the bar corresponds to the frequency.
Example of Frequency Histograms
§ Suppose we have the following values for temperature and we want to visualize the distribution using Frequency Histograms with 7 Bins
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
§ Partition data into bins
– Compute the width w = (85-64)/7 = 3
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85]
Example of Frequency Histograms
§ Suppose we have the following values for temperature and we want to visualize the distribution using Frequency Histograms with 7 Bins
[64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
§ Visualize the count information of each bin by bar – The height of the bar corresponds to the frequency
[64, 67) [67, 70) [70, 73) [73, 76) [76, 79) [79, 82) [82, 85]
Histogram in Python
Relative Frequency
Ø A relative frequency shows the proportion of cases that fall in each category. All the numbers in a relative frequency table sum to 1.
Don’t know
Don’t know
frequency table relative frequency table
Relative Frequency Histograms
§ Relative Frequency Histograms are obtained by mapping the height of the bin to the relative frequency of observations that fall into the bin.
– Frequency: number of observations that fall into each bin
– Relative Frequency: a fraction of observations that fall into each bin
ØFrequency divided by the total number of observations
the shape of the histograms is the same,
but the vertical axes represent different quantities.
Example of Relative Frequency Histograms
Frequency Histograms vs. Relative Frequency Histograms
§ [64, 65, 68, 69, 70, 71, 72, 72, 75, 75, 80, 81, 83, 85]
Bar height denotes frequency: number of observations that fall into each bin.
Bar height denotes relative frequency: a fraction of observations that fall into each bin
Frequency Histograms Relative Frequency Histograms
Density Histogram
§ Frequency and relative frequency histograms do not represent meaningful probability densities since the total area represented by the bars does not equal to one.
§ A density histogram is a histogram that has been normalized so the area under the curve is one.
Frequency Histograms Relative Frequency Histograms Density Histograms
Density Histogram
§ A density histogram is given by the following equation
𝑓 𝑥 = 𝑣! , 𝑥 ∈ 𝐵! 𝑛h
𝐵! is the k-th bin
𝑣! is the number of data points that falls into k-th bin
h is the width of bins and 𝑛 is the total number of data points
– 𝑓 𝑥 is nonnegative – ∫𝑓𝑥𝑑𝑥=1
is a probability density since
Density Histograms
The effect of bin width h in histograms
§ The bin width h determines the smoothness of
the histogram.
§ Small values of h produce histograms with a lot of variations in the heights of the bins, while large bin widths yield smoother histograms
§ Rules of thumb
– Number of bins = sqrt(n)
– Number of bins = log2(n) + 1
The effect of bin width h in histograms
§ Notice that for the larger bin widths, we have only one peak.
§ As the smoothing parameter h gets smaller,
the histogram displays more variation and spurious peaks appear in the histogram estimate.
Bivariate Histograms
§ We can easily extend the univariate density histogram to bivariate case. The bivariate histogram is defined as
𝑣! 𝐵! is the k-th bin
𝑓 𝐱 = 𝑛h”h# , 𝐱 ∈ 𝐵! 𝑣! is the number of data points that falls into k-th bin
h1 and h2 are the width and length of cuboid’s (bin) bottoms 𝑛 is the total number of data points
Summary of Histograms
§ Visually conveying data distribution using vertical bars § Frequency Histograms
– Bar height denotes frequency
§ Relative Frequency Histograms
– Bar height denotes relative frequency
§ Density Histogram
– Bar height denotes probability density
§ The effect of bin width h in histograms – h can be seen as a smoothing parameter
§ Boxplots (sometimes called box-and-whisker diagrams) are an excellent way – To visualize summary statistics such as the median
– To study the distribution of the data
– To supplement multivariate displays with univariate information.
Procedures of Boxplots: Recall on Quartiles and IQR
§ Order the data from smallest to largest § Find Q2, Q1 and Q3
– Q2: Median
– Q1: Median of the lower half – Q3: Median of the upper half
§ Compute IQR
– IQR=Q3–Q1
§ Compute lower limit (LL) and the upper limit (UL) (i.e., 1.5*IQR rule) – LL=Q1–1.5*IQR
– UL=Q3+1.5*IQR
§ Observations outside the LL and UL are potential outliers
§ Boxplots for different univariate samples can be plotted together for visually comparing the corresponding distributions.
§ They can also be plotted horizontally and vertically.
Boxplot for One Variable
Boxplot for Multiple Variables
Boxplot for Multiple Variables for Different Groups
Summary of Boxplot
§ Five number summary of a distribution – “Minimum”: lower limit
– “Maximum”: upper limit
Quantile Plots
§ As an alternative to the boxplots, we can use quantile-based plots to visually compare the distributions of two sample sets.
§ Quantile plots are also appropriate when we want to compare a known theoretical distribution and a sample set.
§ In making the comparisons, we might be interested in knowing how they are shifted relative to each other or to check model assumptions, such as normality.
Quantile Plots
§ Probability Plots
– Used to compare sample quantiles with the quantiles from a known theoretical
distribution, such as normal, exponential, etc
§ Quantile-quantile plots (i.e., q-q plots)
– A q-q plot is used to determine whether two random samples were generated by the same distribution.
Probability Plots
§ A Probability plot is one where the theoretical quantiles are plotted against the ordered data (i.e., the sample quantiles).
§ The main purpose of probability plots is to visually determine whether or not the data could have been generated from the given theoretical distribution.
§ If the sample distribution is similar to the theoretical one, then we would expect the relationship to follow an approximate straight line.
§ Departures from a linear relationship are an indication that the distribution are different.
Technical Details of Probability Plots
§ Let us assume that we have a set of numbers x1, x2, …, xn and we wish to visually study whether the normality assumption is reasonable.
§ 1. Sort the data from smallest to largest
§ 2. Define n empirical sample fractions, p1, p2, …, pn, where pi = (i-0.5)/n.
§ 3. Find a set of number, z1, z2, …, zn, that would be expected from data that exactly follow distribution.
§ 4. Construct a scatter plot with pairs of x(1) and z1, x(2) and z2 and so on.
This definition is somewhat arbitrary. We may also use pi = i/(n+1)
Example of Probability Plots
§ Construct a normal probability plot for the data set given below and determine if the data follow an approximately normal distribution.
[3.7, 2.7, 3.3, 1.3, 2.2,3.1] § 1. Sort the data from smallest to largest
[1.3, 2.2, 2.7, 3.1, 3.3, 3.7]
§ 2. Get quantiles and correspond z from normal distribution.
Observation i
Sample fraction (i-0.5)/n
Example of Probability Plots
Probability Plots
The data was generated by normal distribution
The data was NOT generated by normal distribution
Quantile-Quantile Plot
§ The q-q plot was originally proposed to visually compare two distributions by graphing the quantiles of one versus the quantiles of the other.
§ Either of both of these distributions may be empirical or theoretical.
§ Thus, the probability plot is a special case of the q-q plot.
– Probability plot: one distribution is empirical and another is theoretical.
Quantile-Quantile Plot on two data sets
§ Suppose we have two data sets consisting of univariate measurements.
§ We denote the order statistics for the first data set by x(1), x(2), …, x(n)
§ We denote the order statistics for the second data set by y(1), y(2), …, y(m).
§ If m = n, we simply plot the sample quantiles of one data set versus the other data set as points. Basically, just plot x(i) vs y(i).
§ If m != n, interpolation of one data set is needed.
Quantile-Quantile Plot
Two datasets (i.e, x and y) from the different distributions
Two datasets (i.e, x and y) from the same distribution
Quantile Plots
§ Visually compare the distributions of two sample sets. § Probability Plots
– Used to compare sample quantiles with the quantiles from a known theoretical distribution
§ Quantile-Quantile Plots
– Used to determine whether two random samples were generated by the same distribution.
Scatterplot
§ The scatterplot is a visualization technique that enjoys widespread use in data analysis and is a powerful way to convey information about the relationship between two variables.
§ To construct one of these plots in 2-D, we simply plot the individual (xi , yi) pairs as points or some other symbol. For 3-D scatterplots, we add the third dimension and plot the (xi , yi , zi) triplets as points.
Scatterplot in Python
Scatterplot for visualizing relationship between two variables
Positive Correlated Relationship
Negative Correlated Relationship
Scatterplot on uncorrelated Data
Scatterplot on Iris Data
No strong correlation between ‘sepal_width’ strong correlation between ‘petal_width’ and and ‘sepal_length’ ‘petal_length’
Scatterplot Matrices
§ Scatterplot matrices are suitable for multivariate data, when p > 2. They show all possible 2-D scatterplots, where the axis of each plot is given by one of the variables.
§ The scatterplots are then arranged in a matrix-like layout for easy viewing and comprehension.
§ Some implementations of the scatterplot matrix show the plots in the lower triangular portion of the matrix layout only, since showing both is somewhat redundant.
Scatterplot Matrices
§ Scatterplot Matrices on Iris Data
§ Each cell in a matrix-like layout is a scatterplot of two variables (e.g., ‘sepal_length’ vs ‘sepal_width’).
§ The diagonal cells show the histogram of a corresponding variable (e.g., sepal_length etc).
Line Chart
§ A line chart (or line graph) is a type of chart which displays information as a series of data points connected by straight line segments.
§ It is similar to a scatterplot except that the data points are ordered (typically by their x-axis value) and joined with straight line segments.
§ A line chart is often used to visualize a trend in data over intervals of time.
Line Chart for Time Series Data
§ Visualize Data Distribution
– Histogram: conveys data distribution using
vertical bars
– Boxplot: visualize summary statistics such as the median, quartiles
– Quantile Plots: compare two distributions
§ Scatterplot
– a two-dimensional data visualization that uses dots to represent the values obtained for two different variables.
– Conveys information about the relationship
§ Line Chart
– Often used to visualize trends
Quantile Plots
Scatterplot
Line Chart
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com