CS计算机代考程序代写 algorithm python Excel COMP9321:

COMP9321:
Data services engineering
Term1, 2021
Week 4: Data Visualisation

2
http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf
The right paradigm

The right paradigm
Pie Charts:

• •
Although commonly used, not considered an effective form. Human brain is not wired to parse round shape areas and arcs
Normally other graphs can do the same job (e.g., BAR graphs) Maybe OK when showing two variables (<, >, similar, etc.)
3
Learning Data Visualization by Bill Shanders, Lynda.com

The right paradigm
Histograms
• •
Histograms are useful for viewing (or really discovering)the distribution of data points
The use of bins (discretization) really helps us see the “bigger picture” where as if we use all of the data points without discrete bins, there would probably be a lot of noise in the visualization, making it hard to see what is really going on
4
Learning Data Visualization by Bill Shanders, Lynda.com

The right paradigm: hierarchical data
To show the “connections” between and the hierarchy of objects.
A tree diagram:
http://mbostock.github.io/d3/talk/20111018/tree.html
– Default for showing hierarchy (e.g., org chart)
– Any situation where you a parent, which has children (and grand children)
A node link diagram:
http://mbostock.github.io/d3/talk/20111116/force-collapsible.html
• Showing a lot of links between objects Tree map:
http://mbostock.github.io/d3/talk/20111018/treemap.html – Size of each category
Chord Diagram:
https://bost.ocks.org/mike/uberdata/
Complex data -> very difficult to parse the information (e.g., between dots far apart) Interactivity could help parse (https://bost.ocks.org/mike/fisheye/)
5
Learning Data Visualization by Bill Shanders, Lynda.com

The right paradigm: showing data on maps
On an existing map API like Google API … Place markers (https://www.latlong.net)
– specific location (e.g., building), centre of a region Layers (data associated with the regions on a map)
• •
• •
Point clustering ( http://bl.ocks.org/andrewxhill/raw/8360694/) – display aggregated number/data points per region
Choropleth map (http://leafletjs.com/examples/choropleth/)
– display divided geographical areas or regions that are coloured, shaded or patterned
in relation to a data variable.
Heat map (https://onemilliontweetmap.com/)
Flow map
– show the movement of information or objects from one location to another and their amount (thickness of lines, colours)
– https://datavizcatalogue.com/methods/flow_map.html
– https://www.iom.int/world-migration
6
Learning Data Visualization by Bill Shanders, Lynda.com

Three tricks for doing more with less
• Multiple plots
– simple, easily interpretable subplots – can be beautiful but overwhelming
• Hybrid plots
– a scatter plot of histograms
– or a venn-diagram of histograms, etc.
• Multiple axes
– plot two (or more) different things on one graph

Hybrid plots
Courtesy of http://addictedtor.free.fr/graphiques/addNote.php?graph=78

Hybrid plots

Hybrid plots
Courtesy of http://addictedtor.free.fr/graphiques/addNote.php?graph=109

Multiple plots
Courtesy of Cognitive Science Society. Used with permission.

Multiple plots
Courtesy of Cognitive Science Society. Used with permission.
Baker, Tenenbaum, & Saxe (2007)

Multiple plots
Courtesy of Andrew Gelman. Used with permission.

Multiple axes
ToyA Toy B
Toy C Toy D
Toy E
Toy F
Toy G ToyH
ToyI
1999 2000
2001 2002 2003 Years
2004
Figure by MIT OpenCourseWare.
Number of toys sold

Multiple axes
Serial position (0=target)
Guess 1
Guess 2
P(report) P(report)
Log(observed/chance frequency)

Two tradeoffs
• Informativeness vs. readability
– Too little information can conceal data
– But too much information can be overwhelming – Possible solution: hierarchical organization?
• Data-centric vs. viewer-centric
– Viewers are accustomed to certain types of
visualization
– But novel visualizations can be truer to data

To put it together …
So … there are many many options for visualising data (including a lot of fancy and interactive ones from the latest tools and libraries)
But let’s try to have some basic competency on this:
• Accuracy is important, having a clear story to tell is important
• You need to be ready to do some basic data prep and pre analysis before visualisation
• Knowing the right paradigm (form) to use for the story
• Aware of your own limitation as ‘non-expert’ (visualisation is not easy)
Actually, a lot of experts
recommend “sketching the idea out” with pen and paper.
17

18
Data Visualization using Pandas/Matplotlib
• There are many excellent plotting libraries in Python and I recommend exploring more than one in order to create presentable graphics.
• In this course we are focusing on the Matplotlib library. It is the foundation for many other plotting libraries and plotting support in higher-level libraries such as Pandas.
• Don’t get confused with Matplotlib’s many ways of plotting the same thing. Pandas is our access point

Data Visualization in Python
• Matplotlib: low level, provides lots of freedom
• Pandas Visualization: easy to use interface, built on Matplotlib
• Seaborn: high-level interface, great default styles
• ggplot: based on R’s ggplot2, uses Grammar of Graphics
• Plotly: can create interactive plots 19

20
Matplotlib and Dataframes


• • •
Under the hood, pandas plots graphs with the matplotlib library. This is usually pretty convenient since it allows you to just .plot your graphs.
When you use .plot on a dataframe, you sometimes pass things to it and sometimes you don’t.
.plot plots the index against every column
.plot(x=’col1′) plots against a single specific column
.plot(x=’col1′, y=’col2′) plots one specific column against another specific column

Matplotlib and Dataframes
country year
35 Australia 2015
36 Australia 2016
37 USA 1980
38 USA 1981
39 USA 1982
If you use: df.plot()
unemployment 6.063658 5.723454 7.141667 7.600000 9.708333
21
http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot- in-pandas/

22
Matplotlib and Dataframes
• The major use cases for .plot() is when you have a meaningful index, which usually happens in two situations:
• You’ve just done a .value_counts() or a .groupby()
• You’ve used .set_index, probably with dates

Matplotlib and Dataframes
• So if you do:
df.groupby(“country”)[‘unemployment’].mean().plot(kind =’bar’)
You’ll get:
23

Matplotlib and Dataframes
• What about (.plot(x=’col1′, y=’col2’)) • Let’s try it for the same data before:
df.plot(x=’year’, y=’unemployment’)
Talk about connected
24

Matplotlib and Dataframes
• Groupby to do it right.
• Createasinglegraph
fig, ax = plt.subplots()
df.groupby(‘country’).plot(x=’year’, y=’unemployment’, ax=ax, legend=False)
25

Matplotlib without dataframes
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4], [10, 20, 25, 30], color=’lightblue’, linewidth=3)
plt.scatter([0.3, 3.8, 1.2, 2.5], [11, 25, 9, 26], color=’darkgreen’, marker=’^’)
plt.xlim(0.5, 4.5) plt.show()
26

Matplotlib with/without Pandas
With Pandas
Without Pandas
27

28
Matplotlib Conclusion
• Many details and scattered around documentation
• Stackoverflow is King • DOYOURActivities

Useful Read
• Book: the Functional Art by Alberto Cairo (Chapter 1,2, and 3)
• http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/how-
pandas-uses-matplotlib-plus-figures-axes-and-subplots/ • https://pythonspot.com/visualize-data-with-pandas/
29