DATA SCIENCE BASICS
COMP2420/COMP6420 INTRODUCTION TO DATA MANAGEMENT, ANALYSIS AND SECURITY
WEEK 1, LECTURE 2 (SEMESTER 1 2022)
of Computing
Copyright By PowCoder代写 加微信 powcoder
College of Engineering and Computer Science
Credit: Dr Ramesh Sankaranarayana (Honorary Senior Lecturer)
Acknowledgement of Country
We acknowledge and celebrate the First Australians on whose traditional lands we meet, and pay our respect to the elders of the Ngunnawal people past and present.
01 Data Science
03 Python Basics
04 Data Structures
05 Control Structures
06 Functions
DATA SCIENCE
Data and information
Data is the new oil
Attribution:World Economic Forum 2011
Definition
Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data
Knowledge Pyramid
Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
Data Science helps you climb the Pyramid
Data Science Clock
Data Science Clock
Goals and Use
Turn data into data products
Data => exploratory analysis => knowledge models => product / decision making
Data => predictive models => evaluate / interpret => product / decision making
Example applications?
Can you name some example applications?
Marketing: predict the characteristics of high life time value (LTV) customers, which can be used to support customer segmentation, identify upsell opportunities, and support other marking initiatives
Logistics: forecast how many of which things we need and where will we need them, which enables us to track inventory and prevents out of stock situations
Examples (contd…)
Healthcare: analyze survival statistics for different patient attributes (age, blood type, gender, etc.) and treatments; predict risk of re- admittance based on patient attributes, medical history, etc.
Transaction Databases: Recommender systems (NetFlix), Fraud Detection (Security and Privacy)
Wireless Sensor Data: Smart Home, Real-time Monitoring, Internet of Things
Examples (contd…)
Text Data, Social Media Data: Product Review and Consumer Satisfaction (Facebook, Twitter, LinkedIn), E-discovery
Software Log Data: Automatic Trouble Shooting (Splunk)
Genotype and Phenotype Data: Epic, 23andme, Patient-Centered Care, Personalized Medicine
Data Science: One Definition
Attribution: (data scientist and entrepreneur) http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data Science
Modern Data Scientist
Danger Zone!!
Ronny Kohavi* keynote at KDD 2015
People are incredibly clever at explaining “very surprising results”. Unfortunately most very surprising results are caused by data pipeline errors.
Beware “HiPPOs” (Highest Paid- Person’s Opinion)
*(Previous Vice President and Technical Fellow, Airbnb)
KDD: Knowledge Discovery and Data Mining (one of top conferences in the field)
Danger Zone!!
One Example: Gone Wrong!
Epidemiological modeling of online social network dynamics Jan 2014
What’s Hard?
1. Overcoming assumptions
2. Making ad-hoc explanations of data patterns
3. Overgeneralizing
4. Communication
5. Not checking enough (validate models, data pipeline integrity, etc.)
6. Using statistical tests correctly
7. Prototype to Production transitions
8. Data pipeline complexity (who do you ask?)
How many of you have learnt Python?
Compile vs interpreted
https://www.geeksforgeeks.org/differenc e-between-compiled-and-interpreted- language/
Statically vs dynamically typed
https://medium.com/android- news/magic-lies-here-statically-typed-vs- dynamically-typed-languages- d151c7f95e2b
Created by Dutch programmer Guido van Rossum, 1991
Provided administrative scripting for the Amoeba distributed OS
Interpreted and dynamically typed Syntax from C
Modula-3 inspired keyword arguments and imports
Open source from beginning
Why Python for Data Science?
Great for scripting and applications. Rapid prototyping
Offers improved library support for data science
ipython, numpy, scipy, matplotlib, pandas, …
Strong High Performance Computation support
cython, numba
Load balancing tasks MP, GPU MapReduce
Python Characteristics
http://www.sahosofttutorials.com/Course/Python/102/ 26
Python: Course environment
Anaconda install used for python
Version of python is 3.9 (in latest Anaconda)
All modules needed for course comes pre-installed with Anaconda
We are specifically interested in using jupyter lab
We will use jupyter notebooks and ipython shell as our working environment for python
PYTHON BASICS
Formatting
Uses indentation to delimit code
Generally we indent with four spaces
Comments start with #
End of line indicates end of a statement (almost always)
Colons (:) start a new block
# this is a comment print(“Hello World”) a= 1
# Colon starts the content for
# space in print(a) to say it’s
part of if block
Python is made of various modules
To use a module you need to import it
Four main ways to import
import math
import math as mt
from math import *
from math import pi, sin
To look what’s available inside a module use dir(math)
help(math)
“a storage location paired with an associated symbolic name (an identifier) which contains a value”
https://computersciencewiki.org/index.php/Variables
Variable names in Python can contain:
alphanumerical characters a-z, A-Z, 0-9 Underscore _
Cannot be a keyword
and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from,global, if, import …
Convention: names start with
lowercase for variables Uppercase for objects
The = character is assignment
Data Types
Dynamically typed language: Type is determined at assignment.
a=2.0 float a=’hi’ str a=True bool a=None null type
Type Casting
float(2) int to float
int(2.5) float to int
str(c) any to str
int(‘2’) str to int only if str contains only numerals
3 * 2.0 (6.0) int to float implicit
Arithmetic
+, -, *, / standard math operators (division always gives float)
** power (3**2 means 3″)
//, % gives quotient and remainder for division
Comparison
<, >, <=, >= ordering
== equality
!= not equal
can compare almost any two values of same type
comparison always returns a bool: True or False
DATA STRUCTURES
Can be delimited by a matching single or double quotation marks
double_s = “Hello World” single_s = ‘Hello World’ escaped_s = ‘Isn\’t it’ another_s = “Isn\’t it” Multiple line string
s = “”” This is first. this is second.””” Long string
s = “my name is \
A list is a container of objects that need not be of the same type
x=[1,2,’a’]
len(x) #(3)
x.append(‘3’) #(x=[1,2,‘a’,‘3’]) x[-1] #(‘3’)
x[:2] #([1,2]) x[1:] #([2,‘a’,‘3’])
x[1:3] #([2,‘a’])
x[0]=3 #([3,2,‘a’,‘3’])
You can slice a string too, but it’s immutable 3 in [1,3,5] #3(True)
a,b = [1,2] #(a=1, b=2)
A list of container objects that are immutable
t = (10, 40, ‘A’)
print(type(t), len(t)) #(
print(y,z) #(40, ‘A’)
t[0] = 20 #(tuple object does
not support
item assignment)
Dictionaries
A collection of {key: value} pairs. Equivalent to hash maps in other general computing languages
data[‘a’] = ‘Mon’
data[‘b’] = ‘Tue’
data[2] = 100
print(data) #({2: 100, ‘b’:
‘Tue’, ‘a’: ‘Mon’})
data[4] = ‘Hi’ #({2: 100, ‘b’:
‘Tue’, ‘a’: ‘Mon’, 4: ‘Hi’})
data[4] = [1,2,3]
print(data[4][1]) #(2)
CONTROL STRUCTURES
if x >= 3:
print(x,”is greater than equal to 3″)
elif x > 2 and x < 3:
print(x,"is greater than 2 but less than
print(x,"is less than 2")
#(3 is greater than equal to 3) a=3
x = 'hi' if a < b else 'hello'#(x='hello')
Loop over a list
letters = ['a','b']
for letter in letters:
print(letter)
# output :
Loop over a range of values
for i in range(1,4,2):
for(contd...)
Loop over a list with indexes required
for i,letter in enumerate(letters):
print(i,letter)
Loop over dictionary items:
d = {1:'a', 2:'b'} for k,v in d.items():
print(k,v)
Definition
Functions are defined using def def add_one(x):
This is the docstring, written to aid user
when using help functionality of a function.
return x+1
Provide abstraction and organization
Function can be called only after it is defined
return used to define the return value, if not specified it’s None
Positional arguments
def print_three(a,b,c)
print(a,b,c)
print_three(1,2,3) #(1,2,3)
Named arguments
def print_three(a,b,c)
print(a,b,c)
print_three(c=1,b=2,a=3) #(3,2,1)
Tools for the data scientist
• Needtoolsinordertocarryoutdata science
• Statisticalandprogrammingtools
• Python is the programming language
used in this course
• Willhaveaquicklookatuseful Python packages
• Thesepackageswillbeusedinthe labs and assignments
NumPy and Friends
Most commonly used and useful packages are those in NumPy stack
All packages in stack build on Num common of these packages are:
vSciPy - collection of mathematical
algorithms
vMatplotlib & Seaborn - plotting libraries
vipython via Jupyter - interactive environment
vPandas - data analysis library vSymPy - symbolic computation library
Num for Numerical Python
Is the fundamental package required for high performance computing and data analysis
It provides:
A powerful n-dimensional array object
Fast and efficient functions to operate on entire arrays (no loops)
Tools for integrating C/C++ and Fortran Code
Linear algebra, Fourier transforms, and random number capabilities, etc.
Attribution:Num UMPY BASICS
ndarray object
ndarray object is the backbone of NumPy. It’s an n-dimensional array of homogeneous data types, with many operations being performed in compiled code for performance.
Differences with standard Python sequences:
• ndarray is of fixed size. Changing size is equivalent to creating a new one
• All elements must be of same data type
• More efficient mathematical operations than built-in sequence types
NumPy supports following data types:
* int8, int16, int32, int64
* float16, float32, float64,
NumPy data types are stored in numpy.dtype class
Num (contd..)
import numpy as np
x = np.float32(1.0)
print(x, x.dtype) # 1.0 float32
y = np.int_([1,10,100]) print(y, y.dtype) # [1 10 100] int64
z = np.array(['a','b','c']) print(z, z.dtype) # ['a' 'b' 'c']
# array([False, False, False, False, True], dtype=bool)
print(10*np.sin(a))
# array([0.,8.41470985,9.09297427,1.41120008,7.5 6802495])
print(a*b) # array([ 0, 1, 4, 9, 16])
Multiplicati on
Basic multiplication is element-wise. To do matrix multiplication you need to call the dot function.
x = np.arange(4)
y = np.arange(4,8)
x.shape = (2,2) # [[0, 1], [2, 3]]
y.shape = (2,2) # [[4, 5], [6, 7]]
print(x*y) # array([[0, 5], [12, 21]]) print(np.dot(x,y)) # array([[6, 7], [26, 31]]) print(np.dot(y,x)) # array([[10, 19], [14, 27]])
Note: We can achieve results by applying a loop also, but it’s way slower
Other Functions
Numpy provides several other functions like exp, sin, max, min, etc, that operates on arrays directly
x = np.random.random((2,3)) print(x) # array([[0.43035233, 0.07244463, 0.83782813],
# [0.14712878, 0.92679831, 0.40962548]]) print(x.sum()) # 2.8241776619621159 print(x.max()) # 0.926798306793 print(x.min(axis=0)) # [ 0.14712878 0.07244463 0.40962548] print(x.max(axis=1)) # [ 0.83782813 0.92679831]
Numpy provides various other Linear Algebra functions, which we will cover in the labs
Live demo next
PANDAS will be in a recorded lecture
Image credit: on flickr
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com