程序代写 COMP2420/COMP6420 INTRODUCTION TO DATA MANAGEMENT, ANALYSIS AND SECURITY

DATA SCIENCE BASICS
COMP2420/COMP6420 INTRODUCTION TO DATA MANAGEMENT, ANALYSIS AND SECURITY
WEEK 1, LECTURE 2 (SEMESTER 1 2022)
of Computing

Copyright By PowCoder代写 加微信 powcoder

College of Engineering and Computer Science
Credit: Dr Ramesh Sankaranarayana (Honorary Senior Lecturer)

Acknowledgement of Country
We acknowledge and celebrate the First Australians on whose traditional lands we meet, and pay our respect to the elders of the Ngunnawal people past and present.

01 Data Science
03 Python Basics
04 Data Structures
05 Control Structures
06 Functions

DATA SCIENCE

Data and information

Data is the new oil
Attribution:World Economic Forum 2011

Definition
Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data

Knowledge Pyramid
Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
Data Science helps you climb the Pyramid

Data Science Clock
Data Science Clock

Goals and Use
Turn data into data products
Data => exploratory analysis => knowledge models => product / decision making
Data => predictive models => evaluate / interpret => product / decision making

Example applications?
Can you name some example applications?

Marketing: predict the characteristics of high life time value (LTV) customers, which can be used to support customer segmentation, identify upsell opportunities, and support other marking initiatives
Logistics: forecast how many of which things we need and where will we need them, which enables us to track inventory and prevents out of stock situations

Examples (contd…)
Healthcare: analyze survival statistics for different patient attributes (age, blood type, gender, etc.) and treatments; predict risk of re- admittance based on patient attributes, medical history, etc.
Transaction Databases: Recommender systems (NetFlix), Fraud Detection (Security and Privacy)
Wireless Sensor Data: Smart Home, Real-time Monitoring, Internet of Things

Examples (contd…)
Text Data, Social Media Data: Product Review and Consumer Satisfaction (Facebook, Twitter, LinkedIn), E-discovery
Software Log Data: Automatic Trouble Shooting (Splunk)
Genotype and Phenotype Data: Epic, 23andme, Patient-Centered Care, Personalized Medicine

Data Science: One Definition
Attribution: (data scientist and entrepreneur) http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Data Science

Modern Data Scientist

Danger Zone!!
Ronny Kohavi* keynote at KDD 2015
People are incredibly clever at explaining “very surprising results”. Unfortunately most very surprising results are caused by data pipeline errors.
Beware “HiPPOs” (Highest Paid- Person’s Opinion)
*(Previous Vice President and Technical Fellow, Airbnb)
KDD: Knowledge Discovery and Data Mining (one of top conferences in the field)

Danger Zone!!

One Example: Gone Wrong!
Epidemiological modeling of online social network dynamics Jan 2014

What’s Hard?
1. Overcoming assumptions
2. Making ad-hoc explanations of data patterns
3. Overgeneralizing
4. Communication
5. Not checking enough (validate models, data pipeline integrity, etc.)
6. Using statistical tests correctly
7. Prototype to Production transitions
8. Data pipeline complexity (who do you ask?)

How many of you have learnt Python?

Compile vs interpreted
https://www.geeksforgeeks.org/differenc e-between-compiled-and-interpreted- language/
Statically vs dynamically typed
https://medium.com/android- news/magic-lies-here-statically-typed-vs- dynamically-typed-languages- d151c7f95e2b
Created by Dutch programmer Guido van Rossum, 1991
Provided administrative scripting for the Amoeba distributed OS
Interpreted and dynamically typed Syntax from C
Modula-3 inspired keyword arguments and imports
Open source from beginning

Why Python for Data Science?
Great for scripting and applications. Rapid prototyping
Offers improved library support for data science
ipython, numpy, scipy, matplotlib, pandas, …
Strong High Performance Computation support
cython, numba
Load balancing tasks MP, GPU MapReduce

Python Characteristics
http://www.sahosofttutorials.com/Course/Python/102/ 26

Python: Course environment
Anaconda install used for python
Version of python is 3.9 (in latest Anaconda)
All modules needed for course comes pre-installed with Anaconda
We are specifically interested in using jupyter lab
We will use jupyter notebooks and ipython shell as our working environment for python

PYTHON BASICS

Formatting
Uses indentation to delimit code
Generally we indent with four spaces
Comments start with #
End of line indicates end of a statement (almost always)
Colons (:) start a new block

# this is a comment print(“Hello World”) a= 1
# Colon starts the content for
# space in print(a) to say it’s
part of if block

Python is made of various modules
To use a module you need to import it
Four main ways to import
import math
import math as mt
from math import *
from math import pi, sin
To look what’s available inside a module use dir(math)
help(math)

“a storage location paired with an associated symbolic name (an identifier) which contains a value”
https://computersciencewiki.org/index.php/Variables
Variable names in Python can contain:
alphanumerical characters a-z, A-Z, 0-9 Underscore _
Cannot be a keyword
and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from,global, if, import …

Convention: names start with
lowercase for variables Uppercase for objects
The = character is assignment

Data Types
Dynamically typed language: Type is determined at assignment.
a=2.0 float a=’hi’ str a=True bool a=None null type

Type Casting
float(2) int to float
int(2.5) float to int
str(c) any to str
int(‘2’) str to int only if str contains only numerals
3 * 2.0 (6.0) int to float implicit

Arithmetic
+, -, *, / standard math operators (division always gives float)
** power (3**2 means 3″)
//, % gives quotient and remainder for division

Comparison
<, >, <=, >= ordering
== equality
!= not equal
can compare almost any two values of same type
comparison always returns a bool: True or False

DATA STRUCTURES

Can be delimited by a matching single or double quotation marks
double_s = “Hello World” single_s = ‘Hello World’ escaped_s = ‘Isn\’t it’ another_s = “Isn\’t it” Multiple line string
s = “”” This is first. this is second.””” Long string
s = “my name is \

A list is a container of objects that need not be of the same type
x=[1,2,’a’]
len(x) #(3)
x.append(‘3’) #(x=[1,2,‘a’,‘3’]) x[-1] #(‘3’)
x[:2] #([1,2]) x[1:] #([2,‘a’,‘3’])

x[1:3] #([2,‘a’])
x[0]=3 #([3,2,‘a’,‘3’])
You can slice a string too, but it’s immutable 3 in [1,3,5] #3(True)
a,b = [1,2] #(a=1, b=2)

A list of container objects that are immutable
t = (10, 40, ‘A’)
print(type(t), len(t)) #( 3)
print(y,z) #(40, ‘A’)
t[0] = 20 #(tuple object does
not support
item assignment)

Dictionaries
A collection of {key: value} pairs. Equivalent to hash maps in other general computing languages
data[‘a’] = ‘Mon’
data[‘b’] = ‘Tue’
data[2] = 100
print(data) #({2: 100, ‘b’:
‘Tue’, ‘a’: ‘Mon’})
data[4] = ‘Hi’ #({2: 100, ‘b’:
‘Tue’, ‘a’: ‘Mon’, 4: ‘Hi’})
data[4] = [1,2,3]
print(data[4][1]) #(2)

CONTROL STRUCTURES

if x >= 3:
print(x,”is greater than equal to 3″)
elif x > 2 and x < 3: print(x,"is greater than 2 but less than print(x,"is less than 2") #(3 is greater than equal to 3) a=3 x = 'hi' if a < b else 'hello'#(x='hello') Loop over a list letters = ['a','b'] for letter in letters: print(letter) # output : Loop over a range of values for i in range(1,4,2): for(contd...) Loop over a list with indexes required for i,letter in enumerate(letters): print(i,letter) Loop over dictionary items: d = {1:'a', 2:'b'} for k,v in d.items(): print(k,v) Definition Functions are defined using def def add_one(x): This is the docstring, written to aid user when using help functionality of a function. return x+1 Provide abstraction and organization Function can be called only after it is defined return used to define the return value, if not specified it’s None Positional arguments def print_three(a,b,c) print(a,b,c) print_three(1,2,3) #(1,2,3) Named arguments def print_three(a,b,c) print(a,b,c) print_three(c=1,b=2,a=3) #(3,2,1) Tools for the data scientist • Needtoolsinordertocarryoutdata science • Statisticalandprogrammingtools • Python is the programming language used in this course • Willhaveaquicklookatuseful Python packages • Thesepackageswillbeusedinthe labs and assignments NumPy and Friends Most commonly used and useful packages are those in NumPy stack All packages in stack build on Num common of these packages are: vSciPy - collection of mathematical algorithms vMatplotlib & Seaborn - plotting libraries vipython via Jupyter - interactive environment vPandas - data analysis library vSymPy - symbolic computation library Num for Numerical Python Is the fundamental package required for high performance computing and data analysis It provides: A powerful n-dimensional array object Fast and efficient functions to operate on entire arrays (no loops) Tools for integrating C/C++ and Fortran Code Linear algebra, Fourier transforms, and random number capabilities, etc. Attribution:Num UMPY BASICS ndarray object ndarray object is the backbone of NumPy. It’s an n-dimensional array of homogeneous data types, with many operations being performed in compiled code for performance. Differences with standard Python sequences: • ndarray is of fixed size. Changing size is equivalent to creating a new one • All elements must be of same data type • More efficient mathematical operations than built-in sequence types NumPy supports following data types: * int8, int16, int32, int64 * float16, float32, float64, NumPy data types are stored in numpy.dtype class Num (contd..) import numpy as np x = np.float32(1.0) print(x, x.dtype) # 1.0 float32 y = np.int_([1,10,100]) print(y, y.dtype) # [1 10 100] int64 z = np.array(['a','b','c']) print(z, z.dtype) # ['a' 'b' 'c'] 3)
# array([False, False, False, False, True], dtype=bool)
print(10*np.sin(a))
# array([0.,8.41470985,9.09297427,1.41120008,7.5 6802495])
print(a*b) # array([ 0, 1, 4, 9, 16])

Multiplicati on
Basic multiplication is element-wise. To do matrix multiplication you need to call the dot function.
x = np.arange(4)
y = np.arange(4,8)
x.shape = (2,2) # [[0, 1], [2, 3]]
y.shape = (2,2) # [[4, 5], [6, 7]]
print(x*y) # array([[0, 5], [12, 21]]) print(np.dot(x,y)) # array([[6, 7], [26, 31]]) print(np.dot(y,x)) # array([[10, 19], [14, 27]])
Note: We can achieve results by applying a loop also, but it’s way slower

Other Functions
Numpy provides several other functions like exp, sin, max, min, etc, that operates on arrays directly
x = np.random.random((2,3)) print(x) # array([[0.43035233, 0.07244463, 0.83782813],
# [0.14712878, 0.92679831, 0.40962548]]) print(x.sum()) # 2.8241776619621159 print(x.max()) # 0.926798306793 print(x.min(axis=0)) # [ 0.14712878 0.07244463 0.40962548] print(x.max(axis=1)) # [ 0.83782813 0.92679831]
Numpy provides various other Linear Algebra functions, which we will cover in the labs

Live demo next
PANDAS will be in a recorded lecture
Image credit: on flickr

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com