CS计算机代考程序代写 python data science deep learning decision tree 6c_Data_Science.dvi

6c_Data_Science.dvi

COMP9414 Data Science 1

Overview

� Methodology

� Bias

� Overfitting

� Combining Datasets

� Slicing and Dicing

� Validation

UNSW ©W. Wobcke et al. 2019–2021

COMP9414: Artificial Intelligence

Lecture 6c: Data Science

Wayne Wobcke

e-mail:w. .au

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 3

Feature Engineering

Example: Mobile Phone Data includes location of cell towers

� Location is Angkor Wat and time is 1 day ⇒ tourist?

� Or, journey “similar to” typical tourist trips ⇒ tourist

� Location is shopping centre ⇒ shopping (if not home)?

� Most frequent called person ⇒ spouse? (if married)

� Spouse ⇒ opposite gender (use as a check)

� Location is port and truck driver ⇒ shipment

� Destination(s) of truck ⇒ type of shipment?

Methodology: Emphasis on dealing with multiple levels of uncertainty

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 2

Data Science Methodology

� Methodology: In statistics/machine learning textbooks

◮ Methods, models, theorems, estimators, techniques, tools

� Meta-methodology: Knowledge and practices that support this

◮ How is it decided what “concepts” to measure?

◮ How is it decided how these concepts are defined?

◮ How is it decided how these concepts are measured (what data)?

◮ How is robustness or reliability of results checked?

◮ How are the results validated (internal and external)?

◮ How do the results influence policy/decision making?

Lack of emphasis in textbooks, but very important to learn

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 5

What Data Science is Not (A Caricature)

� Choose a complex concept/statistic/indicator to measure

◮ Poverty/wealth indicators, food security map

� Choose a number of large-ish datasets

◮ Mobile phone data, satellite data, admin data, survey data

� Choose a number of “covariates” in addition

◮ Nighttime lights, land use, etc.

� Throw all data into standard method in R/Python, · · ·

◮ Decision Trees, Random Forests, XGBoost, Neural Networks, · · ·

� Gives mixed results (to the extent validated · · ·)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 4

Bias

Bias = Propensity for method to generalize (good or bad)

� Dataset not representative of population

◮ Only people in areas with phone towers have phones

◮ Only people who are literate can send text messages

◮ Only poorer people need “access” to phone credits

� Training data “discriminates” against certain groups

◮ Learner trained on white male faces

� Learner generalizes “wrong” features

◮ White background (only pictures of snow leopards are in winter)

� Learner “misses” relevant features

◮ Seasonal effects of population movement (food shortages)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 7

Overfitting

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 6

Human Element of Data Science

Essential when data is limited in quality, quantity (most of the time)

� Human suggests relevant features

◮ Protest less likely to be violent if venue private

◮ AfPak ontology of events of interest to conflict progression

� Human defines useful indicators

◮ Village is safe if market is open at night

� Human validates model output

◮ Check agreement with model on 15% random sample

◮ Verify main features used by the model

◮ Define baseline for comparative performance

◮ Cross check model output with other datasets

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 9

Local Contextual Assumptions

Food Consumption Score

� 2100 calories per day estimated by weighting food types

� Weights motivated but oil and sugar “need adjustment”

� Locally validated (seasonal effects, local variations)

◮ North Sudan vs South Sudan

◮ Seasonal variation in Cameroon

� Correlate with other measures (admin data, surveys)

Ideally measures capacity(?), not behaviour

Impossible to learn even with a lot of data, need expertise

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 8

Overfitting

Overfitting = Fit given data too closely and not work in other contexts

Example: How not to measure wealth index (Blumenstock et al. 2015)

� Mobile phone data with 5088 features and 856 labelled examples

� Choose features based on whole dataset (not training set)

� Don’t consider what is Rwanda-specific about this data

� Use non-standard methodology drawn from another paper

� Ignore sensible (human-generated) baselines

� 5-fold cross-validation produces 5 models, not one

Claim(?): Many neural network/deep learning models overfit

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 11

Pipelined Processes

� ADB poverty mapping (land use → regression)

� Errors in Phase 1 most likely systematic, not random

◮ Gauss-Markov assumptions do not hold

◮ Need to empirically estimate rather than use theory

◮ Relies on “ground-truth” dataset

� Methods vs models

◮ Works (better) for Philippines, not Thailand: why?

◮ Tradeoff generality of method and “local validation”

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 10

Combining Datasets

Use of only one type of data is insufficient for many purposes

� Especially social media data (Twitter, Facebook)

� Especially with complex metrics and indicators

◮ Population health using images of hospital carpark

◮ Rainfall locations and amounts using satellite data

� Need triangulation/corroboration, not increased uncertainty

◮ Need to “correlate” independent data sources

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 13

Validation

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 12

Slicing and Dicing

� Data may only be reliable in certain contexts

◮ May be able to determine event occurrence, not details

◮ Sentiment analysis notoriously inaccurate

� May want to analyse subgroups by region, status, etc.

◮ “Big data” can soon become “small data”

◮ Need statistical methods to assess reliability

◮ Map quality of data to quality of resulting decision

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Data Science 14

Conclusion

Is data fit for (what) purpose?

� No model is ever perfect (especially learned models)

� Statistical correlations are usually very weak

� Contextualize models to local circumstances

� Cross check model outputs with other datasets

� Express uncertainty associated with conclusions/decisions

� “Big data” methods can provide “early warning” signals

� Complement traditional measures with different time scales

� Continually validate models as assumptions vary

UNSW ©W. Wobcke et al. 2019–2021