Data Science @ RPI http://www.cs.rpi.edu/research/groups/datascience/
MD-MIS 637-Fall 2020
MIS 637: Data Analytics and Machine Learning
School of Business
Introduction Continued
Fall 2020
Intro from the Text:
Data Mining and Analysis: Foundations and Algorithms, Mohammed J. Zaki and Wagner Meira, Jr,
Cambridge University Press, 2013
Modified by MD
MD-MIS 637-Fall 2020
Traditional Hypothesis Driven Research
Hypothesis
Experiment
Data
Result
Design
Data analysis
MD-MIS 637-Fall 2020
Data Driven Science
Process/Experiment
Data
No Prior Hypothesis
New Science of Data
MD-MIS 637-Fall 2020
Bioinformatics
Datasets:
Genomes
Protein structure
DNA/Protein arrays
Interaction Networks
Pathways
Metagenomics
Integrative Science
Systems Biology
Network Biology
MD-MIS 637-Fall 2020
Astro-Informatics: US National Virtual Observatory (NVO)
New Astronomy
Local vs. Distant Universe
Rare/exotic objects
Census of active galactic nuclei
Search extra-solar planets
Turn anyone into an astronomer
MD-MIS 637-Fall 2020
Ecological Informatics
Analyze complex ecological data from a highly-distributed set of field stations, laboratories, research sites, and individual researchers
MD-MIS 637-Fall 2020
Geo-Informatics/Data
MD-MIS 637-Fall 2020
Temporal-Data: Services in WSN
(Wireless Sensor Networks)
Sink
node
Gateway
Core network
e.g. Internet
End-user
Data
Sender
Data
Receiver
A sample data communication in conventional networks
A sample data communication in WSN
Fire!
Some bits
01100011100
MD-MIS 637-Fall 2020
Cheminformatics
AAACCTCATAGGAAGCATACCAGGAATTACATCA…
Structural Descriptors
Physiochemical Descriptors
Topological Descriptors
Geometrical Descriptors
MD-MIS 637-Fall 2020
Materials Informatics
MD-MIS 637-Fall 2020
Temporal-Data: Economics & Finance
MD-MIS 637-Fall 2020
World Wide Web
MD-MIS 637-Fall 2020
The iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in Massive databases
What is Data Analytics & Machine Learning?
MD-MIS 637-Fall 2020
What is DA & ML?
Valid: generalize to the future
Novel: what we don’t know
Useful: be able to take some action
Understandable: leading to insight
Iterative: takes multiple passes
Interactive: human in the loop
MD-MIS 637-Fall 2020
Why DA & ML?
Massive amounts of data being collected in different disciplines
Biology, Chemistry, Materials science, Astronomy, Ecology, Geology, Economics, and many more
Search for a systematic way to address the challenges across/at the intersection of the diverse fields
Leverage the unique strengths of each area
Techniques from bioinformatics can be applied to other areas (like network intrusion detection)
Game theory from Economics can be applied to problems in CS
Database development in Astronomy can help Ecology applications
Enable Data-informatics: bio-, chem-, eco-, geo-, astro-, materials- informatics
MD-MIS 637-Fall 2020
Why DA & ML?
Dynamic nature of modern data sets: streams
Massive and distributed datasets: tera-/peta-scale
Various modalities:
Tables
Images
Video
Audio
Text, hyper-text, “semantic” text
Networks
Spreadsheets
Multi-lingual
MD-MIS 637-Fall 2020
Data Analytics: Main Goals
Prediction
What?
Opaque
Description
Why?
Transparent
Model
Age
Salary
CarType
High/Low Risk
outlier
MD-MIS 637-Fall 2020
Data Analytics & Machine Learning: Main Techniques
Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy book X, also buy book Y (10% of all shoppers buy both)
Sequence mining (categorical): discover sequences of events that commonly occur together, .e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability
MD-MIS 637-Fall 2020
Data Mining & Analytics: Main Techniques
Classification and regression: assign a new data record to one of several predefined categories or classes. Regression deals with predicting real-valued fields. Also called supervised learning.
Clustering: partition the dataset into subsets or groups such that elements of a group share a common set of properties, with high within group similarity and small inter-group similarity. Also called unsupervised learning.
MD-MIS 637-Fall 2020
Data Mining & Analytics: Main Techniques
Deviation detection: find the record(s) that is (are) the most different from the other records, i.e., find all outliers. These may be thrown away as noise or may be the “interesting” ones.
Similarity search: given a database of objects, and a “query” object, find the object(s) that are within a user-defined distance of the queried object, or find all pairs within some distance of each other.
MD-MIS 637-Fall 2020
Knowledge Discovery Process
Original
Data
Target
Data
Preprocessed
Data
Transformed
Data
Patterns
Knowledge
Selection
Preprocessing
Transformation
Data Analytics
Interpretation
MD-MIS 637-Fall 2020
Data Mining & Analytics Process
Understand application domain
Prior knowledge, user goals
Create target dataset
Select data, focus on subsets
Data cleaning and transformation
Remove noise, outliers, missing values
Select features, reduce dimensions
MD-MIS 637-Fall 2020
Data Mining Process
Apply data mining algorithm
Associations, sequences, classification, clustering, etc.
Interpret, evaluate and visualize patterns
What’s new and interesting?
Iterate if needed
Manage discovered knowledge
Close the loop
MD-MIS 637-Fall 2020
Components of Data Mining Methods
Representation: language for patterns/models, expressive power
Evaluation: scoring methods for deciding what is a good fit of model to data
Search: method for enumerating patterns/models
MD-MIS 637-Fall 2020
New Science of Data
New data models: dynamic, streaming, etc.
New mining, learning, and statistical algorithms that offer timely and reliable inference and information extraction: online, approximate
Self-aware, intelligent continuous data monitoring and management
Data and model compression
Data provenance
Data security and privacy
Data sensation: visual, aural, tactile
Knowledge validation: domain experts
MD-MIS 637-Fall 2020
Data Science Core Areas
Data Analytics and Machine Learning
Mathematical Modeling and Optimization
Databases and Data warehousing
High Performance Computing
Data Compression/Representation
Statistics, Algebra, and Geometry
Visualization, Sonification
Social/ethical/legal Dimensions
Application Domains
Biology, medicine, chemistry, astronomy, finance, economics, geology, environment, materials, large-scale simulations, national security, WWW
MD-MIS 637-Fall 2020
N
N
C
l
O
/docProps/thumbnail.jpeg