程序代写代做代考 database algorithm DNA finance data science data mining Bioinformatics Data Science @ RPI http://www.cs.rpi.edu/research/groups/datascience/

Data Science @ RPI http://www.cs.rpi.edu/research/groups/datascience/

MD-MIS 637-Fall 2020
MIS 637: Data Analytics and Machine Learning

School of Business

Introduction Continued

Fall 2020

Intro from the Text:

Data Mining and Analysis: Foundations and Algorithms, Mohammed J. Zaki and Wagner Meira, Jr,
Cambridge University Press, 2013
Modified by MD

MD-MIS 637-Fall 2020

Traditional Hypothesis Driven Research
Hypothesis
Experiment
Data
Result

Design
Data analysis
MD-MIS 637-Fall 2020

Data Driven Science
Process/Experiment

Data
No Prior Hypothesis
New Science of Data
MD-MIS 637-Fall 2020

Bioinformatics
Datasets:
Genomes
Protein structure
DNA/Protein arrays
Interaction Networks
Pathways
Metagenomics
Integrative Science
Systems Biology
Network Biology

MD-MIS 637-Fall 2020

Astro-Informatics: US National Virtual Observatory (NVO)

New Astronomy
Local vs. Distant Universe
Rare/exotic objects
Census of active galactic nuclei
Search extra-solar planets
Turn anyone into an astronomer

MD-MIS 637-Fall 2020

Ecological Informatics

Analyze complex ecological data from a highly-distributed set of field stations, laboratories, research sites, and individual researchers

MD-MIS 637-Fall 2020

Geo-Informatics/Data

MD-MIS 637-Fall 2020

Temporal-Data: Services in WSN
(Wireless Sensor Networks)

Sink
node
Gateway

Core network
e.g. Internet

End-user

Data
Sender

Data
Receiver
A sample data communication in conventional networks
A sample data communication in WSN

Fire!
Some bits
01100011100
MD-MIS 637-Fall 2020

Cheminformatics

AAACCTCATAGGAAGCATACCAGGAATTACATCA…
Structural Descriptors
Physiochemical Descriptors
Topological Descriptors
Geometrical Descriptors

MD-MIS 637-Fall 2020

Materials Informatics

MD-MIS 637-Fall 2020

Temporal-Data: Economics & Finance

MD-MIS 637-Fall 2020

World Wide Web

MD-MIS 637-Fall 2020

The iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in Massive databases

What is Data Analytics & Machine Learning?
MD-MIS 637-Fall 2020

What is DA & ML?
Valid: generalize to the future
Novel: what we don’t know
Useful: be able to take some action
Understandable: leading to insight
Iterative: takes multiple passes
Interactive: human in the loop

MD-MIS 637-Fall 2020

Why DA & ML?
Massive amounts of data being collected in different disciplines
Biology, Chemistry, Materials science, Astronomy, Ecology, Geology, Economics, and many more
Search for a systematic way to address the challenges across/at the intersection of the diverse fields
Leverage the unique strengths of each area
Techniques from bioinformatics can be applied to other areas (like network intrusion detection)
Game theory from Economics can be applied to problems in CS
Database development in Astronomy can help Ecology applications
Enable Data-informatics: bio-, chem-, eco-, geo-, astro-, materials- informatics
MD-MIS 637-Fall 2020

Why DA & ML?
Dynamic nature of modern data sets: streams
Massive and distributed datasets: tera-/peta-scale
Various modalities:
Tables
Images
Video
Audio
Text, hyper-text, “semantic” text
Networks
Spreadsheets
Multi-lingual

MD-MIS 637-Fall 2020

Data Analytics: Main Goals
Prediction
What?
Opaque
Description
Why?
Transparent

Model

Age
Salary
CarType

High/Low Risk

outlier

MD-MIS 637-Fall 2020

Data Analytics & Machine Learning: Main Techniques
Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy book X, also buy book Y (10% of all shoppers buy both)
Sequence mining (categorical): discover sequences of events that commonly occur together, .e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability

MD-MIS 637-Fall 2020

Data Mining & Analytics: Main Techniques
Classification and regression: assign a new data record to one of several predefined categories or classes. Regression deals with predicting real-valued fields. Also called supervised learning.
Clustering: partition the dataset into subsets or groups such that elements of a group share a common set of properties, with high within group similarity and small inter-group similarity. Also called unsupervised learning.

MD-MIS 637-Fall 2020

Data Mining & Analytics: Main Techniques
Deviation detection: find the record(s) that is (are) the most different from the other records, i.e., find all outliers. These may be thrown away as noise or may be the “interesting” ones.
Similarity search: given a database of objects, and a “query” object, find the object(s) that are within a user-defined distance of the queried object, or find all pairs within some distance of each other.

MD-MIS 637-Fall 2020

Knowledge Discovery Process

Original
Data
Target
Data
Preprocessed
Data
Transformed
Data
Patterns
Knowledge
Selection
Preprocessing
Transformation
Data Analytics
Interpretation

MD-MIS 637-Fall 2020

Data Mining & Analytics Process
Understand application domain
Prior knowledge, user goals
Create target dataset
Select data, focus on subsets
Data cleaning and transformation
Remove noise, outliers, missing values
Select features, reduce dimensions

MD-MIS 637-Fall 2020

Data Mining Process
Apply data mining algorithm
Associations, sequences, classification, clustering, etc.
Interpret, evaluate and visualize patterns
What’s new and interesting?
Iterate if needed
Manage discovered knowledge
Close the loop

MD-MIS 637-Fall 2020

Components of Data Mining Methods
Representation: language for patterns/models, expressive power
Evaluation: scoring methods for deciding what is a good fit of model to data
Search: method for enumerating patterns/models

MD-MIS 637-Fall 2020

New Science of Data
New data models: dynamic, streaming, etc.
New mining, learning, and statistical algorithms that offer timely and reliable inference and information extraction: online, approximate
Self-aware, intelligent continuous data monitoring and management
Data and model compression
Data provenance
Data security and privacy
Data sensation: visual, aural, tactile
Knowledge validation: domain experts
MD-MIS 637-Fall 2020

Data Science Core Areas
Data Analytics and Machine Learning
Mathematical Modeling and Optimization
Databases and Data warehousing
High Performance Computing
Data Compression/Representation
Statistics, Algebra, and Geometry
Visualization, Sonification
Social/ethical/legal Dimensions
Application Domains
Biology, medicine, chemistry, astronomy, finance, economics, geology, environment, materials, large-scale simulations, national security, WWW

MD-MIS 637-Fall 2020

N
N
C
l
O

/docProps/thumbnail.jpeg