407706 Protocol Analysis and Design
COMP723 – Data Mining and Knowledge Engineering
Lecture on Assignment 1 and Misc topics on data mining
‹#›/34
Parma Nand (PN) – Or Text Mining
Assignment 1 General Comments
Assignment task in perspective
TWO classification algorithms
Purpose of Abstract
‹#›/34
‹#›/34
Results
Need to discuss the results obtained. Compare, contrast, trends, etc
Not only screen dumps.
‹#›/34
Analysis
Examples
Why a particular category has very high or low rate.
Category by category comparison
Algorithm comparison on overall results
Extreme variations
Results after feature manipulations
‹#›/34
Results
Need to answer the task question.
Accuracies of two MLA’s on side effect prediction with and without pre-processing.
‹#›/34
Result presentation
‹#›/34
Methodology
Should reflect as closely as possible the tasks
Describe which features used and why? Why have you used only 3 text columns?
‹#›/34
What is Big Data?
No single definition; here is from Wikipedia:
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
‹#›/34
12+ TBs
of tweet data
every day
25+ TBs of
log data every day
? TBs of
data every day
2+ billion people on the Web by end 2011
30 billion RFID tags today
(1.3B in 2005)
4.6 billion camera phones world wide
100s of millions of GPS enabled devices sold annually
76 million smart meters in 2009…
200M by 2014
‹#›/34
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
‹#›/34
Variety (Complexity)
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
A single application can be generating/collecting many types of data
Big Public Data (online, weather, finance, etc)
12
To extract knowledge all these types of data need to linked together
‹#›/34
Velocity
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction
‹#›/34
Data model- old versus new
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
14
‹#›/34
Cloud Computing
IT resources provided as a service
Compute, storage, databases, queues
Clouds leverage economies of scale of commodity hardware
Cheap storage, high bandwidth networks & multicore processors
Geographically distributed data centers
Offerings from Microsoft, Amazon, Google, …
‹#›/34
Classification of Cloud Computing based on Service Provided
Infrastructure as a service (IaaS)
Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers.
Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.
Platform as a Service (PaaS)
Offering a development platform on the cloud.
Google’s Application Engine, Microsofts Azure, Salesforce.com’s force.com .
Software as a service (SaaS)
Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on pay-per-use basis. This is a well-established sector.
Salesforce.coms’ offering in the online Customer Relationship Management (CRM) space, Googles gmail and Microsofts hotmail, Google docs.
‹#›/34
Enabling Technology: Virtualization
Hardware
Operating System
App
App
App
Traditional Stack
Hardware
OS
App
App
App
Hypervisor
OS
OS
Virtualized Stack
‹#›/34
Everything as a Service
Utility computing = Infrastructure as a Service (IaaS)
Why buy machines when you can rent cycles?
Examples: Amazon’s EC2, Rackspace
Platform as a Service (PaaS)
Give me nice API and take care of the maintenance, upgrades, …
Example: Google App Engine
Software as a Service (SaaS)
Just run it for me!
Example: Gmail, Salesforce
‹#›/34
Cloud Services
Amazon Elastic Compute Cloud
Google App Engine
Microsoft Azure
GoGrid
AppNexus
‹#›/34
Types of NN
Kohonen Self-organizing Neural Network
Vectors of random input are input to a discrete map comprised of neurons. Vectors are also called dimensions or planes.
Akin to unsupervised learning, so does not use error correction.
Applications include using it to recognize patterns in data like a medical analysis
Convolutional Neural Network (CNN)
Contains one or more convolution layers with pooling
Allows deeper networks with fewer parameters
Does well on image/text analysis
‹#›/34
Types of NN…
Recursive Neural Network (RNN)
Deep neural network formed by applying the same set of weights recursively over a structure (subset of neurons) so that that structure can make same prediction for a subset of inputs.
Recurrent Neural Network(RNN)
A recurrent neural network (RNN), unlike a feedforward neural network, is a variant of a recursive artificial neural network in which connections between neurons that make a directed cycle.
The output depends on both input and the output from the previous iteration’s neuron state.
Does well with handwriting /speech recognition
‹#›/34
Types of NN…
Long short-term memory (LSTM)
Variation of recurrent NN designed to model temporal sequences and their long and short term dependences more accurately.
Implemented in blocks with “logic gates”
‹#›/34
Kaggle
The most resource-rich data science/data mining site.
Competitions, courses, jobs, datasets etc.
‹#›/34
/docProps/thumbnail.jpeg