CS计算机代考程序代写 data mining data structure decision tree database python algorithm data science INF 553: Foundations and Applications of Data Mining (Summer 2020)

INF 553: Foundations and Applications of Data Mining (Summer 2020)
Yao-Yi Chiang
Associate Professor (Research), Spatial Sciences Institute Associate Director, Integrated Media Systems Center Spatial Computing Lab
University of Southern California
Thanks for source slides and material to: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets http://www.mmds.org

What is Data Mining? About THIS Course

What is Data Mining? Knowledge Discovery from Data

4

Data contains value and knowledge

• Buttoextracttheknowledge data needs to be
• Stored
• Managed
• And ANALYZEDßthis class
Big Data Lifecycle
Data Mining
Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science

What is Data Mining?
• Given lots of data
• Discover patterns and models that are:
• Valid: hold on new data with some certainty • Useful: should be possible to act on the item • Unexpected: non-obvious to the system
• Understandable:humansshouldbeableto interpret the pattern

Data Mining Tasks
• Descriptive methods
• Findhuman-interpretablepatternsthat
describe the data
• Example: Clustering
• Predictive methods
• Usesomevariablestopredictunknown
or future values of other variables • Example: Recommender systems

Meaningfulness of Analytic Answers
• A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless
• Bonferroni’s principle:
• Ifyoulookinmoreplacesforinterestingpatterns than your amount of data will support, you are bound to find crap
9

Meaningfulness of Analytic Answers
Example:
• We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day
• 109 people being tracked – 1 billion
• 1,000 days ~ 3 years
• Each person stays in a hotel 1% of time (1 day out of 100)
• Hotels hold 100 people (so 105 hotels)
• enough to hold the 1% of a billion people who visit a hotel on any
given day
• If everyone behaves randomly will the data mining detect anything suspicious?

Meaningfulness of Analytic Answers (Cont’d)
• 109 people, 1,000 days, 1% hotel stay, 105 hotels
• The probability of any two people both deciding to visit a hotel on any
given day is 0.0001 (i.e., 1%*1%)
• The chance that they will visit the same hotel for one day is 0.0001
/105 = 10-9 ; for two given days = 10-18
• The number of pairs of people is C(109, 2) = 5 x 1017
• The number of pairs of days is C(103, 2) = 5 x 105
• Expected number of “suspicious” pairs of people:
• 5×1017 x5x105x10-18=250,000!
• … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way

What matters when dealing with data?
Usage Quality Context
Streaming Scalability

Data Mining: Cultures
• Data mining overlaps with:
• Databases: Large-scale data, simple(r) queries
• Machine learning: Small data, Complex models
• CS Theory: Algorithms
• Different cultures:
• To a DB person, data mining is an extreme form of analytic processing – queries that
examine large amounts of data • Result is the query answer
• To a ML person, data-mining is the inference of models
• Result is the parameters of the model • In this class we will do both!
CS Theory
Machine Learning
Data Mining
Database systems

About THIS Course

This Course
• This course overlaps with machine learning, statistics, artificial intelligence, databases but more stress on
• Scalability(bigdata)
• Algorithms
• Computingarchitectures
• Automationforhandling large data
Statistics
Machine Learning
Data Mining
Database systems

What will we learn?
• We will learn to mine different types of data:
• Data is high dimensional
• Data is a graph
• Data is infinite/never-ending
• Data is labeled
• We will learn to use different models of computation:
• MapReduce
• Streams and online algorithms
• Single machine in-memory

What will we learn?
• We will learn to solve real-world problems:
• Recommender systems
• Market Basket Analysis
• Spam detection
• Duplicate document detection
• We will learn various “tools”:
• Linear algebra (Rec. Sys., Communities)
• Optimization (stochastic gradient descent)
• Dynamic programming (frequent itemsets)
• Hashing (LSH, Bloom filters)

How It All Fits Together
High dim. data
Clustering
Dimensional ity reduction
Graph data
PageRank, SimRank
Community Detection
Spam Detection
Infinite data
Filtering data streams
Web advertising
Queries on streams
Machine learning
SVM
Decision Trees
Perceptron, kNN
Apps
Recommen der systems
Association Rules
Duplicate document detection
Locality sensitive hashing

2020 INF553 Course Staff
• TAs:
• YifanXu(xuyifan@usc.edu)
• DanFeldman(danf@usc.edu) • Graders:
• TBD
• Office hours:
• Instructor:Wednesdayafterclass
• TAs:SeePiazzaforTAofficehoursandlocations

Course Logistics
• Course website:
• piazza.com/usc/summer2020/inf553
• Lectureslides(samedayafterthelecture) • Homework
• Readings
• Mining of Massive Datasets
• J. Leskovec, A. Rajaraman and J. Ullman
• Free online: http://www.mmds.org or on
Blackboard
• Other relevant papers

Logistics: Communication
• Discussion board on Piazza:
• Use the discussion board for all questions and public
communication with the course staff
a post on the discussion board

Logistics: Communication
• We will post course announcements to Piazza (make sure you check it regularly)
• Emails:
• Donotuseemailsunlessit’spersonal!

Work for the Course
• Four homework: 50%
• Theoreticalandprogrammingquestions
• MapReduce (with Spark)
• Finding Frequent Itemsets
• Recommendation Systems
• Clustering
• Detecting Communities in a Social Network
• Assignmentstakelotsoftime.Startearly!! • Please work on your own code!

Work for the Course
• Homework policy:
• Noregrading
• Oneweeklatepenalty–20%
• 0pointsafteroneweek
• Freefive-dayextensions
• You can use these five days on homework however you
want until the last day of the class
• No more extension days will be given for any reason

Work for the Course
• Not-So-Short weekly quizzes: 30%
• Not-So-Shortin-classquizzeseveryweek • Wewilldropyourtwolowestquizzes
• Comprehensive exam: 20%
• Wednesday,June294pm–5:30pm
• Theexamwillcovereverythingtaughtinclass
• No Final exam
• It’s going to be fun and hard work.

Prerequisites
• Algorithms
• Dynamic programming, basic data structures
• Basic probability
• Moments, typical distributions, MLE, …
• Programming
• Your choice, but Python will be very useful
• Some of the homework will require you to use Scala only
• We provide some background, but the class will be fast paced

Course Grade

What’s after the class
• Directed Research
• Course producer or grader positions • Paid RA positions

To-do items
• Download the textbook
• Install Spark on your machine (http://spark.apache.org/)
• Play with datasets http://grouplens.org/datasets/movielens/ and http://jmcauley.ucsd.edu/data/amazon/links.html
• Signup for Piazza
• Email me a photo of you if your GRS photo is outdated…

One more thing…
• Please send me an email if you want to audit the class
• Check with your advisor for the last day to drop a class without a mark of “W”