CS代考 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
¡ª Chapter 1 ¡ª
Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University

Copyright By PowCoder代写 加微信 powcoder

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

Why Data Mining?
n The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability
n Automated data collection tools, database systems, Web, computerized society
n Major sources of abundant data
n Business: Web, e-commerce, transactions, stocks, …
n Science: Remote sensing, bioinformatics, scientific simulation, …
n Society and everyone: news, digital cameras, YouTube n We are drowning in data, but starving for knowledge!

Why Data Mining?
n Data mining: Automated analysis of massive data sets. It turns a large collection of data into knowledge.
n Google¡¯s Flu Trends:
n Google receives hundreds of millions of queries every day.
n Each query can be viewed as a transaction where the user describes her or his information need.
n Google¡¯s Flu Trends uses specific search terms as indicators of flu activity.
n It found a close relationship between the number of people who search for flu-related information and the number of people who actually have flu symptoms.

Evolution of Sciences
n Before 1600, empirical science n 1600-1950s, theoretical science
n Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
n 1950s-1990s, computational science
n Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
n Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
n 1990-now, data science
n The flood of data from new scientific instruments and simulations
n The ability to economically store and manage petabytes of data online
n The Internet and computing Grid that makes all these archives universally accessible
n Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!

Evolution of Database Technology
n Data collection, database creation, Database Management System (DBMS)
n Relational data model, Relational DBMS (RDBMS) implementation
n RDBMS, advanced data models (extended-relational, OO, deductive, etc.) n Application-oriented DBMS (spatial, scientific, engineering, etc.)
n Data mining, data warehousing, multimedia databases, and Web
databases n 2000s
n Stream data management and mining
n Data mining and its applications
n Web technology (XML, data integration) and global information systems

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

What Is Data Mining?
n Many people treat data mining as a synonym for knowledge discovery from data (KDD), another popularly used term.
n Others view data mining as merely an essential step in the process of knowledge discovery.
n The knowledge discovery process is an iterative sequence of the following steps in the following figure.
n 1) Data cleaning (to remove noise and inconsistent data) n 2) Data integration (where multiple data sources may be
n 3) Data selection (where data relevant to the analysis task are retrieved from the database)

What Is Data Mining?
n 4) Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations)
n 5) Data mining (an essential process where intelligent methods are applied to extract data patterns)
n 6) Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures)
n 7) Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users)

What Is Data Mining?
n Steps 1 through 4 of the process are different forms of data preprocessing, where data are prepared for mining.
n The data mining step may interact with the user or a knowledge base.
n The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.

What Is Data Mining?
n The preceding view shows data mining as one step in the knowledge discovery process, albeit an essential one because it uncovers hidden patterns for evaluation.
n However, ¡°data mining¡± is often used to refer to the entire knowledge discovery process.
n In this course, we adopt a broad view of data mining functionality.
n Namely, data mining is the process of discovering interesting patterns and knowledge from a large amount of data.

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

Data Mining: On What Kinds of Data?
n The most basic forms of data for mining applications are: n Database data (Section 1.3.1)
n Data warehouse data (Section 1.3.2)
n Transactional data (Section 1.3.3).
n The concepts and techniques presented in this book focus on these types of data.
n Data mining can also be applied to other forms of data (e.g., data streams, ordered/sequence data, graph or networked data, spatial data, text data, multimedia data, and the WWW).
n Techniques for mining of these kinds of data are briefly introduced in Chapter 13.

Database Data
n A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).
n A relational database for AllElectronics, the example store used to illustrate concepts throughout this book, includes the following tables:
n Customer n Item
n Employee n Branch

Database Data
n In addition, the database includes the following relationship tables: n Purchases (customer purchases items, creating a sales transaction handled by
an employee)
n Items sold (lists items sold in a given transaction)
n Works at (employee works at a branch of AllElectronics).

Data Warehouse
n Suppose that AllElectronics is a successful international company with branches around the world.
n Each branch has its own set of databases.
n The president of AllElectronics has asked you to provide an analysis of the company¡¯s sales per item type per branch for the third quarter.
n This is a difficult task, particularly since the relevant data are spread out over several databases physically located at numerous sites.
n If AllElectronics had a data warehouse, this task would be easy.

Data Warehouse
n A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually located at a single site.
n Data warehouses are constructed via a process that includes the following steps:
n data cleaning
n data integration
n data transformation
n data loading
n periodic data refreshing
n The details of this process are discussed in Chapters 3 and 4.

Data Warehouse
n The following figure shows the typical framework for construction and use of a data warehouse for AllElectronics.
Pre- processing

Transactional Data
n In general, each record in a transactional database captures a transaction, such as:
n A customer¡¯s purchase
n A flight booking
n A user¡¯s clicks on a web page.
n A transaction typically includes:
n A unique transaction identity number (trans ID)
n A list of the items making up the transaction, such as the items purchased in the transaction.

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

Mining Frequent Patterns, Associations and Correlations
n Frequent patterns: e.g. what items are frequently purchased together in your Walmart?
n Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
n Example association at AllElectronics:
buys(X, ¡°computer¡±) => buys(X, ¡°software¡±) [support=1%, confidence =50%]
X is a variable representing a customer.
A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that the customer will buy software as well.
A 1% support means that 1% of all the transactions under analysis show that computer and software are purchased together.

Classification and Regression for Predictive Analysis
n Classification is the process of finding a model (or function) that describes and distinguishes data classes.
n The model are derived based on the analysis of a set of training data (i.e., data objects for which the class labels are known).
n The model is used to predict the class label of objects for which the the class label is unknown.
n Whereas classification predicts categorical (discrete, unordered) labels, regression models continuous-valued functions. That is, regression is used to predict missing or unavailable numerical data values rather than (discrete) class labels.
n The term prediction refers to both numeric prediction and class label prediction.

Cluster Analysis
n Unlike classification, which analyzes class-labeled (training) data sets, clustering analyzes data objects without class labels.
n In many cases, class-labeled data may simply not exist at the beginning.
n Clustering can be used to generate class labels for a group of data.
n The objects are clustered or grouped according to the following principles:
n Maximizing the intraclass similarity n Minimizing the interclass similarity

Cluster Analysis

Outlier Analysis
n Outlier: A data object that does not comply with the general behavior of the data
n Many data mining methods regard outliers as noise or exceptions. Consequently, outliers are simply discarded.
n However, in some applications (e.g., fraud detection) the rare events can be more interesting than the more regularly occurring ones.

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

Data Mining: Confluence of Multiple Disciplines

Machine Learning
n Machine learning investigates how computers can learn.
n Here, we list the classic problems in machine learning that are highly related to data mining:
n Supervised learning is basically a synonym for classification. The supervision in the learning comes from the labeled examples in the training data set.
n Unsupervised learning is essentially a synonym for clustering. The learning process is unsupervised since the input examples are not class labeled.
n Semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled examples when learning a model.
n Active learning is a machine learning approach that lets users play an active role in the learning process. An active learning approach can ask a user (e.g., a domain expert) to label an example, which may be from a set of unlabeled examples or synthesized by the learning program.

Information Retrieval
n Information retrieval (IR) is the science of searching for information in documents.
n Documents can be text or multimedia, and may reside on the Web.
n The differences between traditional information retrieval and database systems are twofold:
n Information retrieval assumes that (1) the data are unstructured; and (2) the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in database systems).
n Database systems assume that (1) the data are structured; and (2) the queries are based on complex structures (such as SQL queries)

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

Applications of Data Mining
As a highly application-driven discipline, data mining has seen great successes in many applications. Here are some examples:
n Web page analysis: page ranking, page classification/clustering
n Business intelligence: basket data analysis for targeted marketing
n Biological data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
n Medical data analysis: classification and clustering

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

Major Issues in Data Mining (1)
n Mining Methodology
n Mining various and new kinds of knowledge
n Mining knowledge in multi-dimensional space
n Data mining – An interdisciplinary effort: it can be substantially enhanced by integrating new methods from multiple disciplines.
n Boosting the power of discovery in a networked environment n Handling noise, uncertainty, and incompleteness of data
n Pattern evaluation and pattern- or constraint-guided mining
n UserInteraction
n Interactive mining
n Incorporation of background knowledge
n Ad hoc data mining and data mining query languages n Presentation and visualization of data mining results

Major Issues in Data Mining (2)
n Efficiency and Scalability
n Efficiency and scalability of data mining algorithms
n Parallel, distributed, and incremental mining methods
n Diversity of database types
n Handling complex types of data
n Mining dynamic, networked, and global data repositories
n Data mining and society
n Social impacts of data mining n Privacy-preserving data mining
n Invisible data mining: More systems should have built-in data mining functions so that users do not need to have the knowledge of data mining algorithms when using the systems

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

A Brief History of Data Mining Society
n 1989 IJCAI Workshop on Knowledge Discovery in Databases
n Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
n 1991-1994 Workshops on Knowledge Discovery in Databases
n Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
n 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD¡¯95-98)
n Journal of Data Mining and Knowledge Discovery (1997) n ACM SIGKDD conferences since 1998
n More conferences on data mining
n PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
n ACM Transactions on KDD starting in 2007

Conferences and Journals on Data Mining
n KDD Conferences
n ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
n SIAM Data Mining Conf. (SDM) n IEEE Int. Conf. on Data Mining
n European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)
n Other related conferences
n DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
n Web and IR conferences: WWW, SIGIR, WSDM
n ML conferences: ICML, NIPS
n PR conferences: CVPR n Journals
n ACM Trans. on KDD
n IEEE Trans. on Knowledge and
Data Engineering
IEEE Trans. on Big Data
n Pacific-Asia Conf. on Knowledge Discovery and Data Mining
n Int. Conf. on Web Search and Data Mining (WSDM)
n ACM SIGKDD Explorations
n Data Mining and Knowledge Discovery (DAMI or DMKD)

Where to Find References? DBLP, CiteSeer, Google
n Data mining and KDD
n Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
n Journal: ACM TKDD, DMKD, KDD Explorations
n Database systems
n Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA n Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
n AI & Machine Learning
n Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
n Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
n Web and IR
n Conferences: SIGIR, WWW, CIKM, etc.
n Journals: WWW: Internet and Web Information Systems,
n Statistics
n Conferences: Joint Stat. Meeting, etc.
n Journals: Annals of statistics, etc. n Visualization
n Conference proceedings: CHI, ACM-SIGGraph, etc.
n Journals: IEEE Trans. visualization and computer graphics, etc.

Chapter 1. Introduction
n Why Data Mining?
n What Is Data Mining?
n What Kind of Data Can Be Mined?
n What Kinds of Patterns Can Be Mined?
n What Technologies Are Used?
n What Kind of Applications Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society n Summary

n Data mining: Discovering interesting patterns and knowledge from massive amount of data
n A natural evolution of database technology, in great demand, with wide applications
n A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
n Mining can be performed with a variety of data
n Data mining functionalities: association, classification, clustering,
outlier and trend analysis, etc.
n Data mining technologies and applications n Major issues in data mining

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com