程序代写代做代考 data mining information retrieval algorithm finance Excel database decision tree Bayesian SQL 7class-a

7class-a

Data Mining: Concepts and Techniques 1

COMP9318: Data Warehousing
and Data Mining

— L7: Classification and Prediction —

n Problem definition and preliminaries

Data Mining: Concepts and Techniques 2

Data Mining: Concepts and Techniques 3

n Classification:
n predicts categorical class labels (discrete or nominal)
n classifies data (constructs a model) based on the

training set and the values (class labels) in a
classifying attribute and uses it in classifying new data

n Prediction (aka. Regression):
n models continuous-valued functions, i.e., predicts

unknown or missing values
n Typical Applications

n credit approval
n target marketing
n medical diagnosis
n treatment effectiveness analysis

Classification vs. Prediction

• Given a new object o, map it to a feature
vector
• Predict the output (class label)

• Binary classification:
• Multi-class classification:

• Learn a classification function:
• Regression:

Classification and Regression

x = (x1, x2, . . . , xd)
>

{0, 1}
AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4AAAB5HicbZBLSwMxFIXv1FetVatbN8EiuJCScWOXghuXFewDOkPJpHfa0ExmTDJCGfon3LhQxN/kzn9j+lho64HAxzkJufdEmRTGUvrtlba2d3b3yvuVg+rh0XHtpNoxaa45tnkqU92LmEEpFLatsBJ7mUaWRBK70eRunnefURuRqkc7zTBM2EiJWHBmndULCnpF/GA2qNVpgy5ENsFfQR1Wag1qX8Ew5XmCynLJjOn7NLNhwbQVXOKsEuQGM8YnbIR9h4olaMJiMe+MXDhnSOJUu6MsWbi/XxQsMWaaRO5mwuzYrGdz87+sn9u4GRZCZblFxZcfxbkkNiXz5clQaORWTh0wroWblfAx04xbV1HFleCvr7wJneuGTxv+A4UynME5XIIPN3AL99CCNnCQ8AJv8O49ea/ex7Kukrfq7RT+yPv8AaTWjc0=AAAB5HicbZBLSwMxFIXv1FetVatbN8EiuJCScWOXghuXFewDOkPJpHfa0ExmTDJCGfon3LhQxN/kzn9j+lho64HAxzkJufdEmRTGUvrtlba2d3b3yvuVg+rh0XHtpNoxaa45tnkqU92LmEEpFLatsBJ7mUaWRBK70eRunnefURuRqkc7zTBM2EiJWHBmndULCnpF/GA2qNVpgy5ENsFfQR1Wag1qX8Ew5XmCynLJjOn7NLNhwbQVXOKsEuQGM8YnbIR9h4olaMJiMe+MXDhnSOJUu6MsWbi/XxQsMWaaRO5mwuzYrGdz87+sn9u4GRZCZblFxZcfxbkkNiXz5clQaORWTh0wroWblfAx04xbV1HFleCvr7wJneuGTxv+A4UynME5XIIPN3AL99CCNnCQ8AJv8O49ea/ex7Kukrfq7RT+yPv8AaTWjc0=AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBg5SsFz0WvXisYD+gu5Rsmm1Dk+yaZIWy9E948aCIV/+ON/+NabsHbX0w8Hhvhpl5USq4sRh/e6W19Y3NrfJ2ZWd3b/+genjUNkmmKWvRRCS6GxHDBFesZbkVrJtqRmQkWCca3878zhPThifqwU5SFkoyVDzmlFgndYMcXyA/mParNVzHc6BV4hekBgWa/epXMEhoJpmyVBBjej5ObZgTbTkVbFoJMsNSQsdkyHqOKiKZCfP5vVN05pQBihPtSlk0V39P5EQaM5GR65TEjsyyNxP/83qZja/DnKs0s0zRxaI4E8gmaPY8GnDNqBUTRwjV3N2K6IhoQq2LqOJC8JdfXiXty7qP6/49rjVuijjKcAKncA4+XEED7qAJLaAg4Ble4c179F68d+9j0Vryiplj+APv8wfO048jAAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4AAAB5HicbZBLSwMxFIXv1FetVatbN8EiuJCScWOXghuXFewDOkPJpHfa0ExmTDJCGfon3LhQxN/kzn9j+lho64HAxzkJufdEmRTGUvrtlba2d3b3yvuVg+rh0XHtpNoxaa45tnkqU92LmEEpFLatsBJ7mUaWRBK70eRunnefURuRqkc7zTBM2EiJWHBmndULCnpF/GA2qNVpgy5ENsFfQR1Wag1qX8Ew5XmCynLJjOn7NLNhwbQVXOKsEuQGM8YnbIR9h4olaMJiMe+MXDhnSOJUu6MsWbi/XxQsMWaaRO5mwuzYrGdz87+sn9u4GRZCZblFxZcfxbkkNiXz5clQaORWTh0wroWblfAx04xbV1HFleCvr7wJneuGTxv+A4UynME5XIIPN3AL99CCNnCQ8AJv8O49ea/ex7Kukrfq7RT+yPv8AaTWjc0=AAAB5HicbZBLSwMxFIXv1FetVatbN8EiuJCScWOXghuXFewDOkPJpHfa0ExmTDJCGfon3LhQxN/kzn9j+lho64HAxzkJufdEmRTGUvrtlba2d3b3yvuVg+rh0XHtpNoxaa45tnkqU92LmEEpFLatsBJ7mUaWRBK70eRunnefURuRqkc7zTBM2EiJWHBmndULCnpF/GA2qNVpgy5ENsFfQR1Wag1qX8Ew5XmCynLJjOn7NLNhwbQVXOKsEuQGM8YnbIR9h4olaMJiMe+MXDhnSOJUu6MsWbi/XxQsMWaaRO5mwuzYrGdz87+sn9u4GRZCZblFxZcfxbkkNiXz5clQaORWTh0wroWblfAx04xbV1HFleCvr7wJneuGTxv+A4UynME5XIIPN3AL99CCNnCQ8AJv8O49ea/ex7Kukrfq7RT+yPv8AaTWjc0=AAAB73icbVBNSwMxEJ2tX7V+VT16CRbBg5SsFz0WvXisYD+gu5Rsmm1Dk+yaZIWy9E948aCIV/+ON/+NabsHbX0w8Hhvhpl5USq4sRh/e6W19Y3NrfJ2ZWd3b/+genjUNkmmKWvRRCS6GxHDBFesZbkVrJtqRmQkWCca3878zhPThifqwU5SFkoyVDzmlFgndYMcXyA/mParNVzHc6BV4hekBgWa/epXMEhoJpmyVBBjej5ObZgTbTkVbFoJMsNSQsdkyHqOKiKZCfP5vVN05pQBihPtSlk0V39P5EQaM5GR65TEjsyyNxP/83qZja/DnKs0s0zRxaI4E8gmaPY8GnDNqBUTRwjV3N2K6IhoQq2LqOJC8JdfXiXty7qP6/49rjVuijjKcAKncA4+XEED7qAJLaAg4Ble4c179F68d+9j0Vryiplj+APv8wfO048jAAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBg5REBD0WvXisYGuhCWWz3bRLN5u4OxFK6J/w4kERr/4db/4bt20O2vpg4PHeDDPzwlQKg6777ZRWVtfWN8qbla3tnd296v5B2ySZZrzFEpnoTkgNl0LxFgqUvJNqTuNQ8odwdDP1H564NiJR9zhOeRDTgRKRYBSt1PFz94x4/qRXrbl1dwayTLyC1KBAs1f98vsJy2KukElqTNdzUwxyqlEwyScVPzM8pWxEB7xrqaIxN0E+u3dCTqzSJ1GibSkkM/X3RE5jY8ZxaDtjikOz6E3F/7xuhtFVkAuVZsgVmy+KMkkwIdPnSV9ozlCOLaFMC3srYUOqKUMbUcWG4C2+vEza53XPrXt3F7XGdRFHGY7gGE7Bg0towC00oQUMJDzDK7w5j86L8+58zFtLTjFzCH/gfP4A0BOPJw==

Sometimes,

Examples of Classification Problem
n Text categorization:

n Input object: a document = a sequence of words
n Input features : = word frequencies

n freq(democrats)=2, freq(basketball)= 0, …
n = [1, 2, 0, …]T

n Class label:
n ‘Politics’:
n ‘Sport’:

Doc: Months of campaigning
and weeks of round-the-clock
efforts in Iowa all came down to
a final push Sunday, …

Topic:
Politics

Sport

Examples of Classification Problem

n Image Classification:

n Input object: an image = a matrix of RGB values
n Input features = Color histogram

n pixel_count(red) = 1004, pixel_count(blue) = 23000
n = [1004, 23000, …]T

n Class label
n ‘bird image’:
n ‘non-bird image’:

Which images are birds,
which are not?

Supervised Learning

• How to find f()?
• In supervised learning, we are given a set of

training examples:

• Identical independent distribution (i.i.d)
assumption
• A critical assumption for machine learning

theory

Machine Learning Terminologies

n Supervised learning has input labelled data

n #instances x #attributes matrix/table

n #attributes = #features + 1
n 1 (usu. the last attribute) is for the class attribute

n Labelled data split into 2 or 3 disjoint subsets

n Training data
n Training error = 1.0 –

n Validation/development data

n Testing data
n Testing/generalization error = 1.0 –

n We mainly discuss binary classification here

n i.e., #labels = 2

Data Mining: Concepts and Techniques 8

è Build a model

è Select/refine the model

è Evaluate the model

#correctly_classified

#training_instances

#correctly_classified

#testing_instances

n Overview of the whole process (not including
cross-validation)

Data Mining: Concepts and Techniques 9

Data Mining: Concepts and Techniques 10

Classification—A Two-Step Process

n Model construction: describing a set of predetermined classes
n Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute
n The set of tuples used for model construction is training set
n The model is represented as classification rules, decision trees,

or mathematical formulae
n Model usage: for classifying future or unknown objects

n Estimate accuracy of the model
n The known label of test sample is compared with the

classified result from the model
n Accuracy rate is the percentage of test set samples that are

correctly classified by the model
n Test set is independent of training set, otherwise over-fitting

will occur
n If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known

In practice, much more than this simple view

11

Classification Process (1):
Preprocessing & Feature Engineering

Training
Data

class label attrib

EID Name Title EMP_Date Track
101 Mike Assistant Prof 2013-01-01 3-year contract

102 Mary Assistant Prof 2009-04-15 Continuing

103 Bill Scientia Professor 2014-02-03 Continuing

110 Jim Associate Prof 2009-03-14 Continuing

121 Dave Associate Prof 2009-12-02 1-year contract

234 Anne Professor 2013-03-21 Future Fellow

188 Sarah Student Officer 2008-01-17 Continuing

Training
Data

Raw
Data

Gender RANK YEARS TENURED
M Assistant Prof 3 no
F Assistant Prof 7 yes
M Professor 2 yes
M Associate Prof 7 yes
M Associate Prof 6 no
F Professor 3 no

12

Classification Process (2): Training

Training
Data

Classification
Algorithms; θ

IF rank = ‘professor’
OR years ≥ 6
THEN tenured =
‘yes’

Classifier
(Model)

class label attrib

Gender RANK YEARS TENURED
M Assistant Prof 3 no
F Assistant Prof 7 yes
M Professor 2 yes
M Associate Prof 7 yes
M Associate Prof 6 no
F Professor 3 no

training

training error = 33.3%

Predicted

no
yes
yes
yes
yes
yes

Data Mining: Concepts and Techniques 13

Classification Process (3): Evaluate the
Model on Testing Data

Classifier
(Model)

Testing Data

(M, Jeff, Professor, 4, yes)
(F, Merlisa, Asso Prof, 5, yes)

IF rank = ‘professor’
OR years ≥ 6
THEN tenured = ‘yes’

testing error = 50%

Data Mining: Concepts and Techniques 14

Classification Process (4): Use the Model
in Production

Classifier
(Model)

Unseen Data
IF rank = ‘professor’
OR years ≥ 6
THEN tenured = ‘yes’ (Pedro, Assi Prof, 8, ?)

How to judge a model?
n Based on training error or testing error?

n Testing error
n Otherwise, this is a kind of data scooping è overfitting

n What if there are multiple models to choose from?
n

n Can we trust the error values on
the development set?

n Need “large” dev set
è less data for training
n k-fold cross-validation
n k=n: leave-one-out

15

10-fold CV

Test set

All the labelled data
Further split a “test/development set” from the training set

Exercise: Problem definition and
Feature Engineering

n How to formulate the following into ML problems?
What kind of resources do you need? What are
the features you think may be most relevant?
1. Predict the sale trend of a particular product

in the next month

2. Design an algorithm to produce better top-10
ranking results for queries

Data Mining: Concepts and Techniques 16

Data Mining: Concepts and Techniques 17

Supervised vs. Unsupervised
Learning

n Supervised learning (classification)
n Supervision: The training data (observations,

measurements, etc.) are accompanied by labels
indicating the class of the observations

n New data is classified based on the training set
n Unsupervised learning (clustering)

n The class labels of training data is unknown
n Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or
clusters in the data

Data Mining: Concepts and Techniques 18

n Decision Tree classifier
n ID3
n Other variants

Data Mining: Concepts and Techniques 19

Training Dataset

age income student credit_rating buys_computer
<=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

This
follows an
example
from
Quinlan’s
ID3

Data Mining: Concepts and Techniques 20

Output: A Decision Tree for “buys_computer”

age?

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

age income student credit_rating buys_computer
<=30 high no fair no <=30 high no excellent no <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

age income student credit_rating buys_computer
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no

age income student credit_rating buys_computer
31…40 high no fair yes
31…40 low yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes

Data Mining: Concepts and Techniques 21

Extracting Classification Rules from Trees
n Represent the knowledge in the form of IF-THEN rules

n Rules are easier for humans to understand
n One rule is created for each path from the root to a leaf

n Each attribute-value pair along a path forms a conjunction
n The leaf node holds the class prediction
n Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” … … … age? student? credit rating? no yes fairexcellent <=30 >40

no noyes yes

yes

30..40

Data Mining: Concepts and Techniques 22

Algorithm for Decision Tree Induction
n Basic algorithm (a greedy algorithm)

n Input: Attributes are categorical (if continuous-valued, they are
discretized in advance)

n Overview: Tree is constructed in a top-down recursive divide-and-
conquer manner

n At start, all the training examples are at the root
n Samples are partitioned recursively based on selected test-

attributes
n Test-attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)
n Three conditions for stopping partitioning (i.e., boundary conditions)

n There are no samples left, OR
n All samples for a given node belong to the same class, OR
n There are no remaining attributes for further partitioning

(majority voting is employed for classifying the leaf)

Exercise: Write down the pseudo-code of the induction algorithm

Decision Tree Induction Algorithm

Data Mining: Concepts and Techniques 23

Data Mining: Concepts and Techniques 24

Attribute Selection Measure:
Information Gain (ID3/C4.5)

n Select the attribute with the highest information gain
n S contains si tuples of class Ci for i = {1, …, m}
n information measures info required to classify any

arbitrary tuple

n entropy of attribute A with values {a1,a2,…,av}

n information gained by branching on attribute A

2

1

I( ) log
m

i i
1 2 m

i

s s
s ,s ,…,s

s s=
= -å

)s,…,s(I
s
s…s

E(A) mjj
v

j

mjj
1

1

1
å
=

++
=

E(A))s,…,s,I(sGain(A) m -= 21

Data Mining: Concepts and Techniques 25

Attribute Selection by Information

Gain Computation

g Class P: buys_computer = “yes”

g Class N: buys_computer = “no”

g I(p, n) = I(9, 5) =0.940

g Compute the entropy for age:
means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, age pi ni I(pi, ni) <=30 2 3 0.971 30…40 4 0 0 >40 3 2 0.971

694.0)2,3(
14
5

)0,4(
14
4

)3,2(
14
5

)(

=+

+=

I

IIageE

048.0)_(
151.0)(
029.0)(

=
=
=

ratingcreditGain
studentGain
incomeGain

246.0)(),()( =-= ageEnpIageGain

age income student credit_rating buys_computer
<=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

)3,2(
14
5
I

income pi ni I(pi, ni)
high 2 2 1
medium 4 2 0.918
low 3 1 0.811Q: what’s the extreme/worst case?

Data Mining: Concepts and Techniques 26

Other Attribute Selection Measures
and Splitting Choices

n Gini index (CART, IBM IntelligentMiner)
n All attributes are assumed continuous-valued
n Assume there exist several possible split values for

each attribute
n May need other tools, such as clustering, to get the

possible split values
n Can be modified for categorical attributes

n Induces binary split => binary decision trees

Data Mining: Concepts and Techniques 27

Gini Index (IBM IntelligentMiner)

n If a data set T contains examples from n classes, gini index,
gini(T) is defined as

where pj is the relative frequency of class j in T.
n If a data set T is split into two subsets T1 and T2 with sizes

N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as

n The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).

2
1( ) 1

n
jjgini T p== -å

)()()( 2
2

1
1 Tgini

N
N

Tgini
N
NTginisplit +=

Data Mining: Concepts and Techniques 28

Case I: Numerical Attributes

Age Car Class
20 … Y
20 … N
20 … N
25 … N
25 … Y
30 … Y
30 … Y
30 … Y
40 … Y
40 … Y

Split value

22.5

27.5

Cut=22.5 Y N
< 1 2 >= 6 1

Cut=27.5 Y N
< 2 3 >= 5 0


Age
20
20
20
25
25
30

30
30
40
40

å-= 21)( jpSGini
)()()( 2

2
1

1 SGini
n
n

SGini
n
n

SGinisplit +=

Ginisplit(S) =
3/10*Gini(1,2) +
7/10*Gini(6,1) = 0.30

Ginisplit(S) =
5/10*Gini(2,3) +
5/10*Gini(5,0) = 0.24

Exercise: compute gini indexes for other splits

Data Mining: Concepts and Techniques 29

Case II: Categorical Attributes

attrib list for Car Class=Y Class=N
M 5 0

T 0 2

S 2 1

count matrix

Y N
{M, T} 5 2

{S} 2 1

Y N
{M, S} 7 1

{T} 0 2

Y N
{T, S} 2 3

{M} 5 0

Ginisplit(S) =

7/10*Gini(5,2) +

3/10*Gini(2,1) =

0.42

Ginisplit(S) =

8/10*Gini(7,1) +

2/10*Gini(0,2) =

0.18

Ginisplit(S) =

5/10*Gini(2,3) +

5/10*Gini(5,0) =

0.24

Need to consider all possible splits !

Age Car Class
20 M Y

30 M Y

25 T N

30 S Y

40 S Y

20 T N

30 M Y

25 M Y

40 M Y

20 S N

Data Mining: Concepts and Techniques 30

ID3

age?

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

age income student credit_rating buys_computer
<=30 high no fair no <=30 high no excellent no <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

age income student credit_rating buys_computer
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no

age income student credit_rating buys_computer
31…40 high no fair yes
31…40 low yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes

Data Mining: Concepts and Techniques 31

CART/SPRINT

age?

student? credit rating?

no yes fairexcellent

<=30 >30

no noyes yes

age

Illustrative of
the shape only

Data Mining: Concepts and Techniques 32

Avoid Overfitting in Classification

n Overfitting: An induced tree may overfit the training data
n Too many branches, some may reflect anomalies due

to noise or outliers
n Poor accuracy for unseen samples

n Two approaches to avoid overfitting
n Prepruning: Halt tree construction early—do not split a

node if this would result in the goodness measure
falling below a threshold

n Difficult to choose an appropriate threshold
n Postpruning: Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees
n Use a set of data different from the training data to

decide which is the “best pruned tree”

Overfitting Example /1
Name Body

Temperature
Gives Birth Four-legged Hibernates Is Mammal?

Salamander Cold-blooded No Yes Yes No
Guppy Cold-blooded Yes No No No
Eagle Warm-blooded No No No No
Poorwill Warm-blooded No No Yes No
Platypus Warm-blooded No Yes Yes Yes

• Lack of representative samples
• Existence of noise

Body temperature

Hibernates

Four-legged

no

no

yes no

warm cold

yes no

noyes

Human ?
Dog?

Training error = 0%, but
does not generalize!

Overfitting Example /2
Name Body

Temperature
Gives Birth Four-legged Hibernates Is Mammal?

Salamander Cold-blooded No Yes Yes No
Guppy Cold-blooded Yes No No No
Eagle Warm-blooded No No No No
Poorwill Warm-blooded No No Yes No
Platypus Warm-blooded No Yes Yes Yes

• Lack of representative samples
• Existence of noise

Body temperature

Gives Birth no

noyes

warm cold

yes no

Human ?
Dog?

Training error = 20%, but generalizes!

Overfitting

35

n Overfitting: model too complex è training error keep decreasing,
but testing error increases

n Underfitting: model too simple è both training and testing has large
errors.

Overfitting

Data Mining: Concepts and Techniques 36

Overfitting examples in Regression &
Classification

Data Mining: Concepts and Techniques 37

Data Mining: Concepts and Techniques 38

DT Pruning Methods

n Use a separate validation set

n Estimation of generalization/test errors

n Use all the data for training

n but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a
node may improve the entire distribution

n Use minimum description length (MDL) principle

n halting growth of the tree when the encoding
is minimized

Pessimistic Post-pruning

n Observed on the training data
n e(t): #errors on a leaf node t of the tree T
n e(T) = ∑t in T e(t)

n What’s the generalization errors (i.e., errors on testing data) on T?
n Use pessimistic estimates
n e�(t) = e(t) + 0.5
n E�(t) = e(T) + 0.5N, where N is the number of leaf nodes in T

n What’s the generalization errors on root(T) only?
n E’ (root(T)) = e(T) + 0.5

n Post-pruning from bottom-up
n If generalization error reduces after pruning, replace sub-tree by a

leaf node
n Use majority voting to decide the class label

Data Mining: Concepts and Techniques 39

Better estimate of generalization
error in C4.5:
Use the upper 75% confidence
bound from the training error of a
node, assuming a binomial
distribution

Example
Class = Yes 20
Class = No 10

Data Mining: Concepts and Techniques 40

Error = 10/30

A?

A1 A3A2

Class = Yes 8
Class = No 4

Class = Yes 3
Class = No 4

Class = Yes 4
Class = No 1

Class = Yes 5
Class = No 1

A4

n Training error before splitting on
A = 10/30

n Pessimistic error = (10+0.5)/30
n Training error after splitting on A

= 9/30
n Pessimistic error = (9 +

4*0.5)/30 = 11/30

Prune the subtree at A

“̂h

Eh

Eh0
“̂h0

Data Mining: Concepts and Techniques 41

Classification in Large Databases

n Classification—a classical problem extensively studied by
statisticians and machine learning researchers

n Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed

n Why decision tree induction in data mining?
n relatively faster learning speed (than other classification

methods)
n convertible to simple and easy to understand

classification rules
n can use SQL queries for accessing databases
n comparable classification accuracy with other methods

Data Mining: Concepts and Techniques 42

Enhancements to basic decision
tree induction

n Allow for continuous-valued attributes
n Dynamically define new discrete-valued attributes that

partition the continuous attribute value into a discrete
set of intervals, one-hot encoding, or specialized DT
learning algorithms

n Handle missing attribute values
n Assign the most common value of the attribute
n Assign probability to each of the possible values

n Attribute construction
n Create new attributes based on existing ones that are

sparsely represented
n This reduces fragmentation, repetition, and replication

Data Mining: Concepts and Techniques 43

n Bayesian Classifiers

Data Mining: Concepts and Techniques 44

Bayesian Classification: Why?

n Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems

n Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.

n Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities

n Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured

Data Mining: Concepts and Techniques 45

Bayesian Theorem: Basics

n Let X be a data sample whose class label is unknown
n Let h be a hypothesis that X belongs to class C
n For classification problems, determine P(h|X): the

probability that the hypothesis holds given the observed
data sample X

n P(h): prior probability of hypothesis h (i.e. the initial
probability before we observe any data, reflects the
background knowledge)

n P(X): probability that sample data is observed
n P(X|h) : probability of observing the sample X, given that

the hypothesis holds

n Given training data X, posteriori probability of a hypothesis
h, P(h|X) follows the Bayes theorem

n Informally, this can be written as
posterior = likelihood x prior / evidence

n MAP (maximum posteriori) hypothesis

n Practical difficulty: require initial knowledge of many
probabilities, significant computational cost

Data Mining: Concepts and Techniques 46

Bayesian Theorem

Data Mining: Concepts and Techniques 47

Training dataset

age income student credit_rating buys_computer
<=30 high no fair no <=30 high no excellent no 30…40 high no fair yes >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

Hypotheses:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’

Data sample
X =(age<=30, Income=medium, Student=yes, Credit_rating= Fair) Sparsity Problem n Maximum likelihood Estimate of P(h) n Let p be the probability that the class is C1 n Consider a training example x and its label y n L(x) = py(1-p)1-y n L(X) = ∏x in X L(x) n l(X) = log(L(X)) = ∑x in X ylog(p) + (1-y)log(1-p) n To maximize l(X), let dl(X)/dp = 0 è p = (∑y)/n n P(C1) = 9/14, P(C2) = 5/14 n ML estimate of P(X|h) = ? n Requires O(2d) training examples, where d is the #features. Data Mining: Concepts and Techniques 48 Data likelihood Log Data likelihood Curse of dimensionality Data Mining: Concepts and Techniques 49 Naïve Bayes Classifier n Use a model n Assumption: attributes are conditionally independent: n The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([x1,x2],C) = P(x1,C) * P(x2,C) n No dependence relation between attributes given the class n Greatly reduces the computation cost, only count the class distribution è Only need to estimate P(xk | Ci) 1 ( | ) ( | ) n i k i k P X C P x C = =Õ Data Mining: Concepts and Techniques 50 Naïve Bayesian Classifier: Example n Compute P(X|Ci) for each class X=(age<=30 , income =medium, student=yes, credit_rating=fair) P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” likelihood The Need for Smoothing n Pr[Xi = vj | Ck] could still be 0, if not observed in the training data n makes Pr[Ck | X] = 0, regardless of other likelihood values of Pr[Ct = vt| Ck] n Add-1 Smoothing n reserve a small amount of probability for unseen probabilities n (conditional) probabilities of observed events have to be adjusted to make the total probability equals 1.0 Data Mining: Concepts and Techniques 51 Add-1 Smoothing n Pr[Xi = vj | Ck] = Count(Xi = vj, Ck) / Count(Ck) n Pr[Xi = vj | Ck] = [Count(Xi = vj, Ck) +1] / [Count(Ck) +B] n What’s the right value for B? n make ∑vj Pr[Xi = vj | Ck] = 1 n Explain the above constraint n Pr[Xi | Ck]: Given Ck, the (conditional) probability of Xi taking a specific value n Xi must take one of the values, hence summing over all possible value for Xi, the probabilities should sum up to 1.0 n B = dom(Xi), i.e., # of values Xi can take Data Mining: Concepts and Techniques 52 Many other kinds of smoothing exist (esp. in Natural Language Modelling) Smoothing Example Data Mining: Concepts and Techniques 53 age income student credit_rating buys_computer 30…40 high no fair no 30…40 high no excellent no 30…40 high no fair yes >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
30…40 medium no fair no
<=30 low yes fair yes >40 medium yes fair yes
<=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no

Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’

Data sample
X =(age<=30, Income=medium, Student=yes, Credit_rating= Fair) no instance of age <= 30 in the “No” class B=3 B=3 B=2 B=2 Data Mining: Concepts and Techniques 54 Consider the “No” Class n Compute P(X|Ci) for the “No” class X=(age<=30 , income =medium, student=yes, credit_rating=fair) P(age=“<30” | buys_computer=“no”) = (0 +1) / (5 +3) = 0.125 P(income=“medium” | buys_computer=“no”) = (2 +1) / (5 +3) = 0.375 P(student=“yes” | buys_computer=“no”)= (1 +1) / (5 +2) = 0.286 P(credit_rating=“fair” | buys_computer=“no”)= (2 +1) / (5 +2) = 0.429 P(X|Ci) : P(X|buys_computer=“no”)= 0.125 x 0.375 x 0.286 x 0.429 = 0.00575 P(X|Ci)*P(Ci ) : P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.00205 Probabilities Without Smoothing With Smoothing Pr[<=30 | No] 0 / 5 1 / 8 Pr[30..40 | No] 3 / 5 4 / 8 Pr[>=40 | No] 2 / 5 3 / 8

sum
up to 1

How to Handle Numeric Values

n Need to model the distribution of Pr[Xi | Ck]
n Method 1:

n Assume it to be Gaussian è Gaussian Naïve
Bayes

n Pr[Xi = vj | Ck] =

n Method 2:
n Use binning to discretize the feature values

Data Mining: Concepts and Techniques 55

1
p

2⇡�2i
exp



(vj � µik)2

2�2i

Text Classification

n NB has been widely used in text classification
(aka., text categorization)

n Outline:
n Applications
n Language Model
n Two Classification Methods

Data Mining: Concepts and Techniques 56

Based on “Chap 13: Text classification & Naive Bayes”
in Introduction to Information Retrieval

•http://nlp.stanford.edu/IR-book/

Multimedia GUIGarb.Coll.SemanticsML Planning

planning
temporal
reasoning
plan
language…

programming
semantics
language
proof…

learning
intelligence
algorithm
reinforcement
network…

garbage
collection
memory
optimization
region…

�planning
language
proof
intelligence�

Training
Data:

Test
Data:

Classes:
(AI)

Document Classification

(Programming) (HCI)

… …

(Note: in real life there is often a hierarchy, not
present in the above problem statement; and also,
you get papers on ML approaches to Garb. Coll.) 13.1

More Text Classification Examples:
Many search engine functionalities use classification

Assign labels to each document or web-page:
n Labels are most often topics such as Yahoo-categories

e.g., “finance,” “sports,” “news>world>asia>business”
n Labels may be genres

e.g., “editorials” “movie-reviews” “news�
n Labels may be opinion on a person/product

e.g., �like�, �hate�, �neutral�
n Labels may be domain-specific

e.g., “interesting-to-me” : “not-interesting-to-me�
e.g., �contains adult language� : �doesn�t�
e.g., language identification: English, French, Chinese, …
e.g., search vertical: about Linux versus not
e.g., �link spam� : �not link spam�

Challenge in Applying NB to Text

n Need to model P(text | class)
n e.g., P(“a b c d” | class)

n Extremely sparse è Need a model to help
n 1st method:

n Based on a statistical language model
n Turns out to be the bag of words model if using unigrams

n è multinomial NB
n 2nd method:

n View a text as a set of tokens è a Boolean vector in
{0, 1}|V|, where V is the vocabulary.

n è Bernoulli NB

Data Mining: Concepts and Techniques 59

sklearn.naive_bayes.MultinomialNB

sklearn.naive_bayes.BernoulliNB

Unigram and higher-order Language
models in Information Retrieval

n P(“a b c d”) =

n P(a) * P(b | a) * P(c | a b) * P(d | a b c)

n Unigram model: 0-th order Markov Model

n P(d | a b c) = P(d)

n Bigram Language Models: 1st order Markov
Model

n P(d | a b c) = P(d | c)

n P(a b c d) = P(a) * P(b | a) * P(c | b) * P(d | c)

n The same with class-conditional probabilities,
i.e., P(“a b c d” | C)

Easy.
Effective!

Naïve Bayes via a class conditional
language model = multinomial NB

n Effectively, the probability of each class is
done as a class-specific unigram language
model

Class

w1 w2 w3 w4 w5 w6

Probabilistic Graphical Model: arrow means conditional
dependency.

Using Multinomial Naive Bayes
Classifiers to Classify Text: Basic method

unigram
model

bag of word;
multinomial

an example
text of 4
tokens

n Essentially, the classification is independent of the
positions of the words
n Use same parameters for each position
n Result is bag of words model, i.e. text ⌘ {xi : ✓i}

|V |
i=1

P (Cj | w1w2w3w4) / P (Cj)P (w1w2w3w4 | Cj)

/ P (cj)
4Y

i=1

P (wi | Cj)

/ P (cj)
|V |Y

i=1

P (xi | Cj)✓i

e.g., “to be or not to be”

n Textj ¬ single document containing all docsj
n for each word xk in Vocabulary

n nk ¬ number of occurrences of xk in Textj
n

Naïve Bayes: Learning

n From training corpus, extract V = Vocabulary
n Calculate required P(cj) and P(xk | cj) terms

n For each cj in C do
n docsj ¬ subset of documents for which the target

class is cj

n

||
)|(

Vocabularyn
n

cxP kjk a
a

+
+

¬

|documents # total|
||

)( jj
docs

cP ¬

Naïve Bayes: Classifying

n positions ¬ all word positions in current document
which contain tokens found in Vocabulary

n Return cNB, where

Õ
ÎÎ

=
positionsi

jij
Cc

NB cxPcPc )|()(argmax
j

Naive Bayes: Time Complexity

n Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document in D.
n Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld)

time during one pass through all of the data.

n Generally just O(|D|Ld) since usually |C||V| < |D|Ld n Test Time: O(|C| Lt) where Lt is the average length of a test document. n Very efficient overall, linearly proportional to the time needed to just read in all the data. Why? Underflow Prevention: log space n Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. n Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. n Class with highest final un-normalized log probability score is still the most probable. n Note that model is now just max of sum of weights… å ÎÎ += positionsi jij Cc NB cxPcPc )|(log)(logargmax j Bernoulli Model n V = {a, b, c, d, e} = {x1, x2, x3, x4, x5} n Feature functions fi(text) = if text contains xi n The feature functions extract a vector of {0, 1}|V| from any text n Apply NB directly a b c d e C 1 1 1 1 0 + 0 1 0 1 1 - “a b c d” “d e b” Two Models /1 n Model 1: Multinomial = Class conditional unigram n One feature Xi for each word pos in document n feature�s values are all words in dictionary n Value of Xi is the word in position i n Naïve Bayes assumption: n Given the document�s topic, word in one position in the document tells us nothing about words in other positions n Second assumption: n Word appearance does not depend on position n Just have one multinomial feature predicting all words )|()|( cwXPcwXP ji === for all positions i,j, word w, and class c Two Models /2 n Model 2: Multivariate Bernoulli n One feature Xw for each word in dictionary n Xw = true in document d if w appears in d n Naive Bayes assumption: n Given the document�s topic, appearance of one word in the document tells us nothing about chances that another word appears n This is the model used in the binary independence model in classic probabilistic relevance feedback in hand-classified data (Maron in IR was a very early user of NB) Parameter estimation fraction of documents of topic cj in which word w appears n Multivariate Bernoulli model: n Multinomial model: n Can create a mega-document for topic j by concatenating all documents in this topic n Use frequency of w in mega-document == )|(ˆ jw ctXP fraction of times in which word w appears across all documents of topic cj == )|(ˆ ji cwXP Classification n Multinomial vs Multivariate Bernoulli? n Multinomial model is almost always more effective in text applications! n See results figures later n See IIR sections 13.2 and 13.3 for worked examples with each model Feature selection for NB n In general feature selection is necessary for multivariate Bernoulli NB. n Otherwise you suffer from noise, multi-counting n �Feature selection� really means something different for multinomial NB. It means dictionary truncation n The multinomial NB model only has 1 feature n This �feature selection� normally isn�t needed for multinomial NB, but may help a fraction with quantities that are badly estimated NB Model Comparison: WebKB Naïve Bayes on spam email 13.6 Violation of NB Assumptions n Conditional independence n �Positional independence� n Examples? Raining Sunny P(+,+,r) = 3/8 P(+,+,s) = 1/8P(-,-,r) = 1/8 P(-,-,s) = 3/8 n Two identical sensors at the same location n Equivalent training dataset n Note: P(s1|C) = P(s2|C) no matter what Observations S1 S2 Class + + r + + r + + r - - r + + s - - s - - s - - s P(si = + | r ) = P(si = - | r ) = P(si = + | s ) = P(si = - | s ) = P(r) = P(s) = ??? n P(r | ++) n P(s | ++) Prediction S1 S2 Class + + ??? P(si = + | r ) = 3/4 P(si = - | r ) = 1/4 P(si = + | s ) = 1/4 P(si = - | s ) = 3/4 P(r) = 1/2 P(s) = 1/2 / / ??? n P(r | ++) 9/32 n P(s | ++) 1/32 P(si = + | r ) = 3/4 P(si = - | r ) = 1/4 P(si = + | s ) = 1/4 P(si = - | s ) = 3/4 P(r) = 1/2 P(s) = 1/2 / / n P(r | ++) = 9/10 n P(s | ++) =1/10 S1 S2 Class + + r + + r + + r - - r + + s - - s - - s - - s Problem: Posterior probability estimation not accurate Reason: Correlation between features Fix: Use logistic regression classifier Naïve Bayes Posterior Probabilities n Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. n However, due to the inadequacy of the conditional independence assumption, the actual posterior- probability numerical estimates are not. n Output probabilities are commonly very close to 0 or 1. n Correct estimation Þ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities) Data Mining: Concepts and Techniques 81 Naïve Bayesian Classifier: Comments n Advantages : n Easy to implement n Good results obtained in most of the cases n Disadvantages n Assumption: class conditional independence , therefore loss of accuracy n Practically, dependencies exist among variables n E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc n Dependencies among these cannot be modeled by Naïve Bayesian Classifier n Better methods? n Bayesian Belief Networks n Logistic regression / maxent Data Mining: Concepts and Techniques 82 n (Linear Regression and) Logistic Regression Classifier n See LR slides Data Mining: Concepts and Techniques 83 n Other Classification Methods Data Mining: Concepts and Techniques 84 Instance-Based Methods n Instance-based learning: n Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified n Typical approaches n k-nearest neighbor approach n Instances represented as points in a Euclidean space. Data Mining: Concepts and Techniques 85 The k-Nearest Neighbor Algorithm n All instances correspond to points in the n-D space. n The nearest neighbor are defined in terms of Euclidean distance. n The target function could be discrete- or real- valued. n For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. n Vonoroi diagram: the decision boundary induced by 1- NN for a typical set of training examples. . _ + _ xq + _ _ + _ _ + . . . . .* Effect of k Data Mining: Concepts and Techniques 86 • k acts as a smoother Data Mining: Concepts and Techniques 87 Discussion on the k-NN Algorithm n The k-NN algorithm for continuous-valued target functions n Calculate the mean values of the k nearest neighbors n Distance-weighted nearest neighbor algorithm n Weight the contribution of each of the k neighbors according to their distance to the query point xq n giving greater weight to closer neighbors n Similarly, for real-valued target functions n Robust to noisy data by averaging k-nearest neighbors n Curse of dimensionality: n kNN search becomes very expensive in high dimensional space n High-dimensional indexing methods, e.g., LSH n distance between neighbors could be dominated by irrelevant attributes. n To overcome it, axes stretch or elimination of the least relevant attributes. 2 1 ( , ) q i w d x x º Data Mining: Concepts and Techniques 88 Remarks on Lazy vs. Eager Learning n Instance-based learning: lazy evaluation n Decision-tree and Bayesian classification: eager evaluation n Key differences n Lazy method may consider query instance xq when deciding how to generalize beyond the training data D n Eager method cannot since they have already chosen global approximation when seeing the query n Efficiency: Lazy - less time training but more time predicting n Accuracy n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function n Eager: must commit to a single hypothesis that covers the entire instance space