Classification Process (1): Model Construction

Classification Process (2): Use the Model in Prediction

Issues (2): Evaluating Classification Methods

4) Classification by Decision Tree Induction

Output: A Decision Tree for “buys_computer”

Architecture: Typical Data Mining System

Data Mining: Confluence of Multiple Disciplines

1.21M

Category:

informatics

Data analysis. Data management. Lecture 6

1. LECTURE 6

Data analysis.
Data management.
G.Ordabayeva

2.

1) Data analysis bases
2) Characteristics of data sample
3) Classification, Prediction
4) Classification by Decision Tree
Induction
5) What is data mining?
6) What is “big data”?

1) Data analysis bases
Data analysis is a process of
inspecting,
cleansing,
transforming,
and modeling data with the goal of
discovering useful information, informing
conclusions, and supporting decision-making.
Data analysis has multiple facets and
approaches, encompassing diverse techniques
under a variety of names, while being used in
different business, science, and social science
domains.

4. Data Analytics

Accumulation of raw data captured from various
sources (i.e. discussion boards, emails, exam logs,
chat logs in e-learning systems) can be used to
identify fruitful patterns and relationships (Bose,
2009)
visualization – uses exploratory
data analytics by capturing relationships that are
perhaps unknown or at least less formally
formulated
Exploratory
Confirmatory
visualization - theory-driven

5.

2) Characteristics of data sample
In any report or article, the structure of the
sample must be accurately described. It is
especially important to exactly determine the
structure of the sample (and specifically the size of
the subgroups) when subgroup analyses will be
performed during the main analysis phase.
The characteristics of the data sample can be
assessed by looking at:
- Basic statistics of important variables
- Scatter plots
- Correlations and associations
- Cross-tabulations

6. 3) Classification, Prediction

Classification
- predicts categorical class labels (discrete or nominal)
- classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Prediction
- models continuous-valued functions, for example, predicts
unknown or missing values
Typical applications:
- Credit approval
- Target marketing
- Medical diagnosis
- Fraud detection

7. Classification—A Two-Step Process

Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur

8. Classification Process (1): Model Construction

Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
4
no
Assistant Prof
10
yes
Professor
5
yes
Associate Prof 11
yes
Assistant Prof
5
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’

9. Classification Process (2): Use the Model in Prediction

Classifier
Testing
Data
Unseen Data
(George, Professor, 5)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?

10. Issues (1): Data Preparation

Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data

11. Issues (2): Evaluating Classification Methods

Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provded by the model
Goodness of rules
decision tree size
compactness of classification rules

12. 4) Classification by Decision Tree Induction

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree

13. Training Dataset

This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent

14. Output: A Decision Tree for “buys_computer”

age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes

15. What is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data
Alternative name
Knowledge
discovery in databases (KDD)
Query
processing
Expert
systems or statistical programs
15

16. Data Mining: A KDD Process

Data mining—core of
knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
16
Databases

17. Architecture: Typical Data Mining System

Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
17

18. Data Mining: Confluence of Multiple Disciplines

Database
Systems
Machine
Learning
Algorithm
Statistics
Visualization
Data Mining
Other
Disciplines
18

19. Multi-Dimensional View of Data Mining

Data to be mined
- Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
- Multiple/integrated functions and mining at multiple
levels
19

20.

6) What is “big data”?
"Big Data are high-volume, highvelocity, and/or high-variety information
assets that require new forms of processing
to enable enhanced decision making, insight
discovery and process optimization”.
Complicated (intelligent) analysis of
data may make a small data “appear” to be
“big”.
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as “big”.

21. Computational View of Big Data

Data Visualization
Data Access
Data Understanding
Data Analysis
Data Integration
Formatting, Cleaning
Storage
Data

22.

Thank you for
attention!

English Русский Rules