Similar presentations:
Data analysis. Data management. Lecture 6
1. LECTURE 6
Data analysis.Data management.
G.Ordabayeva
2.
1) Data analysis bases2) Characteristics of data sample
3) Classification, Prediction
4) Classification by Decision Tree
Induction
5) What is data mining?
6) What is “big data”?
3.
1) Data analysis basesData analysis is a process of
inspecting,
cleansing,
transforming,
and modeling data with the goal of
discovering useful information, informing
conclusions, and supporting decision-making.
Data analysis has multiple facets and
approaches, encompassing diverse techniques
under a variety of names, while being used in
different business, science, and social science
domains.
4. Data Analytics
Accumulation of raw data captured from varioussources (i.e. discussion boards, emails, exam logs,
chat logs in e-learning systems) can be used to
identify fruitful patterns and relationships (Bose,
2009)
visualization – uses exploratory
data analytics by capturing relationships that are
perhaps unknown or at least less formally
formulated
Exploratory
Confirmatory
visualization - theory-driven
5.
2) Characteristics of data sampleIn any report or article, the structure of the
sample must be accurately described. It is
especially important to exactly determine the
structure of the sample (and specifically the size of
the subgroups) when subgroup analyses will be
performed during the main analysis phase.
The characteristics of the data sample can be
assessed by looking at:
- Basic statistics of important variables
- Scatter plots
- Correlations and associations
- Cross-tabulations
6. 3) Classification, Prediction
Classification- predicts categorical class labels (discrete or nominal)
- classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Prediction
- models continuous-valued functions, for example, predicts
unknown or missing values
Typical applications:
- Credit approval
- Target marketing
- Medical diagnosis
- Fraud detection
7. Classification—A Two-Step Process
Model construction: describing a set of predeterminedclasses
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
8. Classification Process (1): Model Construction
TrainingData
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
4
no
Assistant Prof
10
yes
Professor
5
yes
Associate Prof 11
yes
Assistant Prof
5
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
9. Classification Process (2): Use the Model in Prediction
ClassifierTesting
Data
Unseen Data
(George, Professor, 5)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?
10. Issues (1): Data Preparation
Data cleaningPreprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
11. Issues (2): Evaluating Classification Methods
Predictive accuracySpeed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provded by the model
Goodness of rules
decision tree size
compactness of classification rules
12. 4) Classification by Decision Tree Induction
Decision treeA flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
13. Training Dataset
Thisfollows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
14. Output: A Decision Tree for “buys_computer”
age?<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
15. What is Data Mining?
Data mining (knowledge discovery from data)Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data
Alternative name
Knowledge
discovery in databases (KDD)
Query
processing
Expert
systems or statistical programs
15
16. Data Mining: A KDD Process
Data mining—core ofknowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
16
Databases
17. Architecture: Typical Data Mining System
Graphical user interfacePattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
17
18. Data Mining: Confluence of Multiple Disciplines
DatabaseSystems
Machine
Learning
Algorithm
Statistics
Visualization
Data Mining
Other
Disciplines
18
19. Multi-Dimensional View of Data Mining
Data to be mined- Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
- Multiple/integrated functions and mining at multiple
levels
19
20.
6) What is “big data”?"Big Data are high-volume, highvelocity, and/or high-variety information
assets that require new forms of processing
to enable enhanced decision making, insight
discovery and process optimization”.
Complicated (intelligent) analysis of
data may make a small data “appear” to be
“big”.
Bottom line: Any data that exceeds our
current capability of processing can be
regarded as “big”.
21. Computational View of Big Data
Data VisualizationData Access
Data Understanding
Data Analysis
Data Integration
Formatting, Cleaning
Storage
Data
22.
Thank you forattention!