                # Statistics toolbox

## 2. Decision Tree functions

treefit
treeprune
treedisp
treetest
treeval
Fit a tree-based model for classification
or regression
Produce a sequence of subtrees by
pruning
Show classification or regression tree
graphically
Compute error rate for tree
Compute fitted value for decision tree
applied to data

## 3. Функция ‘treefit’ - fit a tree-based model for classification or regression. Syntax: t = treefit(X,y)

Пример:
t = treefit(meas,species);
treedisp(t,'names',{'SL'
'SW' 'PL' 'PW'});

## 4. Cluster analysis functions

cluster
clusterdata Create clusters from a data set
cophenet
Calculate the cophenetic correlation coefficient
dendrogram Plot a hierarchical tree in a dendrogram graph
inconsistent Calculate the inconsistency values of objects in a
cluster hierarchy tree
kmeans
K-means clustering
Link objects in a dataset into a hierarchical tree of
binary clusters
pdist
Calculate the pairwise distance between objects in a
dataset
silhouette Silhouette plot for clustered data
squareform Reformat output of pdist function from vector to
square matrix

## 5. Функция kmeans

IDX = kmeans(X,k)
[IDX,C] = kmeans(X,k)
[IDX,C,sumd] = kmeans(X,k)
[IDX,C,sumd,D] = kmeans(X,k)
[...] = kmeans(...,'param1',val1,'param2',val2,...)
• IDX = kmeans(X, k) partitions the points in the n-by-p data matrix X into k
clusters. This iterative partitioning minimizes the sum, over all clusters, of
the within-cluster sums of point-to-cluster-centroid distances. Rows of X
correspond to points, columns correspond to variables. By default, kmeans
uses squared Euclidean distances.
• IDX - n-by-1 vector containing the cluster indices of each point.
• C - k-by-p matrix cluster centroid locations.
• sumd - 1-by-k vector within-cluster sums of point-to-centroid distances.
• D - n-by-k matrix of distances from each point to every centroid.

## 6. Параметр ‘distance’

• 'sqEuclidean‘ - Squared Euclidean distance (default).
• 'cityblock‘ - Sum of absolute differences, i.e., L1.
• 'cosine‘ - One minus the cosine of the included angle between
points (treated as vectors).
• 'correlation‘ - One minus the sample correlation between
points (treated as sequences of values).
• 'Hamming‘ - Percentage of bits that differ (only suitable for
binary data).

## 7. Параметр ‘start’

• Method used to choose the initial cluster centroid positions,
sometimes known as "seeds". Valid starting values are:
• 'sample‘ - Select k observations from X at random (default).
• 'uniform‘ - Select k points uniformly at random from the range
of X. Not valid with Hamming distance.
• 'cluster‘ - Perform a preliminary clustering phase on a random
10% subsample of X. This preliminary phase is itself initialized
using 'sample'.
• ‘Matrix’ - k-by-p matrix of centroid starting locations. In this
case, you can pass in [] for k, and kmeans infers k from the
first dimension of the matrix. You can also supply a 3dimensional array, implying a value for the 'replicates'
parameter from the array's third dimension.

## 8. Classification

4.5
4
Sepal width
gscatter(meas(:,1),
meas(:,2),
species,'rgb','osd');
xlabel('Sepal length');
ylabel('Sepal width');
setosa
versicolor
virginica
3.5
3
2.5
2
4
4.5
5
5.5
6
6.5
Sepal length
7
7.5
8

## 9. Linear and quadratic discriminant analysis

Sepal width
linclass = classify(meas(:,1:2),
meas(:,1:2),species);
4.5
numobs = size(meas,1);
4
setosa
versicolor
virginica
3.5
hold on;
3
2.5
'kx');
hold off;
2
4
4.5
5
5.5
6
6.5
Sepal length
7
7.5
8

## 10. Visualization regioning the plane

[x,y] = meshgrid(4:.1:8,2:.1:4.5);
x = x(:);
4.5
y = y(:);
j = classify([x y],meas(:,1:2), 4
species);
gscatter(x,y,j,'grb','sod') 3.5
y
versicolor
setosa
virginica
3
2.5
2
4
4.5
5
5.5
6
x
6.5
7
7.5
8

## 11. Decision trees

tree = treefit(meas(:,1:2), species);
[dtnum,dtnode,dtclass] = treeval(tree, meas(:,1:2));
4.5
versicolor
setosa
virginica
4
y
3.5
3
2.5
2
4
4.5
5
5.5
6
x
6.5
7
7.5
8

## 13. Тестирование качества классификации

0.8
Cross-validation
Resubstitution
0.7
Cost (misclassification error)
resubcost = treetest(tree,'resub');
[cost,secost,ntermnodes,bestlevel] =
treetest(tree,'cross',meas(:,1:2),species);
plot(ntermnodes,cost,'b-',
ntermnodes,resubcost,'r--')
figure(gcf);
xlabel('Number of terminal nodes');
ylabel('Cost (misclassification error)')
legend('Cross-validation','Resubstitution')
0.6
0.5
0.4
0.3
0.2
0.1
0
5
10
15
Number of terminal nodes
20

## 14. Выбор уровня

Cost (misclassification error)
[mincost,minloc] = min(cost);
cutoff = mincost + secost(minloc);
hold on
plot([0 20], [cutoff cutoff], 'k:')
plot(ntermnodes(bestlevel+1),
cost(bestlevel+1), 'mo')
legend('Cross-validation',
'Resubstitution', 'Min + 1 std.
err.','Best choice')
hold off
0.8
Cross-validation
Resubstitution
Min + 1 std. err.
Best choice
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
5
10
15
Number of terminal nodes
20

## 15. Оптимальное дерево классификации

prunedtree = treeprune(tree,bestlevel);
treedisp(prunedtree)
cost(bestlevel+1)
>> ans = 0.22

## 16. Дендрограмма классификации ирисов

eucD = pdist(meas,'euclidean');