DON'T WRITE YOUR OWN CODE to solve problems that are already well known and solved by others who understand the problems and how to implement the various algorithms.
Certainly don't do so, not until you have exhausted the various tools you can find. You have described a basic clustering problem. So the very first thing you should consider is if kmeans can solve your problem. (As I recall, there are a variety of clustering algorithms. But I think kmeans actually includes many of those I recall learning about many years ago.) If kmeans does not solve your problem, then do a search for other clustering algorithms. Anyway, you should do some quick reading no matter what. You might look in these places:
And of course, kmeans.
help kmeans
KMEANS K-means clustering.
IDX = KMEANS(X, K) partitions the points in the N-by-P data matrix X
into K clusters. This partition minimizes the sum, over all clusters, of
the within-cluster sums of point-to-cluster-centroid distances. Rows of X
correspond to points, columns correspond to variables. Note: when X is a
vector, KMEANS treats it as an N-by-1 data matrix, regardless of its
orientation. KMEANS returns an N-by-1 vector IDX containing the cluster
indices of each point. By default, KMEANS uses squared Euclidean
distances.
KMEANS treats NaNs as missing data, and ignores any rows of X that
contain NaNs.
[IDX, C] = KMEANS(X, K) returns the K cluster centroid locations in
the K-by-P matrix C.
[IDX, C, SUMD] = KMEANS(X, K) returns the within-cluster sums of
point-to-centroid distances in the K-by-1 vector sumD.
[IDX, C, SUMD, D] = KMEANS(X, K) returns distances from each point
to every centroid in the N-by-K matrix D.
[ ... ] = KMEANS(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies
optional parameter name/value pairs to control the iterative algorithm
used by KMEANS. Parameters are:
'Display' - Level of display output. Choices are 'off' (the default),
'iter', and 'final'.
'Distance' - Distance measure, in P-dimensional space, that KMEANS
should minimize with respect to. Choices are:
'sqeuclidean' - Squared Euclidean distance (the default).
'cityblock' - Sum of absolute differences, a.k.a. L1 distance
'cosine' - One minus the cosine of the included angle
between points (treated as vectors).
'correlation' - One minus the sample correlation between points
(treated as sequences of values).
'hamming' - Percentage of bits that differ (only suitable
for binary data).
'EmptyAction' - Action to take if a cluster loses all of its member
observations. Choices are:
'singleton' - Create a new cluster consisting of the one
observation furthest from its centroid (the
default).
'error' - Treat an empty cluster as an error.
'drop' - Remove any clusters that become empty, and set
the corresponding values in C and D to NaN.
'MaxIter' - Maximum number of iterations allowed. Default is 100.
'OnlinePhase' - Flag indicating whether KMEANS should perform an "on-line
update" phase in addition to a "batch update" phase. The on-line phase
can be time consuming for large data sets, but guarantees a solution
that is a local minimum of the distance criterion, i.e., a partition of
the data where moving any single point to a different cluster increases
the total sum of distances. 'off' (the default) or 'on'.
'Options' - Options for the iterative algorithm used to minimize the
fitting criterion, as created by STATSET. Choices of STATSET
parameters are:
'UseParallel' - If true and if a parpool of the Parallel Computing
Toolbox is open, compute in parallel. If the
Parallel Computing Toolbox is not installed, or a
parpool is not open, computation occurs in serial
mode. Default is 'default', meaning serial
computation.
'UseSubstreams' - Set to true to compute in parallel in a reproducible
fashion. Default is false. To compute reproducibly,
set Streams to a type allowing substreams:
'mlfg6331_64' or 'mrg32k3a'.
'Streams' - These fields specify whether to perform clustering
from multiple 'Start' values in parallel, and how
to use random numbers when generating the starting
points. For information on these fields see
PARALLELSTATS.
NOTE: If 'UseParallel' is TRUE and 'UseSubstreams' is FALSE,
then the length of 'Streams' must equal the number of workers
used by KMEANS. If a parallel pool is already open, this
will be the size of the parallel pool. If a parallel pool
is not already open, then MATLAB may try to open a pool for
you (depending on your installation and preferences).
To ensure more predictable results, it is best to use
the PARPOOL command and explicitly create a parallel pool
prior to invoking KMEANS with 'UseParallel' set to TRUE.
'Replicates' - Number of times to repeat the clustering, each with a
new set of initial centroids. A positive integer, default is 1.
'Start' - Method used to choose initial cluster centroid positions,
sometimes known as "seeds". Choices are:
'plus' - The Default. Select K observations from X according
to the k-means++ algorithm: the first cluster center
is chosen uniformly at random from X, after which
each subsequent cluster center is chosen randomly
from the remaining data points with probability
proportional to its distance from the point's
closest existing cluster center.
'sample' - Select K observations from X at random.
'uniform' - Select K points uniformly at random from the range
of X. Not valid for Hamming distance.
'cluster' - Perform preliminary clustering phase on random 10%
subsample of X. This preliminary phase is itself
initialized using 'sample'.
matrix - A K-by-P matrix of starting locations. In this case,
you can pass in [] for K, and KMEANS infers K from
the first dimension of the matrix. You can also
supply a 3D array, implying a value for 'Replicates'
from the array's third dimension.
Example:
X = [randn(20,2)+ones(20,2); randn(20,2)-ones(20,2)];
opts = statset('Display','final');
[cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...
'Replicates',5, 'Options',opts);
plot(X(cidx==1,1),X(cidx==1,2),'r.', ...
X(cidx==2,1),X(cidx==2,2),'b.', ctrs(:,1),ctrs(:,2),'kx');
See also LINKAGE, CLUSTERDATA, SILHOUETTE.
Documentation for kmeans
doc kmeans
Other uses of kmeans
gpuArray/kmeans tall/kmeans