Your main question seems to be how to calculate
distances between a data matrix and some set of
points ("centers").

For this you can write a function that takes as
input a data matrix and your set of points and
returns distances for each row (point) in the data
matrix to all the "centers".

Here is such a function:

```
myEuclid <- function(points1,
points2) {
distanceMatrix <- matrix(NA,
nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <-
sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
```

`points1`

is the data matrix with
points as rows and dimensions as columns.
`points2`

is the matrix of centers
(points as rows again). The first line of code
just defines the answer matrix (which will have as
many rows as there are rows in the data matrix and
as many columns as there are centers). So the
point `i,j`

in the result matrix will
be the distance from the *ith* point to the
*jth* center.

Then the for loop iterates over all centers.
For each center it computes the euclidean distance
from each point to the current center and returns
the result. This line here:
`sqrt(rowSums(t(t(points1)-points2[i,])^2))`

is euclidean distance. Inspect it closer and look
up the formula if you have any troubles with that.
(the transposes there are mainly done to make sure
subtraction is being done row-wise).

Now you can also implement k-means
algorithm:

```
myKmeans <- function(x, centers,
distFun, nItter=10) {
clusterHistory <- vector(nItter,
mode="list")
centerHistory <- vector(nItter,
mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1,
which.min)
centers <- apply(x, 2, tapply,
clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory,
centers=centerHistory)
}
```

As you can see it's also a very simple function
- it takes data matrix, centers, your distance
function (the one defined above) and number of
wanted iterations.

The clusters are defined by assigning the
closest center for each point. And centers are
updated as a mean of the points assigned to that
center. Which is a basic k-means algorithm).

Let's try it out. Define some random points (in
2d, so number of columns = 2)

```
mat <- matrix(rnorm(100), ncol=2)
```

Assign 5 random points from that matrix as
initial centers:

```
centers <- mat[sample(nrow(mat),
5),]
```

Now run the algorithm:

```
theResult <- myKmeans(mat, centers,
myEuclid, 10)
```

Here are the centers in the 10th iteration:

```
theResult$centers[[10]]
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
```

Compare that with implemented
`kmeans`

function:

```
theResult2 <- kmeans(mat, centers,
10, algorithm="Forgy")
theResult2$centers
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
```

Works fine. Our function however tracks the
iterations. We can plot the progress over the
first 4 iterations like this:

```
par(mfrow=c(2,2))
for(i in 1:4) {
plot(mat, col=theResult$clusters[[i]],
main=paste("itteration:", i), xlab="x", ylab="y")
points(theResult$centers[[i]], cex=3, pch=19,
col=1:nrow(theResult$centers[[i]]))
}
```

Nice.

However this simple design allows for much
more. For example if we want to use another kind
of distance (not euclidean) we can just use any
function that takes data and centers as inputs.
Here is one for correlation distances:

```
myCor <- function(points1, points2)
{
return(1 - ((cor(t(points1),
t(points2))+1)/2))
}
```

And we then can do Kmeans based on those:

```
theResult <- myKmeans(mat, centers,
myCor, 10)
```

The resulting picture for 4 iterations then
looks like this:

Even thou we specified 5 clusters - there were
2 left at the end. That is because for 2
dimensions the correlation can have to values -
either +1 or -1. Then when looking for the
clusters each point get's assigned to one center,
even if it has the same distance to multiple
centers - the first one get's chosen.

Anyway this is now getting out of scope. The
bottom line is that there are many possible
distance metrics and one simple function allows
you to use any distance you want and track the
results over iterations.