Home PHP C# C++ Android Java Javascript Python IOS SQL HTML Categories

How to compute distances between centroids and data matrix (for kmeans algorithm)

Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers").

For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix to all the "centers".

Here is such a function:

myEuclid <- function(points1,
points2) {
    distanceMatrix <- matrix(NA,
nrow=dim(points1)[1], ncol=dim(points2)[1])
    for(i in 1:nrow(points2)) {
        distanceMatrix[,i] <-

points1 is the data matrix with points as rows and dimensions as columns. points2 is the matrix of centers (points as rows again). The first line of code just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there are centers). So the point i,j in the result matrix will be the distance from the ith point to the jth center.

Then the for loop iterates over all centers. For each center it computes the euclidean distance from each point to the current center and returns the result. This line here: sqrt(rowSums(t(t(points1)-points2[i,])^2)) is euclidean distance. Inspect it closer and look up the formula if you have any troubles with that. (the transposes there are mainly done to make sure subtraction is being done row-wise).

Now you can also implement k-means algorithm:

myKmeans <- function(x, centers,
distFun, nItter=10) {
    clusterHistory <- vector(nItter,
    centerHistory <- vector(nItter,

    for(i in 1:nItter) {
        distsToCenters <- distFun(x, centers)
        clusters <- apply(distsToCenters, 1,
        centers <- apply(x, 2, tapply,
clusters, mean)
        # Saving history
        clusterHistory[[i]] <- clusters
        centerHistory[[i]] <- centers


As you can see it's also a very simple function - it takes data matrix, centers, your distance function (the one defined above) and number of wanted iterations.

The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center. Which is a basic k-means algorithm).

Let's try it out. Define some random points (in 2d, so number of columns = 2)

mat <- matrix(rnorm(100), ncol=2)

Assign 5 random points from that matrix as initial centers:

centers <- mat[sample(nrow(mat),

Now run the algorithm:

theResult <- myKmeans(mat, centers,
myEuclid, 10)

Here are the centers in the 10th iteration:

        [,1]        [,2]
1 -0.1343239  1.27925285
2 -0.8004432 -0.77838017
3  0.1956119 -0.19193849
4  0.3886721 -1.80298698
5  1.3640693 -0.04091114

Compare that with implemented kmeans function:

theResult2 <- kmeans(mat, centers,
10, algorithm="Forgy")

        [,1]        [,2]
1 -0.1343239  1.27925285
2 -0.8004432 -0.77838017
3  0.1956119 -0.19193849
4  0.3886721 -1.80298698
5  1.3640693 -0.04091114

Works fine. Our function however tracks the iterations. We can plot the progress over the first 4 iterations like this:

for(i in 1:4) {
    plot(mat, col=theResult$clusters[[i]],
main=paste("itteration:", i), xlab="x", ylab="y")
    points(theResult$centers[[i]], cex=3, pch=19,



However this simple design allows for much more. For example if we want to use another kind of distance (not euclidean) we can just use any function that takes data and centers as inputs. Here is one for correlation distances:

myCor <- function(points1, points2)
    return(1 - ((cor(t(points1),

And we then can do Kmeans based on those:

theResult <- myKmeans(mat, centers,
myCor, 10)

The resulting picture for 4 iterations then looks like this:

enter image description here

Even thou we specified 5 clusters - there were 2 left at the end. That is because for 2 dimensions the correlation can have to values - either +1 or -1. Then when looking for the clusters each point get's assigned to one center, even if it has the same distance to multiple centers - the first one get's chosen.

Anyway this is now getting out of scope. The bottom line is that there are many possible distance metrics and one simple function allows you to use any distance you want and track the results over iterations.

Categories : R

Related to : How to compute distances between centroids and data matrix (for kmeans algorithm)
Eigen sparse matrix multiplications seem to compute full matrix
Basically, the document is sightly confusing for me at least. the way to do it is simply: SpMat mat_3 = mat_1 * mat_2 No dense matrix is created along the way. Eigen rocks!

Categories : C++
python - how to compute correlation-matrix with nans in data-matrix
You can convert all nan values to zeros using np.nan_to_num() and then proceed further. Demo: >>> data array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , nan, 0.2], [ 4.7, 3.2, 1.3, nan], [ nan, 3.1, 1.5, 0.2]]) >>> np.cov(data.T) array([[ nan, nan, nan, nan], [ nan, 0.04666667, nan, nan],

Categories : Python
Fastest algorithm for computing the determinant of a matrix?
I believe the fastest in practice (and commonly used) algorithm is the Strassen Algorithm. You can find explanation on wiki ( along with sample C code. Algorithms based on Coppersmith-Winograd's multiplication algorithms ( are too complex to be practical, though they have best asymp

Categories : Algorithm
Algorithm to find adjacent cells in a matrix
Yes. If you really need to find the neighbors, then you have an option to use graphs. Graphs are basically vertex classes w/ their adjacent vertexes, forming an edge. We can see here that 2 forms an edge w/ 5, and 1 form an edge w/ 5, etc. If you're going to need to know the neighbors VERY frequently(because this is inefficient if you're not), then implement your own vertex class, wrapping the v

Categories : Algorithm
How to visualize k-means centroids for each iteration?
Try to use tryCatch to automate the the process of stopping when conversion is reached: I use the iris-data set because there kmeans needs 2 iterations (the (6,3.5)-Point switches) set.seed(1337) df = iris[,1:2] dfCluster<-kmeans(df,centers=3, iter.max = 1) plot(df[,1], df[,2], col=dfCluster$cluster,pch=19,cex=2, main="iter 1") points(dfCluster$centers,col=1:5,pch=3,cex=3,lwd=3) max_ite

Categories : R
Recently Add
How to make my loop run faster in R?
Can I create a POSIXct data.frame for a day, in minute units, for that day, each year, over 28 years?
Join and add columns in one go
Combine blocks of sp spatial data into a single block
How to customize a regression line in an R graphics XY scatterplot?
How to apply a function to factored subgroups in R?
R: What's wrong with my use of %in%?
R (data.table) group data by custom range (for example, -18, 18-25, ..., 65+)
Matrix specification for simple diagram, using 'diagram' package
R Match and compare values from different vectors
Build a vector/frame by combining regmatches results
interpreting R code function
how do I check whether a plot was written to file in R
Merging two data frames with different sizes by matching their columns
Using grid.layout inside grid.layout with grid package: weird impact of plotting order
R : adding a new column to an existing dataframe with a condition
Work in multiple environments in RStudio and R
using regex in ddply variables
Draw vertical quantile lines over histogram
Aggregate table with dates and geographic coordinates
Reading floats from a file in R
Use scientific notation with xtable in R
Displaying discrete character data on a phylogeny using R
generating random matrixes with genweb in a for loop
How to Use $ and | Logical operators together In R
How to extract / subset an element from a list with the magrittr %>% pipe?
Return attributes within given radius of observation in R
R cor across rows and columns
access that element in dataframe in R
Applying is.logical to a list with multiple data types
© Copyright 2017 Publishing Limited. All rights reserved.