The gist for this lambda function can be found here.

## The goal

The goal in this post is:

Create a function to classify data using K-nearest neighbors (KNN) in Excel

## A solution

Here’s a lambda function called KNN:

```
KNN =LAMBDA(x, trn, k,
LET(
_trnc, COLUMNS(trn),
_X, INDEX(trn, , 1) : INDEX(trn, , _trnc - 1),
_y, INDEX(trn, , _trnc),
_br, BYROW(_X, LAMBDA(r, SQRT(SUMXMY2(r, x)))),
_f, FILTER(_y, _br <= SMALL(_br, k)),
_fs, FREQ.SIMPLE(_f),
INDEX(_fs, 1, 1)
)
);
FREQ.SIMPLE =LAMBDA(data,
LET(
d, INDEX(data,,1),
u, UNIQUE(d),
X, N(u = TRANSPOSE(d)),
Y, SEQUENCE(ROWS(d), 1, 1, 0),
mp, MMULT(X,Y),
c, CHOOSE({1,2}, u, mp),
SORT(c, 2, -1)
)
);
```

I’ve also included the definition of the FREQ.SIMPLE lambda function. That function produces a two-column frequency table of counts of unique values in a column of data. For details of how that function works, you can read this post.

KNN has three parameters and as such accepts three arguments.

- x – an observation (row) in need of classification. This is an array of numerical measurements about an observation which you want to classify. This is one row and one or more columns.
- trn – an array of training data which is already classified. This array will have COLUMNS(x) + 1 columns. The additional column is because the training set includes a column on the right for the classification of each row. In the example below, the species of the flower.
- k – the number of observations in the training set to use to determine the class of the observation x. For example, if k=5, then the function will use the 5 observations (rows) in the training set which are closest to the observation x in order to determine to which class x should belong.

## How it works

Here’s how it works:

As you can see, we pass row 19 as the first argument (x – the observation needing classification), rows 4:18 as the second argument (trn – the training set) and S20 to k. The function then uses these arguments to predict what the class should be (shown in the yellow column).

Using different values of K can produce different results.

If you’d like to use this function, you can grab the code from the gist linked at the top of this page.

## Let’s break it down

As a reminder, this function is defined as:

```
KNN =LAMBDA(x, trn, k,
LET(
_trnc, COLUMNS(trn),
_X, INDEX(trn, , 1) : INDEX(trn, , _trnc - 1),
_y, INDEX(trn, , _trnc),
_br, BYROW(_X, LAMBDA(r, SQRT(SUMXMY2(r, x)))),
_f, FILTER(_y, _br <= SMALL(_br, k)),
_fs, FREQ.SIMPLE(_f),
INDEX(_fs, 1, 1)
)
);
```

We define some names with bound values:

- _trnc – this is the number of columns in the training set and is calculated with COLUMNS(trn).
- _X – here we use the INDEX function twice, separated by a colon to remove the right-most column from the training data. In other words, return columns 1 to _trnc-1 from the array passed to the trn parameter. By separating two calls of the INDEX function by a colon, we create a reference similar to the form A1:B1.
- _y – here we return the right-most column from the training data trn by INDEXing on _trnc – the count of the columns in the training data.
- _br – we use the BYROW function to iterate through each row in _X (the training data without the classification) and we use the function SQRT(SUMXMY2(r, x)) to compare each row r with the unknown observation x. This function calculates the Euclidean distance between two points. This produces a single-column array with the Euclidean distance between the new row and each row in the training set.
- _f – here we use FILTER to get those rows from the training set with the k smallest Euclidean distances. The classification of these rows will be used to determine what the classification of the new row will be.
- _fs – uses the FREQ.SIMPLE function to count the occurence of each unique class present in the variable _f. In other words – we are trying to determine which class in _f is most frequent.
- Finally, we use INDEX to return the first row from _fs. Because FREQ.SIMPLE produces a frequency table sorted in descending order, the first row is also the row with the class that appears most frequently in the k nearest neighbors.

## In summary

We saw how to classify data using K-nearest neighbors (KNN) in Excel.

We used the reference form of the INDEX function to manipulate arrays into different dimensions (remove a column, select a row).

We used SQRT and SUMXMY2 to calculate the Euclidean distance between two arrays of equal dimension, then selected the K-smallest distances between the unknown x and the training data.

We used the FREQ.SIMPLE lambda to calculate a simple frequency table.

Finally we returned the class with of most frequent occurrence in those training set observations with k-smallest distances from the unknown observation.

This is the K-nearest neighbors algorithm in Excel.