The gist for this lambda function can be found here.
The goal
The goal in this post is:
Create a function to classify data using K-nearest neighbors (KNN) in Excel
A solution
Here’s a lambda function called KNN:
KNN =LAMBDA(x, trn, k,
LET(
_trnc, COLUMNS(trn),
_X, INDEX(trn, , 1) : INDEX(trn, , _trnc - 1),
_y, INDEX(trn, , _trnc),
_br, BYROW(_X, LAMBDA(r, SQRT(SUMXMY2(r, x)))),
_f, FILTER(_y, _br <= SMALL(_br, k)),
_fs, FREQ.SIMPLE(_f),
INDEX(_fs, 1, 1)
)
);
FREQ.SIMPLE =LAMBDA(data,
LET(
d, INDEX(data,,1),
u, UNIQUE(d),
X, N(u = TRANSPOSE(d)),
Y, SEQUENCE(ROWS(d), 1, 1, 0),
mp, MMULT(X,Y),
c, CHOOSE({1,2}, u, mp),
SORT(c, 2, -1)
)
);
I’ve also included the definition of the FREQ.SIMPLE lambda function. That function produces a two-column frequency table of counts of unique values in a column of data. For details of how that function works, you can read this post.
KNN has three parameters and as such accepts three arguments.
- x – an observation (row) in need of classification. This is an array of numerical measurements about an observation which you want to classify. This is one row and one or more columns.
- trn – an array of training data which is already classified. This array will have COLUMNS(x) + 1 columns. The additional column is because the training set includes a column on the right for the classification of each row. In the example below, the species of the flower.
- k – the number of observations in the training set to use to determine the class of the observation x. For example, if k=5, then the function will use the 5 observations (rows) in the training set which are closest to the observation x in order to determine to which class x should belong.
How it works
Here’s how it works:
As you can see, we pass row 19 as the first argument (x – the observation needing classification), rows 4:18 as the second argument (trn – the training set) and S20 to k. The function then uses these arguments to predict what the class should be (shown in the yellow column).
Using different values of K can produce different results.
If you’d like to use this function, you can grab the code from the gist linked at the top of this page.
Let’s break it down
As a reminder, this function is defined as:
KNN =LAMBDA(x, trn, k,
LET(
_trnc, COLUMNS(trn),
_X, INDEX(trn, , 1) : INDEX(trn, , _trnc - 1),
_y, INDEX(trn, , _trnc),
_br, BYROW(_X, LAMBDA(r, SQRT(SUMXMY2(r, x)))),
_f, FILTER(_y, _br <= SMALL(_br, k)),
_fs, FREQ.SIMPLE(_f),
INDEX(_fs, 1, 1)
)
);
We define some names with bound values:
- _trnc – this is the number of columns in the training set and is calculated with COLUMNS(trn).
- _X – here we use the INDEX function twice, separated by a colon to remove the right-most column from the training data. In other words, return columns 1 to _trnc-1 from the array passed to the trn parameter. By separating two calls of the INDEX function by a colon, we create a reference similar to the form A1:B1.
- _y – here we return the right-most column from the training data trn by INDEXing on _trnc – the count of the columns in the training data.
- _br – we use the BYROW function to iterate through each row in _X (the training data without the classification) and we use the function SQRT(SUMXMY2(r, x)) to compare each row r with the unknown observation x. This function calculates the Euclidean distance between two points. This produces a single-column array with the Euclidean distance between the new row and each row in the training set.
- _f – here we use FILTER to get those rows from the training set with the k smallest Euclidean distances. The classification of these rows will be used to determine what the classification of the new row will be.
- _fs – uses the FREQ.SIMPLE function to count the occurence of each unique class present in the variable _f. In other words – we are trying to determine which class in _f is most frequent.
- Finally, we use INDEX to return the first row from _fs. Because FREQ.SIMPLE produces a frequency table sorted in descending order, the first row is also the row with the class that appears most frequently in the k nearest neighbors.
In summary
We saw how to classify data using K-nearest neighbors (KNN) in Excel.
We used the reference form of the INDEX function to manipulate arrays into different dimensions (remove a column, select a row).
We used SQRT and SUMXMY2 to calculate the Euclidean distance between two arrays of equal dimension, then selected the K-smallest distances between the unknown x and the training data.
We used the FREQ.SIMPLE lambda to calculate a simple frequency table.
Finally we returned the class with of most frequent occurrence in those training set observations with k-smallest distances from the unknown observation.
This is the K-nearest neighbors algorithm in Excel.
Shiro says:
may i ask you where can i add this new function ?
OP says:
Yes of course – you should load the Excel Labs add-in, use the Advanced Formula Environment, then follow the instructions to import from a gist. You can read how to do that under point 2 on this post.
Almas Сейтжанов says:
can you please to show all formulas in cells V20
OP says:
Hi
The formula in V20 is the same as the FREQ.SIMPLE function defined in the gist:
https://gist.github.com/ncalm/54b4c306468273679ae07d872fbdec4d