# excel-lambda-LEV: Calculate the Levenshtein Distance in Excel

The gist for this lambda function can be found here.

A friend of mine challenged me to implement a LAMBDA to calculate the Levenshtein Distance in Excel.

As a reminder, the Levenshtein Distance represents the fewest number of insertions, replacements or deletions that are necessary to change one text string into another. For example, given the words “kitten” and “sitting”, we need these operations to change the former to the latter:

1. Replace the k with an s
2. Replace the e with an i
3. Insert a g at the end

We can think of this as a measurement of similarity between two strings. The smaller the number of operations, the more similar the strings are.

I used this page on the Wagner-Fischer algorithm to guide my work.

I call it LEV:

```=LAMBDA(a,b,[ii],[jj],[arr],
LET(
i,IF(ISOMITTED(ii),1,ii),
j,IF(ISOMITTED(jj),1,jj),
a_i,MID(a,i,1),
b_j,MID(b,j,1),
init_array,MAKEARRAY(
LEN(a)+1,
LEN(b)+1,
LAMBDA(r,c,IFS(r=1,c-1,c=1,r-1,TRUE,0))
),
cost,N(NOT(a_i=b_j)),
this_arr,IF(ISOMITTED(arr),init_array,arr),
option_a,INDEX(this_arr,i+1-1,j+1)+1,
option_b,INDEX(this_arr,i+1,j+1-1)+1,
option_c,INDEX(this_arr,i+1-1,j+1-1)+cost,
new_val,MIN(option_a,option_b,option_c),
overlay,MAKEARRAY(
LEN(a)+1,
LEN(b)+1,
LAMBDA(r,c,IF(AND(r=i+1,c=j+1),new_val,0))
),
new_arr,this_arr+overlay,
new_i,IF(i=LEN(a),IF(j=LEN(b),i+1,1),i+1),
new_j,IF(i<>LEN(a),j,IF(j=LEN(b),j+1,j+1)),
is_end,AND(new_i>LEN(a),new_j>LEN(b)),
IF(is_end,new_val,LEV(a,b,new_i,new_j,new_arr))
)
)
```

LEV has two required parameters:

1. a – a string you want to compare with b
2. b – a string you want to compare with a

There are three optional parameters which are used by the recursion, but generally you would never need to use them:

1. [ii] – an integer representing the [ii]th character position of string a
2. [jj] – an integer representing the [jj]th character position of string b
3. [arr] – an interim array created by LEV
Don’t worry too much about them unless you want to know the details of how this works, which I’ll explain below.

This is what LEV does: You can grab it and use it without reading the rest of the post, but if you want to understand how it works, then read on.

### How it’s done

Let’s break it down.
```=LAMBDA(a,b,[ii],[jj],[arr],
LET(
i,IF(ISOMITTED(ii),1,ii),
j,IF(ISOMITTED(jj),1,jj),
a_i,MID(a,i,1),
b_j,MID(b,j,1),
init_array,MAKEARRAY(
LEN(a)+1,
LEN(b)+1,
LAMBDA(r,c,IFS(r=1,c-1,c=1,r-1,TRUE,0))
),

```

If you read the wiki page about the Wagner-Fischer algorithm, this is all going to make more sense. I also recommend this medium article.

We start by using LET to create some variables:

Let’s take the kitten/sitting example. In the image above, I’ve named the cells in B1:B4,B7:B8 to correspond to the variables in the formulas.

You can see that because ii and jj are blank, i and j both become 1. ii=”” is a substitute expression for ISOMITTED(ii).

a_i returns “k”, the [i]th letter from kitten and b_i returns “s”, the [j]th letter from sitting. In this example, because i=1, we are getting the first letter from kitten. Because j=1 we are getting the 1st letter from sitting.

The length of a and b are shown in rows 13 and 14.

The initial array is shown in B17:I23. Because this is a dynamic array, the formula is only in B17.

The algorithm will begin in position [2,2] of init_array. It will fill in each of the 0 cells until it reaches the bottom-right corner. The value that is placed in the bottom-right corner is the Levenshtein Distance

```			cost,N(NOT(a_i=b_j)),
this_arr,IF(ISOMITTED(arr),init_array,arr),
option_a,INDEX(this_arr,i+1-1,j+1)+1,
option_b,INDEX(this_arr,i+1,j+1-1)+1,
option_c,INDEX(this_arr,i+1-1,j+1-1)+cost,
new_val,MIN(option_a,option_b,option_c),
overlay,MAKEARRAY(
LEN(a)+1,
LEN(b)+1,
LAMBDA(r,c,IF(AND(r=i+1,c=j+1),new_val,0))
),
new_arr,this_arr+overlay,
```

We continue to define more calculations in LET:

At this point, if you’re not familiar with this algorithm, I encourage you to read the medium article I linked above, which I will link again here. Each cell in our array is calculated using the expression above.

If you’re not familiar with this kind of expression, don’t worry. The brace { is how options are presented. In this case, we have two options:

1. Where min(i,j) = 0, for when we are comparing character position 0 in either string (i or j) with any other character position in the other string. In this option, we put max(i,j) in the array. This is what creates the row headers and column headers of init_array
2. “otherwise”, which represents every cell in the array that is not on the first row and is not in the first column. So this is the section of the init_array which is currently filled with zeroes. So we know that each cell currently holding a zero (except [1,1]), needs to be filled with the minimum of those three options
```			new_arr,this_arr+overlay,
new_i,IF(i=LEN(a),IF(j=LEN(b),i+1,1),i+1),
new_j,IF(i<>LEN(a),j,IF(j=LEN(b),j+1,j+1)),
is_end,AND(new_i>LEN(a),new_j>LEN(b)),
IF(is_end,new_val,LEV(a,b,new_i,new_j,new_arr))
)
)
```

Now that we have calculated the new value and the new array, we are ready to move to the next cell and calculate that. Some more calculations, remembering that the array we are populating has the character positions i of a on the rows and the character positions j of b on the columns:

The final expression of LET is IF(is_end,new_val,LEV(a,b,new_i,new_j,new_arr)).

If we are at the end, then return new_val. Otherwise, call LEV with a, b, new_i, new_j and new_arr.

Put simply: Move to the next zero-cell in the now-updated array and perform the same calculations involving options a, b and c and the cost as defined above.

The algorithm continues in this way until it reaches the bottom-right corner, at which point it returns the result to the worksheet.

Here is a mocked up example of the state of all those variables during the first iteration: A word of caution: you’ve seen that this involves LEN(a)*LEN(b) iterations to get to a result, so the longer the strings, the more iterations needed.

### To sum up

This was a fun experiment!

Now we know how to calculate the Levenshtein Distance in Excel using LAMBDA. There are all kinds of uses for this – name matching, address matching, product matching and so on.

I hope that this post has inspired you to think of ideas to use recursive lambda functions to solve tasks you are working on.

If you’ve got something you want to solve and aren’t sure where to get started, you can ask me directly in the comments here.