Recently, I faced a problem of this measure in Excel:
- File1 contained a list of names and some data
- File2 contained a list of names and some data
- File1 and File2 had many common names but the names didn't match exactly
- Names in File1 had to be matched with names in File2 and some data retrieved for the name from File2 into File1
The problem was, Excel functions like VLOOKUP match exact strings, the match won't work if strings differ (VLOOKUP has a flag for approximate matching but it's very inadequate and trivial). Microsoft also provides a fuzzy-match plugin for excel but I needed this functionality for my office and office IT guys won't let me install the plugin. You might want to check it out if your IT people aren't that brutal :D
So, I wrote a small macro containing is a fuzzy version of VLOOKUP. It is kind of rough and not very inefficient but does the job. Please go through the comments in the macro.
As an example consider branch_data.xslx as File1 and population.xslx as File2.
branch_data.xlsm is the macro enabled file with the array function FuzzyVLookup applied on columns Population, Matched CIty and Confidence(%) which matches city names from population.xslx . Press Alt+F11 to view the macro.
Keep all files in same folder and the function in branch_data.xslm should work( remember to enable macros when prompted).
Note: This function returns multiple values and uses array formula. Check the "Method to Return a Variable-Size Result Array" at http://support.microsoft.com/kb/110693 .
The macro uses Levenshtein's Algorithm for fuzzy matching of strings( the algorithm calculates the number of insertions and deletions in one string to convert it into other string and uses that metrics as a measure of match)
Credits:
Implementation of Levenshtein in VBA
Here is the code itself( for those who are too lazy to download the branch_data.xlsm file) :
'------------------------------------------------------------------------------------------------------------
Here is the code itself( for those who are too lazy to download the branch_data.xlsm file) :
'------------------------------------------------------------------------------------------------------------
Option Explicit ' Levenshtein3 tweaked for UTLIMATE speed and CORRECT results ' Solution based on Longs ' Intermediate arrays holding Asc()make difference ' even Fixed length Arrays have impact on speed (small indeed) ' Levenshtein version 3 will return correct percentage(0...100) Function FuzzyPercent(ByVal String1 As String, ByVal String2 As String) As Long Dim I As Long, j As Long, string1_length As Long, string2_length As Long Dim distance(0 To 60, 0 To 50) As Long, smStr1(1 To 60) As Long, smStr2(1 To 50) As Long Dim min1 As Long, min2 As Long, min3 As Long, minmin As Long, MaxL As Long string1_length = Len(String1): string2_length = Len(String2) distance(0, 0) = 0 For I = 1 To string1_length: distance(I, 0) = I: smStr1(I) = Asc(LCase(Mid$(String1, I, 1))): Next For j = 1 To string2_length: distance(0, j) = j: smStr2(j) = Asc(LCase(Mid$(String2, j, 1))): Next For I = 1 To string1_length For j = 1 To string2_length If smStr1(I) = smStr2(j) Then distance(I, j) = distance(I - 1, j - 1) Else min1 = distance(I - 1, j) + 1 min2 = distance(I, j - 1) + 1 min3 = distance(I - 1, j - 1) + 1 If min2 < min1 Then If min2 < min3 Then minmin = min2 Else minmin = min3 Else If min1 < min3 Then minmin = min1 Else minmin = min3 End If distance(I, j) = minmin End If Next Next ' Levenshtein3 will properly return a percent match (100%=exact) based on similarities and Lengths etc... MaxL = string1_length: If string2_length > MaxL Then MaxL = string2_length FuzzyPercent = 100 - CLng(distance(string1_length, string2_length) * 100 / MaxL) End Function Function FuzzyVLookup(ByVal LookupValue As String, _ ByVal TableArray As Range, _ ByVal ValIndex As Integer, _ Optional ValIndex1 As Integer) As Variant '******************************************************************************** '**This function must be called by selecting three columns then entering the function and pressing ctrl+shift+enter**' '**this function compares the LookUpValue with the values in 1st column of TableArray range and returns the **' '**values from columns valIndex, ValIndex1 from the range and also percentage match **' '**LookupValue: the value for which a match is to be found in a range of values**' '**TableArray: the range in which the match for LookUpValue is to be found **' '**ValIndex: index of a column in TableArray range whose value is to be retrieved on match**' '**ValIndex1: (optional) additional index of a column in TableArray range whose value is to be retrieved on match**' Dim R As Range Dim strListString As String Dim strWork As String Dim I As Integer Dim lEndRow As Long Dim Row As Integer Dim sngCurPercent As Long Dim sngMinPercent As Long Dim arr As Variant '-------------------------------------------------------------- '-- Validation -- '-------------------------------------------------------------- ReDim arr(1 To 5) Row = 0 sngMinPercent = 0 lEndRow = TableArray.Rows.Count If VarType(TableArray.Cells(lEndRow, 1).Value) = vbEmpty Then lEndRow = TableArray.Cells(lEndRow, 1).End(xlUp).Row End If '--------------- '-- Main loop -- '--------------- For Each R In Range(TableArray.Cells(1, 1), TableArray.Cells(lEndRow, 1)) strListString = R.Offset(0, 0).Text 'the city name in the range is in column 0 '------------------------------------------------ '-- Fuzzy match strings & get percentage match -- '------------------------------------------------ sngCurPercent = FuzzyPercent(String1:=LookupValue, _ String2:=strListString) If sngCurPercent >= sngMinPercent Then Row = R.Row sngMinPercent = sngCurPercent End If Next R '----------------------------------- '-- Return column entry specified -- '----------------------------------- arr(1) = TableArray.Cells(Row, ValIndex) 'return the column value for matched row at ValIndex arr(2) = TableArray.Cells(Row, ValIndex1) 'return the column value for matched row at ValIndex1 arr(3) = sngMinPercent 'return the match % for matched row FuzzyVLookup = arr End Function
Had used a lot of subs before. First time used a function in VBA. Cool!
ReplyDelete@Aayush Used functions because I wanted to be able to apply fuzzy match to a particular row(it's easy to do it from a unction because i knew the row from which function was called)
ReplyDeleteAFAIK subs would have required some manipulation to calculate the rows on which to apply the match.