## When you can’t rely on common unique IDs…

**Table of Contents**

Introduction

Method used

Java code

Testing the tool

**Introduction**

It happens quite often during projects that you need to match users from one system (for instance the Meta-Directory) to another system (for instance a HR System), but you can’t use any common unique identifiers between the two systems! Maybe in the HR System, you will only find first name and last name, and no login names…

This little tip may help you find matching users based on their first name and last name. The tool is calculating “distances” between first names and last names in the different systems. Based on the score, you will find the exact matches (same first name, same last name, distance = 0) and very close matches (maybe one letter or two inverted characters difference).

**Method used**

The method is using the Levenshtein algorithm to calculate the distance between two strings. This distance is equal to the minimum characters to add, insert or replace to go from one string to the other one.

For instance, the distance between two identical stings is 0 (obvious).

The distance between “Henry” and “Henri” is 1 as you need to replace “y” by “i”.

The distance between “Marianne” and “Marian” is 2 as you need to delete two letters.

Here is the algorithm in pseudo-code:

int LevenshteinDistance(char s[1..m], char t[1..n]) // d is a table with m+1 rows and n+1 columns declare int d[0..m, 0..n] for i from 0 to m d[i, 0] := i for j from 0 to n d[0, j] := j for i from 1 to m for j from 1 to n { if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) } return d[m, n]

**Java code**

Based on the pseudo code, here is the Java implementation of the distance calculation:

private static int minimum(int a, int b, int c){ if (a<=b && a<=c) return a; if (b<=a && b<=c) return b; return c; } public static int computeLevenshteinDistance(String aStr1, String aStr2) { char[] str1 = aStr1.toCharArray(); char[] str2 = aStr2.toCharArray(); int[][] distance = new int[str1.length+1][]; for(int i=0; i<=str1.length; i++){ distance[i] = new int[str2.length+1]; distance[i][0] = i; } for(int j=0; j<str2.length+1; j++) distance[0][j]=j; for(int i=1; i<=str1.length; i++) for(int j=1;j<=str2.length; j++) distance[i][j]= minimum(distance[i-1][j]+1, distance[i][j-1]+1, distance[i-1][j-1]+((str1[i-1]==str2[j-1])?0:1)); return distance[str1.length][str2.length]; }

The tool that has been built, based on this calculation, takes 3 arguments in the command-line:

- first name, last name, unique ID export from the first system
- first name, last name, unique ID export from the second system
- a CSV result file to store the matches along with scores calculated

Here is the full source code of the tool explained (not that big):

First, the different imports to be able to read files, write files, manipulate the data:

import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileReader; import java.io.FileWriter; import java.io.PrintWriter; import java.util.Iterator; import java.util.TreeSet; public class Distance {

As we have seen, here are the Levenshtein distance calculation methods:

private static int minimum(int a, int b, int c){ if (a<=b && a<=c) return a; if (b<=a && b<=c) return b; return c; } public static int computeLevenshteinDistance(String aStr1, String aStr2) { char[] str1 = aStr1.toCharArray(); char[] str2 = aStr2.toCharArray(); int[][] distance = new int[str1.length+1][]; for(int i=0; i<=str1.length; i++){ distance[i] = new int[str2.length+1]; distance[i][0] = i; } for(int j=0; j<str2.length+1; j++) distance[0][j]=j; for(int i=1; i<=str1.length; i++) for(int j=1;j<=str2.length; j++) distance[i][j]= minimum(distance[i-1][j]+1, distance[i][j-1]+1, distance[i-1][j-1]+((str1[i-1]==str2[j-1])?0:1)); return distance[str1.length][str2.length]; }

In the main class, we load the two CSV files (“,” is the separator used here) and load them in TreeSet objects, so they are ordered by last name:

public static void main( String[] args ) throws Exception { BufferedReader oInSys1 = new BufferedReader( new FileReader(args[0]) ); BufferedReader oInSys2 = new BufferedReader( new FileReader(args[1]) ); PrintWriter oOutResult = new PrintWriter(new BufferedWriter(new FileWriter(args[2]))); String oLine = ""; // Load Sys1 data TreeSet oSys1Set = new TreeSet(); while( ( oLine = oInSys1.readLine() ) != null ) { String[] oTokens = oLine.split( "," ); String oSn = oTokens[0].toUpperCase(); String oGivenName = oTokens[1].toUpperCase(); String oSys1ID = oTokens[2]; String oSys1Data = oSn + "~" + oGivenName + "~" + oSys1ID; oSys1Set.add( oSys1Data ); } // Load Sys2 data TreeSet oSys2Set = new TreeSet(); while( ( oLine = oInSys2.readLine() ) != null ) { String[] oTokens = oLine.split(","); String oSn = oTokens[0].toUpperCase(); String oGivenName = oTokens[1].toUpperCase(); String oUniqueID = oTokens[2]; String oSys2Data = oSn + "~" + oGivenName + "~" + oUniqueID; oSys2Set.add( oSys2Data ); }

Now we can cycle through all entries in the second system and for each entry, cycle in all the entries of the second system to calculate distances and find the best match:

// Cycle through Sys2 data and find closest Sys1 string Iterator oSys2Iterator = oSys2Set.iterator(); while( oSys2Iterator.hasNext() ) { String oSys2Data = (String) oSys2Iterator.next(); String[] oSys2Tokens = oSys2Data.split("~"); String oSys2Sn = oSys2Tokens[0]; String oSys2GivenName = oSys2Tokens[1]; String oSys2UniqueID = oSys2Tokens[2]; String oBestMatch = ""; String oBestSn = ""; String oBestGivenName = ""; double oBestScore = 999999; // Find closest Sys1 string Iterator oSys1Iterator = oSys1Set.iterator(); while( oSys1Iterator.hasNext() ) { String oSys1Data = (String) oSys1Iterator.next(); String[] oSys1Tokens = oSys1Data.split("~"); String oSys1Sn = oSys1Tokens[0]; String oSys1GivenName = oSys1Tokens[1]; String oSys1ID = oSys1Tokens[2];

The calculation is using the distance between last names and between first names, giving more importance to the last name. You can experiment by changing a bit the way the final score is calculated…

double oScore = java.lang.Math.sqrt( java.lang.Math.pow( Distance.computeLevenshteinDistance( oSys2Sn , oSys1Sn ) * 2, 2 ) + java.lang.Math.pow( Distance.computeLevenshteinDistance( oSys2GivenName , oSys1GivenName ), 2 ) ); if ( oScore < oBestScore ) { oBestScore = oScore; oBestMatch = oSys1ID; oBestSn = oSys1Sn; oBestGivenName = oSys1GivenName; }

If an exact match is found, we remove the entry from the first set of data:

if ( oScore == 0 ) { oSys1Set.remove( oSys1Data ); break; } }

Finally, we save the best match to the result file:

String oResultData = oSys2Sn + "," + oSys2GivenName + "," + oSys2UniqueID + "," + oBestSn + "," + oBestGivenName + "," + oBestMatch + "," + oBestScore; oOutResult.write( oResultData + "\n" ); } oOutResult.close(); oInSys2.close(); oInSys1.close(); } }

We can compile the source code of the tool using the following command:

javac Distance.java

The result is a Distance.class file we can package in a jar file using the following command:

jar cvf distance.jar Distance.class

**Testing the tool**

Let’s test the tool by processing two CSV files with exact matches and similar matches.

The first CSV file sys1.csv corresponding to the first system is the following:

William,Abdo,wabdo Eddie,Austin,eaustin Jerry,Dunn,jdunn Deborah,Fair,dfair Marianne,Griffith,mgriffith Jeff,Healey,jhealey Clarence,Holloway,chooloway Clifford,Hough,chough Randy,Krumins,rkrumins Larry,Larsen,llarsen Sean,Murphy,smurphy Marie,Saenz,msaenz

The second CSV file sys2.csv corresponding to the second system is the following:

Willy,Abdo,10001 Eddy,Austin,10002 Jeremy,Dunn,10003 Deborah,Fair,10004 Marian,Griffith,10005 Jeff,Healy,10006 Clarence,Halloway,10007 Clifford,Hough,10008 Randy,Krumins,10009 Larry,Larsen,10010 Sean,Murphy,10011 Mary,Saenz,10012

We can run the tool by executing the following command:

java -classpath distance.jar Distance sys1.csv sys2.csv result.csv

Here is the resulting CSV file:

CLARENCE,HALLOWAY,10007,CLARENCE,HOLLOWAY,chooloway,1.0 CLIFFORD,HOUGH,10008,CLIFFORD,HOUGH,chough,0.0 DEBORAH,FAIR,10004,DEBORAH,FAIR,dfair,0.0 EDDY,AUSTIN,10002,EDDIE,AUSTIN,eaustin,4.0 JEFF,HEALY,10006,JEFF,HEALEY,jhealey,1.0 JEREMY,DUNN,10003,JERRY,DUNN,jdunn,4.0 LARRY,LARSEN,10010,LARRY,LARSEN,llarsen,0.0 MARIAN,GRIFFITH,10005,MARIANNE,GRIFFITH,mgriffith,4.0 MARY,SAENZ,10012,MARIE,SAENZ,msaenz,4.0 RANDY,KRUMINS,10009,RANDY,KRUMINS,rkrumins,0.0 SEAN,MURPHY,10011,SEAN,MURPHY,smurphy,0.0 WILLY,ABDO,10001,WILLIAM,ABDO,wabdo,6.0

We can see entries from the second systems with possible matches in the first system and the score calculated based on the distance between names. If you order by scores, you will first see the exact matches and then names that are very similar (scores 1 or 2).

Using this technique on a few thousands of entries can really help in the data cleaning, matching process, before putting in production a new driver.

As always, a great tool which will be very useful.

It might be an idea to implement this in the Analyzer product, for it is not in there yet, to my knowledge.

Grtz

David

Thanks a lot !!!

This is a problem I have to deal with regularly when we need to take lists from different systems and match, say IDs on one system to logins on a third system. Just spent an hour on doing this manually from data dumps from our eProcurement system and our ERP system, and I was getting ready to work on a more permanent solution. You just saved me a few hours of programming,

Thanks again,

Shivaji

Hello,

Thanks for the feedback, I always appreciate! Of course this can be improved, but could be a good base for special matching algorithm. I remember that I also tested another way where all vowels are removed (“Reza Kalfane” becomes “Rz Klfn”), and depending on the language, this can also help a bit in the matching process…

Cheers,

Réza