Matching Users Between Systems



By: rkalfane

November 13, 2008 7:27 am

Reads: 156

Comments:3

Rating:0

When you can’t rely on common unique IDs…

Table of Contents

        Introduction

        Method used

        Java code

        Testing the tool

Introduction

It happens quite often during projects that you need to match users from one system (for instance the Meta-Directory) to another system (for instance a HR System), but you can’t use any common unique identifiers between the two systems! Maybe in the HR System, you will only find first name and last name, and no login names…

This little tip may help you find matching users based on their first name and last name. The tool is calculating “distances” between first names and last names in the different systems. Based on the score, you will find the exact matches (same first name, same last name, distance = 0) and very close matches (maybe one letter or two inverted characters difference).

Method used

The method is using the Levenshtein algorithm to calculate the distance between two strings. This distance is equal to the minimum characters to add, insert or replace to go from one string to the other one.

For instance, the distance between two identical stings is 0 (obvious).

The distance between “Henry” and “Henri” is 1 as you need to replace “y” by “i”.

The distance between “Marianne” and “Marian” is 2 as you need to delete two letters.

Here is the algorithm in pseudo-code:

int LevenshteinDistance(char s[1..m], char t[1..n])
  // d is a table with m+1 rows and n+1 columns
  declare int d[0..m, 0..n]

  for i from 0 to m
      d[i, 0] := i
  for j from 0 to n
      d[0, j] := j

  for i from 1 to m
      for j from 1 to n
      {
          if s[i] = t[j] then cost := 0
                         else cost := 1
          d[i, j] := minimum(
                               d[i-1, j] + 1,     // deletion
                               d[i, j-1] + 1,     // insertion
                               d[i-1, j-1] + cost   // substitution
                           )
      }

  return d[m, n]
  
 

Java code

Based on the pseudo code, here is the Java implementation of the distance calculation:

    private static int minimum(int a, int b, int c){
        if (a<=b && a<=c)
            return a;
        if (b<=a && b<=c)
            return b;
        return c;
    }
    
    public static int computeLevenshteinDistance(String aStr1, String aStr2) {
    	
    	char[] str1 = aStr1.toCharArray(); 
    	char[] str2 = aStr2.toCharArray();
    	
        int[][] distance = new int[str1.length+1][];

        for(int i=0; i<=str1.length; i++){
            distance[i] = new int[str2.length+1];
            distance[i][0] = i;
        }
        for(int j=0; j<str2.length+1; j++)
            distance[0][j]=j;

        for(int i=1; i<=str1.length; i++)
            for(int j=1;j<=str2.length; j++)
                  distance[i][j]= minimum(distance[i-1][j]+1, distance[i][j-1]+1, 
                                          distance[i-1][j-1]+((str1[i-1]==str2[j-1])?0:1));
       
        return distance[str1.length][str2.length];
    }
	
 

The tool that has been built, based on this calculation, takes 3 arguments in the command-line:

  • first name, last name, unique ID export from the first system
  • first name, last name, unique ID export from the second system
  • a CSV result file to store the matches along with scores calculated

Here is the full source code of the tool explained (not that big):

First, the different imports to be able to read files, write files, manipulate the data:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.util.Iterator;
import java.util.TreeSet;

public class Distance 
{	


 

As we have seen, here are the Levenshtein distance calculation methods:

    private static int minimum(int a, int b, int c){
        if (a<=b && a<=c)
            return a;
        if (b<=a && b<=c)
            return b;
        return c;
    }
    
    public static int computeLevenshteinDistance(String aStr1, String aStr2) {
    	
    	char[] str1 = aStr1.toCharArray(); 
    	char[] str2 = aStr2.toCharArray();
    	
        int[][] distance = new int[str1.length+1][];

        for(int i=0; i<=str1.length; i++){
            distance[i] = new int[str2.length+1];
            distance[i][0] = i;
        }
        for(int j=0; j<str2.length+1; j++)
            distance[0][j]=j;

        for(int i=1; i<=str1.length; i++)
            for(int j=1;j<=str2.length; j++)
                  distance[i][j]= minimum(distance[i-1][j]+1, distance[i][j-1]+1, 
                                          distance[i-1][j-1]+((str1[i-1]==str2[j-1])?0:1));
       
        return distance[str1.length][str2.length];
    }
	
 

In the main class, we load the two CSV files (“,” is the separator used here) and load them in TreeSet objects, so they are ordered by last name:

	public static void main( String[] args ) throws Exception
	{
		 BufferedReader oInSys1 = new BufferedReader( new FileReader(args[0]) );
		 BufferedReader oInSys2 = new BufferedReader( new FileReader(args[1]) );
 		 PrintWriter oOutResult = new PrintWriter(new BufferedWriter(new FileWriter(args[2])));

		 String oLine = "";
		 
		 // Load Sys1 data
		 TreeSet oSys1Set = new TreeSet();
		 while( ( oLine = oInSys1.readLine() ) != null )
		 {
			 String[] oTokens = oLine.split( "," );
			 String oSn = oTokens[0].toUpperCase();
			 String oGivenName = oTokens[1].toUpperCase();
			 String oSys1ID = oTokens[2];
			 String oSys1Data = oSn + "~" + oGivenName + "~" + oSys1ID;
			 oSys1Set.add( oSys1Data );
		 }
		 
		 // Load Sys2 data
		 TreeSet oSys2Set = new TreeSet();
		 while( ( oLine = oInSys2.readLine() ) != null )
		 {
			 String[] oTokens = oLine.split(",");
			 String oSn = oTokens[0].toUpperCase();
			 String oGivenName = oTokens[1].toUpperCase();
			 String oUniqueID = oTokens[2];
			 String oSys2Data = oSn + "~" + oGivenName + "~" + oUniqueID;
			 oSys2Set.add( oSys2Data );
		 }
		 
 

Now we can cycle through all entries in the second system and for each entry, cycle in all the entries of the second system to calculate distances and find the best match:

	 
		 // Cycle through Sys2 data and find closest Sys1 string
		 Iterator oSys2Iterator = oSys2Set.iterator();
		 while( oSys2Iterator.hasNext() )
		 {
			 String oSys2Data = (String) oSys2Iterator.next();
			 String[] oSys2Tokens = oSys2Data.split("~");
			 String oSys2Sn = oSys2Tokens[0];
			 String oSys2GivenName = oSys2Tokens[1];
			 String oSys2UniqueID = oSys2Tokens[2];
			 
			 String oBestMatch = "";
			 String oBestSn = "";
			 String oBestGivenName = "";
			 double oBestScore = 999999;
			 
			 // Find closest Sys1 string
			 Iterator oSys1Iterator = oSys1Set.iterator();
			 while( oSys1Iterator.hasNext() )
			 {
				 String oSys1Data = (String) oSys1Iterator.next();
				 String[] oSys1Tokens = oSys1Data.split("~");
				 String oSys1Sn = oSys1Tokens[0];
				 String oSys1GivenName = oSys1Tokens[1];
				 String oSys1ID = oSys1Tokens[2];
				 
 

The calculation is using the distance between last names and between first names, giving more importance to the last name. You can experiment by changing a bit the way the final score is calculated…

				 double oScore = java.lang.Math.sqrt( 
						 java.lang.Math.pow( Distance.computeLevenshteinDistance( oSys2Sn , oSys1Sn ) * 2, 2 ) 
						 + java.lang.Math.pow( Distance.computeLevenshteinDistance( oSys2GivenName , oSys1GivenName ), 2 ) );
				 
				 if ( oScore < oBestScore ) 
				 {
					 oBestScore = oScore;
					 oBestMatch = oSys1ID;
					 oBestSn = oSys1Sn;
					 oBestGivenName = oSys1GivenName;
				 }
				 
 

If an exact match is found, we remove the entry from the first set of data:

			 
				 if ( oScore == 0 )
				 {
					 oSys1Set.remove( oSys1Data );
					 break;
				 }
			 }
			 
 

Finally, we save the best match to the result file:

		 
			 String oResultData = oSys2Sn + "," + oSys2GivenName + "," + oSys2UniqueID + "," + oBestSn + "," + oBestGivenName + "," + oBestMatch + "," + oBestScore;
			 oOutResult.write( oResultData + "\n" );
		 }

		 oOutResult.close();
		 oInSys2.close();
		 oInSys1.close();
	}
}

 

We can compile the source code of the tool using the following command:

javac Distance.java 

The result is a Distance.class file we can package in a jar file using the following command:

jar cvf distance.jar Distance.class 

Testing the tool

Let’s test the tool by processing two CSV files with exact matches and similar matches.

The first CSV file sys1.csv corresponding to the first system is the following:

William,Abdo,wabdo
Eddie,Austin,eaustin
Jerry,Dunn,jdunn
Deborah,Fair,dfair
Marianne,Griffith,mgriffith
Jeff,Healey,jhealey
Clarence,Holloway,chooloway
Clifford,Hough,chough
Randy,Krumins,rkrumins
Larry,Larsen,llarsen
Sean,Murphy,smurphy
Marie,Saenz,msaenz

 

The second CSV file sys2.csv corresponding to the second system is the following:

Willy,Abdo,10001
Eddy,Austin,10002
Jeremy,Dunn,10003
Deborah,Fair,10004
Marian,Griffith,10005
Jeff,Healy,10006
Clarence,Halloway,10007
Clifford,Hough,10008
Randy,Krumins,10009
Larry,Larsen,10010
Sean,Murphy,10011
Mary,Saenz,10012

 

We can run the tool by executing the following command:

java -classpath distance.jar Distance sys1.csv sys2.csv result.csv 

Here is the resulting CSV file:

CLARENCE,HALLOWAY,10007,CLARENCE,HOLLOWAY,chooloway,1.0
CLIFFORD,HOUGH,10008,CLIFFORD,HOUGH,chough,0.0
DEBORAH,FAIR,10004,DEBORAH,FAIR,dfair,0.0
EDDY,AUSTIN,10002,EDDIE,AUSTIN,eaustin,4.0
JEFF,HEALY,10006,JEFF,HEALEY,jhealey,1.0
JEREMY,DUNN,10003,JERRY,DUNN,jdunn,4.0
LARRY,LARSEN,10010,LARRY,LARSEN,llarsen,0.0
MARIAN,GRIFFITH,10005,MARIANNE,GRIFFITH,mgriffith,4.0
MARY,SAENZ,10012,MARIE,SAENZ,msaenz,4.0
RANDY,KRUMINS,10009,RANDY,KRUMINS,rkrumins,0.0
SEAN,MURPHY,10011,SEAN,MURPHY,smurphy,0.0
WILLY,ABDO,10001,WILLIAM,ABDO,wabdo,6.0

 

We can see entries from the second systems with possible matches in the first system and the score calculated based on the distance between names. If you order by scores, you will first see the exact matches and then names that are very similar (scores 1 or 2).

distance_Clipboard01.png

Using this technique on a few thousands of entries can really help in the data cleaning, matching process, before putting in production a new driver.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Tags:
Categories: Identity Manager, Technical Solutions

Disclaimer: As with everything else at NetIQ Cool Solutions, this content is definitely not supported by NetIQ, so Customer Support will not be able to help you if it has any adverse effect on your environment.  It just worked for at least one person, and perhaps it will be useful for you too.  Be sure to test in a non-production environment.

3 Comments

  1. By:dvandermaas

    As always, a great tool which will be very useful.
    It might be an idea to implement this in the Analyzer product, for it is not in there yet, to my knowledge.
    Grtz
    David

  2. By:Shivaji

    Thanks a lot !!!

    This is a problem I have to deal with regularly when we need to take lists from different systems and match, say IDs on one system to logins on a third system. Just spent an hour on doing this manually from data dumps from our eProcurement system and our ERP system, and I was getting ready to work on a more permanent solution. You just saved me a few hours of programming,

    Thanks again,

    Shivaji

  3. By:rkalfane

    Hello,

    Thanks for the feedback, I always appreciate! Of course this can be improved, but could be a good base for special matching algorithm. I remember that I also tested another way where all vowels are removed (“Reza Kalfane” becomes “Rz Klfn”), and depending on the language, this can also help a bit in the matching process…

    Cheers,

    Réza

Comment