30/09/2018, 18:14

Fuzzy string matching using cosine similarity

That’s how they calculate a match of two text in unit of percent using cosine similarity algorithm.

Another dev's two cents – 20 Sep 15

Fuzzy string matching using cosine similarity

Lately i've been dealing quite a bit with mining unstructured data[1]. This often involved determining the similarity of Strings and blocks of text. One of the more interesting algorithms i came across was the Cosine Similarity algorithm. It's a...

Working with strings

The first step is to turn our strings into vectors that we can work with. We do this by considering the Term Frequency of unique words or letters in a given string. Taking the following two strings as an example,

a. He is the hero Gotham deserves
b. but not the one it needs right now.

We can identify 13 unique words shared between these two strings. We
build our vector by noting how many times each word occurred in each
string.

      Word       a        b
1.    the        1        1  
2.    he         1        0  
3.    is         1        0  
4.    it         0        1  
5.    right      0        1  
6.    needs      0        1  
7.    but        0        1  
8.    hero       1        0  
9.    deserves   1        0  
10.   not        0        1  
11.   now        0        1  
12.   one        0        1  
13.   gotham     1        0  

Now we have two vectors

vector a =(1,1,1,0,0,0,0,1,1,0,0,0,1)

vector b =(1,0,0,1,1,1,1,0,0,1,1,1,0)

With this we can easily calculate the dot product and magnitudes of the vectors using the method that was illustrated earlier.

a . b = 1, ||a|| = 2.44948974278, ||b|| = 2.82842712475

Cosine similarity = 1 / (2.44948974278) * (2.82842712475)
Cosine similarity = 0.144337567297

So, string a match 14.43 % with string b.

Working with letter
For word cosine similarity, we consider unique characters in the string to construct our vector,

a. bob
b. rob

I’m sure you already see where this is going, but i’ll show you anyway

      Letter       a        b
1.    b            2        1  
2.    o            1        1  
3.    r            0        1  

Now we just compute the cosine similarity using the vectors

a . b = 3, ||a|| = 2.2360679775, ||b|| = 1.73205080757

Cosine similarity = 3 / (2.2360679775) * (1.73205080757)
Cosine similarity = 0.774596669241

It turns out, ‘bob’ and ‘rob’ are indeed similar; approximately 77% according to the results.

Java code:

 /**
 * @param terms values to analyze
 * @return a map containing unique 
 * terms and their frequency
 */
public static Map<String, Integer> getTermFrequencyMap(String[] terms) {
    Map<String, Integer> termFrequencyMap = new HashMap<>();
    for (String term : terms) {
        Integer n = termFrequencyMap.get(term);
        n = (n == null) ? 1 : ++n;
        termFrequencyMap.put(term, n);
    }
    return termFrequencyMap;
}

/**
 * @param text1 
 * @param text2 
 * @return cosine similarity of text1 and text2
 */
public static double cosineSimilarity(String text1, String text2) {
    //Get vectors
    Map<String, Integer> a = getTermFrequencyMap(text1.split("\W+"));
    Map<String, Integer> b = getTermFrequencyMap(text2.split("\W+"));

    //Get unique words from both sequences
    HashSet<String> intersection = new HashSet<>(a.keySet());
    intersection.retainAll(b.keySet());

    double dotProduct = 0, magnitudeA = 0, magnitudeB = 0;

    //Calculate dot product
    for (String item : intersection) {
        dotProduct += a.get(item) * b.get(item);
    }

    //Calculate magnitude a
    for (String k : a.keySet()) {
        magnitudeA += Math.pow(a.get(k), 2);
    }

    //Calculate magnitude b
    for (String k : b.keySet()) {
        magnitudeB += Math.pow(b.get(k), 2);
    }

    //return cosine similarity
    return dotProduct / Math.sqrt(magnitudeA * magnitudeB);
}

This code splits the given strings into words using the regex pattern W+. To adapt this for characters, (?!^) should be used instead.

... viết 20:26 ngày 30/09/2018

we might see this tutorial use in some modern IDE, to list the matching our code with some keyword, sort by decrease the percent of similarity of code.

Bài liên quan
0