12/08/2018, 15:10

Item-based recommendation

Last month, I introduced some basic concept of recommendation based on user ratings and provide the simple way to evaluate data then giving recommended item. In this post, I'll show to all of you how to transform from user-based to item-based recommendation. There are many terms and Ruby methods I ...

Last month, I introduced some basic concept of recommendation based on user ratings and provide the simple way to evaluate data then giving recommended item. In this post, I'll show to all of you how to transform from user-based to item-based recommendation. There are many terms and Ruby methods I have already presented in last post about recommendation, I recommend all of you should read it first before take a challenge below.

From the last post, we have known that the source data has the format:

RATINGS = {
    'John': {
      'Kong': 7.0,
      'John Wick': 7.5,
      'Logan': 6.0,
      'Split': 5.5,
      'Moana': 6.5,
      'La La La Land': 8.0
    },
    'Lee': {
      'Kong': 6.5,
      'John Wick': 5.0,
      'Logan': 4.5,
      'Split': 4,
      'La La La Land': 6.0,
      'Moana': 7.0
    },
   ...

This data was accumulated from reviews of individuals and from these numbers, however, if we want to change to item-based recommendations, the data need to be reformatted to

{
  'movie_1': {
    film_1_1: 1.0,
    film_1_2: 2.0,
    ...
  },
  'movie_2': {
    film_2_1: 1.0,
    film_2_2: 2.0,
    ...
  },
  ...
}

So the first thing we need to do is transform data to new format

def convert_to_items_based ratings
  {}.tap do |items_ratings|
    # Get all movies names and iterating whole movies names
    ratings.values.map{|reviews| reviews.keys}.flatten.uniq.each do |movie|
      items_ratings[movie] = {}
      ratings.each do |user, rate|
        user_rate = rate[movie]
        items_ratings[movie][user] = user_rate unless user_rate.nil?
      end
    end
  end
end

And now we have item based data to provide recommendation based on items

As I presented in last post, to ranking the ratings, we have to employed methods to find out similarities between the items: Euclidean distance and Pearson correlation Although there're many differences between two method in approaches, theories and implementation, both two methods help us to score each item and based on scores we can find out the similarities between the items And from that, we can build an top items that suitable similar with one items

def top_matches data, target_item, n = 5
  scores = data.map do |item, _|
    next if target_item == item
    #In this case, I use Pearson score
    {}.tap do |item_rating|
      item_rating[item] = pearson_correlation(data, target_item, item)
    end
  end.compact

  #Sort the list to get the highest score
  scores.sort_by{|item_rating| item_rating.values.first }.reverse.take(n)
end

This method returns the top n items which has highest score returning from Euclidean distance and Pearson correlation. For now we need to calculate similar items for each movie

def calculate_similar_items ratings, n = 10
  {}.tap do |similar_items|
    item_ratings = convert_to_items_based ratings
    item_ratings.each do |movie, ratings|
      puts "[INFO]: #{movie.to_s} - #{ratings.values.length} ratings"
      scores = top_matches(item_ratings, movie, n = n)
      similar_items[movie] = scores
    end
  end
end

Now let try it

2.4.1 :005 > calculate_similar_items RATINGS
[INFO]: Kong - 5 ratings
[INFO]: John Wick - 7 ratings
[INFO]: Logan - 4 ratings
[INFO]: Split - 7 ratings
[INFO]: Moana - 6 ratings
[INFO]: La La La Land - 6 ratings
 => {:Kong=>[{:Moana=>0.8058229640253802}, {:Logan=>0.6546536707079758}, {:"John Wick"=>0.39929785312496224}, {:Split=>0.2795084971874737}, {:"La La La Land"=>0.0}], :"John Wick"=>[{:Logan=>0.5703518254720301}, {:Split=>0.5111815065740504}, {:Kong=>0.39929785312496224}, {:"La La La Land"=>0.16297339597886237}, {:Moana=>-0.5213601623400473}], :Logan=>[{:"La La La Land"=>0.9116377679037143}, {:Split=>0.7230210236376229}, {:Kong=>0.6546536707079758}, {:"John Wick"=>0.5703518254720301}, {:Moana=>-0.3503292361635921}], :Split=>[{:Logan=>0.7230210236376229}, {:"John Wick"=>0.5111815065740504}, {:"La La La Land"=>0.38822469593451137}, {:Kong=>0.2795084971874737}, {:Moana=>-0.09059377806311973}], :Moana=>[{:Kong=>0.8058229640253802}, {:Split=>-0.09059377806311973}, {:Logan=>-0.3503292361635921}, {:"John Wick"=>-0.5213601623400473}, {:"La La La Land"=>-0.574620465390228}], :"La La La Land"=>[{:Logan=>0.9116377679037143}, {:Split=>0.38822469593451137}, {:"John Wick"=>0.16297339597886237}, {:Kong=>0.0}, {:Moana=>-0.574620465390228}]}

As you can see from code and screen, I have to show the logs when calculating each items because sometimes, with large dataset, calculating takes much more time than expectation. As explained in last post, the similar score of each item start from -1 to 1 because I use Pearson correlation to calculate score. As more close to 1, as more similar to item and in other hand, more close to -1, that item's more difference to target item. Because now we get 10 most similar items and the data-set has just a limited information so you can see the result above has some movies get negative score. We can prevent it by get only positive values. In real systems, if we can maintain a large data-set, the similar scores between items will be more stable

Now we're ready to give recommendations based on similarity scores but in some case, each person has their own taste and we need to add personalities to recommendation. Because as you know, the similar scores that we calculate above's stable for every user so we need to mix similarity scores with personal review to provide the recommending item which is specialize for each user. On the other hand, when user visit item on web page, they expect to get suggestions for items that they've never seen before, so recommended items should be different to browsing history items. The most easiest way to make it is using multiply operator to mixing similar ties scores and their previous ratings. The table below will show how does it work for user Jack

Movie Rating Kong x.Kong Logan x.Logan La La La Land x.La La La Land
John Wick 9.0 0.3993 3.5937 0.5704 5.1336 0.16297 1.46673
Moana 4.0 0.8058 3.2232 -0.3053 -1.2212 -0.57462 -2.29848
Split 8.0 0.2795 2.236 0.7230 5.784 0.38822 3.10576
Total 1.4846 9.0529 0.9881 9.6964 -0.02343 2.27401
Normalized 6.09787 9.8132 -97.05548

Based on that we got a method to provide item based recommendation

def item_based_recommendation ratings, user
  user_ratings = ratings[user]
  scores = {}
  total_sim = {}
  similar_items = calculate_similar_items ratings

  user_ratings.each do |movie, rating|
    similar_items[movie].each do |sim_movie|
      #ignore item has already had review
      sim_movie_name = sim_movie.keys.first
      sim_movie_score = sim_movie.values.first
      next unless user_ratings[sim_movie_name].nil?

      scores[sim_movie_name] = 0 if scores[sim_movie_name].nil?
      scores[sim_movie_name] +=  sim_movie_score * rating

      total_sim[sim_movie_name] = 0 if total_sim[sim_movie_name].nil?
      total_sim[sim_movie_name] += sim_movie_score
    end
  end

  rankings = scores.map do |item, score|
    {}.tap{|rec| rec[item] = score/total_sim[item]}
  end
  rankings.sort_by{|rank| rank.values.first }.reverse
end

In comparison, item-based recommendation is significantly faster than user-based when getting list of recommendation of large data-set, however we need to maintain data regularly. Also there is a difference in accuracy that depends on how "sparse" the data-set is. For example, if user gives rating to every movie, the data-set is dense (not "sparse"), on the other hand, each user provide just few ratings which will create a sparse data-set. Item-based filtering usually outperform user-based filtering in sparse data-set. Having said that, user-based filter is simpler to implement and doesn't have extra steps so it's suitable for smaller data-set system. However, showing people has same interesting on same-thing is quite strange on shopping website but for sharing link or music may be a good choice. Finally, all the things I shared is just a reference, if you found that your recommendation system works much better than my idea, that's no problem because we build recommendation system to reduce the gap between users and our system. So, let contact and sharing with me!

0