Part 1. Predict Dota 2 match winner by the first 5 minutes of the game. Gradient Boosting.
Introduction Dota 2 is a computer game in the MOBA (Multiplayer Online Battle Arena) genre. It is played by two teams, called Radiant and Dire which consist of five players each. The main goal of the game is to destroy other team's “Ancient", located at the opposite corners of the map. ...
Introduction
Dota 2 is a computer game in the MOBA (Multiplayer Online Battle Arena) genre. It is played by two teams, called Radiant and Dire which consist of five players each. The main goal of the game is to destroy other team's “Ancient", located at the opposite corners of the map. Matches are generated from a queue, taking into account the level of the game all players, known as MMR (Match Making Rank).
Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks').
Description of the match
1. Players select "heroes"
Each of the players choose one hero to play with from a pool of 112 heroes in the drafting stage of the game (sometimes called 'picks'). Each hero has a set of features that define his role in the team and playstyle. Among these features there are his basic attribute (Strength, Agility or Intelligence) and unique set of 4 (or for some heroes even more) skills.
2. Main part
Players can receive gold and experience for killing other people's heroes or other units. The accumulated experience influences the level of the hero, which in turn makes it possible to improve abilities. For accumulated gold, players buy items that improve the characteristics of heroes or give them new abilities.
After death, the hero goes to the "tavern" and is reborn only after some time, so the team loses the player for a while, but the player can prematurely redeem the hero from the tavern for a certain amount of gold.
During the game, teams develop their heroes, defend their part of the field and attack the enemy.
3. End of the game
The game ends when one of the teams destroys a certain number of "towers" of the enemy and destroys the throne
For the first 5 minutes of the game, predict which of the teams will win: Radiant or Dire?
The training set consists of matches, for which all of the ingame events (like kills, item purchase etc.) as well as match outcome are know. You are given only the first 5 minutes of each match and you need to predict the likelihood of Radiant victory.
The data for this contest is based on the archive of public games by OpenDota.com (formerly YASP).
6 rows × 102 columns
start_time | lobby_type | r1_hero | r1_level | r1_xp | r1_gold | r1_lh | r1_kills | r1_deaths | r1_items | r2_hero | r2_level | r2_xp | r2_gold | r2_lh | r2_kills | r2_deaths | r2_items | r3_hero | r3_level | r3_xp | r3_gold | r3_lh | r3_kills | r3_deaths | r3_items | r4_hero | r4_level | r4_xp | r4_gold | r4_lh | r4_kills | r4_deaths | r4_items | r5_hero | r5_level | r5_xp | r5_gold | r5_lh | r5_kills | r5_deaths | r5_items | d1_hero | d1_level | d1_xp | d1_gold | d1_lh | d1_kills | d1_deaths | d1_items | d2_hero | d2_level | d2_xp | d2_gold | d2_lh | d2_kills | d2_deaths | d2_items | d3_hero | d3_level | d3_xp | d3_gold | d3_lh | d3_kills | d3_deaths | d3_items | d4_hero | d4_level | d4_xp | d4_gold | d4_lh | d4_kills | d4_deaths | d4_items | d5_hero | d5_level | d5_xp | d5_gold | d5_lh | d5_kills | d5_deaths | d5_items | first_blood_time | first_blood_team | first_blood_player1 | first_blood_player2 | radiant_bottle_time | radiant_courier_time | radiant_flying_courier_time | radiant_tpscroll_count | radiant_boots_count | radiant_ward_observer_count | radiant_ward_sentry_count | radiant_first_ward_time | dire_bottle_time | dire_courier_time | dire_flying_courier_time | dire_tpscroll_count | dire_boots_count | dire_ward_observer_count | dire_ward_sentry_count | dire_first_ward_time | duration | radiant_win | tower_status_radiant | tower_status_dire | barracks_status_radiant | barracks_status_dire |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1430198770 | 7 | 11 | 5 | 2098 | 1489 | 20 | 0 | 0 | 7 | 67 | 3 | 842 | 991 | 10 | 0 | 0 | 4 | 29 | 5 | 1909 | 1143 | 10 | 0 | 0 | 8 | 20 | 3 | 757 | 741 | 6 | 0 | 0 | 7 | 105 | 3 | 732 | 658 | 4 | 0 | 1 | 11 | 4 | 3 | 1058 | 996 | 12 | 0 | 0 | 6 | 42 | 4 | 1085 | 986 | 12 | 0 | 0 | 4 | 21 | 5 | 2052 | 1536 | 23 | 0 | 0 | 6 | 37 | 3 | 742 | 500 | 2 | 0 | 0 | 8 | 84 | 3 | 958 | 1003 | 3 | 1 | 0 | 9 | 7.0 | 1.0 | 9.0 | 134.0 | -80.0 | 244.0 | 2 | 2 | 2 | 0 | 35.0 | 103.0 | -84.0 | 221.0 | 3 | 4 | 2 | 2 | -52.0 | 2874 | 1 | 1796 | 0 | 51 | 0 | |
1430220345 | 0 | 42 | 4 | 1188 | 1033 | 9 | 0 | 1 | 12 | 49 | 4 | 1596 | 993 | 10 | 0 | 1 | 7 | 67 | 4 | 1506 | 1502 | 18 | 1 | 0 | 7 | 37 | 3 | 669 | 631 | 7 | 0 | 0 | 7 | 26 | 2 | 415 | 539 | 1 | 0 | 0 | 5 | 39 | 5 | 1960 | 1384 | 16 | 0 | 0 | 8 | 88 | 3 | 640 | 566 | 1 | 0 | 1 | 5 | 79 | 3 | 720 | 1350 | 2 | 2 | 0 | 12 | 7 | 2 | 440 | 583 | 0 | 0 | 0 | 7 | 12 | 4 | 1470 | 1622 | 24 | 0 | 0 | 9 | 54.0 | 1.0 | 7.0 | 173.0 | -80.0 | 2 | 0 | 2 | 0 | -20.0 | 149.0 | -84.0 | 195.0 | 5 | 4 | 3 | 1 | -5.0 | 2463 | 1 | 1974 | 0 | 63 | 1 | ||
1430227081 | 7 | 33 | 4 | 1319 | 1270 | 22 | 0 | 0 | 12 | 98 | 3 | 1314 | 775 | 6 | 0 | 0 | 6 | 20 | 3 | 1297 | 909 | 0 | 1 | 0 | 6 | 27 | 5 | 2360 | 2096 | 26 | 1 | 1 | 6 | 4 | 3 | 1395 | 1627 | 27 | 0 | 0 | 9 | 22 | 5 | 2305 | 2028 | 19 | 1 | 1 | 10 | 66 | 3 | 1024 | 959 | 19 | 0 | 1 | 10 | 86 | 3 | 755 | 620 | 3 | 0 | 0 | 8 | 29 | 4 | 1319 | 667 | 4 | 0 | 0 | 7 | 80 | 3 | 1350 | 1512 | 25 | 0 | 0 | 7 | 224.0 | 0.0 | 3.0 | 63.0 | -82.0 | 2 | 5 | 2 | 1 | -39.0 | 45.0 | -77.0 | 221.0 | 3 | 4 | 3 | 1 | 13.0 | 2130 | 0 | 0 | 1830 | 0 | 63 | ||
1430263531 | 1 | 29 | 4 | 1779 | 1056 | 14 | 0 | 0 | 5 | 30 | 2 | 539 | 539 | 1 | 0 | 0 | 6 | 75 | 5 | 2037 | 1139 | 15 | 0 | 0 | 6 | 37 | 2 | 591 | 499 | 0 | 0 | 0 | 6 | 41 | 3 | 712 | 1075 | 12 | 0 | 0 | 6 | 96 | 5 | 1878 | 1174 | 17 | 0 | 0 | 6 | 48 | 3 | 732 | 1468 | 22 | 0 | 0 | 10 | 15 | 4 | 1681 | 1051 | 11 | 0 | 0 | 7 | 102 | 2 | 674 | 537 | 1 | 0 | 0 | 7 | 20 | 2 | 510 | 499 | 0 | 0 | 0 | 7 | 208.0 | -75.0 | 0 | 3 | 2 | 0 | -30.0 | 124.0 | -80.0 | 184.0 | 0 | 4 | 2 | 0 | 27.0 | 1459 | 0 | 1920 | 2047 | 50 | 63 | |||||
1430282290 | 7 | 13 | 4 | 1431 | 1090 | 8 | 1 | 0 | 8 | 27 | 2 | 629 | 552 | 0 | 0 | 1 | 7 | 30 | 3 | 884 | 927 | 0 | 1 | 0 | 8 | 72 | 3 | 925 | 1439 | 16 | 1 | 0 | 11 | 93 | 4 | 1482 | 880 | 7 | 0 | 0 | 8 | 26 | 3 | 704 | 586 | 1 | 0 | 2 | 9 | 69 | 3 | 1169 | 1665 | 20 | 1 | 0 | 7 | 22 | 3 | 1055 | 638 | 1 | 0 | 0 | 9 | 25 | 5 | 1815 | 1275 | 18 | 0 | 0 | 8 | 8 | 4 | 1119 | 904 | 6 | 0 | 1 | 7 | -21.0 | 1.0 | 6.0 | 166.0 | -81.0 | 181.0 | 1 | 4 | 2 | 0 | 46.0 | 182.0 | -80.0 | 225.0 | 6 | 3 | 3 | 0 | -16.0 | 2449 | 0 | 4 | 1974 | 3 | 63 | |
1430284186 | 1 | 11 | 5 | 1961 | 1461 | 19 | 0 | 1 | 6 | 20 | 2 | 441 | 686 | 4 | 0 | 0 | 5 | 28 | 4 | 1874 | 1438 | 22 | 0 | 0 | 4 | 25 | 2 | 528 | 800 | 1 | 1 | 0 | 9 | 65 | 3 | 799 | 785 | 6 | 0 | 1 | 6 | 55 | 3 | 847 | 785 | 7 | 0 | 1 | 7 | 52 | 2 | 455 | 967 | 2 | 1 | 0 | 11 | 3 | 2 | 279 | 916 | 0 | 1 | 0 | 10 | 73 | 5 | 2065 | 2565 | 26 | 0 | 0 | 13 | 48 | 5 | 2029 | 1781 | 29 | 0 | 0 | 8 | 78.0 | 1.0 | 7.0 | 35.0 | -85.0 | 182.0 | 5 | 4 | 2 | 1 | -27.0 | 2.0 | -86.0 | 212.0 | 4 | 4 | 4 | 0 | -43.0 | 1453 | 0 | 512 | 2038 | 0 | 63 |
Description of features
- match_id: match id in Dataset
- start_time: match start time (unixtime)
- lobby_type: type of room with gamers
- dataset for each gamers (team of the Radiant has prefix rN, Dire — dN):
- r1_hero: hero gamer
- r1_level: max level of herouse (for first 5 minutes)
- r1_xp: max expirience
- r1_gold: max gold
- r1_lh: count of dead units
- r1_kills: count of killed heroes
- r1_deaths: count of heroes death
- r1_items: count of purchased items
- features of event "first blood"
- first_blood_time: game time of first blood
- first_blood_team: first blood team (0 — Radiant, 1 — Dire)
- first_blood_player1: Player1 involved in an event "first blood"
- first_blood_player2: Player2 involved in an event "first blood"
- features for each team (prefix radiant_ and dire_)
- radiant_bottle_time: time of acquisition of item "bottle"
- radiant_courier_time: time of acquisition of item "courier"
- radiant_flying_courier_time: time of acquisition of item "flying_courier"
- radiant_tpscroll_count: count of items "tpscroll" for first 5 minutes
- radiant_boots_count: count of items "boots"
- radiant_ward_observer_count: count of items "ward_observer"
- radiant_ward_sentry_count: count of items "ward_sentry"
- radiant_first_ward_time: The time of installation by the command of the first "observer", i.e. An object that allows you to see part of the playing field
- Match result
- duration: duration
- radiant_win: if win the Radiant command then 1, else 0
- Status of towers and barracks by the end of the match
- tower_status_radiant
- tower_status_dire
- barracks_status_radiant
- barracks_status_dire
Quality metric
As a quality metric, we will use the area under the ROC curve (AUC-ROC). Note that AUC-ROC is a quality metric for the algorithm that issues the first-class membership estimates. To do this, you need to get predictions using the function predict_proba. It returns two columns: the first contains ratings for belonging to the zero class, the second to the first class. We need values from the second column:
Pred = clf.predict_proba (X_test) [:, 1]
Gradien Boosting as method of possible solution
We will use method of Gradient Boosting. It is not very demanding on data, restores non-linear dependencies, and works well on many data sets, which determines its popularity. It is quite reasonable to try it first. May be next article we will try method of logistic regression.
Process of solution
- Read features from file
import pandas X_train = pandas.read_csv('features.csv', index_col='match_id')
Remove features about end of match, such as duration, radiant_win, tower_status_radiant, tower_status_dire, barracks_status_radiant, barracks_status_dire
X_test = X_train.drop("duration",1).drop("radiant_win",1).drop("tower_status_radiant",1).drop("tower_status_dire",1).drop("barracks_status_radiant",1).drop("barracks_status_dire",1)
- Check the selection for skipped passes with the count () function, which for each column shows the number of filled values.
# find a max value maxCount = 0 C = X_test.count() for x in range(0, len(C)): if C[x] > maxCount: maxCount = C[x] # OUT: maxCount = 97230 # features with skipped data for idx,item in enumerate(C): if item < maxCount: print( C.keys()[idx],': ', maxCount-item)
# OUT: first_blood_time : 19553 first_blood_team : 19553 first_blood_player1 : 19553 first_blood_player2 : 43987 radiant_bottle_time : 15691 radiant_courier_time : 692 radiant_flying_courier_time : 27479 radiant_first_ward_time : 1836 dire_bottle_time : 16143 dire_courier_time : 676 dire_flying_courier_time : 26098 dire_first_ward_time : 1826
Missing data is used event "first blood". If "first blood" have't been happened for 5 minutes, then values set a skipped value. For example, skipped data in first_blood_time and first_blood_team means nobody has been died for first 5 minutes.
# Replace skipped data to 0 X_test = X_test.fillna(0) # Target value y_test = features['radiant_win']
We will forget that there are categorical attributes in the sample, and we will try to teach a gradient boosting over trees on the existing matrix "objects-signs". Lock the split generator for cross-validation in 5 blocks (KFold), do not forget to shuffle the sample (shuffle = True), because the data in the table is sorted by time, and without undue effects you may encounter unwanted effects when evaluating the quality.
Evaluate the quality of GradientBoostingClassifier using this cross-validation, try different number of trees (10, 20, 30, 40) and measure elapsed time.
from sklearn.ensemble import GradientBoostingClassifier import time import datetime import numpy as np from sklearn.metrics import roc_auc_score from sklearn.model_selection import KFold # cross-validation kf = KFold(n_splits=5, shuffle=True) X = X_test.as_matrix() y = y_test.as_matrix() for ne in [10,20,30,40]: start_time = datetime.datetime.now() # Gradient Boosting grd = GradientBoostingClassifier(n_estimators=ne, verbose=False, random_state=241) # teach grd.fit(X, y) minQuality = 1; for train_index, test_index in kf.split(X): X_train, X_cross_test = X[train_index], X[test_index] y_train, y_cross_test = y[train_index], y[test_index] # predict pred1 = grd.predict_proba(X_train)[:, 1] # find quality auc1 = roc_auc_score(y_train, pred1) if auc1 < minQuality: minQuality = auc1 print('estimators : ', ne) print("minQuality : ", minQuality) print('Time elapsed : ', datetime.datetime.now() - start_time)
# OUT: estimators : 10 minQuality : 0.671158427379 Time elapsed : 0:00:07.285712 estimators : 20 minQuality : 0.690562120847 Time elapsed : 0:00:13.752695 estimators : 30 minQuality : 0.699039946513 Time elapsed : 0:00:20.728346 estimators : 40 minQuality : 0.704060457972 Time elapsed : 0:00:26.331095
Is the optimum achieved at the tested values of the parameter n_estimators, or is the quality likely to continue to grow with its further increase?
if we take 60 or 80 trees, quality will increase insignificantly.
estimators : 60; min quality : 0.7128636129 estimators : 70; min quality : 0.715895358026 estimators : 80; min quality : 0.718241911848
Conclusions
We can predict the outcome of a match with an accuracy of 71% using the method of gradient boosting. Next time I will use logistic regression for prediction and will check fineal model by new data. It is interesting to compare result of another method.
Links:
- The competition: Dota 2: Win Probability Prediction
- Repository with solution code Github.com
- Datasets