The U.S. Open main draw begins this morning and for the fourth year in a row, I will not be able to attend. Gone are the good ol’ days of working for the USTA and getting to take the trip up to New York to take it all in.
Since I cannot go, I decided to utilize Markov Chain models and Monte Carlo simulations to predict who will win.
Markov Models for tennis are essentially placing some initial inputs into a model and allow it to simulate an entire match, giving you the probabilities player A wins over player B. A Monte Carlo simulation is when you run an entire tournament over and over like this. Even if you can do the math, one of the most difficult parts is creating the initial inputs to run the Markov Model.
MY METHODOLOGY
I decided to experiment with an idea that begins with something I read in Dr. Kamran Aslam’s PhD dissertation he wrote at USC. Dr. Aslam and his advisor, Dr. Paul K. Newton published portions of this paper several times, including in the Journal of Quantitative Analysis in Sport back in 2009.
Dr. Aslam took the idea that you start by finding the overall mean probability to win a point while returning. This is defined as the returning average of ‘the field’. Let’s say this is 0.330. Then, if Roger Federer is playing Novak Djokovic and Roger’s average ability to win a point returning is 0.40, then he is 0.07 better than ‘the field’. If Novak’s average is 0.41, then he is 0.08 better than ‘the field’.
Then, if Roger’s percentage he wins serve is 0.7, you subtract Novak’s ability ‘above the field’ (0.08), making Roger’s effective serving percentage, 0.62.
Likewise, if Novak’s serving percentage is 0.68, then his effective serving percentage is 0.61. Therefore, the input to the program would be 0.62 and 0.39 for Roger (one minus Novak’s effective serving percentage). If you ran this for Novak, the inputs would be 0.61 and 0.38.
Modifications
To get the data, I scraped all serving and receiving stats from the ATP website for each player in the draw. I also decided to scale the data.
Scaling
Using only hard court results for the 2014 season, I scaled the data based on the level of competition. This allowed me to include all Challenger data as well as ATP-level, which is available on the ATP site. If an opponent was inside the top-64, no scaling was done. If the opponent was ranked between 65 and 128, then I scaled it down by 1.5%. If the opponent was in the top-192, I scaled it another 1.5%. I scaled it another 1.5% for between 193-256 and another 1.5% for those over a 256 ranking.
For some matches, the opponent’s ranking is listed at N/A. In those cases, the scaling was done based on the player’s own ranking, which seemed to be close enough to the actual ranking, except in a few instances.
Scaling this way may not be the best solution, but this is a solid starting point.
I then found ‘the field’ by averaging the scaled percentages of all players in the tournament. Five players have not played on the hard courts yet this season, so I removed them when calculating the field. Also, rather than placing zeroes in the data for them, I substituted numbers slightly below the averages for both serving and receiving.
In future versions, I may substitute scaled, full season statistics, irrelevant of surface for these players.
Noah Rubin
Then there was the case of Noah Rubin, who only had one hard court match, where he had some pretty good numbers, despite losing last week in Winston-Salem. In this case, I decided to manually modify his percentages down closer to the five I had to manually enter who had not played a single hard court match.
Shortcomings
Most of the problems come from too little data on some players. Some of this can be handled by using more stringent scaling for Challenger-level matches. Two of the most noticeable are Gilles Muller and Jared Donaldson. Muller won the Guadalajara Challenger and none of his opponents’ rankings are listed in the data, so they were only scaled per his No.68-ranking. Donaldson also had a lot of Challenger results that were not scaled sufficiently.
Coding
Jeff Sackman at tennisabstract.com published some python code to run the Markov Models a few years ago (here’s a link to his 2014 predictions, which you may like more than mine). He uses similar inputs and generates a probability player A wins the match. I modified Jeff’s code for my purposes, then wrapped it within a Monte Carlo Simulation and ran it 50,000 times.
I am not posting my entire code just yet on github, but hope to soon. I need to refine my entire process, soup-to-nuts, before I feel comfortable with that.
THE RESULTS
The table below shows howe far a player advances. For instance, Roger Federer lost 1802 out of 50,000 trials in the first round, but won the tournament 16895 times.
Federer seems to be the biggest winner here with Rafael Nadal out. I know this isn’t perfect, but it is a good start and something to work with moving forward. There are some basic assumptions I make and some data that needs refining, but overall I am satisfied with the outcome.
| PLAYER | R1 | R2 | R3 | R16 | Q | S | F | W | PCT |
| Roger-Federer | 1802 | 1399 | 4752 | 5083 | 5099 | 8057 | 6913 | 16895 | 33.8% |
| Tomas-Berdych | 4984 | 1008 | 5294 | 6528 | 7710 | 11934 | 5119 | 7423 | 14.8% |
| Novak-Djokovic | 244 | 18702 | 4269 | 3604 | 5764 | 4425 | 5658 | 7334 | 14.7% |
| Andy-Murray | 795 | 7560 | 7183 | 7213 | 13748 | 5073 | 4877 | 3551 | 7.1% |
| Gilles-Muller | 5497 | 25948 | 3678 | 3013 | 3864 | 2726 | 2705 | 2569 | 5.1% |
| Milos-Raonic | 3273 | 16891 | 7391 | 7462 | 5420 | 5007 | 2704 | 1852 | 3.7% |
| Stan-Wawrinka | 10010 | 5230 | 11653 | 5295 | 7827 | 5558 | 2753 | 1674 | 3.3% |
| Kei-Nishikori | 9637 | 4005 | 2004 | 16234 | 7388 | 6449 | 2777 | 1506 | 3.0% |
| David-Ferrer | 6078 | 14349 | 3023 | 8772 | 9880 | 5189 | 1576 | 1133 | 2.3% |
| Blaz-Kavcic | 6213 | 9611 | 16077 | 5177 | 6755 | 3917 | 1524 | 726 | 1.5% |
| Marin-Cilic | 17736 | 6164 | 6662 | 8257 | 6488 | 3217 | 905 | 571 | 1.1% |
| Peter-Gojowczyk | 6533 | 24512 | 5850 | 5298 | 3446 | 2674 | 1152 | 535 | 1.1% |
| Jared-Donaldson | 21742 | 3033 | 10349 | 5732 | 6508 | 1532 | 677 | 427 | 0.9% |
| David-Goffin | 2882 | 3878 | 15412 | 13738 | 10691 | 2144 | 836 | 419 | 0.8% |
| Roberto-Bautista-Agut | 9661 | 6577 | 12280 | 16611 | 2253 | 1578 | 685 | 355 | 0.7% |
| Adrian-Mannarino | 9823 | 8095 | 13559 | 14600 | 1878 | 1270 | 528 | 247 | 0.5% |
| Paolo-Lorenzi | 4830 | 17542 | 13070 | 6225 | 6295 | 1285 | 515 | 238 | 0.5% |
| Simone-Bolelli | 12534 | 12440 | 6899 | 10541 | 4619 | 2052 | 681 | 234 | 0.5% |
| Facundo-Bagnis | 22299 | 5898 | 6488 | 11127 | 2308 | 1093 | 590 | 197 | 0.4% |
| Bernard-Tomic | 16933 | 18795 | 2488 | 5200 | 4245 | 1712 | 432 | 195 | 0.4% |
| Igor-Sijsling | 10360 | 6161 | 21214 | 6118 | 3278 | 2052 | 626 | 191 | 0.4% |
| Ernests-Gulbis | 10555 | 16235 | 7808 | 10532 | 2450 | 1743 | 487 | 190 | 0.4% |
| Gael-Monfils | 28258 | 2975 | 8844 | 4478 | 4152 | 860 | 292 | 141 | 0.3% |
| Dominic-Thiem | 13221 | 16989 | 7085 | 8985 | 1991 | 1272 | 317 | 140 | 0.3% |
| Ivo-Karlovic | 12126 | 5162 | 26714 | 3018 | 1564 | 938 | 342 | 136 | 0.3% |
| Jo-Wilfried-Tsonga | 20942 | 8099 | 8426 | 7785 | 3365 | 856 | 395 | 132 | 0.3% |
| Richard-Gasquet | 13310 | 18723 | 9403 | 4126 | 3460 | 664 | 222 | 92 | 0.2% |
| Yen-Hsun-Lu | 15007 | 14542 | 16524 | 1747 | 1306 | 530 | 254 | 90 | 0.2% |
| Benoit-Paire | 20522 | 10624 | 8481 | 6731 | 2643 | 642 | 277 | 80 | 0.2% |
| Grigor-Dimitrov | 15605 | 14089 | 10471 | 5488 | 3458 | 618 | 197 | 74 | 0.1% |
| Philipp-Kohlschreiber | 27701 | 5789 | 5829 | 8094 | 1545 | 654 | 317 | 71 | 0.1% |
| Marcos-Baghdatis | 32264 | 5563 | 4588 | 4262 | 2312 | 804 | 148 | 59 | 0.1% |
| John-Isner | 14229 | 12162 | 12604 | 8587 | 1547 | 599 | 217 | 55 | 0.1% |
| Kevin-Anderson | 3544 | 18025 | 16794 | 7219 | 3297 | 922 | 151 | 48 | 0.1% |
| Juan-Monaco | 29058 | 7290 | 6559 | 4912 | 1679 | 336 | 124 | 42 | 0.1% |
| Sam-Querrey | 2888 | 23684 | 19638 | 1877 | 1236 | 444 | 192 | 41 | 0.1% |
| Alexander-Kudryavtsev | 24959 | 5369 | 15596 | 2145 | 1205 | 570 | 117 | 39 | 0.1% |
| Bradley-Klahn | 21459 | 11049 | 12642 | 2322 | 1899 | 430 | 162 | 37 | 0.1% |
| Evgeny-Donskoy | 25041 | 5464 | 15495 | 2100 | 1199 | 560 | 116 | 25 | 0.1% |
| Steve-Johnson | 22309 | 10788 | 9442 | 5674 | 1130 | 539 | 93 | 25 | 0.1% |
| Dudi-Sela | 751 | 25740 | 14179 | 6185 | 2661 | 380 | 84 | 20 | 0.0% |
| Tommy-Robredo | 19699 | 17062 | 5206 | 5582 | 1758 | 570 | 106 | 17 | 0.0% |
| Radek-Stepanek | 11794 | 30753 | 3478 | 2102 | 1464 | 282 | 111 | 16 | 0.0% |
| Andreas-Beck | 2542 | 27796 | 11149 | 6347 | 1739 | 323 | 89 | 15 | 0.0% |
| Sergiy-Stakhovsky | 15749 | 11889 | 12577 | 7060 | 2052 | 550 | 109 | 14 | 0.0% |
| Julien-Benneteau | 29478 | 9134 | 6133 | 3817 | 1163 | 200 | 61 | 14 | 0.0% |
| Wayne-Odesnik | 40363 | 3301 | 1383 | 3746 | 849 | 295 | 54 | 9 | 0.0% |
| James-McGee | 10881 | 25214 | 7925 | 4574 | 1153 | 192 | 52 | 9 | 0.0% |
| Andrey-Kuznetsov | 28541 | 9826 | 9132 | 1393 | 891 | 159 | 49 | 9 | 0.0% |
| Jiri-Vesely | 39990 | 3818 | 3995 | 1132 | 811 | 202 | 44 | 8 | 0.0% |
| Tatsuma-Ito | 27691 | 9767 | 7701 | 3842 | 650 | 298 | 43 | 8 | 0.0% |
| Lleyton-Hewitt | 45016 | 1036 | 2024 | 1202 | 499 | 183 | 34 | 6 | 0.0% |
| Ivan-Dodig | 21405 | 16081 | 8031 | 3614 | 590 | 241 | 32 | 6 | 0.0% |
| Mikhail-Youzhny | 14304 | 19251 | 10826 | 4501 | 894 | 191 | 27 | 6 | 0.0% |
| Marco-Chiudinelli | 20849 | 21573 | 4124 | 2442 | 820 | 164 | 22 | 6 | 0.0% |
| Dustin-Brown | 33067 | 12283 | 1366 | 1966 | 1030 | 242 | 41 | 5 | 0.0% |
| Jan-Lennard-Struff | 21002 | 16351 | 8368 | 3712 | 434 | 111 | 17 | 5 | 0.0% |
| Blaz-Rola | 23408 | 15137 | 9150 | 1331 | 812 | 118 | 40 | 4 | 0.0% |
| Gilles-Simon | 10414 | 5720 | 27011 | 4809 | 1745 | 265 | 32 | 4 | 0.0% |
| Thomaz-Bellucci | 17702 | 25481 | 4806 | 1182 | 643 | 167 | 15 | 4 | 0.0% |
| Feliciano-Lopez | 28595 | 13364 | 5638 | 2021 | 284 | 85 | 9 | 4 | 0.0% |
| Jeremy-Chardy | 22487 | 19476 | 5697 | 1275 | 794 | 223 | 45 | 3 | 0.0% |
| Fernando-Verdasco | 26592 | 13988 | 7748 | 1047 | 529 | 77 | 17 | 2 | 0.0% |
| Jerzy-Janowicz | 25303 | 14188 | 7428 | 2276 | 656 | 132 | 15 | 2 | 0.0% |
| Ryan-Harrison | 34395 | 9438 | 4306 | 1399 | 417 | 34 | 9 | 2 | 0.0% |
| Dusan-Lajovic | 24697 | 14643 | 7500 | 2296 | 715 | 128 | 20 | 1 | 0.0% |
| Alejandro-Falla | 27513 | 16945 | 4151 | 827 | 449 | 97 | 17 | 1 | 0.0% |
| Edouard-Roger-Vasselin | 30301 | 13073 | 3280 | 2566 | 648 | 120 | 11 | 1 | 0.0% |
| Jack-Sock | 17867 | 26513 | 1780 | 3228 | 486 | 114 | 11 | 1 | 0.0% |
| Fabio-Fognini | 15873 | 23836 | 7021 | 2983 | 230 | 47 | 9 | 1 | 0.0% |
| Sam-Groth | 16829 | 31352 | 1087 | 534 | 148 | 41 | 8 | 1 | 0.0% |
| Guillermo-Garcia-Lopez | 34993 | 9023 | 5491 | 333 | 124 | 27 | 8 | 1 | 0.0% |
| Illya-Marchenko | 29151 | 16700 | 2530 | 1245 | 321 | 47 | 5 | 1 | 0.0% |
| Victor-Estrella-Burgos | 39640 | 4215 | 5357 | 598 | 145 | 39 | 5 | 1 | 0.0% |
| Kenny-De-Schepper | 39445 | 7622 | 1860 | 930 | 112 | 26 | 4 | 1 | 0.0% |
| Mikhail-Kukushkin | 28998 | 13406 | 5509 | 1915 | 134 | 34 | 3 | 1 | 0.0% |
| Andreas-Seppi | 34251 | 8382 | 5378 | 1699 | 261 | 27 | 1 | 1 | 0.0% |
| Matthias-Bachinger | 38206 | 11021 | 552 | 173 | 41 | 5 | 1 | 1 | 0.0% |
| Daniel-Gimeno-Traver | 9294 | 29670 | 6064 | 4411 | 431 | 97 | 33 | 0 | 0.0% |
| Vasek-Pospisil | 37466 | 7425 | 2746 | 1885 | 410 | 58 | 10 | 0 | 0.0% |
| Lukas-Rosol | 3411 | 36289 | 9252 | 834 | 174 | 33 | 7 | 0 | 0.0% |
| Lukas-Lacko | 36779 | 9154 | 2435 | 1390 | 181 | 57 | 4 | 0 | 0.0% |
| Paul-Henri-Mathieu | 44503 | 5116 | 256 | 70 | 41 | 10 | 4 | 0 | 0.0% |
| Tobias-Kamke | 9550 | 7288 | 27984 | 4629 | 464 | 82 | 3 | 0 | 0.0% |
| Marinko-Matosevic | 48198 | 762 | 571 | 318 | 113 | 35 | 3 | 0 | 0.0% |
| Tim-Smyczek | 20643 | 22118 | 5015 | 2062 | 125 | 34 | 3 | 0 | 0.0% |
| Marcos-Giron | 35771 | 8081 | 4583 | 1414 | 124 | 24 | 3 | 0 | 0.0% |
| Pere-Riba | 40177 | 4868 | 3526 | 1305 | 104 | 17 | 3 | 0 | 0.0% |
| Teymuraz-Gabashvili | 22253 | 21265 | 5983 | 390 | 91 | 15 | 3 | 0 | 0.0% |
| Andreas-Haider-Maurer | 40339 | 4504 | 3543 | 1481 | 108 | 23 | 2 | 0 | 0.0% |
| Filip-Krajinovic | 29357 | 16801 | 2890 | 900 | 40 | 10 | 2 | 0 | 0.0% |
| Donald-Young | 43787 | 3968 | 1813 | 305 | 106 | 20 | 1 | 0 | 0.0% |
| Denis-Istomin | 36690 | 9754 | 2577 | 701 | 260 | 17 | 1 | 0 | 0.0% |
| Jarkko-Nieminen | 37874 | 4004 | 7699 | 332 | 73 | 17 | 1 | 0 | 0.0% |
| Nicolas-Mahut | 32298 | 15471 | 1808 | 306 | 102 | 14 | 1 | 0 | 0.0% |
| Alejandro-Gonzalez | 24004 | 22641 | 2772 | 478 | 101 | 3 | 1 | 0 | 0.0% |
| Andrey-Golubev | 34127 | 13201 | 2166 | 478 | 25 | 2 | 1 | 0 | 0.0% |
| Yoshihito-Nishioka | 45170 | 3981 | 714 | 113 | 21 | 0 | 1 | 0 | 0.0% |
| Damir-Dzumhur | 43922 | 4573 | 607 | 655 | 215 | 28 | 0 | 0 | 0.0% |
| Benjamin-Becker | 43467 | 5726 | 551 | 194 | 54 | 8 | 0 | 0 | 0.0% |
| Pablo-Andujar | 32133 | 16181 | 760 | 850 | 70 | 6 | 0 | 0 | 0.0% |
| Marcel-Granollers | 12044 | 29804 | 7917 | 210 | 19 | 6 | 0 | 0 | 0.0% |
| Martin-Klizan | 12104 | 35990 | 1429 | 390 | 82 | 5 | 0 | 0 | 0.0% |
| Nick-Kyrgios | 35696 | 10478 | 3088 | 667 | 66 | 5 | 0 | 0 | 0.0% |
| Santiago-Giraldo | 27747 | 17902 | 4055 | 243 | 48 | 5 | 0 | 0 | 0.0% |
| Frank-Dancevic | 17016 | 28639 | 3479 | 755 | 108 | 3 | 0 | 0 | 0.0% |
| Taro-Daniel | 46727 | 2871 | 309 | 78 | 13 | 2 | 0 | 0 | 0.0% |
| Dmitry-Tursunov | 25996 | 21351 | 2271 | 317 | 64 | 1 | 0 | 0 | 0.0% |
| Aleksandr-Nedovyesov | 39119 | 9397 | 1234 | 239 | 10 | 1 | 0 | 0 | 0.0% |
| Niels-Desein | 47118 | 1643 | 1091 | 139 | 8 | 1 | 0 | 0 | 0.0% |
| Radu-Albot | 39586 | 4073 | 6027 | 279 | 35 | 0 | 0 | 0 | 0.0% |
| Leonardo-Mayer | 9476 | 29457 | 10544 | 512 | 11 | 0 | 0 | 0 | 0.0% |
| Albert-Ramos-Vinolas | 33171 | 16487 | 255 | 76 | 11 | 0 | 0 | 0 | 0.0% |
| Federico-Delbonis | 23729 | 20814 | 5281 | 167 | 9 | 0 | 0 | 0 | 0.0% |
| Noah-Rubin | 26271 | 19393 | 4197 | 131 | 8 | 0 | 0 | 0 | 0.0% |
| Matthew-Ebden | 40450 | 4523 | 4807 | 213 | 7 | 0 | 0 | 0 | 0.0% |
| Joao-Sousa | 32984 | 15840 | 1045 | 125 | 6 | 0 | 0 | 0 | 0.0% |
| Michael-Llodra | 40706 | 8643 | 555 | 93 | 3 | 0 | 0 | 0 | 0.0% |
| Robin-Haase | 49205 | 666 | 115 | 11 | 3 | 0 | 0 | 0 | 0.0% |
| Pablo-Cuevas | 46456 | 3144 | 374 | 24 | 2 | 0 | 0 | 0 | 0.0% |
| Steve-Darcis | 37896 | 11966 | 124 | 14 | 0 | 0 | 0 | 0 | 0.0% |
| Jurgen-Melzer | 37956 | 11030 | 1005 | 9 | 0 | 0 | 0 | 0 | 0.0% |
| Albert-Montanes | 40524 | 8732 | 738 | 6 | 0 | 0 | 0 | 0 | 0.0% |
| Pablo-Carreno-Busta | 47458 | 2446 | 93 | 3 | 0 | 0 | 0 | 0 | 0.0% |
| Diego-Schwartzman | 49756 | 234 | 8 | 2 | 0 | 0 | 0 | 0 | 0.0% |
| Maximo-Gonzalez | 47112 | 2751 | 136 | 1 | 0 | 0 | 0 | 0 | 0.0% |
| Carlos-Berlocq | 49249 | 733 | 17 | 1 | 0 | 0 | 0 | 0 | 0.0% |
| Borna-Coric | 46589 | 3335 | 76 | 0 | 0 | 0 | 0 | 0 | 0.0% |