In November 2015, Airbnb challenged data scientists to predict in what country a new user would make his or her first booking. The data consisted of new user attributes like age, gender, signup method, and browser along with some user session data. The competition was hosted by Kaggle and included 1,462 teams.
I ended up placing 20th out of 1,462 teams (top 2%). Three tricks help me land near the top.
This challenge suffered from a common problem known as class imbalance; Most people don’t book a trip, a large number of people booked their first trip to the U.S. and a small number of people booked their first trip to one of the 9 other locations. Instead of building a single classifier, I built one for each of these scenarios.
I used cross validation to tune the hyper-parameters of my model, but I specified my objective function as weighted AUC ROC. This resulted in deep trees which generated better probability estimates for each target class which was important for this competition since the evaluation metric took into account how good your second and third destination guess was for each user.
Airbnb’s user destination frequency changed a lot since 2010, so I threw out all the data prior to 2014. This help my models train quicker and it gave me a nice boost on the leaderboard.