In August 2016, Bosch challenged data scientists to predict internal failures of components along their assembly lines using thousands of measurements and tests for each component. The competition was hosted by Kaggle, spanned three months, and included 1,373 teams.
I joined forces with three other Kagglers and together we placed 14th. As the lowest ranked Kaggler in the group, I was surprised that the team quickly promoted me to team leader and encouraged me to take on all model ensembling duties to build our best overall model.
Unfortunately this competition was plagued by what I call a “soft” data leak. Information about the holdout set could be derived from the training dataset that wouldn’t otherwise be possible in a real world setting. However, the origin of the leak was unclear and the leak only gave you hints on which parts might fail but it didn’t guarantee that those parts would fail (what I’d call a hard leak). The best models for this competition were the ones that took significant advantage of this leak.
The size of the data also created some major headaches – it was Kaggle’s biggest dataset for a competition to date. While my teammates leveraged large AWS clusters, I did some clever data manipulation on my 16GB laptop to reduce the overall size of the data and still build an effective gradient boosting model using XGBoost.