/logo-square.png

Spatial Data Analysis in R

In this tutorial you will learn how to define geometries (points, lines, polygons) plot those geometries execute spatial joins (which points are contained in a polygon?) get the distance between a set of points do all of the above within the context of geospatial data (e.g. cities, roads, counties) Important This tutorial is based on sf version 0.5-3 and ggplot2 version 2.2.1.900. The Basics To get started we need to learn how to define and operate on abstract geometries without a coordinate reference system (CRS).

Gradient Boosting Explained

If linear regression was a Toyota Camry, then gradient boosting would be a UH-60 Blackhawk Helicopter. A particular implementation of gradient boosting, XGBoost, is consistently used to win machine learning competitions on Kaggle. Unfortunately many practitioners (including my former self) use it as a black box. It’s also been butchered to death by a host of drive-by data scientists’ blogs. As such, the purpose of this article is to lay the groundwork for classical gradient boosting, intuitively and comprehensively.

Guide to Model Stacking (i.e. Meta Ensembling)

Introduction Stacking (also called meta ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. For this reason, stacking is most effective when the base models are significantly different.

Convert More Sales Leads With Machine Learning

The Problem You sell software that helps stores manage their inventory. You collect leads on thousands of potential customers, and your strategy is to cold-call them and pitch your product. You can only make 100 phone calls per day, so you want to identify leads with a high probability of converting to a sale. By calling leads randomly, you only generate about two sales per day - a 2% hit ratio.

Random Forest From Top To Bottom

In three months (as of June 2016) the New Orleans Saints will play a football game against the Atlanta Falcons. I want to know who will win. I ask my friend and he says the Saints. Technically this is a predictive model, but it’s probably not worth much. I can improve upon this model by asking other people who they think will win. Someone might pick the Saints because we have a better quarterback.

Introduction to Naive Bayes

I think there’s a rule somewhere that says “You can’t call yourself a data scientist until you’ve used a Naive Bayes classifier”. It’s extremely useful, yet beautifully simplistic. This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion. Motivating Problem Let’s start with a problem to motivate our formulation of Naive Bayes. (Feel free to follow along using the Python script or R script found here.