/logo-square.png

Random Forest From Top To Bottom

In three months (as of June 2016) the New Orleans Saints will play a football game against the Atlanta Falcons. I want to know who will win. I ask my friend and he says the Saints. Technically this is a predictive model, but it’s probably not worth much. I can improve upon this model by asking other people who they think will win. Someone might pick the Saints because we have a better quarterback.

Introduction to Naive Bayes

I think there’s a rule somewhere that says “You can’t call yourself a data scientist until you’ve used a Naive Bayes classifier”. It’s extremely useful, yet beautifully simplistic. This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion. Motivating Problem Let’s start with a problem to motivate our formulation of Naive Bayes. (Feel free to follow along using the Python script or R script found here.

Quick Guide to Regex in Python

The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in Python. If you’re brand new to regular expressions, I highly recommend checking out RegexOne. For this guide, we’ll use Python’s re module which makes using regular expressions a breeze. Setup import re # import the re module sentence = "We bought our Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St.

Quick Guide to Regex in R

The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in R. If you’re brand new to regular expressions, I highly recommend checking out RegexOne. Hadley Wickham’s stringr package makes using regular expressions in R a breeze. I use it to avoid the complexity of base R’s regex functions grep, grepl, regexpr, gregexpr, sub and gsub where even the function names are cryptic.

How to Calculate Customer Retention and Churn

Here’s a practical guide for calculating customer retention and churn from transaction data. Preface The general idea of customer retention is self explanatory; It’s a measure of how well a business retainins their customers. Unfortunately the specifics of how to calculate a retention metric are not so clear. Likewise, customer churn is the complement of retention; It’s a measure of how many customers end their relationship with a business - i.

Logistic Regression Fundamentals

Logistic regression is a generalized linear model most commonly used for classifying binary data. It’s output is a continuous range of values between 0 and 1 (commonly representing the probability of some event occurring), and its input can be a multitude of real-valued and discrete predictors. Motivating Problem Suppose you want to predict the probability someone is a homeowner based solely on their age. You might have a dataset like