/logo-square.png

Introduction to Naive Bayes

I think there’s a rule somewhere that says “You can’t call yourself a data scientist until you’ve used a Naive Bayes classifier”. It’s extremely useful, yet beautifully simplistic. This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion. Motivating Problem Let’s start with a problem to motivate our formulation of Naive Bayes. (Feel free to follow along using the Python script or R script found here.

Quick Guide to Regex in Python

The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in Python. If you’re brand new to regular expressions, I highly recommend checking out RegexOne. For this guide, we’ll use Python’s re module which makes using regular expressions a breeze. Setup import re # imoprt the re module sentence = "We bought our Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St.

Quick Guide to Regex in R

The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in R. If you’re brand new to regular expressions, I highly recommend checking out RegexOne. Hadley Wickham’s stringr package makes using regular expressions in R a breeze. I use it to avoid the complexity of base R’s regex functions grep, grepl, regexpr, gregexpr, sub and gsub where even the function names are cryptic.

How to Calculate Customer Retention and Churn

Here’s a practical guide for calculating customer retention and churn from transaction data. Preface The general idea of customer retention is self explanatory; It’s a measure of how well a business retainins their customers. Unfortunately the specifics of how to calculate a retention metric are not so clear. Likewise, customer churn is the complement of retention; It’s a measure of how many customers end their relationship with a business - i.

Logistic Regression Fundamentals

Logistic regression is a generalized linear model most commonly used for classifying binary data. It’s output is a continuous range of values between 0 and 1 (commonly representing the probability of some event occurring), and its input can be a multitude of real-valued and discrete predictors. Motivating Problem Suppose you want to predict the probability someone is a homeowner based solely on their age. You might have a dataset like

Decision Trees in R using rpart

R’s rpart package provides a powerful framework for growing classification and regression trees. To see how it works, let’s get started with a minimal example. Motivating Problem First let’s define a problem. There’s a common scam amongst motorists whereby a person will slam on his breaks in heavy traffic with the intention of being rear-ended. The person will then file an insurance claim for personal injury and damage to his vehicle, alleging that the other driver was at fault.