# Convert More Sales Leads With Machine Learning ## The Problem You sell software that helps stores manage their inventory. You collect leads on thousands of potential customers, and your strategy is to cold-call them and pitch your product. You can only make 100 phone calls per day, so you want to identify leads with a high probability of converting to a sale. By calling leads randomly, you only generate about two sales per day - a 2% *hit ratio*. If you could be smarter about who you target, you could increase your sales with no extra resources… ## The Solution Machine Learning. You tracked important data on your leads – attributes like Facebook likes, phone number, type of business, etc. Now you can use machine learning techniques to build a model to help you separate the strong leads from the weak ones. In this article I’ll present an example dataset with example solutions to illustrate how this works. If you’d like to download the data and solutions, you can find them in the [Rank Sales Leads](https://github.com/ben519/MLPB/tree/master/Problems/Rank%20Sales%20Leads) directory of the [Machine Learning Problem Bible](https://github.com/ben519/MLPB) on Github. ## The Data ### train

LeadID	CompanyName	TypeOfBusiness	FacebookLikes	TwitterFollowers	Website	PhoneNumber	Contact	Sale
1	Drugs R Us	drug store	NA	NA	NA	5042538336	general line	FALSE
4	Joe's Hobby Shop	hobbies & toys	NA	NA	NA	5043304798	other	FALSE
5	The Corner	corner store	26	NA	thecornerstore.com	6461045953	owner	TRUE
...	...	...	...	...	...	...	...	...
26	Juicy Burger	restaurant	112	20	NA	3103833242	other	FALSE
27	Comic Corner	books	89	67	comiccorner.com	5044820790	manager	TRUE
28	Smokes and Beer	corner store	NA	NA	NA	5046995720	other	FALSE

### test

LeadID	CompanyName	TypeOfBusiness	FacebookLikes	TwitterFollowers	Website	PhoneNumber	Contact	Sale
2	The Law Offices of Smith	law office	87	46	smithlaw.net	3109859670	other	FALSE
3	Bait n Tackle	NA	22	NA	NA	6464315297	owner	TRUE
6	Smokie's Bar & Grill	restaurant	215	NA	NA	6461572195	general line	FALSE
...	...	...	...	...	...	...	...	...
22	Tools & Supplies	NA	NA	NA	NA	3101508786	owner	FALSE
29	Reading Rainbow	books	6	NA	readingrainbowstore.us	5049169286	owner	FALSE
30	Katz Claims	law office	144	36	katzclaims.com	5045337861	owner	FALSE

## The Objective This is important! The simplest and most common objective function for most binary classification algorithms is accuracy rate. This is not what we want. For our sample training dataset, just 35% of leads convert to a sale and in practice this number can be **much** lower (less than 1%). The most *accurate* model will often be the one that *predicts every lead will not become a sale*. Instead, we’re interested in accurately **ranking** the likelihood that each lead becomes a sale. Then we can pick out the best leads to pursue with confidence that our overall conversion rate will be higher than pursuing a set of random leads. A very easy and common objective function for this is [Area Under the ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve). There’s a lot of information about it, so I encourage you to [Google it](https://www.google.com/search?q=area+under+the+roc+curve). ### In short... Suppose we randomly sort our leads and predict that every lead converts to a sale. Then we can measure the cumulative *True Positive Rate* (portion of sales correctly identified up to position i) and the cumulative *False Positive Rate* (portion of non-sales incorrectly identified up to position i). Now we can plot the cumulative True Positive Rate vs the cumulative False Positive Rate which, in theory, will generate a $ y = x $ line from 0 to 1. So, for a point (x, y) like (0.5, 0.5), the interpretation is that we have to pursue 50% of the total non-converting leads in order to catch 50% of converting leads.

Randomly Sorted Test Data
LeadID	Sale	Prediction	FalsePositiveRate	TruePositiveRate
12	FALSE	TRUE	0.125	0.0
13	TRUE	TRUE	0.125	0.5
6	FALSE	TRUE	0.250	0.5
22	FALSE	TRUE	0.375	0.5
19	FALSE	TRUE	0.500	0.5
11	FALSE	TRUE	0.625	0.5
30	FALSE	TRUE	0.750	0.5
29	FALSE	TRUE	0.875	0.5
3	TRUE	TRUE	0.875	1.0
2	FALSE	TRUE	1.000	1.0

Now suppose we build a model that predicts the probability each lead will convert. We sort the data from the strongest lead to the weakest lead and remeasure the cumulative True Positive Rate and True Negative Rate. Plotting this over our previous example yields

Model Sorted Test Data
LeadID	Sale	Prediction	FalsePositiveRate	TruePositiveRate
19	FALSE	TRUE	0.125	0.0
3	TRUE	TRUE	0.125	0.5
6	FALSE	TRUE	0.250	0.5
13	TRUE	TRUE	0.250	1.0
30	FALSE	TRUE	0.375	1.0
29	FALSE	TRUE	0.500	1.0
2	FALSE	TRUE	0.625	1.0
11	FALSE	TRUE	0.750	1.0
22	FALSE	TRUE	0.875	1.0
12	FALSE	TRUE	1.000	1.0

For reference, the point (0.25, 1.0) means that our model exhausted 25% of the bad leads in order to identify 100% of the good leads. This curve is called the Receiver Operating Characteristic (ROC curve). Measuring area under the ROC curve gives an indication of how well the leads are ranked. A perfect model will have AUC ROC = 1, and a random guess model should have AUC ROC = 0.5. One last bit about our objective – It doesn’t matter how well we rank the weakest leads because we only have the resources to call the top leads. For this reason, partial AUC ROC would be even better objective function, but since it’s a little more complex we’ll stick to maximizing AUC ROC. ## The Models I built three models to solve this problem. Each model has some benefits and drawbacks which I’ll discuss, but I’m not going to display their results. This would be counterproductive for a few reasons 1. There’s not enough data to make confident model comparisons 2. I made up the data 3. I didn’t spend a lot of time tuning each model’s hyper parameters Instead, think of this as a starter kit/guide to solving your own problem. ### Logistic Regression This is probably the purist model of three with the strongest statistical foundation. It can work really well, but it’ll quickly fall apart when important assumptions are violated (as they often are when working with real world data). #### Pros – Can work extremely well (if the data is very clean and satisfies a bunch of nice properties) – Fast – Simple. You could build it in an Excel spreadsheet #### Cons – Requires imputation for missing values – Poorly handles non monotonic relationships and spikes (e.g. imputing -1 for NA values where all other values are $ \geq 0 $) – Requires one-hot-encoding for unordered categorical features – Can be difficult to measure variable importance – R implementation doesn’t directly provide a regularization term #### Implementations - [rank_leads_logreg.R](https://github.com/ben519/MLPB/blob/master/Problems/Rank%20Sales%20Leads/rank_leads_logreg.R) - [rank_leads_logreg.py](https://github.com/ben519/MLPB/blob/master/Problems/Rank%20Sales%20Leads/rank_leads_logreg.py) ### Random Forest This is probably my favorite solution. It’s easy to set up and it almost always works. Plus it’s pretty darn cool. #### Pros – Almost always performs well – Easy to understand the model’s logic – Can handle complex interactions of multiple variables – Does a good job of modeling non monotonic relationships and spikes in the data – Easy to measure the importance of each predictor – Easy to tune the model’s hyper parameters #### Cons – Can be difficult to explain exactly why a certain prediction was made on test data – Can be difficult to diagnose issues – Requires imputation for missing values – Scikit-Learn requires one-hot-encoding for unordered categorical features which leads to problems with performance and possible speed issues – R implementation only allows categorical features with up to 53 unique categories #### Implementations - [rank_leads_rf1.R](https://github.com/ben519/MLPB/blob/master/Problems/Rank%20Sales%20Leads/rank_leads_rf1.R) - [rank_leads_rf1.py](https://github.com/ben519/MLPB/blob/master/Problems/Rank%20Sales%20Leads/rank_leads_rf1.py) ### Gradient Boosting (XGBoost) In terms of performance, this one’s probably the best. XGBoost wins something like half of all Kaggle competitions. However, it’s fairly complex, can be difficult to tune, and it’s predictions are even more mystical than random forest. #### Pros – Usually performs extremely well – Very fast – Smart handling of missing values – Easy to measure variable importance #### Cons – Can be difficult to explain exactly why a certain prediction was made on test data – Difficult to prepare dataset for model training and predictions – Difficult to tune hyper parameters – Requires one-hot encoding categorical features #### Implementations - [rank_leads_xgb.R](https://github.com/ben519/MLPB/blob/master/Problems/Rank%20Sales%20Leads/rank_leads_xgb.R) - [rank_leads_xgb.py](https://github.com/ben519/MLPB/blob/master/Problems/Rank%20Sales%20Leads/rank_leads_xgb.py)