# Predicting Facebook Results with R

What if you had a crystal ball that could accurately predict the performance of your Facebook campaign before any impressions are served?

Although this doesn’t exist, you can get eerily close with your predictions if you have historical data.

In this post, I’ll show you step by step how to export this historical data and turn it into a predictive model.

A streaming service I work with spends seven figures a year promoting NFL games.

Here’s the structure of the campaign:

• Campaign is optimized for conversions, a free trial for the company’s service
• Ad set level is broken down by NFL team – one ad set per team. The ad set also utilized interest targeting, daily budgets, and auto-bidding
• Ads for each game are nested under the relevant teams’ ad sets. Flighting was typically 24-36 hours before a game started to an hour into the game

For me, the first step was to go into ads manager and export all of the data to a spreadsheet.

I selected all of the historic NFL ads at the ad level and hit Export > Export Table Data. Luckily, on the ad level, the ads were labeled with the week the game was played. This made it slightly easier to manually input fields.

Because the ads were optimized for conversions, all of the conversion data was available in this sheet.

It looked something like this: The columns highlighted in blue were the ones I manually added, as the other columns (spend, conversions) were all available in the export.

In blue are the variables that I hypothesized would affect conversions, such as promoted team and week of the season.

This was all manual data entry, so you’ll have to decide whether or not this analysis would be worth your time. For me, it was a no brainer considering the volume of spend as well as the plethora of historical data.

Once the raw data was ready, I imported it into R.

R is free to use statistical software and especially useful at modeling data. My goal was to explore the relationship between conversions (the response variable) and all of the other factors listed above: promoted team, opposing team, etc (the predictor variables).

First, in R, we’ll set the working directory and load the libraries required for this type of modeling:

```# set the working directory with Ctrl+Shift+H setwd(...) lib = c("ggplot2", "GGally", "data.table", "AER", "dplyr", "plyr")```

```new.packages <- lib[!(lib %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages)```

`data = fread(input = "nfl_2016.csv")`

Then, we’ll explore the data:

`str(data)`

This will return all the columns in the dataset. Again, our objective is to create a model to predict Results, or conversions.

The potential predictor variables include promoted team, opposing team, channel, week, Thuuz rating (an algorithmic excitement rating for sports), promoted team’s Elo rating, opposing team’s Elo rating, and spend.

`reqColumns = c("Results", "Team", "Opp_Team", "Channel_Market", "Week", "Thuuz_Rating", "Team_EloRating", "Opp_Team_EloRating", "Spend")`

`reqData = data[,reqColumns, with = F]`

Now to look at reqData and to check if the training data has any missing values:

```summary(reqData) paste("Training data has ",sum(is.na(reqData))," missing values", sep = "")```

From str(data), we can see that there are a few variables that are categorical. Since that’s the case, we have to convert them to factor variables:

```reqData[,":="( Team = as.factor(Team), Opp_Team = as.factor(Opp_Team), Channel_Market = as.factor(Channel_Market), Week = as.factor(Week)) ]```

With this, we’ve essentially converted the qualitative values to quantitative values, which we can use in our regression.

Visually, here’s what coding the variable “channel” would look like: Next, we’ll train a tobit model that generates predictions. Tobit is used here since Results, or Facebook conversions, can’t be a negative number.

We’ll start by creating a model that contains all of the predictor variables:

`model1 <- tobit(Results ~ Team + Opp_Team + Channel_Market + Week + Thuuz_Rating + Team_EloRating + Opp_Team_EloRating + Spend , left=0, data=reqData)`

`summary(model1)`

If there’s no errors (which there are not in this particular model), we’ll proceed with generating predictions for Results:

```allPred = predict(model1,newdata = reqData) allPred[allPred <0] = 0```

The allPred[allPred <0] = 0 code makes sure that in any instance that the prediction is less than 0, it becomes 0. Again, this is necessary since it’s impossible to have negative conversions from an ad.

Here’s what it looks like visually: To look at the values that have been predicted, we’ll input:

`allPred`

Next, we’ll look at the general performance of model1 on the training data:

`summary(reqData\$Results - allPred)`

This returns something like this:

Min. 1st Qu. Median Mean 3rd Qu. Max.
-103.494 -10.049 0.000 -1.499 8.019 230.988

Since we started out just picking every predictor variable for our model, it’s a must to cross-validate our model.

Cross-validation will show us which model will be the most accurate at predicting the response variable.

First, we’ll set the parameters for k fold cross-validation.

Wikipedia has a great definition on what k-fold cross-validation is –

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data.

Here’s the code used:

```k = 4 reqData\$id = sample(1:k, nrow(reqData), replace = T) list = 1:k```

The default for k-fold cross-validation takes 10 folds of the data, but since our dataset is fairly small (only 260 rows), we’ll take only 4 folds of the data.

We’ll then create prediction and test set data frames:

```prediction = data.frame() testsetCopy = data.frame()```

Next, we’ll create a progress bar to show the status of cross-validation:

```progress.bar = create_progress_bar("text") progress.bar\$init(k)```

Here’s the bulk of the work. Here, we’ll define the function for k-fold cross-validation:

```for(i in 1:k){ # remove rows with id i from dataframe to create training set # select rows with id i to create test set trainingset <- subset(reqData, id %in% list[-i]) testset <- subset(reqData, id %in% c(i))```

```# run a tobit model mymodel <- tobit(Results ~ Team + Opp_Team + Channel_Market + Week + Thuuz_Rating + Team_EloRating + Opp_Team_EloRating + Spend , left=0, data=trainingset)```

```# remove response column 1, Results temp <- as.data.frame(predict(mymodel, testset[,-1])) temp[temp<0] = 0 # append this iteration's predictions to the end of the prediction data frame prediction <- rbind(prediction, temp)```

`# append this iteration's test set to the test set copy data frame`
`# keep only the Sepal Length Column`
`testsetCopy <- rbind(testsetCopy, as.data.frame(testset[,1]))`

`progress.bar\$step()`
`}`

Once this has all been run, the progress bar should show 100% without any errors.

The heavy lifting is now done! Now we’ll compare the predictions vs. the actual Results values:

```result <- cbind(prediction, testsetCopy[, 1]) names(result) <- c("Predicted", "Actual") result\$Difference <- abs(result\$Actual - result\$Predicted)```

For this model, we’ll use Mean Absolute Error as a way to evaluate the accuracy:

`summary(result\$Difference)`

This will return something like this:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.649 13.071 28.204 34.923 420.808

A lower mean equates to a stronger model.

So far, we’ve only tried a model with every predictor variable. However, the possibility is high that 1) there’s an interaction between Spend and other variables and 2) not every variable should be selected in this model.

The reason why an interaction between Spend and other variables (for example, Team) makes sense is because the Spend and Team potentially interact to have an effect that’s more than just the sum of its parts.

To illustrate, let’s consider that there’s a team that has been so poor performing that there have not been a single conversion from promoting their games. If Spend increases for this team, it’s unlikely that conversions will increase similarly as it would for a strong performing team.

Keeping this in mind, we’ll go back to the summary of our original model where we input every predictor variable.

Looking at the right hand column of p-values, we can identify that channel, week, Elo Rating, and spend are most likely to be significant, so we’ll reduce our model to only include these four variables.

`mymodel2 <- tobit(Results ~ Channel_Market + Week + Team_EloRating + Spend , left=0, data=trainingset)`

After running through k-fold cross validation, we’ll again look at the mean absolute error:

`summary(result\$Difference)`

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.595 8.600 19.149 24.753 424.013

It looks like the mean has gone down, from 28.2 to 19.1 – this means that this simplified model is stronger than the one with every variable, which likely overfit the data.

Now, we’ll also start to account for possible interactions – see below for the notation:

`mymodel3 <- tobit(Results ~ Channel_Market + Week + Team_EloRating + Spend + Channel_Market:Spend + Week:Spend + Team_EloRating:Spend, left=0, data=trainingset)`

However, when checking the summary, I got the following error:

`summary(tobit(Results ~ Channel_Market + Week + Team_EloRating + Spend + Channel_Market:Spend + Week:Spend + Team_EloRating:Spend, left=0, data=reqData))`

`Error in solve.default(vcov.hyp) :`
`Lapack routine dgesv: system is exactly singular: U[1,1] = 0`

This error means that there are too many levels in the data, and the model does not see any variance. Even if the mean absolute error is low, these models should be disregarded.

A simplified way to explain this is let’s imagine there’s the following data:

William 1 Pass
William 2 Fail
Arnold 1 Pass
Arnold 2 Pass
Randy 1 Fail
Randy 2 Fail

If we were to train this data to predict the last column by using the first two variables, the model can get this 100% right. There won’t be any variance, and the error you get is zero.

But if this model sees new data, the result would not be as pleasant.

Eventually, we get to a fairly simple model that can predict Results:

`mymodelfinal <- tobit(Results ~ Team_EloRating + Spend + Week:Spend + Channel_Market:Spend , left=0, data=trainingset)`

Once this has been done, it’s simple to export the data and use it to predict outcomes!