Testing Our Ad LTV Models for Accuracy

Header image with the text: testing our ad ltv models for accuracy

Hi, my name is Adam and I’m excited to be sharing my first post here. As the Lead Data Analyst at Soomla, my job is to measure and improve our Ad Revenue attribution and provide actionable insights to the team about the product. In order to do that, we need an accurate and reliable way to measure the performance of our different attribution methods. This post will describe how we go about assessing and optimizing the accuracy of our revenue attribution models and how we choose the best algorithm to use for each ad network and ad type. I’ll describe the KPI’s we use, the methodology behind the evaluation process, and finally some insights we’ve gleaned from the data so far.


To Measure or Not To Measure – That Isn’t Even a Question

Having a measurable indicator of success is crucial to any productive venture. Measuring the accuracy of our Ad revenue attribution is no exception. Our models calculate user ad revenue in proportion to the measured impressions, clicks or installs, in what is essentially a curve-fitting or regression problem. Although common statistical metrics for comparing fit such as R2 or Mean Squared Error provide some perspective on which algorithms align best with the reported revenue, they don’t give us a measure of how well we’re doing in actual dollar terms, one that is translatable and comparable across models. Since we work with many different kinds of models we also need a KPI that is model-agnostic, meaning that it is equally applicable for linear, non-linear, or machine learning models. Therefore, we’ve decided to measure ourselves by the absolute percentage error of our predictions. This KPI is easily calculated, understandable, and can be flexibly translated into real monetary terms across several dimensions – by app, country, ad type, network etc.

Automatic for the People

Another major improvement we’ve made in recent months is turning our previously manual self-testing procedure into a fully automated one. By leveraging Data Science libraries and toolkits in Python, we’ve turned the accuracy measurement process into scripts. These programs can ingest and process data on thousands of apps and tens of thousands of placements over millions of rows of observations in just a matter of minutes. In the past we would estimate our accuracy for each ad type/network/platform combination on only a handful of apps. Now, we can analyze all of the apps and placements that fulfill our Data Validation criteria, making for a much more consistent, accurate, and efficient process. A larger sample size also means that are results are more robust and reliable.

Data Validation (DV) – The process of ensuring that we’re seeing the same number of impressions and other events as is reported by the publisher APIs

Known Unknowns

Although SOOMLA’s core product is the attribution of Ad revenue to the user level, the attribution is dependent upon an accurate recording of the activity in the app. Since accuracy is paramount to us, we keep reminding ourselves of the Data/Computer Science maxim “garbage in, garbage out”, meaning that we have to account for technical or implementation issues that can create noise in the data, and in general ensure that our measurement data is in line with in-app activity. By comparing our user-level measurements to the aggregated reporting available from the Ad Networks, we can perform Data Validation on our data and find the apps that clear our DV Threshold.

DV Threshold – The maximum allowed discrepancy between our data and the data from the publisher API. When we train our models we set a threshold of maximum 10% discrepancy in DV. In other words we validate that our truth sets agree at least 90% for any truth set we will use for training our algorithms.

Quantifying this aspect of SOOMLA’s product also enables us to identify where we can improve our data collection. Our KPI for validating our data is also simple and revenue centric; it is the percentage of total revenue that is generated by apps that clear our DV threshold. 


Crouching Algorithm, Hidden Dimension

Now that we’ve established what data we use and how we measure our performance, the question remains: how do we actually evaluate our algorithms? We use many different methods, but for the purposes of this use-case we’ve utilized what we call a “dimension removal method”.

Dimension Removal Method – A method for estimating the accuracy of an Ad Revenue measurement algorithm. We use internal breakdowns of the reporting APIs as truth sets. To do this we have to hide these true data points from the algorithm and ask the algorithm to calculate the answer without knowing it.

The basic idea is in a sense to interpolate the placement-level revenue using SOOMLA’s revenue attribution methods and compare the simulated data-points to the actual revenue distribution. To do this, we collect truth sets (analogous to labeled data in machine learning) of actual ad revenue from different ad-network reporting API’s broken down to the placement level. Next, we “hide” the placement dimension from the models and let them estimate the revenue for each placement, based on the independent variables of each algorithm. Then we calculate the absolute difference between the models’ revenue assignment and the actual revenue from the APIs.

Truth Set / Training Data – Data that contains the known quantity or label that the algorithm is predicting. In this case we’re referring to the daily Publishers API reporting for each placement

While this might seem like a limited approach it actually provides SOOMLA with an abundance of true data points to learn from. Some stats about our training data in April:

  • 6,957,819 true data points with placement-level ad revenue in mobile apps
  • We worked with 3,044,994 of these data points to learn and improve our algorithms and methods
  • 794,390 data points were used to evaluate the best performing models

The Paradox of Choice

I started by discussing how we test our accuracy and as we dove into more detail, you probably realized that the abundance of data points we collect on every single impression allows us to utilize more than one model for calculating the ad revenue attribution. In fact, we have more than 8 models in production and are continuously working on new ones as well as new ways to improve existing models. With so many options, it could be quite hard to choose. However, the accuracy testing methods we have developed and the ability to receive truth sets prove highly beneficial on this front as well, as it allows us to quickly iterate, test and compare new models and new versions on large amounts of historical data and select the top performer.

For example, with a simple click-based model, if we’ve recorded that a placement represented 25% of all clicks for a particular app, that placement is credited with 25% of the revenue. For an app with $100 revenue that equals $25. If in fact the placement only generated $20, the absolute error would be $5 for that placement. If there were two more placements with 60% and 15% of clicks and $60 and $20 respectively, that means the app would have a $10 absolute error or a 10% absolute percentage error.

You can also look at the example below highlighting the error rates for 3 different models.

Table explaining the way error rates are calculated for 3 different ad ltv models with truth sets coming from ad-network’s APIs.
Example of App Error Calculation

This process is completed for all of the metrics we collect and for all apps, ad types, operating systems and networks, which allows us to calculate our KPI at whatever level of analysis we need – by app, ad type, network, country or any combination thereof. We run this process together with other evaluation methods once a month, allowing us to consistently evaluate, select, and track the top-performing algorithms and spot trends in our accuracy data.

All Ads Are Equal – But Some Ads Are More Equal Than Others

Now that we’ve covered the main points of our methodology, let’s get into some of the insights and takeaways from our research. If you want to see a full report of our accuracy figures across all ad types and networks, you can follow this link and explore yourself. You can also compare the predictions of our top-performing algorithms to the benchmark accuracy of a standard impression-based attribution method.

Apr 2019 - Accuracy Test Results

Variance Is The Spice of Life

One of the main takeaways from our research is that the error rates change significantly depending on the model used to interpolate the ad revenue per impression. SOOMLA evaluates at least 8 different ad LTV models for each ad-network, ad-type and platform. When choosing the best one, interstitial ads and rewarded videos perform very well across all networks with a very low error. However, when choosing the wrong model, the error rates are much higher and the model cannot accurately attribute revenue. This further emphasizes the need for content testing against true data about ad revenue. 

If you want to learn more about each one of the models we use and how we obtain truth-sets, track installs and identify advertiser campaigns from the publisher side, you can schedule a deep dive session here.

Steady As She Goes

One consequence of tracking the algorithm selection month by month is that we can see which algorithms perform best over time. We would expect there to be consistency in which method works best for the different networks, as certain payout models tend to be dominant in each network. It turns out that the data validates this theory. Looking at our models for Q1 of 2019 we see that on Vungle, for example, install-based algorithms were the top performers for all ad types. On Facebook we see that click-based methods perform the best, with the exception of rewarded video ads on android devices where install-based methods are the top performers. Continuing with operating system breakdowns, we see that on Tapjoy, install-based methods are consistently preferred for interstitial ads on Android devices, while on iOS video-completions are the best metric for assigning revenue.

For a Data professional such as myself who is new to the Ad-tech industry, this kind of analysis is very interesting and helpful for learning about what characterizes each network and where the greatest improvements can be made in attributing Ad LTV on the user level. 

I’m looking forward to sharing more posts like this as we continue working on improving our methods and accuracy. If you have any questions, comments or feedback about this article in particular and our work on accurately calculating User Ad LTV, send me an email at adam@soomla.com!


Feel free to share:


Please enter your comment!
Please enter your name here