hero-image

Identity Graph in AdTech

2024-08-06

Introduction

Graph databases can be useful in many AdTech use cases, especially those related to identity mapping and user profiles. Because graph databases can quickly retrieve transitive records or traverse the whole graph, they can be an attractive option for lookalike modeling, behavior analysis, and user profile building with all user IDs, devices, and interests. Graph databases offer more advanced query functionality over high-speed key–value stores and can handle AdTech workloads as long as you manage write latency.

One interesting capability enabled by graph databases is the ability to perform nontraditional multistep lookups, which can be useful for ID mapping and GDPR-like privacy use cases. For example, if you get a stream of ID mapping pairs between email hashes to device IDs and email hashes to something else, then graph databases will allow you to query data in an ad hoc direction—in this example, you can query to get all device IDs by a list of email hashes. This capability is also very useful for privacy use cases, when users may ask to show all of their information  but provide only one ID, or when users ask to delete profiles completely by providing only one or a few IDs, like an email address or a phone number.

Graph query capabilities are also useful for fraud detection in AdTech; they may help you to analyze a connection between devices, users, clicks, and IP addresses and see any unusual patterns or detect a cluster of bots. Or, if you have a database of bots’ IP addresses, you can retrieve and mark all activity related to those IP addresses, like impressions, clicks, and users, as potential fraud.

Write latency is not the strongest attribute of graph databases

The AdTech industry often uses fast key–value or document databases for actions like enriching events in real time, capping budgets and impressions, or using other types of counters. Graph databases have more complex underlying data structures and use different types of indexes to enable quick transitive queries. As a result, the write latency and throughput is much lower compared to key–value databases. We did various comparison tests and repeated write workload tests in graph databases with an attempt to achieve 100,000 record (ID pair) writes per second, having some budget limitation in mind, and we were not able to reach the performance goal within the given budget.

In order to adapt graph databases for AdTech data volumes and latency requirements, one of the techniques we think can help is record buffering so that you can write in batches or in bulk. If we consider identity mapping use cases, it makes sense to deduplicate records prior to writing them into the graph database, because during the ID exchange process, such as cookie matching or other types of tracking, the same ID pair may come into the system multiple times within an hour. In our previous experience, we saw cases when a few weeks’ worth of data was 90% duplicated records.

Cookies depreciation will lower your write throughput expectations

We took the goal of 100,000 records per second from our experience with typical AdTech workloads and customer expectations; however, the majority of such ID mapping traffic is usually made up of pairs of cookies from different partners. Because third-party cookies often have a short life span, the cookie-matching services produce a lot of traffic. With the depreciation of third-party cookies and the adoption of ID systems such as Unified ID 2.0, which eventually will result in more stable and consistent user IDs, the incoming traffic of ID pairs will likely be significantly lower and easier to deduplicate. Wider usage of privacy-preserving ad APIs, such as Chrome’s Google Privacy Sandbox, will also reduce the need to use ID mapping for tracking and conversion attribution use cases, because such APIs cover many AdTech use cases on the side of a user’s browser.

Another idea for reducing the load on an identity graph database is to split ID traffic into geographical regions (countries), which also may be useful for different privacy-law implementations, such as GDPR.

Conclusion

There are challenges in using graph databases for AdTech, especially with write latency and throughput, where batching and deduplication of records may help before writing to a database. If you don’t put its critical path on writes but instead stream later, a graph database such as Aerospike Graph, with its hybrid memory architecture, high performance at scale, and cost efficiency, will be a standout choice for managing large identity datasets within memory and budget limitations.

Author: Alexey Rosolovsky

Share:

Recent Posts