Growing Technical Architecture at Glispa: Five Lessons Learned

Bastian Quilitz

by Bastian Quilitz, VP of Technology

This piece is the first in our new Tech at Heart series, where we look at topics surrounding our operational and advertising technology. The following was adapted from a speech Bastian made at our exclusive Tech Open Air event, entitled 'Challenges of Scale'. The event was co-hosted with Stack Overflow at Glispa's Berlin HQ. Take a look at our video of the event by clicking here, or at the end of this post.

As tech companies grow, they unavoidably start to face challenges presented by growth. Every tech company approaches these challenges in their own way, but not every company is successful in scaling their IT infrastructure to manage and crunch large amounts of data.

For us at Glispa Global Group, it was a long journey. We learned a lot, we saw things work and we saw things fail. We’d like to share some key learnings from our own experience in growing Glispa from a small ad network to a global adtech company.

1. Focus on what brings value to the company.
Tracking campaign performance is key for every advertising company. The data from tracking is used for many things from optimization to fraud detection, to billing advertisers and paying publishers. At Glispa we were faced with an important decision early on. We needed to decide whether to build our own tracking solution or to make use of a third-party tracking provider. We decided to focus on what brings value: our priority for scaling the business was building out operational technology. We decided against building our own tracking technology and licenced it instead. This decision allowed us to focus on building the tools that enabled us to deal with more campaigns and process more invoices, increase efficiency and allow people to spend more time working with clients and growing the business.

Putting something as critical as tracking on the shoulders of our small development team of 4-5 people back then would have been very risky. Thanks to the success of our operational technology supporting business growth, we were able to build our own tracking technology at a later stage.

2. Things will break: be prepared

When we built our tracking system we wanted no compromises in availability as every minute of downtime is costly and harms our reputation. We evaluated many different technologies and decided to use Cassandra, a scalable, fault-tolerant, decentralized data store. Our server components are written in Java. We picked Vertx as the core Framework. Vertx is a framework for building reactive applications, it’s event driven and non-blocking, similar to NodeJs.

For global presence and elastic scale we decided to use Amazon Web Services (AWS) and deployed the system to two regions: the US and the EU, with a cross-region Cassandra cluster, some tracking servers and some API servers. A rather simple setup, possible since Cassandra handles the complexity of cross-data-center logic.

High-level overview of the first iteration of our tracking platform
High-level overview of the first iteration of our tracking platform

We learned that adopting technology is easy, mastering it is the tricky part! We spent a lot of time tweaking Cassandra for our workloads. We had a period in which our cluster suffered from high load, and nodes kept dropping from the ring. Adding more nodes did not help and after a few days we really did not know how to continue. Finding the solution to the problem required several days of digging into Cassandra internals. We eventually found the root cause being a problem in how Cassandra handled TCP keep-alive. After patching Cassandra and changing some settings, our cluster was back to normal. We’re now a lot more vigilant when it comes to the stability of our system.

3. You’d better be sure what’s happening
The issues with Cassandra taught us that OS-level monitoring is not enough. It gave us very limited insights into what was actually going on inside our application. From this experience we started adding more and more monitoring to our systems, to the point where we can understand the timing for every step in our request pipeline. Aside from timing, we keep track of error rates, queue sizes, and lots more. Proper monitoring helped us a lot in better scaling our software and it speeds up troubleshooting. We currently use Grafana on InfluxDB for monitoring our systems.

4. The importance of isolation
Proper isolation is essential to achieve highest availability. This is reflected in our architecture today through the Event Exchange Platform (EXP), our system for processing and enriching billions of events. The data required for operation is available in every cluster and we manage the isolation of clusters by using a pull strategy, to prevent problems from getting bigger. Each cluster has its own local part of the EXP. The central EXP component collects data from the other data centres in pull mode: if the link breaks, the data will be still stored in the local cluster ensuring no data loss until the link is restored. The same applies to the system configuration which is replicated through a pull mechanism to every cluster.

Our Tracking platform today is centered around an Event Platform (EXP) processing billions of events every month.
Our Tracking platform today is centered around an Event Platform (EXP) processing billions of events every month.

5. Your team is what keeps it together
Having the right people and the right team setup is the most crucial of all. We’ve found that creating a fail-safe environment helps promote innovation, and we understand that failure is a part of work. In growing our team and the projects we work on, we adjusted the way we work. We made the decision to create cross-functional, self-organized teams. They are empowered to make decisions themselves, try out new things, and take ownership of the systems they build. This way they can do what they do best: building excellent, scalable software and troubleshooting issues with the system.

How we evolved
We learned a lot over the years, building our tracking solution and other systems. We reached limits, we failed, we tried again, improved and iterated. Evolutionary development with an agile, iterative and incremental approach has become part of our process.

Scaling is about more than just technology. It requires people, experts, talented engineers and, perhaps most importantly: experience, to put the pieces together. It’s important for us to live a culture of learning and improvement. We found that the best way to cultivate a culture of innovation was to ensure a fail-safe environment. Being agile also helps to minimize the risk: an iterative, incremental approach helps to limit the scope of failure. This culture and environment helped us to scale the team as well as the technologies more efficiently.

As VP of Technology at Glispa Global Group, Bastian is responsible for building and scaling Glispa’s services and infrastructure, overseeing the company’s technology assets, and shaping a technology vision for the business. He has been part of the Glispa team since 2010.

Watch the video of our Tech Open Air 2016 event, co-hosted with Stack Overflow, below!

To watch the recorded live stream of the event, including all speeches, click here.