Responding to multi-million concurrency is an interesting challenge confronted by ad-tech vendors. Amagi THUNDERSTORM, our Server-Side Ad Insertion (SSAI) platform achieved this feat for the first time across news channels, during the last US presidential elections in November 2020.
We have been handling exponential scale since then and decided to crystalize some of our learnings and insights in this blog post.
Read on to understand the key considerations and challenges we encounter while designing systems to achieve concurrency at scale. Take a look at some of the preparations that went into handling a huge spike in production traffic, scaling limitations, and overcoming challenges faced during peak load.
SSAI requirements during a normal load
We were handling ad monetization for around 100K concurrent users for a typical linear channel on a free ad-supported streaming TV (FAST) platform, with close to 500 million impressions per month. Internal components and most of the AWS services we use were within the limits to support this requirement. All the previous stress tests, performance tests, and scaling tests done were to accommodate a maximum of 100K concurrent users for a channel.
Anticipation of a huge surge in traffic
During the US presidential elections, we anticipated up to 2 million concurrent users, especially during the presidential debates and election days. Due to the nature of the event, it was difficult to judge when the surges would occur or what the magnitude of the surges would be. Hence, we decided to prepare for the highest load to give end-users a smooth and enjoyable experience during the events.
Our preparations included precautionary measures like pre-scaling of some components and services - and giving some heads-up but not too earlier than the events.
We identified components where the load is a factor of number of users, pre-scaled and over-provisioned till 80% of the highest load expected and let our in-house automatic scaling component handle the load after that. This was done to avoid any component failures because of the bursty nature of requests.
Along with these, several monitoring and alerting scripts were added to promptly identify issues anywhere in the system.
Partnering with AWS and solving scaling challenges
With the high load, we worked with the AWS support team to increase limits for most of the services we use. In the case of components, we decided to make code changes and quickly deploy in production.
S3 bucket per-prefix upload limit
Amagi THUNDERSTORM stores all transcoded and re-timestamped AD segments in S3 bucket with channel and current DATE as prefixes. Since several segments were being created and uploaded concurrently in bursts and only for a few channels, it breached the per-prefix upload limit of 3,500 and started to throttle with the exception, *Service: Amazon S3; Status Code: 503; Error Code: SlowDown;*
S3 is designed in such a way that it spins up machines based on the load. To handle the sudden spike in S3 PUT API calls and avoid exceptions, we created more prefixes under the current prefix, balanced the load between them, and changed the prefix structure.
We use Lambda to re-timestamp segments on the fly for each user and serves fillers in case of any errors. AWS has a limit of per region burst concurrency and the best count is 3000. Lambda started to throttle once this limit was reached.
To cope with the sudden spike in lambda calls, we used AWS lambda provision concurrency, a feature that keeps functions initialized and hyper-ready to respond in double-digit milliseconds. This helped us ensure all the requests were served by initialized instances with very low latency.
To circumvent this, we used our in-house automatic scaling component to scale Kinesis shards out or in using split/merge mechanism by maintaining 2^k shards at any given point of time.
Elasticache-Redis no. of connections limit
Elasticache-Redis is used to store replaced ad segments details and quartile tracking URLs per user. 65,000 is the maximum number of clients that can connect to elasticache. So, we hit this limit when close to 4,000 clients, with min_active_connections as 50.
We decreased each client to have a minimum of 10 connections and a maximum of 50, so each client will have a min of 10 connections and it will go till 50 if it cannot use any of the already created connections. This gave us 4,000 * 10 connections, which is acceptable.
ELB per target group instances
There is a limit on how many EC2 instances can be launched in a particular target group. The default limit is 1,000. With a huge load, more than 1,000 app servers were needed.
The ideal solution is to have multiple target groups and attach them to a single ELB based on weightage, which could give us more instances per ELB. The next phase is to have a multiple ELB based design.
AWS batch and transcoding of ads
We use AWS Batch to do transcoding of ad assets, which is very CPU-intensive. Here we were using SPOT_CAPACITY_OPTIMIZED allocation strategy. Since there were close to 10,000 jobs, the number of instances required rose to more than 2,000. We had configured the compute environment to allocate c5.xlarge instances. However, due to higher number of instances, all the subnets were filled and there were no IPs left to assign new machines.
We tuned AWS batch to use BEST_FIT_PROGRESSIVE instead of SPOT_CAPACITY_OPTIMIZED which could select additional instance types that are large enough to meet the requirements of the jobs in the queue.
For communication across microservices, we needed to use different pipelines based on the service. Since scale across components were in a different order, we ended up using different event streaming platforms or different connection strings of the same streaming platform. As the load was heavy and live streaming shouldn’t stop at any point in time, restarting the component/s was not an option.
We therefore needed a solution to change the communication pipeline at runtime without affecting the end-user. We introduced per topic-based connection config maps, which would define the connection configurations for a topic. Earlier, we were using global configuration for all connections, so we had a good fallback to global configuration, if per topic configuration was not defined. The configs were reloaded every 20s based on version/modified date so when per component config gets introduced in 20s, it would start sending a different host accordingly.
In addition to what we discussed so far, a few more challenges need to be considered while designing such high concurrency systems:
Reducing latency in communication pipeline (say, Kafka to SQS to gRPC)
Periodic re-fetch of new users (using Bloom Filters can help, with the right design approach)
Scaling static public IP allocation for pixel trackers (using a managed NAT is a possible solution)
This post is a joint contribution of the entire Amagi THUNDERSTORM development team and we are eagerly working on some of the bigger challenges in designing next generation designs for higher concurrency and lower latency. We will come back with more on this topic in future. Until then, thank you for reading!