Enabling real time esport tracking with streaming video object detection


The esport industry has seen tremendous growth lately. Each tournament is streamed live and reaches several million viewers all around the world, increasing the demand for live updates of games, players, e.g. for live betting and more.
Such information can sometimes be accessed directly through the games’ servers and API integrations, but in most cases, it is limited in volume and accessibility.

To improve the experience of watching these tournaments, Abios Gaming provides an API for live information on games, teams, and players. To strengthen Abios Gaming’s offer, and to enable real-time monitoring of esport games, Modulai joined forces with their tech team to build a deep learning object detection solution, to extract information on-the-fly from real-time video streams of gaming tournaments. The system was required to detect both text and icons, and the extracted information was then used to update the status of live matches (e.g. Counter-Strike: Global Offensive (CS:GO), League of Legends, Dota 2, and Fortnite). Moreover, the framework needed to be general enough to be applied to new games easily, in an AWS production environment.

In this post, we will walk through how we addressed the first selected use-case (CS:GO). We will introduce some essential methodological pieces (Such as Scene text detection [1] [2] to detect and recognize text from game frames, and Mask R-CNN [3] to detect icons) as well as synthetic data generation. We then describe the results and end with some conclusions and lessons learned.

Counter-Strike: Global Offensive

CS:GO [4] is a multiplayer first-person shooter game where two teams of five players each fight against each other: the Terrorists and the Counter-Terrorists. A player that dies is removed from the game until the following round. A team wins a round when all the players in the opposing team are dead. Other ways to win a round are to plant the bomb or defuse it, for Terrorists and Counter-Terrorists respectively. There are different game modalities, but those were not considered in the use case. In competitive tournaments, the first team to win 16 rounds is declared the winner.
At the beginning of each round, players have an amount of money to spend on buying equipment, such as rifles, grenades, knives, and protections. Winning a round results in more money available in the following one.
While watching live streaming of the game, information about the match is available through an overlay which shows details such as weapons equipped, the health of the players, team score, and more. This information is shown both as text (e.g. available money, player name) and icons (e.g. equipped weapon, available grenades), and is theinformation Abios needs to extract for its API. However, competitive events organizers design a personalized overlay for each event, making it non-trivial to develop heuristics which can automatically extract relevant information using only the game frames. An example of a frame with overlay is shown below.

Technical methodology

In this section, we describe the algorithms behind our approach for text detection and recognition, as well as the model used for detecting icons. In particular, we use a text detection and localization model, pre-trained on the SynthText dataset to detect text in the images, meaning that given an image, the model will return bounding boxes for those parts of the image containing text. The boxes are cropped and fed to STR, a model responsible for parsing the text in each of these input boxes. For the icons, a Mask R-CNN is trained on generated synthetic data.

Text detection, localization, and recognition

The text-detection and localization model (TDL), and is a scene text detection method that identifies text by exploring characters and their affinities. It uses a Convolutional Neural Network (CNN) architecture to predict both character regions and affinity between characters, as is illustrated below.

The architecture has an encoder-decoder structure based on VGG-16 [5] with batch normalization, and there are skip-connections to the decoder part at different depths, similar to the U-Net architecture [6]. Most scene text detection datasets contain only word-level annotations. Hence they miss the necessary information to train character-level models. During the training process, character and affinity regions annotations are generated from synthetic datasets such as SynthText [7].

Training is performed on such annotations and can be combined with weakly supervised learning on datasets with word-level annotations. The training loss is simply a sum of mean squared error over pixels for character region score and affinity score, multiplied by a confidence score in case of using weakly supervised learning. At inference, TDL uses a post-processing pipeline to combine character region and affinity region predictions to obtain word-level bounding boxes.

The work from [2] introduces a new framework for performing Scene Text Recognition (STR). The framework is composed of four stages and takes as input an image, and outputs the text contained in the image. Note that, while in TDL the input could be any scene image containing text, in STR the input has to be the ”text image”, meaning only that portion of the image containing the text. The four stages of the STR framework are:

  • Transformation: In this stage the input image is normalized. The transformation is performed with the so-called “Thin-Plate Spline” (TPS), a variant of the Spatial Transformer Network [8], to simplify the feature extraction phase and remove the need for learning invariant representations. STR allows users to either use TPS or leave the input image unchanged.
  • Feature Extraction: The input is now processed with a CNN architecture to extract a visual feature map. The available architectures are VGG [5], ResNet [9], and Gated RCNN [10].
  • Sequence Modeling: The visual features are further processed and transformed into contextual features by using Bidirectional LSTM [11], to add contextual information. As for the transformation step, the sequence modelling is optional.
  • Prediction: The final stage of STR predicts a sequence of characters from the previous output. There are two available options: Connectionist Temporal Classification [12] or Attention mechanism [13]. 

For more information on STR, the reader is referred to [2].

Mask R-CNN

Mask R-CNN [3] is a CNN architecture to perform both object detection and semantic segmentation. It is an extension of Faster R-CNN [14], with the addition of a branch to predict segmentation mask in parallel with the existing object detection branch for each Region of Interest (RoI).

The first part of Mask R-CNN is a Region Proposal Network (RPN), identical to the one proposed in Faster R-CNN, which outputs a group of ROIs. But what sets mask R-CNN apart from earlier attempts is that each of these ROIs are fed to the network head, which consists of two parallel branches. One branch is performing bounding box regression and classification, and the other predicting a binary segmentation mask. Thus, Mask R-CNN only predicts the binary mask in the segmentation branch and leaves the classification task to the object detection branch. The image is taken from [3].

Approach and results

Text recognition

The pipeline for recognizing text from real-time game frames is composed of a TDL and an STR step. Each frame is first processed with TDL, obtaining as output a list of bounding boxes containing text. These boxes are cropped and fed to STR, which then outputs the corresponding text. Pre-trained versions of TDL and STR were used, specifically versions trained to recognize special characters, which are commonly used in player and team names. An example of recognized text from a frame is shown below.

Detecting game icons 

Deep Learning methods have been successfully used to detect logos (which to some extent are very similar to icons) [17], [18], also thanks to the possibility of generating synthetic data [19]. However, in the existing literature, such methods are used to predict logos in natural images, where logos come in different shapes, lightning conditions, and orientation, with the possibility of being occluded or only partially visible. In tasks of similar complexity, traditional computer vision methods have shown poor performance and have been outperformed by deep-learning-based models.

In our use case, the problem complexity is much lower. The icons will always have the same orientation, being perpendicular to the screen, and some icon has a mirrored counterpart, depending on which side of the screen the team is located. For such problems, traditional methods like template matching are usually a good starting point. However, using template matching did not provide satisfactory results. Given different overlays with different sizes of icons, template matching was often missing the correct icon, giving false positives. In the following section, we describe the deep-learning-based solution that we developed, which is composed of dataset creation and model training.

Synthetic Data Generation for icon detection

In this case, there were a total of 67 icons that could potentially be identified. These include the kind of weapon equipped, the grenades owned by each player, and whether the bomb is planted or not, among others. Examples of icons are shown below.

These icons were overlaid onto a large number of random images to form synthetic training data. We used websites such as https://picsum.photos for backgrounds as well as thousands of frames from the game itself. For each background, a random number of icons were chosen and randomly positioned on top of the image, being careful not to have overlapping icons (which never happens in CS:GO). The color, position, size, orientation (original and mirrored), and transparency are randomly selected.

Once an icon is added to the background image, it is relatively straightforward to obtain the corresponding bounding box. Also, by using the OpenCV function findContours(), it is possible to get quite precise segmentation masks, especially for icons which do not contain holes. The class (corresponding icon), bounding box, and segmentation mask information were then saved in some standard object detection format, such as coco or Pascal Voc. 

Mask R-CNN for Icons

Finally, we trained Mask R-CNN on the generated synthetic images, and used the trained model to detect icons on real CS:GO frames. We found that training the model by combining the tasks of object detection and semantic segmentation gave superior results (earliest attempts included dated architectures based on only bounding boxes) compared to training only on object detection. Also, objects such as grenades are relatively small and can be challenging to detect. We obtained significant improvements by reducing the RPN anchor sizes, as suggested in [20].



  • OCR models trained on large bodies of synthetic text with high variability can vastly outperform traditional frameworks (such as Tesseract) in situations where the text is varied and deformed.
  • Although the icon-detection part, in essence, is a template matching problem, the scale invariance we were after made it more feasible to approach it as an object detection problem.
  • Synthetic data generation has again (it is becoming a Modulai speciality by now) proven to be an integral part of a project where annotated data is scarce or costly. By sampling from distributions with high variability (transformation of icons) and using a large set of background images, we were able to train a robust and well-performing model, that surpassed expectations.

Want to discuss this or a similar case?

Contact us

Gianluigi Silvestri

ML Engineer

Puya Sharif

Co-founder, ML engineer

Denna sida använder cookies. För mer information kan du läsa om cookies här.