Unlocking the Complexities of Synthetic Data: Challenges, Lessons & The Way Forward
June 08,2023While synthetic data has immense potential, it’s important to remember that all synthetic data is not created the same. Synthetic data is a by-product of the process and theory used to generate it. It too can fall prey to the same challenges often faced by real data:
1. Biased Data (Generation):One of the most common challenges in working with data lies in biased data. This phenomenon occurs when individuals base their training data on familiar patterns and sources, leading to suboptimal performance when the model encounters different classes, groups, environments, or contexts. Essentially, biased data generation limits the model's exposure to the full spectrum of potential inputs, impeding its ability to effectively handle novel scenarios.
2. Overfitting: Overfitting occurs when a model achieves exceptional accuracy on validation data but fails on new, unseen instances. Rather than learning to discern the essential features that define the target class, an overfitted model tends to "memorize" the training and validation data. As a result, it struggles to adapt to variations in the data distribution, ultimately leading to subpar performance when faced with real-world inputs.
3. Inadequate Diversity: The key to successful model training is a diverse range of examples. Without this diversity, the model's understanding of the problem remains limited, leading to poor performance when faced with novel inputs. To truly grasp the data's intricacies and patterns, a rich and varied dataset encompassing the entire problem space is imperative.
A Lesson in Poor Performance of Synthetic Data
In this publication by Lockheed Martin the limitations of a model trained on synthetic data were brought to light. However, upon closer examination, it becomes apparent that the study had several shortcomings:
- Lockheed Martin only used 3 different models of an aircraft
- All airplanes were placed on a tarmac background.
As a result, the classifier developed a bias towards this specific contextual factor, making decisions primarily based on the image background rather than accurately identifying the foreground subject (the C130 aircraft).
This oversight in data generation resulted in a classifier that was predisposed to prioritize the tarmac background as a defining feature, rather than discerning the essential characteristics of the C130 aircraft itself.
Attention heatmap showing where the neural net looks at to make its prediction. Red: area of focus. The neural net focused primarily on backgrounds for 2 out of 4 images.
It’s unsurprising that Lockheed Martin’s study concluded that synthetic data does not provide measurable benefits. However, their study epitomizes the challenges with training a computer vision (CV) model. People assume CV models have a similar learning pattern to a human. They assume that if you collect1,000 photos of an object in the real world, the model will be robust enough to perform reliably in production based on these images.
While the CV model may perfrom well under typical scenarios, eventually, live’s noises and complexities will come into play. Cameras will jitter, sensors will be effected by weather,skies will turn orange from wildfire, objects will show upin the least expected places, and vehicles you only expect to see in the movies will suddenly barrel down the road.
While the challenges posed by the complexities of the real world may seem daunting for computer vision models, there are strategies and approaches that can help mitigate these issues.
The Bifrost Approach to Synthetic Data
Bifrost works to overcome the challenges faced with CV building, whether using real or synthetic data. The goal of ensuring the right blend of similarity to how things will be perceived in the production deployment and diversity of conditions. We achieve that through 4 key areas:
1. Parametrically Infinite Variation of Asset
a. Textures, colors, orientations, sizes, weapon configurations, etc. Bifrost can generate 1,000s of different variations of any asset
2. A similarly diverse range of backgrounds (tarmac, forest, grass, etc.)
a. Using a diverse set of backgrounds, we ensure the model focuses on the features of the asset itself rather than the background.
3. Domain Adaptation:
a. A common issue of synthetic data is a distinctive difference between computer-generated and real imagery. AI models can pick up this difference. With unoptimized synthetic imagery, performance can drop the longer you train on it. However, our data has been specifically tuned to emulate specific sensors on a pixel and feature level. This results in more stable training over time and higher accuracy overall.
4. Sensor-Specification Data Generation
a. Bifrost-generated data is built to match specific sensor attributes, optimizing performance for that particular sensor. Such specificity greatly improves performance. Our post-processing techniques emulate real sensor properties and artifacts.
As our F-16 Bench shows, when synthetic data is implemented correctly it forces the neural net to only focus on the object of interest (e.g. the F-16) to make its decision rather than trying to take shortcuts and utilize background information.
Focus heatmap for a classifier trained on Bifrost synthetic data. Notice how focus is primarily on the main body of the aircraft.
The result is a trained model that performs better across a wider range of environments, scenarios and objects. In other words, more performance out of the box!
Want to learn more about Bifrost and how we are enabling companies to build better computer vision models, faster? Reach out at hello@bifrost.ai or here!
Want to play around with Bifrost’s F-16 dataset? access it here!