Free Preview
7investing advisors Anirban Mahanti and Simon Erickson discuss the progress being made using synthetic data in AI.
7investing lead advisor Simon Erickson recently attended the MIT EmTech Digital conference. He’ll be sharing his key takeaways in an upcoming 7investing Advisor Update, though he also had a recent conversation with fellow lead advisor Anirban Mahanti about the use of synthetic data to train AI models.
This is intriguing, because synthetic data was being done in the research Anirban was working on 15 years ago with his colleagues and students. The two briefly discussed the topic, how and where synthetic data could be useful for AI, and a few companies that might be interesting for investors.
Simon Erickson
Andrew Ng – who built a good portion of the AI that powers Google and Baidu – thinks data-centric AI is the wave of the future. IE focusing on using new tools to systematically label the data that feeds the AI; rather than feeding it a ton of unstructured data.
Also a ton of discussion about using synthetic data to train AI models now. Saves a ton of time and money that’s typically been spent on manually annotating and labeling data.
_
Anirban Mahanti
Did you get a sense of the big deal with synthetic data? Machine learning folks have always used synthetic data. Also there are semi supervised approaches where one mixes labelled data with unlabelled (but real world) data to improve accuracy of models. My colleagues, students, and I were using a mix of labelled and unlabeled data to train models back in 2005/2006.
_
Simon Erickson
It was the Unity (NYSE: U) guy who was really bullish on it. A few notes from that session: Manually annotating Training Data takes time and is expensive. Plus, there are privacy concerns. Using synthetic data to pre-train your AI models (especially computer vision) saves time and money. “We’re not trying to replicate the real world. We’re trying to efficiently train the models.” Unity is being used to create this synthetic data. [SAE note: This will be extremely important in the Metaverse.]
Simulation intelligence & digital twins: Simulations use Control Models to adapt to the Real World…which uses sensors attached to IOT devices to improve the simulations. A continual feedback loop.
Seemed to suggest it was in fact a big deal that should become the standard for training the AI models. And then complement those models that are trained by the synthetic data by real world sensors, or things to verified if the physics and models are accurate. So the feedback loop is important too.
_
Anirban Mahanti
Okay … seems like industry folks are catching up now to decades old state-of-the-art!
Synthetic data & simulations are used by both Waymo & Tesla for the autonomous driving neural nets. Tesla for instance in its 2021 AI Day talked about using manual labeling, automatic labeling, and simulations for its training data sets. Here’s a link to some approaches being used by Waymo. Maybe there’s something specific about the gaming environment and/or meta verse?
_
Simon Erickson
Yes I think so. It seems like most of the enterprise world isn’t as far along with AI as those outside of the industry might assume. There was a stat earlier in the program that said only like 20% of the companies who had deployed AI are getting a commercial AI out of it. The rest are still figuring it out.
_
Anirban Mahanti
That makes sense to me. Real AI is perhaps only used at few top firms.
_
Simon Erickson
So if it’s true those 80% are still getting up-and-running, maybe synthetic data is what they need to consider. Yes exactly.
_
Anirban Mahanti
Yes, so that is how they teach ML/AI at school. You use synthetic data sets made available by researchers.
Or labelled data made available by researchers.
Apple (NASDAQ: AAPL), Alphabet (NASDAQ: GOOG), Tesla (NASDAQ: TSLA) are probably the leaders. Then maybe Microsoft (NASDAQ: MSFT) and Facebook (NASDAQ: FB). Rest possibly still very early days.
_
Simon Erickson
I think you were 15 years ahead of the curve if you were using it in your models back in 2006 Anirban!
_
Anirban Mahanti
A decade+ gap between state of the art and most of the industry makes sense.
_
Simon Erickson
“Data” was definitely the key word of the day though
A decade ago everyone was just unleashing the AI on unstructured data sets. Now they’re getting smarter about what they’re training the models with.
related news & insights
The Trade Desk Deep Dive: April 2025
The Trade Desk Recommendation Report
The Next Big Thing to Invest In
From the April MoneyShow Symposium, here are three trends forward-thinking investors should consider.
Crocs Deep Dive: March 2025
Crocs Recommendation Report