Synthetic AI, a startup developing a platform that generates synthetic data to train AI systems, today announced it has raised $17 million in a Series A funding round led by 468 Capital with participation from Sorenson Ventures and Strawberry Creek Ventures, Bee Partners, PJC, iRobot Ventures, Boom Capital and Kubera Venture Capital. CEO and Founder Yashar Behzadi said profits will go towards product R&D, growing the company’s team and expanding research, especially in the area of mixed real and synthetic data. .
Synthetic data, or data that is artificially created rather than captured in the real world, is increasingly used in data science as demand for AI systems increases. The advantages are obvious: although collecting real-world data to develop an AI system is expensive and laborious, a theoretically infinite amount of synthetic data can be generated to meet all the criteria. For example, a developer can use synthetic images of cars and other vehicles to develop a system capable of differentiating between makes and models.
Unsurprisingly, Gartner predicted that 60% of the data used for the development of AI and analytics projects will be synthetic by 2024. survey called the use of synthetic data “one of the most promising general techniques on the rise”. [AI].”
But synthetic data has limitations. Although it can mimic many properties of real data, it is not an exact copy. And the quality of synthetic data depends on the quality of the algorithm that created it.
Behzadi, of course, claims that Synthesis has taken significant steps to overcome these technical hurdles. A former scientist at government IT services company SAIC and creator of PopSlate, a smartphone case with an integrated E Ink display, Behzadi founded Synthesis in AI in 2019 with the goal, in his own words, of “solving the data problem in AI”. and transform[ing] the computer vision paradigm.
“As companies develop new hardware, new models, or expand their geographic and customer base, new training data is needed to ensure the models are working properly,” Behzadi told TechCrunch via email. “Companies are also grappling with ethical issues related to model bias and consumer privacy in human-centric products. It is clear that a new paradigm is needed to build the next generation of computer vision. .
In most AI systems, labels – which can take the form of callouts or annotations – are used during the development process to “train” the system to recognize certain objects. Teams normally have to painstakingly add labels to real-world images, but synthetic tools like Synthesis eliminate the need – in theory.
Synthesis’ cloud-based platform enables companies to generate synthetic image data with labels using a combination of artificial intelligence, procedural generation and VFX rendering technologies. For customers developing algorithms to tackle challenges like face recognition and driver monitoring, for example, Synthesis has generated around 100,000 “synthetic people” spanning different genders, ages, BMIs, skin tones and ethnicities. Using the platform, data scientists could customize avatars’ poses as well as their hair, facial hair, clothing (e.g., masks and goggles), and environmental aspects like lighting and even “difficulty type.” ‘lens’ of the virtual camera.
“Major companies in the AR, VR and metaverse space are using our diverse digital humans and a rich set of 3D facial and body landmarks to create more realistic and emotive avatars,” Behzadi said. “[Meanwhile,] our smartphone and consumer device customers use synthetic data to understand the performance of various camera modules… Several of our customers are building a car driver and occupant detection system. They leveraged synthetic data from thousands of individuals in the car’s cabin in various situations and environments to determine the optimal camera placement and overall setup to ensure the best performance.
Some of the areas endorsed by Synthesis are controversial, it’s worth pointing out, such as facial recognition and “emotion detection.” Gender and racial bias is a well-documented phenomenon in facial analysis, attributable to shortcomings in the datasets used to train the algorithms. (Generally speaking, an algorithm developed from images of people with consistent facial structures and colors will perform worse on “face types” to which it has not been exposed.) Recent to research highlights the consequences, showing that some production systems classify the emotions expressed by black people as more negative. Computer vision tools like Zoom virtual backgrounds and Twitter automatic photo croppingalso, have historically disadvantaged people with darker skin.
But Behzadi is optimistic that Synthesis can reduce these biases by generating sample data – for example, various faces – that otherwise would not be collected. It also asserts that Synthesis Synthetic Data confers privacy and fair use benefits, primarily in that it is not tied to personally identifiable information (although some to research disagree) and is not copyrighted (unlike many images on the public web).
“In addition to building better models, Synthesis is focused on the ethical development of AI by reducing bias, preserving privacy, and democratizing access… [The platform] delivers perfectly labeled data on demand at increased speeds and at reduced cost compared to human-in-the-loop labeling approaches,” Behzadi said. “AI is being driven by high-quality, labeled data. As the AI space shifts from model-centric AI to data-centric AI, data is becoming the primary driver of competition. »
Indeed, synthetic data – depending on how it is applied – has the potential to address many of the development challenges faced by companies trying to operationalize AI. Recently, MIT researchers have found a way to classify images using synthetic data. Nvidia researchers have explored a way to use synthetic data created in virtual environments to train robots to pick up objects. And almost every major autonomous vehicle company uses simulation Data to supplement the real-world data they collect about cars on the road.
But again, not all synthetic data is created equal. Datasets must be transformed in order to make them usable by systems that create synthetic data, and assumptions made during transformations can lead to undesirable results. A STAT report found that Watson Health, IBM’s beleaguered life sciences division, often gave poor and dangerous cancer treatment advice because the platform’s models were trained using synthetic patient records erroneous rather than real data. And in January 2020 studyresearchers at Arizona State University have shown that an AI system trained on a dataset of images of professors can create very realistic synthetic faces – but synthetic faces that are predominantly male and white, because it amplifies biases contained in the original dataset.
Matthew Guzdial, an assistant professor of computer science at the University of Alberta, points out that Synthesis’ own white paper acknowledges that training a model on synthetic data alone generally makes it perform less well.
“I don’t see anything that really stands out here [with Synthesis’ platform]. That’s pretty standard, from a synthetic data perspective. In some cases, they are able to use synthetic data in combination with real data to help a model usefully generalize,” he told TechCrunch via email. “[G]In general, I advise my students against using synthetic data because I find it too easy to introduce biases that actually worsen your final model… Since synthetic data is generated algorithmically (e.g., with a function), the easiest thing for a model to learn is to simply replicate the behavior of that function, rather than the actual problem you’re trying to approximate.
Robin Röhm, co-founder of the Apheris data analysis platform, argue that quality checks must be developed for each new set of synthetic data in order to avoid abuse. The party generating and validating the dataset must have specific knowledge about how the data will be applied, he says, or run the risk of creating an inaccurate — and possibly harmful — system.
Behzadi agrees in principle – but with the aim of increasing the number of apps supported by Synthesis, fending off rivals like Mainly AI, Render.aiYData, datagen and Synthetic. With over $24 million in funding and Fortune 50 customers in consumer, metaverse and robotics, Synthesis plans to launch new products targeting new and existing verticals, including photo enhancement , teleconferencing, smart homes and smart assistants.
“With unparalleled breadth and depth of representative human data, Synthesis AI has established itself as the gold standard provider for production-level synthetic data…The company has delivered over 10 million labeled images to support the most advanced computer vision companies in the world.” Behzadi said. “Synthesis AI has 20 employees and will grow to 50 by the end of the year.”