{"id":92708,"date":"2023-08-14T12:07:08","date_gmt":"2023-08-14T12:07:08","guid":{"rendered":"https:\/\/www.techopedia.com"},"modified":"2023-10-31T09:42:51","modified_gmt":"2023-10-31T09:42:51","slug":"the-pitfalls-of-training-ai-with-made-up-data","status":"publish","type":"post","link":"https:\/\/www.techopedia.com\/the-pitfalls-of-training-ai-with-made-up-data","title":{"rendered":"The Pitfalls of Training AI With Made-Up Data"},"content":{"rendered":"
AI is growing up, entering our lives and the workplace as the possibilities of an Einstein in your pocket catches on.<\/p>\n
Whether it is writing an essay, creating complex artwork, reviewing policies, creating custom code, or writing an after-dinner speech for you, it’s already beginning to transform how we work and live.<\/p>\n
However, artificial intelligence<\/a> (AI) depends solely on data to do what it does.<\/p>\n Let’s take an example of the prompt: “Create me a picture of a rose”. AI first needs to learn about the various data on offer, before getting to work.<\/p>\n It needs to learn about the typical rose shape, colors, design, petal arrangement \u2014 all the characteristics that make a rose a rose.<\/p>\n What is the source of the data from which it learns? The data is supplied by AI-generated data<\/a> or synthetic data<\/a>.<\/p>\n While our focus today is training an AI system with AI-generated data, generally, an AI system is trained with a mix of AI-generated and real-world data.<\/p>\n The process is designed around the constraints of legal, ethical, and secrecy considerations in acquiring real-world data.<\/p>\n But data is critical if you are to generate realistic AI systems \u2014 synthetic news readers, for example<\/a> \u2014 and given the lack of real-world data, generating synthetic data, which imitates real-world data, becomes vital.<\/p>\n For example, an AI system might be able to generate a detailed image of a cockpit in an airplane, but it will not match exactly the image of a real-world cockpit.<\/p>\n The source AI system generates synthetic data that is used to train the target AI model, which could be a neural network<\/a> or another machine learning algorithm<\/a>.<\/p>\n The synthetic data is as close as possible to real-world data and enables the target AI system to learn about the object the data is about. It knows about things like shapes, colors, and configuration details.<\/p>\n The synthetic data is mixed with appropriate real-world data. For example, the AI-generated image of an airplane cockpit dashboard is combined with the actual image of a cockpit dashboard.<\/p>\n This is an opportunity for the AI learning model<\/a> to learn from the data. It can not only identify the component parts of the data, for example, the Fuel Meter and the Altimeter, but also distinguish between synthetic and real-world data.<\/p>\n The target AI model learns from the mixed data set<\/a>.<\/p>\n For example, the objective is to enable the AI model to learn about different types of images of dogs. The acceptable response is that it can identify the dogs\u2019 names and categorize them as sheepdogs, hound dogs, etc.<\/p>\n The AI model provides a limited collection of real dogs\u2019 images and a wider collection of synthetic data.<\/p>\n The learning model studies and understands the various characteristics and parameters and learns to draw inferences and patterns.<\/p>\n For example, dogs with short tails might be identified as Dobermans, or those with prominent and acutely triangular ears might be identified as German Shepherds.<\/p>\n The learning model also learns not to generalize based on the parameters. For example, Dobermans will have short tails, but all dogs with short tails might not be Dobermans.<\/p>\n One of the most notable real-world examples of AI trained by AI-generated data is PilotNet, the self-driving car project by NVIDIA<\/a>.<\/p>\n PilotNet is a deep learning system that learns about real-time driving from both synthetic data and observing human drivers who drive a special car designed to collect data on driving, road conditions, traffic signs, lane markings, vehicles, and pedestrians.<\/p>\n Driving is a complex task because it involves both skills and decision-making within an extremely short period of time. As the human driver drives the car, PilotNet gathers data, and the relevant data is marked as highlighted pixels.<\/p>\n The deep learning system behind the self-driven car must control the driving based on the highlighted pixels that identify various objects on the road, such as pedestrians, traffic signals, and vehicles.<\/p>\n The main benefits<\/a> of training AI with synthetic data are:<\/p>\n Synthetic data is both an advantage and a limitation because it is not<\/em> real-world data, regardless of quality.<\/p>\nTraining an Artificial Intelligence<\/span><\/h2>\n
Step 1: Generating Synthetic Data<\/h3>\n
Step 2: Training data preparation<\/h3>\n
Step 3: Training the AI model<\/h3>\n
Using Data in the Real World<\/span><\/h2>\n
Benefits of Synthetic Data<\/span><\/h2>\n
\n
Limitations and Issues<\/span><\/h2>\n