> To generate what is called synthetic data, researchers train generative AI models using real human medical information, then ask the models to create data sets with statistical properties that represent, but do not include, human data.
famously, "garbage in, garbage out"
but thanks to AI, we now have the exciting innovation that you can inject garbage into the middle of the process.
you have data from actual humans. it has some statistical properties.
you could look at those statistical properties, and do research on them, looking for hidden correlations or whatever. that's been possible for decades, no need for LLMs.
or, you can take those statistical properties, ask a chatbot to generate synthetic data based on them, and then do research on that synthetic data. but...why?
any valid conclusions from the research will be based on the statistical properties that were already there in the original data. the extra step of using the LLM gains nothing, and adds risk of the research being faulty because it found some correlation that the LLM made up.
this is like taking an image, saving it as a JPEG with 5% quality (or some other lossy process), and then asking an AI to upscale and enhance it for you. in the best-case, all you get is a reconstruction of the original. and realistically you'll almost certainly introduce misleading artifacts and noise.
or, scramble an egg, take a picture, and ask the chatbot to generate a picture for you of what the unbroken egg might have looked like. maybe it'll do a decent job of it...but 5 minutes ago you had the unbroken egg in your hand.
LLMs cannot reverse entropy. they cannot unscramble the egg. you can easily add randomness to a data set, but you cannot easily remove it.
evil-olive•4m ago
famously, "garbage in, garbage out"
but thanks to AI, we now have the exciting innovation that you can inject garbage into the middle of the process.
you have data from actual humans. it has some statistical properties.
you could look at those statistical properties, and do research on them, looking for hidden correlations or whatever. that's been possible for decades, no need for LLMs.
or, you can take those statistical properties, ask a chatbot to generate synthetic data based on them, and then do research on that synthetic data. but...why?
any valid conclusions from the research will be based on the statistical properties that were already there in the original data. the extra step of using the LLM gains nothing, and adds risk of the research being faulty because it found some correlation that the LLM made up.
this is like taking an image, saving it as a JPEG with 5% quality (or some other lossy process), and then asking an AI to upscale and enhance it for you. in the best-case, all you get is a reconstruction of the original. and realistically you'll almost certainly introduce misleading artifacts and noise.
or, scramble an egg, take a picture, and ask the chatbot to generate a picture for you of what the unbroken egg might have looked like. maybe it'll do a decent job of it...but 5 minutes ago you had the unbroken egg in your hand.
LLMs cannot reverse entropy. they cannot unscramble the egg. you can easily add randomness to a data set, but you cannot easily remove it.