A lot of researchers think their job is to build models. They don't want to collect their own data, so they go find whatever dataset they can on kaggle or from a previous paper or wherever.
This is backwards. The model is the easy part. Getting good data is 99% of the job, and nearly any clown can make a good model once you hand them a good dataset.
skvmb•25m ago
As a clown, I can confirm.
If you hand me a clean, well-labeled, representative dataset, I can make the model do a respectable little dance by lunch.
If you hand me a Kaggle CSV with duplicated rows, target leakage, mislabeled outcomes, and columns named final_final_v2_REAL, suddenly I’m not doing ML anymore. I’m doing archaeology with a red nose on.
The model is the balloon animal. The dataset is the elephant you had to drag into the tent.
nradov•21m ago
For a lot of clinical decision support use cases you don't even need fancy AI models to get accurate results. If you have good quality cleansed data you can literally just import it into Excel and run a simple linear regression analysis. But unfortunately that won't get you a reputation as an "AI thought leader".
Legend2440•36m ago
This is backwards. The model is the easy part. Getting good data is 99% of the job, and nearly any clown can make a good model once you hand them a good dataset.
skvmb•25m ago
If you hand me a clean, well-labeled, representative dataset, I can make the model do a respectable little dance by lunch.
If you hand me a Kaggle CSV with duplicated rows, target leakage, mislabeled outcomes, and columns named final_final_v2_REAL, suddenly I’m not doing ML anymore. I’m doing archaeology with a red nose on.
The model is the balloon animal. The dataset is the elephant you had to drag into the tent.
nradov•21m ago