Here's something I've found really useful in my DeOldify research: Keep a separate huge master set of images created from various sources (eg open images), then use Jupyter notebooks made specifically to generate training datasets from that master and output them elsewhere. 1/
-
Show this thread
-
In the notebooks I do a lot of filtering out of images from the master, depending on the task. You can filter out images for being grayscale when you're doing colorization, for example. Or filtering out for a min resolution, etc. I've found this helps data quality a lot. 2/
1 reply 0 retweets 9 likesShow this thread -
The bigger your master set is, the more aggressive filters you can apply and still get a big enough dataset at the end. This also gives you the freedom to use sloppy but good enough filtering techniques that have a lot of false positives such as blurry image detection. 3/
1 reply 0 retweets 9 likesShow this thread -
For the master you can use a huge and fairly slow/cheap drive as opposed to a fancy ssd/nvme. Save the speedy drive space for the resulting filtered and processed datasets. 4/
3 replies 0 retweets 10 likesShow this thread -
Replying to @citnaj
This sounds a lot like your own image data lake pattern. Raw, transformed, curated, published images. On cheap storage. How do you track the models, versions and what inputs / parameters affected the model? How are outcomes measured / what metrics would you review?
4 replies 0 retweets 0 likes -
Replying to @andrew_sears
Model versions: Github! Inputs/Parameters/Outcomes: Well documented experiment Jupyter notebooks that I save in a separate backup drive to freeup space along with tensorboard outputs and model checkpoints. Point is I can reproduce with ease and not think too much about it.
1 reply 0 retweets 2 likes -
Replying to @citnaj @andrew_sears
Does it mean you add new code whenever you want to change parameters or you save stuff just when you find something interesting? Also, I guess you don't fix your seeds!?
1 reply 0 retweets 1 like -
Replying to @mariokostelac @andrew_sears
I just try to make sure that experiments at any point of time can be 100% reproduced in terms of the functioning but not necessarily to the point where the exact data is used (fixed seeds). Benchmarks do use the same exact images though, so effectively the same outcome there.
1 reply 0 retweets 3 likes -
Key point is that I consider the benchmarks to be the more useful point of comparison, as opposed to validation/training loss, because the latter can easily be modified/changed and that won't be a problem.
1 reply 0 retweets 1 like -
Replying to @citnaj @andrew_sears
How do you go about creating benchmarks? Some dataset with metrics you define? If so, how long did it take you to figure out how to create a good benchmark for your use case?
2 replies 0 retweets 1 like
I honestly don't try to reinvent the wheel too much there. I try to find the benchmarks that reliably reflects how we rate the images visually, and we try to do that rigorously. FID > SSIM > PSNR in that regard, which isn't unexpected. But they're already available! 1/
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.