Here's something I've found really useful in my DeOldify research: Keep a separate huge master set of images created from various sources (eg open images), then use Jupyter notebooks made specifically to generate training datasets from that master and output them elsewhere. 1/
-
Show this thread
-
In the notebooks I do a lot of filtering out of images from the master, depending on the task. You can filter out images for being grayscale when you're doing colorization, for example. Or filtering out for a min resolution, etc. I've found this helps data quality a lot. 2/
1 reply 0 retweets 9 likesShow this thread -
The bigger your master set is, the more aggressive filters you can apply and still get a big enough dataset at the end. This also gives you the freedom to use sloppy but good enough filtering techniques that have a lot of false positives such as blurry image detection. 3/
1 reply 0 retweets 9 likesShow this thread -
For the master you can use a huge and fairly slow/cheap drive as opposed to a fancy ssd/nvme. Save the speedy drive space for the resulting filtered and processed datasets. 4/
3 replies 0 retweets 10 likesShow this thread -
Replying to @citnaj
This sounds a lot like your own image data lake pattern. Raw, transformed, curated, published images. On cheap storage. How do you track the models, versions and what inputs / parameters affected the model? How are outcomes measured / what metrics would you review?
4 replies 0 retweets 0 likes
Metrics part II: The point is that validation loss can change for a variety of reasons and not be comparable across experiments, but these metrics remain stable. That, and they're a way to double check to make sure progress is actually being made.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.