Here's something I've found really useful in my DeOldify research: Keep a separate huge master set of images created from various sources (eg open images), then use Jupyter notebooks made specifically to generate training datasets from that master and output them elsewhere. 1/
-
Show this thread
-
In the notebooks I do a lot of filtering out of images from the master, depending on the task. You can filter out images for being grayscale when you're doing colorization, for example. Or filtering out for a min resolution, etc. I've found this helps data quality a lot. 2/
1 reply 0 retweets 9 likesShow this thread -
The bigger your master set is, the more aggressive filters you can apply and still get a big enough dataset at the end. This also gives you the freedom to use sloppy but good enough filtering techniques that have a lot of false positives such as blurry image detection. 3/
1 reply 0 retweets 9 likesShow this thread -
For the master you can use a huge and fairly slow/cheap drive as opposed to a fancy ssd/nvme. Save the speedy drive space for the resulting filtered and processed datasets. 4/
3 replies 0 retweets 10 likesShow this thread -
Replying to @citnaj
Would a metadata database be helpful here? Store each image's location plus the metadata like grayscale, resolution, etc? I'd imagine if yoh have many images on a slow mechanical hard drive, that this could speed up the filtering quite a bit?
1 reply 0 retweets 1 like
Sure that would be a really smart idea actually and I have yet to do it LOL. I'd just suggest to keep it simple and just make a simple local file with perhaps key/value entries. i.e. just treat it as a cache. Then you can just as easily delete it if you want to start fresh.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.