Any web archiving / wget gurus out there who can help me with a weird problem I'm running into while trying to crawl a site from localhost with wget? http://qanda.digipres.org/1166/crawl-website-localhost-preserving-files-source-directory …
-
-
Replying to @bitsgalore
the content you are serving with apache is just static (no cgi, ssi, php)? i think that warcit https://github.com/webrecorder/warcit … could be better suited for this task
2 replies 2 retweets 1 like -
Replying to @atomotic
Update: so this mostly worked, but like wget the WARC still turns up as 85646 separate captures in pywb (see screenshot).
@IlyaKreymer maybe you know how to get around this?pic.twitter.com/VUeJyGVA4X
1 reply 0 retweets 1 like -
-
Replying to @atomotic @IlyaKreymer
Yes, actually I started out with webrecorderplayer, but the website/WARC appears to be too large for it (it becomes completely unresponsive after indexing the WARC). This is also why I switched to pywb for testing the WARC.
1 reply 0 retweets 0 likes -
Replying to @bitsgalore @IlyaKreymer
and you've used the autoindexing in pywb? try to run `cdx-indexer -j file.warc.gz > index.cdxj` and replace the current index in collection/__/indexes
1 reply 0 retweets 0 likes -
Replying to @atomotic @IlyaKreymer
Hmm ... will the result be any different from the index that is built when using wb-manager add (which creates a CDX at the same location)?
1 reply 0 retweets 0 likes -
Replying to @bitsgalore @IlyaKreymer
in theory, should not be different. it's a try to verify that cdx is complete.
1 reply 0 retweets 0 likes -
Replying to @atomotic @IlyaKreymer
OK, so this was useful in the end: I replaced the auto-generated index.cdxj of one of my "working" (i.e. single-capture) WARCs with one generated by cdx-indexer. After that it showed up as 85885 captures in pywb! So it seems the index is to blame here and not the WARC.
1 reply 0 retweets 1 like
@bitsgalore not sure I fully understand the issue yet. It seems like the same url is repeated 80k times, somehow.. if you can you send us a copy of the WARC, can help figure out what's going on.. could you send either the WARC or a link to download it to support@webrecorder.io?
-
-
Replying to @IlyaKreymer @atomotic
Thanks, I'll send you (a link to ) the WARC when I'm back in the office on Monday.
0 replies 0 retweets 0 likesThanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.