I want to write a blog post called "Hadoop is slow" but I still don't really understand if/why Hadoop is slow. halp.https://gist.github.com/jvns/9381521489a99e888c0e …
@squarecog @peterseibel @ansate @b0rk oh, thanks. So the reducers just do a single pass merge of the sorted output from each mapper?
-
-
@avibryant@peterseibel@ansate@b0rk wait till you learn about mapper spills and merges and serialization... -
@squarecog@peterseibel@ansate@b0rk not to mention combiners. 'Course,@scalding can't use those. -
@avibryant@squarecog@peterseibel@ansate@b0rk wait scalding can't use combiners? Why? -
@venusatuluri@avibryant@peterseibel@ansate@b0rk cascading has its own in-memory version to avoid IO overhead. -
@squarecog@venusatuluri@avibryant actually, scalding has it's own as well, and the typed API exposes it individually: sumByLocalKeys -
@scalding@squarecog@avibryant Why have custom combiners at all? Does the default do wasteful IO - I thought combiners were in-process. -
@venusatuluri@scalding@avibryant it does wasteful io. There are pros and cons to its approach but I generally find more cons than pros.
End of conversation
New conversation -
-
-
@avibryant@peterseibel@ansate@b0rk sort of. Multiple passes possible, depends on num mappers and some settings -
.
@squarecog@avibryant@peterseibel@b0rk multiple - based on available Xmx, data size,#disks, combiner... basically Hadoop MR doesn't fail -
.
@squarecog@avibryant@peterseibel@b0rk also multiple simultaneous merges: Mem-to-Mem, Mem-to-disk, disk-to-disk; again based on factors
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.