@j_houg I'm no sure spark offers anything more at a model level. Implementation is different, but logically isomorphic, no?
@j_houg @posco @dxbydt_jasq so join is fundamental, then - can't be expressed as other primitives?
-
-
@avibryant@posco@dxbydt_jasq You end up cogrouping two pairRDDs and specifying a partitioner. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L481 … -
@avibryant@posco@dxbydt_jasq partitioner is for output. If input rdds are already partitioned the same way, the join goes faster. -
@avibryant@posco@dxbydt_jasq So, I guess the primitive used there is cogroup w added info about partitioning so data is task local. -
@j_houg@avibryant@dxbydt_jasq The primitive is the same: shuffling. Cogrouping does not need sorting, especially if everything is in mem. -
-
-
- 2 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.