@j_houg @posco @dxbydt_jasq what's spark's secondary sort model?
-
-
Replying to @avibryant
@avibryant@posco@dxbydt_jasq It doesn't sort. The shuffle recently(https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html …) started being sort based, but reduceByKey...1 reply 0 retweets 0 likes -
Replying to @j_houg
@avibryant@posco@dxbydt_jasq (seeming closest analog to reducers) combines on the map side.1 reply 0 retweets 0 likes -
Replying to @j_houg
@avibryant@posco@dxbydt_jasq So, ordering is irrelevant. groupByKey doesn't guarantee any ordering.2 replies 0 retweets 0 likes -
Replying to @j_houg
@avibryant@posco@dxbydt_jasq Also, all the values associated with a key after groupBy must fit in memory.2 replies 0 retweets 1 like -
Replying to @j_houg
@j_houg@posco@dxbydt_jasq so join is fundamental, then - can't be expressed as other primitives?1 reply 0 retweets 0 likes -
Replying to @avibryant
@avibryant@posco@dxbydt_jasq You end up cogrouping two pairRDDs and specifying a partitioner. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L481 …1 reply 0 retweets 0 likes -
Replying to @j_houg
@avibryant@posco@dxbydt_jasq partitioner is for output. If input rdds are already partitioned the same way, the join goes faster.1 reply 0 retweets 0 likes -
Replying to @j_houg
@avibryant@posco@dxbydt_jasq So, I guess the primitive used there is cogroup w added info about partitioning so data is task local.2 replies 0 retweets 0 likes -
Replying to @j_houg
@avibryant@posco@dxbydt_jasq Is there a more primitive way to impl cogroup?1 reply 0 retweets 0 likes
@j_houg @posco @dxbydt_jasq it's easy to implement with secondary sort + a way to stream over the sorted group on the reduce side.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.