My understanding is that on desktop it has some kind of internal HW queue of fragments coming from the shading units to reduce global memory traffic?
-
-
Replying to @pcwalton @jhaberstro
I just saw your code. You're computing blending all tiles in a single thread. When you use the rasterizer, the pixel shader runs computeCoverage() in different threads, then performs the final blending stage serialized. [1/2]
1 reply 0 retweets 1 like -
If you want to beat the ROP, you need N threads to call computeCoverage() and store result into shared memory, then a barrier, then merge the results using a parallel sum algorithm and have 1 threads store the final combined value [2/2]
2 replies 0 retweets 0 likes -
Replying to @matiasgoldberg @jhaberstro
Yeah, I’ve thought about doing it that way. But other compute-based vector rendering solutions I’ve seen don’t work this way; they interpret command lists sequentially per tile.
3 replies 0 retweets 0 likes -
Replying to @pcwalton @jhaberstro
It's simplicity vs complexity. Unfortunately writing efficient compute code isn't straightforward. i.e. the obvious way usually isn't the fastest
1 reply 0 retweets 1 like -
Replying to @matiasgoldberg @jhaberstro
Yeah, the thing is that the rasterizer is often simplest of all…let the silicon do the hard work :)
1 reply 0 retweets 1 like -
Replying to @pcwalton @jhaberstro
Just to repeat myself, this is what I'm talking aboutpic.twitter.com/58EtIrcYvs
1 reply 0 retweets 1 like -
If you devote enough time you sould beat the ROP, because you know in advance how many tiles there are (the Rasterizer doesn't and needs to do some juggling for load balancing)
1 reply 0 retweets 0 likes -
However if it's not worth your time spending on that (e.g. it already consumes little time), then just let the rasterizer do it, which is much simpler code
1 reply 0 retweets 0 likes -
Replying to @matiasgoldberg @jhaberstro
Yeah, it depends a lot on the vector scene… Interestingly, doing mask tile *generation* sequentially per pixel in compute has been a win across a variety of hardware, unlike compositing. That has a lot of microtriangles so perhaps it’s due to avoiding vertex shader overhead.
1 reply 0 retweets 0 likes
So it’s interesting that so far a hybrid approach has seemed best. Mask tile generation—vector filling—is fastest with compute, while pure raster blending is fastest with rasterization.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.