Tweets
- Tweets, current page.
- Tweets & replies
- Media
You blocked @Liran_Alon
Are you sure you want to view these Tweets? Viewing Tweets won't unblock @Liran_Alon
-
Pinned Tweet
https://youtu.be/Pc7F-n5278w - My talk on KVM nested virtualization past year improvements on KVM Forum. Much to present in a very short time slot. Highly recommend viewing slides appendix for much more technical details and nVMX mechanisms! (slides: https://events.linuxfoundation.org/wp-content/uploads/2017/12/Improving-KVM-x86-Nested-Virtualization-Liran-Alon-Oracle.pdf …)
Thanks. Twitter will use this to make your timeline better. UndoUndo -
I later saw this great talk: https://www.youtube.com/watch?v=Ii_pEXKKYUg … that explains how RISC may achieve CISC-like perf with clever micro-arch tricks. Main concepts is Macro-Fusion in which decoder generates a single uOp for multiple MacroOps & CISC having both 2 and 4 bytes instructions.https://twitter.com/Liran_Alon/status/1223215852146786305 …
Thanks. Twitter will use this to make your timeline better. UndoUndo -
RAP/XFG based on hash seems better. Interesting observation that 7% of indirect branches have >100 valid targets if only hash prototype. Thought: Add compiler annotation for non-cross-module func_ptr & targets that makes linker add unique arg for this branch targets hash? (2/2)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
https://outflux.net/slides/2020/lca/cfi.pdf …
@kees_cook Slides on Linux kernel CFI. Clang jmp-table based CFI seems quite bad as add many opcodes before indirect branch, exec additional jmp and requires global call-site visibility (E.g. doesn't work for cross-module branch). (1/2)Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
https://gamozolabs.github.io/metrology/2019/12/30/load-port-monitor.html … : Cool work of using MLPDS to observe micro-arch usage of Load Port. Idea: Signal sibling thread to do some op to observe, sleep X cycles & then observe Load Port stale data with MLPDS. Slowly enlarge X to sequence multiple loads done by single observed op
Thanks. Twitter will use this to make your timeline better. UndoUndo -
Awesome talk by
@damageboy on how he re-implemented C# Array.Sort() using CPU vector operations (AVX2) to avoid branch miss-prediction perf penalty. Resulting in significantly faster implementation.https://twitter.com/damageboy/status/1209153363620835330 …Thanks. Twitter will use this to make your timeline better. UndoUndo -
http://infocenter.arm.com/help/topic/com.arm.doc.dai0274b/DAI0274B_migrating_from_ia32_to_arm.pdf … : Encountered by mistake this nice (but very old) PDF that compares x86 and ARM architectures. Mostly from userspace perspective but still I found it interesting.
Thanks. Twitter will use this to make your timeline better. UndoUndo -
Also on ARM64, wmb()+writeX_relaxed() compared to writel() will change dma_wmb() to wmb() unnecessarily. As dma_wmb()==DMB(OSHST) is sufficient to flush WCBs. I'm not sure if write to doorbell (UC Device mem) does implicit wmb()==DSB(ST) anyway as in x86 Intel. ARM expert here?..
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
Having said that, I wonder if on these scenarios it's sufficiently ok to just wmb()+writeX_relaxed() on write to doorbell even though it exec unnecessary SFENCE on Intel. Because probably it cause implicit SFENCE on write to UC to be much faster? This is all very weird... (3/3)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
This applies to some NIC drivers I recently reviewed. They have a feature that Tx desc is written to PCI BAR mapped as WC (Instead to mem) to avoid one DMA read. Thus, only on AMD they require wmb() before writing to doorbell (UC). For example, mlx4 BlueFlame feature. (2/3)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
Encountered a strange x86 cache-coherency inconsistency: Intel guarantees to flush WCBs on read/write UC mem but
@AMD does so only for read. If true, Linux should have new flush_wcb_writeX() util that differ between CPU vendors? (1/3)@fagiolinux@_msw_@DanielMarcovit3@_AlexGrafpic.twitter.com/kT3Tw6VBKz
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator … Intel DSA: Similar to Intel QuickData but with ScalableIOV support, generate/test CRC support & generate delta of memcmp support. It's also first time I see new "posted write" interface being used (e.g. ENQCMD) aimed to be generic HW accelerator interface
Thanks. Twitter will use this to make your timeline better. UndoUndo -
I also wonder how these entries are separated between "translation entries" and "paging-structure-cache entries". I didn't find this information specified anywhere. I do hope these numbers don't include the latter and that there are additional entries for that somewhere. :) (2/2)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
Surprised by interesting numbers in Intel Optimisation Guide section 2.5.5.2 L1 DCache. Turns out there are only 4 DTLB entires for 1GB pages! Crazy! So mapping all guests memory as 1GB pages in EPT may be less efficient? Worth benchmarking!
@_msw_@Karim_Allah@rsinghal1 (1/2)pic.twitter.com/wmlyFJBp8z
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
Given Intel DDIO provides device with direct access to limited set of LLC ways, I would also expect to have a non-temporal store instruction that not only write directly to LLC, but can be hinted to write to DDIO-accessible LLC ways. E.g. To accelerate NIC/NVMe submissions. (3/3)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
i.e. Producer isn't expected to read descriptors it writes to submission queue. Thus, no need to load their cache-lines to producer's L1/L2. Which also hurts Consumer latency on reading them. Thoughts? (2/3)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
Q: Producer/Consumer ring is a common pattern for high perf comm between 2 CPU cores or CPU core & device. Thus, I expected Intel to have non-temporal store instruction that write to LLC without polluting L1/L2. Useful also with device DDIO. But MOVNT* also bypass LLC. Why? (1/3)
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo -
https://plundervolt.com/doc/plundervolt.pdf … Attack on SGX: Lower operating voltage of CPU via undocumented MSR to cause complex inst to produce wrong results. Malicious host can use this before ECALL to cause enclave's MUL & AES-NI inst to malfunction. Can lead to SGX leaking secrets to host.
Thanks. Twitter will use this to make your timeline better. UndoUndo -
https://nullprogram.com/blog/2019/12/09/ … Nice and short blog post on when it's useful to use C "restrict" keyword to limit effects of pointer aliasing in order to aid compiler optimizations.
Thanks. Twitter will use this to make your timeline better. UndoUndo -
https://youtu.be/xLbx-iqjZxk AWS Networking news: VPC traffic mirroring (+ LB integration), VPC ingress routing (Custom routes on IGW/VGW), multicast routing (Clone packet to vNICs group), VPC inter-region peering & accelerated site-to-site VPN (VPN to Edge -> Direct-Connect to AWS)
Thanks. Twitter will use this to make your timeline better. UndoUndo -
https://www.youtube.com/watch?v=Cqa1BHos1sg … AWS Compute news: Nitro Enclaves, Graviton2, Inf1 instances, Compute Optimizer, Outpost + Rack-slot security-key holds PK, Local Zones (City-local EC2 servers), Wavelength (EC2 servers at 5G city aggregation center -> single-digit ms latency).
Thanks. Twitter will use this to make your timeline better. UndoUndo
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.