In case there are any NT kernel devs listening: from cold start, an exe that touches ~1gb of memory takes over 100ms to do so due to page handling. With 2MB pages enabled, this drops down to 30ms. This suggests to me that a "MEM_REALLY_COMMIT" VirtualAlloc flag would help?
-
Show this thread
-
Because when you VirtualAlloc 1gb, if we know the total time plus page faulting to use the memory is 30ms with 2MB pages, one assumes that even with 4k pages if VirtualAlloc did the page prep right there in bulk, it could at least provide something closer to 30ms?
1 reply 0 retweets 10 likesShow this thread -
Replying to @cmuratori
Don't you think it's simply the price you have to pay for setting up 512 times as many page table entries? And could PrefetchVirtualMemory help in your case (don't know if you're doing I/O or just reserving pages)?
3 replies 0 retweets 0 likes -
Replying to @molecularmusing
I will try adding a PVM right after the VirtualAlloc and see if anything changes. As for it taking 80ms to set up the pagetable entries, while _possible_, we are talking about 300 million cycles here. Maybe that is the cost of 1gb of 4k pagetable entries, but I doubt it?
1 reply 0 retweets 1 like -
Replying to @cmuratori @molecularmusing
So, absent a kernel dev telling me otherwise, my assumption would definitely be that most of the time is in repeatedly handling faults and doing the pages serially, instead of once in bulk.
1 reply 0 retweets 1 like -
Replying to @cmuratori @molecularmusing
How do you measure? If you make an ETW trace and open in WPA with symbols, you should be able to verify your assumption pretty quickly. (There'd be some ntoskrnl!KiPageFault frames in your stack I think)
1 reply 0 retweets 0 likes -
Replying to @CarePackage17 @cmuratori
Second that. What I do find weird is that I would have thought that most of the time is spent on zeroing the pages, but then you wouldn't see such a difference between 4k and 2MB pages - in both cases, 1GB has to be zeroed.
2 replies 0 retweets 0 likes -
Out of curiosity, did you try doing this in several threads just to see if this is serialized in the kernel?
1 reply 0 retweets 0 likes -
Replying to @molecularmusing @CarePackage17
Threads are complicated. So, MEM_LARGE_PAGES, they do not help. I believe this is because it is already at or close to memory bandwidth, and while you do get marginally more by adding cores, it's probably not enough to offset the synchronization? I would have to do more testing.
1 reply 0 retweets 0 likes
Without MEM_LARGE_PAGES, they _do_ help, and again, this suggests to me that it is page fault handling, so you are basically spreading the page fault handling over multiple threads, reducing the total time spend handling the faults.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.