211 stories
·
2 followers

24-core CPU and I can’t type an email (part one)

2 Shares

I wasn’t looking for trouble. I wasn’t trying to build Chrome a thousand times in a weekend, I was just engaging in that most mundane of 21st century tasks, writing an email at 10:30 am. And suddenly gmail hung. I kept typing but for several seconds but no characters were appearing on screen. Then, suddenly gmail caught up and I resumed my very important email. Then it happened again, only this time gmail went unresponsive for even longer. Well that’s funny

I have trouble resisting a good performance mystery but in this case the draw was particularly strong. I work on Chrome, on Windows, focused on performance. Investigating this hang was actually my job. And after a lot of false starts and some hard work I figured out how Chrome, gmail, Windows, and our IT department were working together to prevent me from typing an email, and in the process I found a way to save a significant amount of memory for some web pages in Chrome.

This investigation had so many rabbit holes that I’m going to save some of the digressions for a follow-on post, but this one will entirely explain the hangs.

As usual I had UIforETW running in the background, tracing to circular buffers, so I just had to type Ctrl+Win+R and the buffers, representing the last thirty seconds or so of system activity, were saved to disk. I loaded it into Windows Performance Analyzer (WPA), but I couldn’t definitively find the hang.

When a Windows program stops pumping messages there will be ETW events emitted to indicate exactly where this happened, so those types of hangs are trivial to find. But apparently Chrome kept on pumping messages. I looked for a time when one of Chrome’s crucial threads either went into a busy loop or went completely idle, but I couldn’t find anything conclusive. There were some spots where Chrome went mostly idle, but even then all of the key threads kept chugging along, so I couldn’t be sure where the hang was – Chrome might have been going idle because nothing was happening:

image

imageUIforETW has a built-in key logger which is often useful when identifying key points in a trace. However, for obvious security reasons it defaults to anonymizing the keystrokes, treating every number typed as a ‘1’ and every letter as an ‘A’. This was making it hard to find the exact moment of the hang so I turned the input tracing from “Private” to “Full” and waited for the hang. At around 10:30 the next morning the hang happened again. I saved the trace buffers, and put these notes into UIforETW’s trace information field:

Was typing “defer to those with more scuba experience” and gmail hung around the end of “those” and then resumed partway through “experience”. gmail tab was PID 27368.

Clearly just a normal work-related discussion of commute methods, but the important thing is that now I had a way of locating the hang in the ETW trace. I loaded the trace, looked at the keyboard data in the Generic Events field (the events are emitted by UIforETW itself, each one is a purple diamond in the screen shot below) and suddenly I could see exactly where the hang was, and clearly the dip in CPU usage was 100% correlated:

image

Okay, but why is Chrome stopping? Here’s some foreshadowing: not shown in these screenshots is that WmiPrvSE.exe was consuming an entire CPU hyperthread whenever this happened. But that shouldn’t matter. I have a 24-core/48-thread machine so consuming one hyperthread means my machine is still about 98% idle.

I then zoomed in to the clearly-significant period where Chrome was idle and looked in particular at CrRendererMain in chrome.exe (27368), the gmail tab process.

Aside: I just want to say thank you to 2015 me for asking Microsoft to improve their thread naming mechanisms, and thank you to Microsoft for doing everything I suggested – thread names in WPA are great!

The problem suddenly became clearer. During the 2.81 s hang this thread was scheduled and ran 440 separate times. Normally running every 6 ms is enough to make a program responsive, but for some reason it wasn’t making any progress. I noticed that every time it woke up it was on the same stack. Simplified down it was:

chrome_child.dll (stack base)
KernelBase.dll!VirtualAlloc
ntoskrnl.exe!MiCommitVadCfgBits
ntoskrnl.exe!MiPopulateCfgBitMap
ntoskrnl.exe!ExAcquirePushLockExclusiveEx
ntoskrnl.exe!KeWaitForSingleObject (stack leaf)

Chrome is calling VirtualAlloc, which tries to update some “CfgBits” and needs to acquire a lock. At first I assumed that Chrome was calling VirtualAlloc 440 times, which seemed like a weird thing to do, but the reality was perhaps weirder. Chrome called VirtualAlloc once and a lock needed to be acquired. Chrome was signaled when that lock was available but – for 439 times in a row – when Chrome woke up and tried to acquire the lock it was unavailable. The lock had been reacquired by the process that had just released it.

This is because Windows locks are, by design, not fair and if a thread releases a lock and then tries to reacquire it immediately then it can, in cases like this, reacquire it every single time. Starvation. More on this next time.

The Readying Process – the one that signaled Chrome to go grab the lock – in every case was WmiPrvSE.exe, releasing the lock on this stack:

ntoskrnl.exe!KiSystemServiceCopyEnd (stack base)
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiUnlockAndDereferenceVad
ntoskrnl.exe!ExfTryToWakePushLock (stack leaf)

I got distracted by WMI for a while (more on that next time) but eventually I wrote a program to recreate WMI’s behavior. I had CPU sampling data that showed me where WmiPrvSE.exe was spending its time (some edits and simplifications made):

WmiPerfClass.dll!EnumSelectCounterObjects (stack base)
WmiPerfClass.dll!ConvertCounterPath
pdh.dll!PdhiTranslateCounter
pdh.dll!GetSystemPerfData
KernelBase.dll!blah-blah-blah
advapi32.dll!blah-blah-blah
perfproc.dll!blah-blah-blah
perfproc.dll!GetProcessVaData
ntdll.dll!NtQueryVirtualMemory
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiQueryAddressSpan
ntoskrnl.exe!MiQueryAddressState
ntoskrnl.exe!MiGetNextPageTable (stack leaf)

It turns out that reproducing the slow scanning from the sampling data was quite easy. My VirtualScan program just called NtQueryVirtualMemory in the obvious loop to scan the address space of a specified process, the code worked, it took a really long time to scan the gmail process (10-15 seconds), and it triggered the hang. But why?

The advantage of writing the scanning loop myself was that I could start generating statistics. NtQueryVirtualMemory returns data on each contiguous range of address space that has matching attributes. That might mean all reserved, all committed with particular protection settings, etc. The gmail process had about 26,000 of these blocks but I found another process (WPA, as it turns out) that had 16,000 memory blocks but could be scanned very quickly.

At some point I looked at the gmail process with vmmap and noticed that the gmail process had a fair amount of memory (361,836 KiB) and a lot of separate blocks (49,719) coming from a single reservation in the Shareable category – a reservation with a size of 2,147,483,648 KiB, aka 2 TiB. Wot?

image

I happened to know that the 2 TiB reservation is used for Control Flow Guard (CFG), and I remembered that “CFG” showed up on the call stacks where the gmail Chrome process was waiting – MiCommitVadCfgBits was on the stack. Maybe the large number of blocks in the CFG region were the problem!

Control Flow Guard (CFG) is used to try to stop exploits. Its 2 TiB reservation is a sparse bitmap which indicates which addresses (across 128 TiB of user-mode address space) are valid indirect branch targets. So I modified my virtual memory scanner to count how many blocks were in the CFG region (just look for the 2 TiB reservation) and how many blocks were tagged as executable. Since CFG memory is used to describe executable memory I expected to see one block of CFG memory for each block of executable memory in the process. Instead I saw 98 blocks of executable memory and 24,866 blocks of committed CFG memory. A significant discrepancy:

     Scan time,  Committed, page tables, committed blocks
Total: 41.763s, 1457.7 MiB,    67.7 MiB,  32112, 98 code blocks
  CFG: 41.759s,  353.3 MiB,    59.2 MiB,  24866

vmmap shows reservations and commited memory as blocks and my scanning tool only counts committed blocks – that’s why vmmap shows 49,684 blocks where my scanning tool reports 24,866

In hindsight it is obvious, but what if the CFG bitmap never gets cleaned up? What if memory gets allocated in the CFG bitmap when you allocate executable memory, but it doesn’t get freed when you free the executable memory. That would explain this behavior.

End-to-end repro

The next step was to write a VAllocStress program that allocates and then frees thousands of blocks of executable memory at random addresses. This executable needs to have CFG enabled and needs to be 64-bit and I hit some road blocks because I didn’t really know what I was trying to do but it worked. After allocating and freeing this many blocks of executable memory this program should sit in a loop where it occasionally try to allocate/free more executable memory, and note when this is slow. Here is my basic algorithm for VAllocStress:

  1. Spin in a loop a lot of times:
    1. Allocate some executable memory with VirtualAlloc at a random address
    2. Free the memory
  2. Then, spin in an infinite loop:
    1. Sleep for 500 ms (don’t want to be a CPU hog)
    2. Allocate some executable memory with VirtualAlloc at a fixed location
    3. Print a message if the VirtualAlloc call takes more than ~500 ms
    4. Free the memory

That’s it. It’s super simple. And it was so satisfying when it worked. I ran it and then got my VirtualScan program to scan my VAllocStress process. I quickly got confirmation that the CFG block was horribly fragmented, and the scanning took a long time. And my VAllocStress program hung for the duration of the scan!

At this point I had simulated the CFG fragmentation, the long scanning times, and the hang. Huzzah!

Root-cause

It turns out that v8 (Chrome’s JavaScript engine) has CodeRange objects for managing code-gen, and each CodeRange object is constrained to a 128 MiB address range. This should be small enough to avoid the rampant CFG allocation I was seeing.

But what if you have multiple CodeRange objects, and what if those get allocated at random addresses and then freed? I instrumented the CodeRange constructor, left gmail running, and found the smoking gun. Every couple of minutes another CodeRange object was created (and destroyed). Being in the debugger made it easy to find that WorkerThread::Start was allocating these objects, and suddenly everything makes sense:

  1. gmail uses service workers, probably for implementing offline mode
  2. These appear and disappear every few minutes because that’s how service workers behave
  3. Each worker thread gets a temporary CodeRange object which allocates some executable pages for JITted JavaScript code from a random location in the 47-bit process address space
  4. The 2 TB CFG memory reservation gets some entries every time new code pages are allocated
  5. The CFG allocations are never freed
  6. NtQueryVirtualMemory is painfully slow when scanning CFG memory (about 1 ms per block) for reasons that I don’t understand

The slow scanning of CFG memory is actually fixed in Windows 10 RS4 (April 2018 Update), which briefly had me wondering if all of this investigation was pointless. But it’s not.

Memory

The CFG reservation starts out as just an address reservation – no memory is actually committed. As executable pages are allocated however parts of the CFG reservation are turned into committed memory, using real memory pages. And those pages never get freed. If you keep allocating and freeing randomly located blocks of executable memory then the CFG region will grow arbitrarily large! Well, that’s not quite true. The CFG memory block is best thought of a cache with bounded size. However that is little consolation when the bounded size is 2 TiB, per-process!

The worst I’ve seen so far is when my gmail tab had been running for eight days and had accumulated 353.3 MiB of CFG memory, and 59.2 MiB of page-tables to map that memory, for a total waste of about 400 MiB. For some reason most people see either no symptoms or much milder symptoms than I do.

Resolution

The v8 (Chrome’s JavaScript engine) team has updated their code so that it reuses addresses for CodeRange objects, which will squish this bug quite nicely. Microsoft already fixed their CFG memory scanning performance problems. Maybe some day Microsoft will free CFG regions when executable memory is freed, at least in the easy cases where large address ranges are freed. And a vmmap bug was reported.

Most of my coworkers and our customers never noticed this problem. I only noticed it because:

  • I had gmail’s offline mode enabled
  • I was running on an old version of Windows 10
  • Our IT department was running regular WMI scans of our computers
  • I was paying attention
  • I got lucky

It seems highly improbably that, as one of the most qualified people to diagnose this bug, I was the first to notice it. If any of the factors above had not been true then I would not have experienced this hang and the associated memory waste might never have been found.

By the way, the reason that the hangs kept happening at 10:30 am is just because that’s when our IT team ran their inventory scans. If you want to trigger a scan manually, you can go to Control Panel | Configuration Manager | Actions | Select ‘Hardware Inventory Cycle’ and then Run Now.

clip_image002

Source code

If you want to play along at home you can grab the source-code to VAllocStress and VirtualScan from github, here.

Rabbit holes

This investigation included a lot of rabbit holes. Some were important (page tables), some were instructive (locks), and some were purely a waste of time (WMI details). I’ll talk more about these, and a vmmap bug, in a follow-up post. If you want more gory details – some of which ended up being wrong or irrelevant – you can follow the whole investigation at crbug.com/870054.



Read the whole story
kbrint
3 days ago
reply
Share this story
Delete

ALL CAPS

1 Comment and 3 Shares
trump_case.png
Read the whole story
kbrint
5 days ago
reply
Fake news. SAD.
Share this story
Delete

focaccia sandwiches for a crowd

1 Share

Last year, Alexandra Stafford published a very good book about bread. It sprang from a recipe for the peasant bread her mother made often when she was growing up. When she shared it on her site, it went viral, which is no surprise given that it’s no-knead, comes together in under five minutes, rises in about an hour, and after a brief second rise, you bake it in buttered bowls that form it into a blond, buttery crusted bread that she boasts is “the antithesis of artisan.” Because there are no hidden tricks; no steam ovens, special flours, lames to score the crust, or bannetons to shape the loaves. Her central tenet is that “good bread can be made without a starter, without a slow or cold fermentation, without an understanding of bakers’ percentages, without being fluent in the baking vernacular: hydration, fermentation, biga, poolish, soaker, autolyse, barm.” (None of those words appear in the book.) She knows that there are a lot of no-knead breads out there, but this is the only one that can be started at 4pm and be on the dinner table at 7.

what you'll needwhisk flour, salt, and yeastadd waterlet it proof for an hour

I realize you’re thinking, as I briefly worried before I read it, how does one write an entire cookbook based on one recipe? But Stafford is a gifted recipe developer, and there isn’t a thing in this book — one part breads (with all types of flours, grains, and shapes, including pizzas, flatbreads, rolls and buns), one part toasts (including sandwiches, tartines, stratas, panzanellas, soups, summer puddings and so much more), and one part crumbs (a celebration of crunchy gratin toppings, stuffing, burgers, eggplant parmesan, fish sticks, meatballs, and brown bettys) — that I didn’t want to make. (I suspect that having four kids to feed ensures that these recipes were vetted by the most finicky of reviewer classes.) It’s also a gorgeous book, with a focus and format that my inner, long-surrendered organized person finds deeply pleasing.

Read more »

















Read the whole story
kbrint
8 days ago
reply
Share this story
Delete

IT is urbanizing, McDonald’s gets it, but Woonsocket doesn’t (yet)

2 Shares

My favorite UK TV producer once had to sell his house in Wimbledon and move to an apartment in Central London just to get his two adult sons to finally leave home. Now something similar seems to be happening in American IT. Some people are calling it age discrimination. I’m not sure I’d go that far, but the strategy is clear: IT is urbanizing — moving to city centers where the labor force is perceived as being younger and more agile.

The poster child for this tactic is McDonald’s, based for 47 years in Oak Brook, Illinois, but just this summer moved to a new Intergalactic HQ downtown in the Chicago Loop. Not everybody has left the old digs. McDonald’s has opened a software division at the new HQ specifically working on McDonald’s cloud offerings, which is to say working on the future of McDonald’s IT. 

The old guys and gals are generally back in the burbs, while the new Dev/Ops Cloud folks are in the city.  This is likely by design. McDonald’s techies can get their Cloud training online in either location, but if you are in the suburbs you can’t get the Cloud charge codes because you are not going to the meetings downtown.  Move to the city if you want to work on Cloud.  Charge codes are the way they starve old practices at McDonald’s.

I don’t see this change in IT structure as an accounting function.  It is a sunk cost tire problem.  The corporations are seeing the issue, they can’t get their older staff to adapt to the innovations.  Clayton Christensen covered this in a number of books. 

What has changed at the board level is companies like Ford realize now that they are a software company that makes cars.  Liberty Mutual said as much.  The Internet of Things, Cloud, IT Innovation, have all struck a chord, and CEOs are listening. 

The issue is finding how to get your staff to adapt to change.  The new answer is to move.  Starve the older teams with fewer charge codes, give them all the training they want since that is very cheap now and all online.  Reward them for training and make it apart of their MBO (Management by Objectives). The training courses are now free across many corporations, (SafariBooks, PluralSight, INE, etc.). 

As an old IT guy it is possible for me to see this as age discrimination and a sneaky (if expensive) workaround for laws against that practice. But it’s really more a matter of innovation discrimination, since fogies like me who are willing to do the classes and make the accompanying physical move — that is older workers who are eager for new experiences of all types — are generally free to join the future.

Still, the COBOL crowd will be pissed when they figure out what’s happening.

When I started work on this column the poster child wasn’t McDonald’s, it was Aetna, the giant health insurer, which earlier this year announced it was moving from Hartford, Connecticut, where it had lived for more than a century, to New York, NY. Though the $9 million in subsidies Aetna was to receive for this $85 million move got a lot of press, I really doubt they were doing it for the tax savings.

Here’s the rationale I was hearing from inside Aetna:

Aetna needed to move into the 21st Century.

The existing Hartford staff is mainly baby boomers stuck in 1990.

The move discharges the baby boomers who will not take the move package.

The IT department got to move into a city that is the capital of FinTech.

The transition would transform Aetna tech with less risk than if they stayed in Hartford with old IT thinking.

Then something funny happened: CVS offered to buy Aetna for $69 billion (the deal has yet to close) and strongly suggested Aetna remain in Hartford, closer to CVS HQ in Woonsocket — yes, Woonsocket — Rhode Island. So Aetna cancelled its move to Manhattan, giving back the city’s $9 million, also forcing me to flip this column on its head.

Here’s where I am going to make a bold prediction. IF the deal closes (it looks likely) and CVS buys Aetna sometime in the next few months, a come-to-Jesus meeting will be held in Woonsocket after which the combined company will resume that move into New York City.

Here is why. I have been noticing that many developers my age don’t want to learn something new.  Docker is new.  Cloud Foundry changes everything in software development.  That knocks down multiple monolithic structures.  When you look at IT performance you see some baby boomers, some, Generation X’ers, but more and more Y and Z generation who have adopted new ways of seeing IT.  This problem is at the core of IBM, HP, Dell, and others.  Baby Boomers can’t quit: they are broke if they do.  But they also can’t adjust, so they strangle innovation to stay on top.

Ultimately companies sell their homes in Wimbledon, so to speak, moving into the big urban tech centers, urbanizing to join the biggest corporate headquarters real estate boom in history — a real estate boom ironically driven by virtualization.


My old friend Adam Dorrell’s company, CustomerGauge, is having an event on September 14th at the Computer History Museum in Mountain View, CA. I’ll be there as host, moderator and valet parking attendant. CustomerGauge is a Software-as-a-Service (SaaS) platform that measures and reports on customer feedback in real time.








Digital Branding
Web Design Marketing

Read the whole story
kbrint
12 days ago
reply
Share this story
Delete

Polyhedra Viewer

jwz
1 Comment
This is amazing:

This application visualizes the relationships between the convex, regular-faced polyhedra. The 120 solids presented here can be transformed into each other by a network of operations. Select a solid below to manipulate it and to explore its relationships with other polyhedra.

Previously, previously, previously, previously, previously, previously.

Read the whole story
kbrint
13 days ago
reply
Cool!
Share this story
Delete

Mac and Cheese

1 Share

IMG_5299

IMG_5293

IMG_5296

IMG_5298

We have made provolone crisps before. Today I took that idea further by adding pasta to the browning cheese. The blanched rings became embedded, really bonded, into the cheese. I was able to trim the crisp cheese so that each tube had its own lacy cheese cap. Time to explore its potential.

 

Years Past

August 2, 2017

August 2, 2016

August 2, 2015

August 2, 2014

August 2, 2013

August 2, 2012

August 2, 2011

August 2, 2010

August 2, 2009

August 2, 2008

August 2, 2007

August 2, 2006

August 2, 2005

Ideas in Food: Great Recipes and Why They Work

Maximum Flavor: Recipes That Will Change the Way You Cook

Gluten Free Flour Power: Bringing Your Favorite Foods Back to the Table

Read the whole story
kbrint
17 days ago
reply
Share this story
Delete
Next Page of Stories