Older blog entries for wli (starting at number 10)

Shared pagetables are festering. Instead I appear to be going after scsi_malloc() and trimming down the sizes of various data structures, and backporting things too. Oh well, one can only do so much. lazy_buddy and gang_cpu are sort of on hold, though progress seems to be happening.

Looks like I need to do things faster. All the stuff I've been talking about really needs to happen, which I've known all along but seems to be getting impressed upon me by observing various things. No sleep this weekend...

28 May 2002 (updated 31 May 2002 at 03:19 UTC) »

Pushed another lazy_buddy out the door and testers/reviewers are nowhere to be found. Maybe I should just keep moving until they do chime in. Shared pagetables, here I come...

Looks like openlogging.org fell off the net and I can't bk commit anything. Which is somewhat painful as I'd like to avoid batching unrelated things together in a given changeset.

The bugreporter seems to have indicated that the rmap13 bug was created by an independent patch used in combination with it. What a relief!

A number of strategies seem to have surfaced for dealing with kva exhaustion:

  1. making kernel/user address spaces disjoint
  2. dynamically mapping large data structures
  3. reserving a region of per-process kva for windowing potentially large things predominantly accessed from the context of its creator
  4. shrinking the size of various data structures
  5. shrinking caches more aggressively
  6. reserving a larger global windowing region
  7. reserving per-cpu windowing regions with scheduler support
  8. daemons parked in front of large structures stealing the user portion of the address space for windowing their oversized data structure
  9. statically reserving per-process kva for dynamically mapping things like pagetables with strong process affinity
  10. doing nothing whatsoever and using 64-bit hardware instead (supposedly the preferred course of action which is not really acceptable to those I'm helping)
Highmem is really evil.

Making fork() not copy pte's for file-backed vma's seems to have some very difficult to trace issues. As best as I can tell things somehow end up faulting in garbage.

Poking around fault paths and such has led me to consider beautifying it somewhat by fleshing out the segment driver -like approach and some pure non-semantic beautification of rbtree manipulation code. I might also try using a different kind of tree if I feel like going in for the long haul. Not that I haven't done that before.

pagemap_lru_lock things are going well. Just taking it slowly from here so bugs I may have missed don't show up in larger groups than can be effectively handled, and not feeding stuff to riel until after the issues from the last batch of things sent to him have been handled. There's an annoying one where the bugreporter has vaporized and I can't get anything close to useful info as to what happened that I'd like to get a grip on, but short of literally flying out to meet the guy, sitting on his doorstep until he reappears, and borrowing his box to debug on (which isn't going to happen) I'm not sure what can really be done.

It looks like rmap is getting close to (or at) parity on NUMA-Q now with a rollup of the pending changes so I'm slowing that down and keeping things stable. Now I'm helping to chase highmem stability issues and have some fork() efficiency issues on at a lower priority. Highmem is evil. Very evil. It's going to take a bunch of us grinding away at it full-time to make this stuff work. The niceness of direct-mapping the kernel virtual address space turns into a kva-exhaustion horror beyond imagination as dynamic:direct ratios go up. There will be much pain.

Trying to debug the races with the pagemap_lru_lock breakup all week. Mostly just singlestepping and trying to debug the simulator. Nothing to see here, move on.

So nothing really got done this weekend. It wouldn't have helped to run the benchmarks without having the analysis code I wanted to use the benchmarks as testcases for ready for the occasion. Reimplementing math libraries there aren't free equivalents of is a big PITA.

I got a real profile instead of a description and the signs point to too many calls to add_timer(), mod_timer(), and del_timer() as opposed to cache-blowing in cascade_timers(), which surprised me but relieves me of the burden of writing the umpteenth priority queue. It also appears to be specific to ip_conntrack which I'm not sure is one of my priorities.

Following the yellow brick profile...

mbligh managed to get some testing in on the pte_chain_freelist racefix, and it appears to survive booting and running some benchmarks. Per-zone freelists should now follow after poking around for further races in the rest of this round of auditing.

Looks like the rest of today will be spent on entertainment types of things. Maybe some code will come out late tonight.

Found a race in an audit of one of the pagemap_lru_lock breakups that appears to be common to all of them, but it's unclear whether it's the only one left. After the pagemap_lru_lock was broken up the pte_chain_freelist, which is global, was left naked. Apparently after I survived that one then one of the init_idle() races came out and I ran out of time on the machine.

Discovered in some additional testing that the incomplete gamma function for the chi^2 CDF had convergence problems when either a or x > 7, so it appears that will need the continued fraction expansion for that domain. The Kolmogorov-Smirnov CDF code is spewing complete garbage, and the other CDF's are on the back burner.

The queue is propagating things upward and downward and finding the right levels to put things at, but the stratified trees that are supposed to be what gets bubbled around still need to get plugged in. There appear to be some bugs with respect to using the right chain field at the right time with them. Basically when it's time to bubble the things down from one level to another, or when things need to be chained off each other by insertion in a tree of a deeper nesting level, something is getting the nesting level of the tree node wrong.

Found two more classes of bugs in another audit of the waitqueue code. One is that the leader against which other waiters on an object are chained is not actually considered an element of the list by the list_head routines, but rather only a sentinel, which caused the reference to it to be lost, and the other is that some of the code assuming it had a unique reference to the queue wasn't actually removing it from the comb list.

Some more people I've never heard of chimed in on the hashing thread and put in a few words to perpetuate the confusion between "random" and "uniformly distributed". This is never going to end. I didn't actually bother answering the post because I'm going to hold out until I have some hashtable analysis code others can use to reproduce my results and speak in numbers. As long as it's rooted in terminology and anecdotes no one will ever admit what's going on.

Minimal progress on the stratified trees but some pointy haired issues came up that distracted me for a while.

Reviewed some small changes from Sam Ortiz that look pretty as far as getting SGI's discontigmem stuff to play happily in combination with removing ->virtual. It's not clear whether it will perform well as it apparently takes some doing (in terms of CPU cycles, the code is not that bad) to remap mem_map array indices to page frame number offsets. Hopefully that won't be too bad, but if it is, ->virtual is #ifdef'd and can be brought back by that method.

Tried to take a harder look at the discontigmem thing itself but there's quite a bit there to wade through. I think I'll be waiting for the separated patch so I don't need to guess at which chunk corresponds to which feature it's trying to implement. If it were smaller (which I'm not sure it can be) it'd be easier to get a full picture, but it'll only get smaller by becoming multiple patches.

Finally got a look at Pat's NUMA-Q discontigmem patch and was very impressed, the code was very clean and very readable. I'll have to take a harder look to be sure I've done due diligence with respect to it not breaking other things but it's very nice.

The hashing flamewar apparently degenerated to the name-calling level, though the name-caller does not have a particularly good reputation. I don't care. I'll continue collecting hash table metrics and their measurements from test runs. Sounds like I might be having a benchmark weekend. Again.

1 older entry...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!