ssp is currently certified at Master level.

Name: Søren Sandmann
Member since: 2002-04-20 20:12:43

Homepage: ssp.impulsetrain.com

### Recent blog entries by ssp

The Radix Heap is a priority queue that has better caching behavior than the well-known binary heap, but also two restrictions: (a) that all the keys in the heap are integers and (b) that you can never insert a new item that is smaller than all the other items currently in the heap.

These restrictions are not that severe. The Radix Heap still works in many algorithms that use heaps as a subroutine: Dijkstra’s shortest-path algorithm, Prim’s minimum spanning tree algorithm, various sweepline algorithms in computational geometry.

Here is how it works. If we assume that the keys are 32 bit integers, the radix heap will have 33 buckets, each one containing a list of items. We also maintain one global value last_deleted, which is initially MIN_INT and otherwise contains the last value extracted from the queue.

The invariant is this:

The items in bucket $k$ differ from last_deleted in bit $k - 1$, but not in bit $k$ or higher. The items in bucket 0 are equal to last_deleted.

For example, if we compare an item from bucket 10 to last_deleted, we will find that bits 31–10 are equal, bit 9 is different, and bits 8–0 may or may not be different.

Here is an example of a radix heap where the last extracted value was 7:

As an example, consider the item 13 in bucket 4. The bit pattern of 7 is 0111 and the bit pattern of 13 is 1101, so the highest bit that is different is bit number 3. Therefore the item 13 belongs in bucket $3 + 1 = 4$. Buckets 1, 2, and 3 are empty, but that’s because a number that differs from 7 in bits 0, 1, or 2 would be smaller than 7 and so isn’t allowed in the heap according to restriction (b).

Operations
When a new item is inserted, it has to be added to the correct bucket. How can we compute the bucket number? We have to find the highest bit where the new item differs from last_deleted. This is easily done by XORing them together and then finding the highest bit in the result. Adding one then gives the bucket number:

bucket_no = highest_bit (new_element XOR last_deleted) + 1


where highest_bit(x) is a function that returns the highest set bit of x, or $-1$ if x is 0.

Inserting the item clearly preserves the invariant because the new item will be in the correct bucket, and last_deleted didn’t change, so all the existing items are still in the right place.

Extracting the minimum involves first finding the minimal item by walking the lowest-numbered non-empty bucket and finding the minimal item in that bucket. Then that item is deleted and last_deleted is updated. Then the bucket is walked again and all the items are redistributed into new buckets according to the new last_deleted item.

The extracted item will be the minimal one in the data structure because we picked the minimal item in the redistributed bucket, and all the buckets with lower numbers are empty. And if there were a smaller item in one of the buckets with higher numbers, it would be differing from last_deleted in one of the more significant bits, say bit $k$. But since the items in the redistributed bucket are equal to last_deleted in bit $k$, the hypothetical smaller item would then have to also be smaller than last_deleted, which it can’t be because of restriction (b) mentioned in the introduction. Note that this argument also works for two-complement signed integers.

We have to be sure this doesn’t violate the invariant. First note that all the items that are being redistributed will satisfy the invariant because they are simply being inserted. The items in a bucket with a higher number $k$ were all different from the old last_deleted in the $(k-1)$th bit. This bit must then necessarily also be different from the $(k-1)$th bit in the new last_deleted, because if it weren’t, the new last_deleted would itself have belonged in bucket $k$. And finally, since the bucket being redistributed is the lowest-numbered non-empty one, there can’t be any items in a bucket with a lower number. So the invariant still holds.

In the example above, if we extract the two ‘7’s from bucket 0 and the ‘8’ from bucket 4, the new heap will look like this:

Notice that bucket 4, where the ‘8’ came from, is now empty.

Performance
Inserting into the radix heap takes constant time because all we have to do is add the new item to a list. Determining the highest set bit can be done in constant time with an instruction such as bsr.

The performance of extraction is dominated by the redistribution of items. When a bucket is redistributed, it ends up being empty. To see why, remember that all the items are different from last_deleted in the $(k - 1)$th bit. Because the new last_deleted comes from bucket $k$, the items are now all equal to last_deleted in the $(k - 1)th$ bit. Hence they will all be redistributed to a lower-numbered bucket.

Now consider the life-cycle of a single element. In the worst case it starts out being added to bucket 31 and every time it is redistributed, it moves to a lower-numbered bucket. When it reaches bucket 0, it will be next in line for extraction. It follows that the maximum number of redistributions that an element can experience is 31.

Since a redistribution takes constant time per element distributed, and since an element will only be redistributed $d$ times, where $d$ is the number of bits in the element, it follows that the amortized time complexity of extraction is $O(d)$. In practice we will often do better though, because most items will not move through all the buckets.

Caching performance
Some descriptions of the radix heap recommend implementing the buckets as doubly linked lists, but that would be a mistake because linked lists have terrible cache locality. It is better to implement them as dynamically growing arrays. If you do that, the top of the buckets will tend to be hot which means the per-item number of cache misses during redistribution of a bucket will tend to be $O(1/B)$, where $B$ is the number of integers in a cache line. This means the amortized cache-miss complexity of extraction will be closer to $O(d/B)$ than to $O(d)$.

In a regular binary heap, both insertion and extraction require $\Theta(\log n)$ swaps in the worst case, and each swap (except for those very close to the top of the heap) will cause a cache miss.

In other words, if $d = \Theta(\log n)$, extraction from a radix heap will tend to generate $\Theta(\log n / B)$ cache misses, where a binary heap will require $\Theta(\log n)$.

Syndicated 2013-05-25 00:00:00 from Søren Sandmann Pedersen

Fast Multiplication of Normalized 16 bit Numbers with SSE2

If you are compositing pixels with 16 bits per component, you often need this computation:

uint16_t a, b, r;

r = (a * b + 0x7fff) / 65535;


There is a well-known way to do this quickly without a division:

uint32_t t;

t = a * b + 0x8000;

r = (t + (t >> 16)) >> 16;


Since we are compositing pixels we want to do this with SSE2 instructions, but because the code above uses 32 bit arithmetic, we can only do four operations at a time, even though SSE registers have room for eight 16 bit values. Here is a direct translation into SSE2:

a = punpcklwd (a, 0);
b = punpcklwd (b, 0);
a = pmulld (a, b);
b = psrld (a, 16);
a = psrld (a, 16);
a = packusdw (a, 0);


But there is another way that better matches SSE2:

uint16_t lo, hi, t, r;

hi = (a * b) >> 16;
lo = (a * b) & 0xffff;

t = lo >> 15;
hi += t;
t = hi ^ 0x7fff;

if ((int16_t)lo > (int16_t)t)
lo = 0xffff;
else
lo = 0x0000;

r = hi - lo;


This version is better because it avoids the unpacking to 32 bits. Here is the translation into SSE2:

t = pmulhuw (a, b);
a = pmullw (a, b);
b = psrlw (a, 15);
b = pxor (t, 0x7fff);
a = pcmpgtw (a, b);
a = psubw (t, a);


This is not only shorter, it also makes use of the full width of the SSE registers, computing eight results at a time.

Unfortunately SSE2 doesn’t have 8-bit variants of pmulhuw, pmullw, and psrlw, so we can’t use this trick for the more common case where pixels have 8 bits per component.

Exercise: Why does the second version work?

Syndicated 2013-05-16 05:00:56 from ssp

Sysprof 1.1.8

A new version 1.1.8 of Sysprof is out.

This is a release candidate for 1.2.0 and contains mainly bug fixes.

Syndicated 2013-05-16 05:00:56 from ssp

Gamma Correction vs. Premultiplied Pixels

Pixels with 8 bits per channel are normally sRGB encoded because that allocates more bits to darker colors where human vision is the most sensitive. (Actually, it’s really more of a historical accident, but sRGB nevertheless remains useful for this reason). The relationship between sRGB and linear RGB is that you get an sRGB pixel by raising each component of a linear pixel to the power of $1/2.2$.

A lot of graphics software does alpha blending directly on these sRGB pixels using alpha values that are linearly coded (ie., an alpha value of 0 means no coverage, 0.5 means half coverage, and 1 means full coverage). Because alpha blending is best done with premultiplied pixels, such systems store pixels in this format:

[ alpha,  alpha * red_s,  alpha * green_s,  alpha * blue_s ]


where alpha is linearly coded, and (red_s, green_s, blue_s) are sRGB coded. As long as you are happy with blending in sRGB, this works well. Also, if you simply discard the alpha channel of such pixels and display them directly on a monitor, it will look as if the pixels were alpha blended (in the sRGB space) on top of a black background, which is the desired result.

But what if you want to blend in linear RGB? If you use the format above, some expensive conversions will be required. To convert to premultiplied linear, you have to first divide by alpha, then raise each color to 2.2, then multiply by alpha. To convert back, you must divide by alpha, raise to $1/2.2$, then multiply with alpha.

The conversions can be avoided if you store the pixels linearly, ie., keeping the premultiplication, but coding red, green, and blue linearly instead of as sRGB. This makes blending fast, but the downside is that you need deeper pixels. With only 8 bits per pixel, the linear coding loses too much precision in darker tones. Another problems is that to display these pixels, you will either have to convert them to sRGB, or if the video card can scan them out directly, you have to make sure that the gamma ramp is set to compensate for the fact that the monitor expects sRGB pixels.

[ alpha,  alpha_s * red_s,  alpha_s * green_s,  alpha_s * blue_s ]


That is, the alpha channel is stored linearly, and the color channels are stored in sRGB, premultiplied with the alpha value raised to 1/2.2. Ie., the red component is now

(red * alpha)^(1/2.2),


where before it was

alpha * red^(1/2.2).


It is sufficient to use 8 bits per channel with this format because of the sRGB encoding. Discarding the alpha channel and displaying the pixels on a monitor will produce pixels that are alpha blended (in linear space) against black, as desired.

You can convert to linear RGB simply by raising the R, G, and B components to 2.2, and back by raising to $1/2.2$. Or, if you feel like cheating, use an exponent of 2 so that the conversions become a multiplication and a square root respectively.

This is also the pixel format to use with texture samplers that implement the sRGB OpenGL extensions (textures and framebuffers). These extensions say precisely that the R, G, and B components are raised to 2.2 before texture filtering, and raised to 1/2.2 after the final raster operation.

Syndicated 2013-05-16 05:00:56 from ssp

Over is not Translucency

The Porter/Duff Over operator, also known as the “Normal” blend mode in Photoshop, computes the amount of light that is reflected when a pixel partially covers another:

The fraction of bg that is covered is denoted alpha. This operator is the correct one to use when the foreground image is an opaque mask that partially covers the background:

A photon that hits this image will be reflected back to your eyes by either the foreground or the background, but not both. For each foreground pixel, the alpha value tells us the probability of each:

$a \cdot \text{fg} + (1 - a) \cdot \text{bg}$

This is the definition of the Porter/Duff Over operator for non-premultiplied pixels.

But if alpha is interpreted as translucency, then the Over operator is not the correct one to use. The Over operator will act as if each pixel is partially covering the background:

Which is not how translucency works. A translucent material reflects some light and lets other light through. The light that is let through is reflected by the background and interacts with the foreground again.

Let’s look at this in more detail. Please follow along in the diagram to the right. First with probability $a$, the photon is reflected back towards the viewer:

$a \cdot \text{fg}$

With probability $(1 - a)$, it passes through the foreground, hits the background, and is reflected back out. The photon now hits the backside of the foreground pixel. With probability $(1 - a)$, the foreground pixel lets the photon back out to the viewer. The result so far:

\begin{align*} &a\cdot \text{fg} \\ +&(1 - a) \cdot \text{bg} \cdot (1 - a) \end{align*}

But we are not done yet, because with probability $a$ the foreground pixel reflects the photon once again back towards the background pixel. There it will be reflected, hit the backside of the foreground pixel again, which lets it through to our eyes with probability $(1 - a)$. We get another term where the final $(1 - a)$ is replaced with $a \cdot \text{fg} \cdot \text {bg} \cdot (1 - a)$:

\begin{align*} &a\cdot \text{fg} \\ +&(1 - a) \cdot \text{bg} \cdot (1 - a)\\ +&(1 - a) \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot (1 - a) \end{align*}

And so on. In each round, we gain another term which is identical to the previous one, except that it has an additional $a \cdot \text{fg} \cdot \text{bg}$ factor:

\begin{align*} &a\cdot \text{fg} \\ +&(1 - a) \cdot \text{bg} \cdot (1 - a)\\ +&(1 - a) \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot (1 - a)\\ +&(1 - a) \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot a \cdot \text{fg} \cdot \text{bg} \cdot (1 - a) \\ +&\cdots \end{align*}

or more compactly:

$\displaystyle a \cdot \text{fg} + (1 - a)^2 \cdot \text{bg} \cdot \sum_{i=0}^\infty (a \cdot \text{fg} \cdot \text{bg})^i$

Because we are dealing with pixels, both $a$, $\text{fg}$, and $\text{bg}$ are less than 1, so the sum is a geometric series:

$\displaystyle \sum_{i=0}^\infty x^i = \frac{1}{1 - x}$

Putting them together, we get:

$\displaystyle a \cdot \text{fg} + \frac{(1 - a)^2 \cdot bg}{1 - a \cdot \text{fg} \cdot \text{bg}}$

I have sidestepped the issue of premultiplication by assuming that background alpha is 1. The calculations with premultipled colors are similar, and for the color components, the result is simply:

$\displaystyle r = \text{fg} + \frac{(1 - a_\text{fg})^2 \cdot \text{bg}}{1 - \text{fg}\cdot\text{bg}}$

The issue of destination alpha is more complicated. With the Over operator, both foreground and background are opaque masks, so the light that survives both has the same color as the input light. With translucency, the transmitted light has a different color, which means the resulting alpha value must in principle be different for each color component. But that’s not possible for ARGB pixels. A similar argument to the above shows that the resulting alpha value would be:

$\displaystyle r = 1 - \frac{(1 - a)\cdot (1 - b)}{1 - \text{fg} \cdot \text{bg}}$

where $b$ is the background alpha. The problem is the dependency on $\text{fg}$ and $\text{bg}$. If we simply assume for the purposes of the alpha computation that $\text{fg}$ and $\text{bg}$ are equal to $a$ and $b$, we get this:

$\displaystyle r = 1 - \frac{(1 - a)\cdot (1 - b)}{1 - a \cdot b}$

which is equal to

$\displaystyle a + \frac{(1 - a)^2 \cdot b}{1 - a \cdot b}$

Ie., exactly the same computation as the one for the color channels. So we can define the Translucency Operator as this:

$\displaystyle r = \text{fg} + \frac{(1 - a)^2 \cdot \text{bg}}{1 - \text{fg} \cdot \text{bg}}$

for all four channels.

Here is an example of what the operator looks like. The image below is what you will get if you use the Over operator to implement a selection rectangle. Mouse over to see what it would look like if you used the Translucency operator.

Both were computed in linear RGB. Typical implementations will often compute the Over operator in sRGB, so that’s what see if you actually select some icons in Nautilus. If you want to compare all three, open these in tabs:

Over, in sRGB

Translucency, in linear RGB

Over, in linear RGB

And for good measure, even though it makes zero sense to do this,

Translucency, in sRGB

Syndicated 2013-05-16 05:00:56 from ssp

23 older entries...

Others have certified ssp as follows:

[ Certification disabled because you're not logged in. ]