Brief Glamor Hacks

Eric Anholt started writing Glamor a few years ago. The goal was to provide credible 2D acceleration based solely on the OpenGL API, in particular, to implement the X drawing primitives, both core and Render extension, without any GPU-specific code. When he started, the thinking was that fixed-function devices were still relevant, so that original code didn't insist upon "modern" OpenGL features like GLSL shaders. That made the code less efficient and hard to write.

Glamor used to be a side-project within the X world; seen as something that really wasn't very useful; something that any credible 2D driver would replace with custom highly-optimized GPU-specific code. Eric and I both hoped that Glamor would turn into something credible and that we'd be able to eliminate all of the horror-show GPU-specific code in every driver for drawing X text, rectangles and composited images. That hadn't happened though, until now.

Fast forward to the last six months. Eric has spent a bunch of time cleaning up Glamor internals, and in fact he's had it merged into the core X server for version 1.16 which will be coming up this July. Within the Glamor code base, he's been cleaning some internal structures up and making life more tolerable for Glamor developers.

Using libepoxy

A big part of the cleanup was a transition all of the extension function calls to use his other new project, libepoxy, which provides a sane, consistent and performant API to OpenGL extensions for Linux, Mac OS and Windows. That library is awesome, and you should use it for everything you do with OpenGL because not using it is like pounding nails into your head. Or eating non-tasty pie.

Using VBOs in Glamor

One thing he recently cleaned up was how to deal with VBOs during X operations. VBOs are absolutely essential to modern OpenGL applications; they're really the only way to efficiently pass vertex data from application to the GPU. And, every other mechanism is deprecated by the ARB as not a part of the blessed 'core context'.

Glamor provides a simple way of getting some VBO space, dumping data into it, and then using it through two wrapping calls which you use along with glVertexAttribPointer as follows:

pointer = glamor_get_vbo_space(screen, size, &offset);
glVertexAttribPointer(attribute_location, count, type,
              GL_FALSE, stride, offset);
memcpy(pointer, data, size);
glamor_put_vbo_space(screen);

glamor_get_vbo_space allocates the specified amount of VBO space and returns a pointer to that along with an 'offset', which is suitable to pass to glVertexAttribPointer. You dump your data into the returned pointer, call glamor_put_vbo_space and you're all done.

Actually Drawing Stuff

At the same time, Eric has been optimizing some of the existing rendering code. But, all of it is still frankly terrible. Our dream of credible 2D graphics through OpenGL just wasn't being realized at all.

On Monday, I decided that I should go play in Glamor for a few days, both to hack up some simple rendering paths and to familiarize myself with the insides of Glamor as I'm getting asked to review piles of patches for it, and not understanding a code base is a great way to help introduce serious bugs during review.

I started with the core text operations. Not because they're particularly relevant these days as most applications draw text with the Render extension to provide better looking results, but instead because they're often one of the hardest things to do efficiently with a heavy weight GPU interface, and OpenGL can be amazingly heavy weight if you let it.

Eric spent a bunch of time optimizing the existing text code to try and make it faster, but at the bottom, it actually draws each lit pixel as a tiny GL_POINT object by sending a separate x/y vertex value to the GPU (using the above VBO interface). This code walks the array of bits in the font and checking each one to see if it is lit, then checking if the lit pixel is within the clip region and only then adding the coordinates of the lit pixel to the VBO. The amazing thing is that even with all of this CPU and GPU work, the venerable 6x13 font is drawn at an astonishing 3.2 million glyphs per second. Of course, pure software draws text at 9.3 million glyphs per second.

I suspected that a more efficient implementation might be able to draw text a bit faster, so I decided to just start from scratch with a new GL-based core X text drawing function. The plan was pretty simple:

  1. Dump all glyphs in the font into a texture. Store them in 1bpp format to minimize memory consumption.

  2. Place raw (integer) glyph coordinates into the VBO. Place four coordinates for each and draw a GL_QUAD for each glyph.

  3. Transform the glyph coordinates into the usual GL range (-1..1) in the vertex shader.

  4. Fetch a suitable byte from the glyph texture, extract a single bit and then either draw a solid color or discard the fragment.

This makes the X server code surprisingly simple; it computes integer coordinates for the glyph destination and glyph image source and writes those to the VBO. When all of the glyphs are in the VBO, it just calls glDrawArrays(GL_QUADS, 0, 4 * count). The results were "encouraging":

1: fb-text.perf
2: glamor-text.perf
3: keith-text.perf

       1                 2                           3                 Operation
------------   -------------------------   -------------------------   -------------------------
   9300000.0      3160000.0 (     0.340)     18000000.0 (     1.935)   Char in 80-char line (6x13) 
   8700000.0      2850000.0 (     0.328)     16500000.0 (     1.897)   Char in 70-char line (8x13) 
   6560000.0      2380000.0 (     0.363)     11900000.0 (     1.814)   Char in 60-char line (9x15) 
   2150000.0       700000.0 (     0.326)      7710000.0 (     3.586)   Char16 in 40-char line (k14) 
    894000.0       283000.0 (     0.317)      4500000.0 (     5.034)   Char16 in 23-char line (k24) 
   9170000.0      4400000.0 (     0.480)     17300000.0 (     1.887)   Char in 80-char line (TR 10) 
   3080000.0      1090000.0 (     0.354)      7810000.0 (     2.536)   Char in 30-char line (TR 24) 
   6690000.0      2640000.0 (     0.395)      5180000.0 (     0.774)   Char in 20/40/20 line (6x13, TR 10) 
   1160000.0       351000.0 (     0.303)      2080000.0 (     1.793)   Char16 in 7/14/7 line (k14, k24) 
   8310000.0      2880000.0 (     0.347)     15600000.0 (     1.877)   Char in 80-char image line (6x13) 
   7510000.0      2550000.0 (     0.340)     12900000.0 (     1.718)   Char in 70-char image line (8x13) 
   5650000.0      2090000.0 (     0.370)     11400000.0 (     2.018)   Char in 60-char image line (9x15) 
   2000000.0       670000.0 (     0.335)      7780000.0 (     3.890)   Char16 in 40-char image line (k14) 
    823000.0       270000.0 (     0.328)      4470000.0 (     5.431)   Char16 in 23-char image line (k24) 
   8500000.0      3710000.0 (     0.436)      8250000.0 (     0.971)   Char in 80-char image line (TR 10) 
   2620000.0       983000.0 (     0.375)      3650000.0 (     1.393)   Char in 30-char image line (TR 24) 

This is our old friend x11perfcomp, but slightly adjusted for a modern reality where you really do end up drawing billions of objects (hence the wider columns). This table lists the performance for drawing a range of different fonts in both poly text and image text variants. The first column is for Xephyr using software (fb) rendering, the second is for the existing Glamor GL_POINT based code and the third is the latest GL_QUAD based code.

As you can see, drawing points for every lit pixel in a glyph is surprisingly fast, but only about 1/3 the speed of software for essentially any size glyph. By minimizing the use of the CPU and pushing piles of work into the GPU, we manage to increase the speed of most of the operations, with larger glyphs improving significantly more than smaller glyphs.

Now, you ask how much code this involved. And, I can truthfully say that it was a very small amount to write:

 Makefile.am        |    2 
 glamor.c           |    5 
 glamor_core.c      |    8 
 glamor_font.c      |  181 ++++++++++++++++++++
 glamor_font.h      |   50 +++++
 glamor_priv.h      |   26 ++
 glamor_text.c      |  472 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.c |    2 
 8 files changed, 741 insertions(+), 5 deletions(-)

Let's Start At The Very Beginning

The results of optimizing text encouraged me to start at the top of x11perf and see what progress I could make. In particular, looking at the current Glamor code, I noticed that it did all of the vertex transformation with the CPU. That makes no sense at all for any GPU built in the last decade; they've got massive numbers of transistors dedicated to performing precisely this kind of operation. So, I decided to see what I could do with PolyPoint.

PolyPoint is absolutely brutal on any GPU; you have to pass it two coordinates for each pixel, and so the very best you can do is send it 32 bits, or precisely the same amount of data needed to actually draw a pixel on the frame buffer. With this in mind, one expects that about the best you can do compared with software is tie. Of course, the CPU version is actually computing an address and clipping, but those are all easily buried in the cost of actually storing a pixel.

In any case, the results of this little exercise are pretty close to a tie -- the CPU draws 190,000,000 dots per second and the GPU draws 189,000,000 dots per second. Looking at the vertex and fragment shaders generated by the compiler, it's clear that there's room for improvement.

The fragment shader is simply pulling the constant pixel color from a uniform and assigning it to the fragment color in this the simplest of all possible shaders:

uniform vec4 color;
void main()
{
       gl_FragColor = color;
};

This generates five instructions:

Native code for point fragment shader 7 (SIMD8 dispatch):
   START B0
   FB write target 0
0x00000000: mov(8)          g113<1>F        g2<0,1,0>F                      { align1 WE_normal 1Q };
0x00000010: mov(8)          g114<1>F        g2.1<0,1,0>F                    { align1 WE_normal 1Q };
0x00000020: mov(8)          g115<1>F        g2.2<0,1,0>F                    { align1 WE_normal 1Q };
0x00000030: mov(8)          g116<1>F        g2.3<0,1,0>F                    { align1 WE_normal 1Q };
0x00000040: sendc(8)        null            g113<8,8,1>F
                render ( RT write, 0, 4, 12) mlen 4 rlen 0      { align1 WE_normal 1Q EOT };
   END B0

As this pattern is actually pretty common, it turns out there's a single instruction that can replace all four of the moves. That should actually make a significant difference in the run time of this shader, and this shader runs once for every single pixel.

The vertex shader has some similar optimization opportunities, but it only runs once for every 8 pixels -- with the SIMD format flipped around, the vertex shader can compute 8 vertices in parallel, so it ends up executing 8 times less often. It's got some redundant moves, which could be optimized by improving the copy propagation analysis code in the compiler.

Of course, improving the compiler to make these cases run faster will probably make a lot of other applications run faster too, so it's probably worth doing at some point.

Again, the amount of code necessary to add this path was tiny:

 Makefile.am        |    1 
 glamor.c           |    2 
 glamor_polyops.c   |  116 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 glamor_priv.h      |    8 +++
 glamor_transform.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.h |   51 ++++++++++++++++++++++
 6 files changed, 292 insertions(+), 4 deletions(-)

Discussion of Results

These two cases, text and points, are probably the hardest operations to accelerate with a GPU and yet a small amount of OpenGL code was able to meet or beat software easily. The advantage of this work over traditional GPU 2D acceleration implementations should be pretty clear -- this same code should work well on any GPU which offers a reasonable OpenGL implementation. That means everyone shares the benefits of this code, and everyone can contribute to making 2D operations faster.

All of these measurements were actually done using Xephyr, which offers a testing environment unlike any I've ever had -- build and test hardware acceleration code within a nested X server, debugging it in a windowed environment on a single machine. Here's how I'm running it:

$ ./Xephyr  -glamor :1 -schedMax 2000 -screen 1024x768 -retro

The one bit of magic here is the '-schedMax 2000' flag, which causes Xephyr to update the screen less often when applications are very busy and serves to reduce the overhead of screen updates while running x11perf.

Future Work

Having managed to accelerate 17 of the 392 operations in x11perf, it's pretty clear that I could spend a bunch of time just stepping through each of the remaining ones and working on them. Before doing that, we want to try and work out some general principles about how to handle core X fill styles. Moving all of the stipple and tile computation to the GPU will help reduce the amount of code necessary to fill rectangles and spans, along with improving performance, assuming the above exercise generalizes to other primitives.

Getting and Testing the Code

Most of the changes here are from Eric's glamor-server branch:

git://people.freedesktop.org/~anholt/xserver glamor-server

The two patches shown above, along with a pair of simple clean up patches that I've written this week are available here:

git://people.freedesktop.org/~keithp/xserver glamor-server

Of course, as this now uses libepoxy, you'll need to fetch, build and install that before trying to compile this X server.

Because you can try all of this out in Xephyr, it's easy to download and build this X server and then run it right on top of your current driver stack inside of X. I'd really like to hear from people with Radeon or nVidia hardware to know whether the code works, and how it compares with fb on the same machine, which you get when you elide the '-glamor' argument from the example Xephyr command line above.