Shared Memory Fences

In our last adventure, dri3k first steps, one of the 'future work' items was to deal with synchronization between the direct rendering application and the X server. DRI2 "handles" this by performing a round trip each time the application starts using a buffer that was being used by the X server.

As DRI3 manages buffer allocation within the application, there's really no reason to talk to the server, so this implicit serialization point just isn't available to us. As I mentioned last time, James Jones and Aaron Plattner added an explicit GPU serialization system to the Sync extension. These SyncFences serializing rendering between two X clients, but within the server there are hooks provided for the driver to use hardware-specific serialization primitives.

The existing Linux DRM interfaces queue rendering to the GPU in the order requests are made to the kernel, so we don't need the ability to serialize within the GPU, we just need to serialize requests to the kernel. Simple CPU-based serialization gating access to the GPU will suffice here, at least for the current set of drivers. GPU access which is not mediated by the kernel will presumably require serialization that involves the GPU itself. We'll leave that for a future adventure though; the goal today is to build something that works with the current Linux DRM interfaces.

SyncFence Semantics

The semantics required by SyncFences is for multiple clients to block on a fence which a single client then triggers. All of the blocked clients start executing requests immediately after the trigger fires.

There are four basic operations on SyncFences:

Trigger. Mark the fence as ready and wake up all waiting clients
Await. Block until the fence is ready.
Query. Retrieve the current state of the fence.
Reset. Unset the fence; future Await requests will block.

SyncFences are the same as Events as provided by Python and other systems. Of course all of the names have been changed to keep things interesting. I'll call them Fences here, to be consistent with the current X usage.

Using Pthread Primitives

One fact about pthreads that I recently learned is that the synchronization primitives (mutexes, barriers and semaphores) are actually supposed to work across process boundaries, if those objects are in shared memory mapped by each process. That seemed like a great simplification for this project; allocate a page of shared memory, map into the X server and direct rendering application and use the existing pthreads APIs.

Alas, the pthread objects are architecture specific. I'm pretty sure that when that spec was written, no-one ever thought of running multiple architectures within the same memory space. I went and looked at the code to check, and found that each of these objects has a different size and structure on x86 and x86_64 architectures. That makes it pretty hard to use this API within X as we often have both 32- and 64- bit applications talking to the same (presumably 64-bit) X server.

As a last resort, I read through a bunch of articles on using futexes directly within applications and decided that it was probably possible to implement what I needed in an architecture-independent fashion.

Futexes

Linux Futexes live in this strange limbo of being a not-quite-public kernel interface. Glibc uses them internally to implement locking primitives, but it doesn't export any direct interface to the system call. Certainly they're easy to use incorrectly, but it's unusual in the Linux space to have our fundamental tools locked away 'for our own safety'.

Fortunately, we can still get at futexes by creating our own syscall wrappers.

static inline long sys_futex(void *addr1, int op, int val1,
                 struct timespec *timeout, void *addr2, int val3)
{
    return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}

For this little exercise, I created two simple wrappers, one to block on a futex:

static inline int futex_wait(int32_t *addr, int32_t value) {
    return sys_futex(addr, FUTEX_WAIT, value, NULL, NULL, 0);
}

and one to wake up all futex waiters:

static inline int futex_wake(int32_t *addr) {
    return sys_futex(addr, FUTEX_WAKE, MAXINT, NULL, NULL, 0);
}

Atomic Memory Operations

I need atomic memory operations to keep separate cores from seeing different values of the fence value, GCC defines a few such primitives and I picked sync_bool_compare_and_swap and sync_val_compare_and_swap. I also need fetch and store operations that the compiler won't shuffle around:

#define barrier() __asm__ __volatile__("": : :"memory")

static inline void atomic_store(int32_t *f, int32_t v)
{
    barrier();
    *f = v;
    barrier();
}

static inline int32_t atomic_fetch(int32_t *a)
{
    int32_t v;
    barrier();
    v = *a;
    barrier();
    return v;
}

If your machine doesn't make these two operations atomic, then you would redefine these as needed.

Futex-based Fences

These wake-all semantics of Fences greatly simplify reasoning about the operation as there's no need to ensure that only a single thread runs past Await, the only requirement is that no threads pass the Await operation until the fence is triggered.

A Fence is defined by a single 32-bit integer which can take one of three values:

0 - The fence is not triggered, and there are no waiters.
1 - The fence is triggered (there can be no waiters at this point).
-1 - The fence is not triggered, and there are waiters (one or more).

With those, I built the fence operations as follows. Here's Await:

int fence_await(int32_t *f)
{
    while (__sync_val_compare_and_swap(f, 0, -1) != 1) {
        if (futex_wait(f, -1)) {
            if (errno != EWOULDBLOCK)
                return -1;
        }
    }
    return 0;
}

The basic requirement that the thread not run until the fence is triggered is met by fetching the current value of the fence and comparing it with 1. Until it is signaled, that comparison will return false.

The compare_and_swap operation makes sure the fence is -1 before the thread calls futex_wait, either it was already -1 in the case where there were other waiters, or it was 0 before and is now -1 in the case where there were no waiters before. This needs to be an atomic operation so that the fence value will be seen as -1 by the trigger operation if there are any threads in the syscall.

The futex_wait call will return once the value is no longer -1, it also ensures that the thread won't block if the trigger occurs between the swap and the syscall.

Here's the Trigger function:

int fence_trigger(int32_t *f)
{
    if (__sync_val_compare_and_swap(f, 0, 1) == -1) {
        atomic_store(f, 1);
        if (futex_wake(f) < 0)
            return -1;
    }
    return 0;
}

The atomic compare_and_swap operation will make sure that no Await thread swaps the 0 for a -1 while the trigger is changing the value from 0 to 1; either the Await switches from 0 to -1 or the Trigger switches from 0 to 1.

If the value before the compare_and_swap was -1, then there may be threads waiting on the Fence. An atomic store, constructed with two memory barriers and a regular store operation, to mark the Fence triggered is followed by the futex_wake call to unblock all Awaiting threads.

The Query function is just an atomic fetch:

int fence_query(int32_t *f)
{
    return atomic_fetch(f) == 1;
}

Reset requires a compare_and_swap so that it doesn't disturb things if the fence has already been reset and there are threads waiting on it:

void fence_reset(int32_t *f)
{
    __sync_bool_compare_and_swap(f, 1, 0);
}

A Request for Review

Ok, so we've all tried to create synchronization primitives only to find that our 'obvious' implementations were full of holes. I'd love to hear from you if you've identified any problems in the above code, or if you can figure out how to use the existing glibc primitives for this operation.