Playing around with LLC controls on Sandybridge today.

What I wanted to know was how caching backbuffers affects application performance. The reason for this question is that when the driver uses page flipping to display a new frame, we force the new scanout buffer to be uncached because the display hardware only scans out from main memory. However, when we go back to using this buffer as the next back-buffer, we don't turn the caching bits back on as that requires writing GTT entries to change caching modes.

This means that swapping via page-flipping and swapping via copying has a large change in main memory access patterns during rendering -- page flipping applications use an uncached back buffer while copying applications use a cached back buffer.

What effect does the back-buffer caching-mode have on application performance?

The changes

In the kernel, I have three modes

1) Existing code. Set buffer objects to cache mode none when preparing them as scanout buffers. The first time the application flips to a back buffer, caching is disabled on that buffer forever

2) Flip back and forth. Set buffers to cache mode none when used as scanout. When that stops, flip them back to cache mode llc. This involves a whole lot of clflushing and GTT rewriting, obviously.

3) Flush to memory, but don't disable caching. In this mode, the buffer gets flushed (from both CPU and GPU) before being used as a scanout, but the cache mode isn't changed.

Benchmarking with glxgears

The first test was pretty simple -- I hacked up glxgears (yes, glxgears is not a benchmark) to redraw the same frame multiple times (including the clear) per swap to provide a scalable amount of work per swap. No pixel operations, so this is just writing to the back buffer. Thes numbers are the median of 5 runs, except for those obviously limited by the refresh rate.

RepsExistingFlipFlush
10606060
20606060
4042.07034.11337.130
8021.25917.95718.658

Note the linear scaling over reps (once you get below refresh). I think this means that the swap overhead is in the noise, including flushing and rewriting GTT entries. Given that my screen is (yuck) 1366x768 pixels, or just over 4MB, and my total LLC is 4MB, it's seems like the performance changes here are likely caused by the rendering ejecting every other object from the cache.

What's wrong with benchmarking with glxgears?

Of course, glxgears is not a benchmark, and I realized while collecting this data that, in particular, it never reads from the backbuffer, and most pixels are only written once by the clear operation at the start of the rendering code. So, it forces everything else out of the cache at the start of each drawing cycle, but doesn't stick anything useful into the cache.

Benchmarking nexuiz, a (more) real application

For the next test, I ran nexuiz demo1.dem in all three modes. Numbers out of that weren't nearly as stable as glxgears, and I'm afraid I didn't write all of them down, but I got numbers between 45 and 50 frames per second in all three modes. None significantly faster, none significantly slower.

So, for these two tests, caching has no positive effect on overall rendering performance. Obviously, I need to collect data from more applications to see if the effect is general. I sure hope so, because the alternative will be to find some heuristic to direct when to enable caching.

The patch

This is on top of kernel version 3.5.

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b0b676a..f69fba8 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1335,6 +1335,10 @@ int __must_check
 i915_gem_object_pin_to_display_plane(struct drm_i915_gem_object *obj,
                     u32 alignment,
                     struct intel_ring_buffer *pipelined);
+
+void
+i915_gem_object_unpin_from_display_plane(struct drm_i915_gem_object *obj);
+
 int i915_gem_attach_phys_object(struct drm_device *dev,
                struct drm_i915_gem_object *obj,
                int id,
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 288d7b8..b606bd2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2851,6 +2851,9 @@ int i915_gem_object_set_cache_level(struct drm_i915_gem_object *obj,
    return 0;
 }

+#define JUST_FLUSH 0
+#define FLIP_CACHING   0
+
 /*
  * Prepare buffer for display plane (scanout, cursors, etc).
  * Can be called from an uninterruptible phase (modesetting) and allows
@@ -2883,7 +2886,11 @@ i915_gem_object_pin_to_display_plane(struct drm_i915_gem_object *obj,
     * of uncaching, which would allow us to flush all the LLC-cached data
     * with that bit in the PTE to main memory with just one PIPE_CONTROL.
     */
+#if JUST_FLUSH
+   ret = i915_gem_object_finish_gpu(obj);
+#else
    ret = i915_gem_object_set_cache_level(obj, I915_CACHE_NONE);
+#endif
    if (ret)
        return ret;

@@ -2913,6 +2920,16 @@ i915_gem_object_pin_to_display_plane(struct drm_i915_gem_object *obj,
    return 0;
 }

+void
+i915_gem_object_unpin_from_display_plane(struct drm_i915_gem_object *obj)
+{
+   i915_gem_object_unpin(obj);
+#if FLIP_CACHING
+   if (HAS_LLC(obj->base.dev))
+       i915_gem_object_set_cache_level(obj, I915_CACHE_LLC);
+#endif
+}
+
 int
 i915_gem_object_finish_gpu(struct drm_i915_gem_object *obj)
 {
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index a8538ac..76a2012 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -1821,7 +1821,7 @@ err_interruptible:
 void intel_unpin_fb_obj(struct drm_i915_gem_object *obj)
 {
    i915_gem_object_unpin_fence(obj);
-   i915_gem_object_unpin(obj);
+   i915_gem_object_unpin_from_display_plane(obj);
 }

 static int i9xx_update_plane(struct drm_crtc *crtc, struct drm_framebuffer *fb,