I’ve recently received an Oculus Rift Development Kit Mk. II, and since I’m on Linux, there is no official SDK for me and I’m pretty much out there on my own. But that’s OK; it’s given me a chance to experiment with the DK2 as a black box, and investigate some ways how I could support it in my VR toolkit under Linux, and improve Vrui’s user experience while I’m at it. And I also managed to score a genuine Oculus VR Latency Tester, and did a set of experiments with interesting results. If you just want to see those results, skip to the end.
The Woes of Windows
If you’ve been paying attention to the Oculus subreddit since the first DK2s have been delivered to developers/enthusiasts, there is a common consensus that the user experience of the DK2 and the SDK that drives it could be somewhat improved. Granted, it’s a developer’s kit and not a consumer product, but even developers seem to be spending more time getting the DK2 to run smoothly, or run at all, than actually developing for it (or at least that’s the impression I get from the communal bellyaching).
So what appear to be the main sticking points with the SDK, which practically, since Linux support doesn’t exist and Mac OS X is treated like a red-headed stepchild, means the Windows SDK? First, a bit of background. Under Windows, there are two distinct ways to drive the Rift: in “extended display” or legacy mode, and in “direct-to-Rift” mode. Extended display mode is what happens when someone plugs a Rift, or any other display, into a Windows computer: the display manager opens a new screen for the newly-connected display, and places it somewhere next to the existing display(s), such that the desktop now spans all of them. In extended mode, there appear to be the following problems:
- Sometimes, the VR software doesn’t find the Rift’s display, and places the VR window in some random location, maybe sticking halfway out to the side etc. This means the software becomes unusable in that instance.
- Alternatively, the VR software shows severe “judder,” where the world jerks back and forth while the user smoothly moves her head.
- Or, related to that, the Rift’s display only displays 60 frames a second, instead of its native refresh rate of 75Hz.
- There appears to be more end-to-end latency than there should be.
As it turns out, all these effects boil down to how Windows, and Windows’ 3D graphics libraries, OpenGL and Direct3D, handle multiple displays. I don’t have an explanation for the first problem besides a bug in the SDK, but the second to fourth can squarely be blamed on the OS. (Disclaimer: I know pretty much nothing about Windows or Direct3D, so the following is slightly conjecture-y.) Deep down, the last three issues are caused by the way how Windows synchronizes graphics updates with their respective displays. But first, some background (bear with me, this will be on the test).
The Anatomy of a Video Frame
How do a video card and a display device work together to create an image? These days, all displays are raster displays, and there is only one way. The video card maintains a frame buffer, a grid of pixels representing the image on the display screen (in reality the frame buffer is fragmented, but that doesn’t affect the principle). Let’s ignore display scaling and assume that the frame buffer has the same number of pixels as the display. A very low-level part of the graphics card, the video controller, has the job of sending those pixels from the graphics card’s memory to the display. And because the cable connecting the two is a serial cable, the pixels have to be sent one after the other. To achieve this, the video controller runs in a tight loop: it sends a frame of pixels by traversing the frame buffer in left-to-right, top-to-bottom order, starting over from the top-left (almost) immediately after reaching the bottom-right. There are three major time periods involved here: the pixel clock, the rate at which individual pixels are sent across the cable (usually measured in MHz); the horizontal retrace period, the time it takes to send an entire row of pixels (usually measured in kHz), and the vertical retrace period, the time it takes to send an entire frame of pixels (usually measured in Hz).
To give an example: a 60Hz 1080p video signal used for computer monitors (1920×1080 pixels) has a pixel clock of 148.5MHz, a horizontal retrace period of 67.5kHz, and a vertical retrace period of 60Hz. “But wait,” you say, “those numbers don’t work out!” And you’re right.
More background required. In the olden days of cathode ray tubes (CRTs), displays were pretty dumb. A CRT creates a picture by tracing an electron beam across its display surface, in left-to-right, top-to-bottom order. That beam is modulated directly by the (analog) pixel data arriving over the serial display cable, without any buffering or processing. But there was a problem: when the electron beam had finished tracing out a row of pixels, it had to return to the left edge to trace out the next row, and that took some time. The same happened when the beam had to return from the bottom-right corner to the top-left corner at the end of a frame. The simplest solution to the problem is still with us, regardless of CRTs having gone the way of the dodo: add some padding to the video signal. The real pixel data, say 1920 horizontal pixels of a 1080p frame, are embedded into a larger video line: it starts with the horizontal retrace, a synchronization pulse to return the electron beam to the leftmost position, followed by a period of unused pixels, the horizontal back porch, then the actual pixel data, then another period of unused pixels, the horizontal front porch (note that front and back porch seem reversed because they’re relative to the sync pulse, not the pixel data). Same for frames: a full video frame starts with the vertical retrace, a synchronization pulse to return the electron beam to the top-left corner, followed by the vertical back porch, then the real pixel rows with their per-line padding, and finally the vertical front porch. See Figure 1.
Let’s rework the example: in our 60Hz 1080p video signal, the horizontal sync pulse is 56 pixels, the horizontal back porch is 139 pixels, the horizontal front porch is 85 pixels, the vertical sync pulse is 6 pixel rows, the vertical back porch is 37 pixel rows, and the vertical front porch is 2 pixel rows. Now the numbers make sense: 148.5MHz divided by 2200 total pixels per video line is 67.5kHz, and that divided by 1125 total pixel rows per video frame is 60Hz. Ta da.
Update: Just FYI, the Rift DK2’s video timings in its native (unrotated) 1080×1920 mode are: pixel clock 164.9816 MHz, number of visible columns 1080, horizontal front porch 33, horizontal sync pulse 10, horizontal back porch 15 (total line width 1138 and horizontal retrace frequency 144.975 kHz), number of visible rows 1920, vertical front porch 1, vertical sync pulse 6, vertical back porch 6 (total frame height 1933 and vertical retrace frequency 75 Hz).
A Literal Race Condition
How does all this affect judder and latency on the Rift DK2 in extended mode? Directly and fundamentally, that’s how. Modern 3D graphics applications generate images by drawing primitives, primarily triangles or bitmaps. These primitives are drawn in some application-defined order, but due to the user’s freedom to view a 3D scene from almost arbitrary viewpoints, these primitives don’t arrive in the frame buffer nicely ordered. In other words, 3D graphics applications don’t draw images nicely from left-to-right, top-to-bottom, but draw small primitives in random order all over the place. And that causes contention with the video controller’s scanout, which continuously races across the frame buffer in left-to-right, top-to-bottom order. Imagine a simple 3D scene: a white rectangle in the middle of the frame, occluded by a red rectangle in front of it. The application draws the white rectangle first, then the red one. Now image that, due to bad luck, the video controller scans out the frame buffer region containing the rectangles directly between the application drawing the white one and the red one. As a result, the video controller only sees the white triangle and sends its pixels to the display, where it is shown to the user, violating the application’s intent. The viewer seeing an incorrect image is such a bad problem that the brute-force solution to it is firmly entrenched in graphics programmers’ lizard brains: double buffering.
Instead of drawing directly to the frame buffer, applications draw to a separate, hidden, buffer, the back buffer. Once the image is completely drawn and therefore correct, the entire back buffer is copied into the frame buffer, from where the video controller sends it to the display. But doesn’t that cause another race condition? What if the copy operation is overtaken by scanout? The result is tearing, where the lower part of the displayed video frame contains the new image, and the upper part still contains the old image (because it was sent earlier). If an application can draw images very quickly, it’s even possible to have multiple generations of images on the screen at the same time, in vertical bands.
Tearing is an artifact, but a much less objectionable one than scanout race, because while there are multiple partial images on the screen at once, at least all those images are themselves correct (meaning, none of them would show the white rectangle from our example). Still, tearing should be solved. Is there any time where the video controller is not reading from the frame buffer? Why, yes — conveniently, during the vertical blanking interval, the combination of front porch, sync pulse, and back porch. The solution to tearing is vertical retrace synchronization, or vsync for short, where a new back buffer is copied into the frame buffer while the video controller is idly generating signal padding originally meant for ancient CRT monitors. So the canonical 3D graphics rendering loop is as follows:
- Draw image, consisting of primitives such as triangles or bitmaps, into a back buffer.
- Wait until the video controller enters the vertical blanking interval.
- Copy the back buffer into the frame buffer.
- Rinse and repeat.
This loop is so ingrained that many graphics programmers, myself unfortunately included, can’t even conceive of doing it otherwise (more on that later). But for the Rift in extended mode, there is a problem: wait until which video controller enters the vertical blanking interval? As it turns out, Windows’ display manager always syncs to the primary display. And, since the Rift makes a poor primary display, that’s usually the wrong one. “But wait,” you say, “wrong synchronization only causes tearing, which ain’t so bad!” Unfortunately, wrong. There is another very clever technology at play here, dubbed time warp. Without explaining it in detail, time warp reduces apparent display latency, the bane of VR, by predicting the viewer’s head motion over the duration of a video frame. But to do that correctly, the real display latency, between applying time warp and the time-warped image being presented to the viewer, must be very predictable.
Did I mention that I got my hands on a latency tester? Here’s a preview of one experiment. The two diagrams in Figure 2 are histograms of measured display latencies, with the vertical axis latency in ms, and the horizontal axis number of measurements. The measurements were taken on a Rift DK2, running at 75Hz in extended mode next to a 60Hz primary display. The left histogram is when vsync is locked to the Rift’s display, and everything looks great. There is a clear spike at about 4.3ms, and latency is highly predictable from frame to frame. Not so in the right diagram: here vsync is locked to the main monitor, and as a result, real display latency is all over the place. Reader, meet judder: due to unpredictable real latency, time warp will randomly over- or undershoot the viewer’s real head motion during a video frame, and the view will jerk back and forth rapidly even during smooth head motions.
Apparently, the only workable extended mode under Windows is with the Rift as primary display, which is inconvenient because UI components will appear by default on the Rift’s screen, where they will be near-impossible to use. An oft-proposed workaround is to force the main display and the Rift to run at the same frame rate, either both at 60Hz, or both at 75Hz. That reduces the problem, but does not eradicate it. For one, two displays running at 60Hz don’t usually run at the exact same frequency, due to minor differences in video timing. To lock two displays to each other, one would have to use the exact same pixel clocks, resolutions, and horizontal and vertical blanking periods. Windows itself does not offer such fine control of video timings; as a result, most displays run at real refresh rates of, say, 59.96Hz or 60.02Hz. Furthermore, even when two displays do run at the exact same frequency, they do not necessarily run at the same phase. One might be in the vertical blanking interval, the other might be in the middle of its frame. This would lead to more predictable display latency, but would cause consistent tearing.
To solve all these problems, and some we haven’t discussed yet, Oculus introduced “Direct-to-Rift Mode.” A custom OS kernel driver prevents Windows’ display manager from asserting control over the Rift’s display, and allows Oculus’ SDK to write images directly to the Rift. This immediately addresses the main problems with extended mode:
- VR software does not need to find the Rift’s display in Windows’ extended desktop; by writing to the Rift directly, stereoscopic images always end up precisely in the right locations.
- The kernel driver can generate a custom vsync signal for the Rift’s display, which prevents tearing, makes display latency predictable and time warp work, lets the Rift sync at its native refresh rate of 75Hz, and therefore lets applications render at 75 frames per second.
- Direct-to-Rift mode reduces display latency.
The first two bullet points are obvious, but the last threw me for a loop, until I was able to overcome my double-buffer conditioning. How exactly does “directly writing to a display” differ from, and have lower latency than, the way it’s normally done, i.e., handing 3D primitives to a graphics library (OpenGL or Direct3D), and letting the latter write to the graphics card’s frame buffer? To understand that, we have to look back at video signal timings and the canonical rendering loop.
The main issue with vsync-ed double buffering is that, in a typical application, it incurs at least Tframe, or one full frame, of latency, e.g., 16.67ms at 60Hz. Step 2 in the loop waits for the vertical sync pulse to occur, then copies the back buffer, and then repeats the application’s inner loop. But copying the back buffer is an extremely fast operation (it’s typically implemented as a simple pointer swap), and therefore the application loop will start immediately after retrace. And the first thing typical application loops do is poll input devices, or, in the case of VR, the head tracker. Now let’s assume that the application is very efficient, and can perform its inner loop and render the new frame in a few milliseconds. But it won’t be able to present the new frame to the video controller until the back buffer is swapped at the next sync pulse, which is almost exactly one frame after the input devices have been polled. Meaning, in the ideal case, motion-to-photon latency is one full frame plus whatever latency is inherent in the display, or (Tframe + Tdisplay).
Now, if the application had a hard upper bound Tmax on its inner loop’s execution time, it could do something clever. It could pause for (Tframe – Tmax) after the vertical retrace, and then run the next loop iteration. This means input devices are now polled Tmax before the next sync pulse, reducing motion-to-photon latency to (Tmax + Tdisplay). The sticky issue is that it is often impossible to find a tight upper bound on application processing time, because it depends on many unpredictable factors — I/O load, viewing conditions, AI activity, etc. And if the application only takes a fraction of a millisecond too long, it misses an entire frame, causing a disrupting skip in the VR display.
And that’s the core idea behind time warp. Unlike main application processing, the process of time warping a rendered image based on newly-sampled head tracking data is very predictable, and it is easy to measure a tight upper bound. Given such a bound Twarp, the new rendering loop becomes:
- Sample input devices and run application processing.
- Draw image, consisting of primitives such as triangles or bitmaps, into a temporary frame buffer.
- Wait until Twarp before the video controller enters the vertical blanking interval.
- Sample or predict up-to-date head tracking state, then time-warp the rendered buffer into the back buffer.
- Copy the back buffer into the frame buffer.
- Rinse and repeat.
That’s better, but there are still problems. For one, if anything goes wrong during time warp processing (for example, if the process is temporarily pre-empted), an entire frame is lost. Second, and I didn’t realize that until an Oculus engineer confirmed it, there is no way in Windows to wait for a time before a vertical sync without introducing an entire frame of latency, completely defeating the purpose.
And here is the major insight that didn’t occur to me until late last night: unlike application rendering in step 2, time warping in step 4 proceeds in a left-to-right, top-to-bottom, orderly fashion. Meaning, double-buffering the result of time warping is completely unnecessary. If time warping starts exactly at the vertical sync, and writes directly into the video controller’s frame buffer, then time warping can start writing pixel rows from the top while the video controller is twiddling its thumbs during the vertical blanking interval. When the video controller finally starts scanning rows, time warping will have a large enough head start that it can race scanout all the way down the frame, and beat it to the finish. And that’s how Direct-to-Rift mode reduces latency. It not only circumvents a somewhat stupid Windows API, but it shifts time warp processing from before the vertical sync to after the vertical sync, where it has all the time in the world (almost an entire retrace period) to do its thing. There is even enough time to adjust the motion prediction interval on a line-by-line basis, so that each line is time-warped to the precise instant in time when it is presented to the viewer by the Rift’s OLED screen’s low-persistence scanout. And isn’t it funny that this 21st century advanced deep magic rendering method only works because 1940s analog TVs couldn’t move their electron beams fast enough?
And that’s another reason why Direct-to-Rift mode and its kernel module are required: apparently, under Windows it’s impossible to write directly into a managed display’s frame buffer. Ouch.
So What About Linux and Vrui?
After reading the complaints in the Oculus forums, I started investigating whether it would be possible to solve them under Linux, without having to write a custom display driver and side-stepping (and possibly re-inventing) a huge chunk of functionality. Turns out, the outlook is very good:
- The XRANDR extension to the X Window System protocol lets clients enumerate all displays connected to a server, including their vendor and model names and their positions and sizes within the server’s virtual desktop. That’s all that’s needed to find the Rift’s display, and place a window smack-dab into it. There’s even an event for asynchronous notification, meaning that a user can re-arrange her displays while a VR application is running, for example to temporarily mirror the Rift’s display onto the main display, and the application keeps working properly. That’s neat.
- At least using the Nvidia graphics driver (and who would use anything else, anyway?), OpenGL can vsync to any connected display, primary or secondary. That means applications can render at the Rift’s native frame rate, without tearing or judder (at least the kind of judder caused by vsync mismatch).
- Under a non-compositing window manager, regular double-buffering under OpenGL does not incur undue latency (and there is a window manager hint to tell a compositing manager to grant pass-through to a window). This means the canonical application loop has exactly one frame latency, and a delayed application loop has exactly Tmax latency, plus whatever is inherent in the display. This was confirmed via latency tester: a fast-switching LCD desktop monitor had around 12ms latency from drawing a rectangle to the rectangle showing up on the screen, and the Rift DK2’s OLED display had around 6ms latency. These latencies were achieved with a conservative estimate for Tmax, to avoid dropped frames.
- OpenGL and X Windows allow applications to draw directly into a video controller’s frame buffer, allowing single-buffered rendering and racing the video controller’s scanout. There is a GLX extension (GLX_SGI_video_sync) that suspends the caller until the next vertical retrace, to start the front buffer scanout race just at the right time. With front buffer rendering, display latency on the Rift DK2 dropped to about 4.3ms (see Figure 2).
So does this mean that the equivalent of “Direct-to-Rift mode” can be achieved under Linux with all built-in means? Unless there is some really deep black magic going on in the Oculus display driver, I’d say yes. I need to point out at this juncture that everything I’m saying here about the inner workings of the Oculus display driver is pure conjecture, as it’s closed source.
One final question: why is the DK2’s minimal latency at 4.3ms? Shouldn’t it be basically nil, due to the OLED screen’s nearly instantaneous reaction time? No, for a simple reason: as discussed above, pixel data is fed to the display one pixel at a time, at a constant rate (the pixel clock). This means that, if the refresh rate is 75Hz, it takes almost 13ms to scan out an entire frame. More precisely, display latency at the very top of the screen (the right edge in the DK2’s case) is just the total vertical blanking interval; latency in the middle is about 6.5ms, and latency at the left edge is about 13ms. I measured latency by holding the tester against the right lens, meaning I sampled at about a quarter down the entire frame, or approximately 4ms from vertical sync.