I have talked many times about the importance of eye tracking for head-mounted displays, but so far, eye tracking has been limited to the very high end of the HMD spectrum. Not anymore. SensoMotoric Instruments, a company with around 20 years of experience in vision-based eye tracking hardware and software, unveiled a prototype integrating the camera-based eye tracker from their existing eye tracking glasses with an off-the-shelf Oculus Rift DK1 HMD (see Figure 1). Fortunately for me, SMI were showing their eye-tracked Rift at the 2014 Augmented World Expo, and offered to bring it up to my lab to let me have a look at it.
As can be seen in Figure 1, this is not a finished product yet. Besides the fact that the Rift DK1 is no longer for sale, and that Rift DK2s, which are supposed to start shipping in July, have a slightly different internal layout, the modifications required to fit the existing eye tracker into the cramped interior of the Rift are somewhat invasive. SMI engineers had to cut a square hole into the bottom-most portion of both of the Rift’s lenses in order to give the tracking cameras an unobstructed view of the user’s eyes, and had to place several infrared LEDs around the rim of each lens to create the illumination needed for 3D eye tracking (more on that later). Christian Villwock, SMI’s director, assured me that their next-generation tracking camera won’t require cutting chunks out of the lenses, and will be small enough to be packaged entirely into the Rift’s lens holders; in other words, eye tracking could be added to a Rift as an after-market mod, by exchanging the original lens holders with modified ones. But leaving predictions of the future aside, let’s focus on the important questions: how does it work, how well does it work, and what are the immediate and long-term applications?
Without going into the nitty-gritty technical details (and potentially spilling trade secrets), the most important thing about SMI’s eye tracker is that it tracks the user’s pupil position and gaze direction in three dimensions. Each of the two cameras is a 3D camera, not entirely dissimilar to the first-generation Kinect. Based on the depth data captured by each camera, SMI’s software is able to derive a detailed 3D model of the user’s eyes, including cornea shape, pupil position, and even features of the retina. This level of detail sets the system apart from two-dimensional eye trackers, which primarily track the centroid of the user’s pupil in the camera’s 2D image plane. In one of the firsts posts on this blog, I posited that future HMDs should contain one stereo camera per eye for 3D eye tracking; it didn’t occur to me that per-eye 3D cameras would be even better. The most important measurements provided by the 3D eye model are the 3D position of the pupil’s center point, which is required for properly calibrated projection, and a 3D gaze direction vector, which can serve as input for gaze-based interaction schemes, both relative to the cameras. But since the camera is rigidly attached to the inside of the Rift, it is possible to transform both measurements to “HMD space,” and then to the VR application’s model space based on head tracking data.
Now that we have a basic idea of how it works, how well does it work? The crucial cornerstones of eye tracking performance are accuracy and latency. Let’s address latency first, because that’s where I have to guesstimate. SMI’s demo included a 2D calibration/accuracy utility (see Figure 2), which simply displayed four small dots arranged around a central dot, covering a visual angle of (estimated) 25° from the center in all directions. The user’s current visual focus point was indicated by a larger circle, apparently as fast as the firmware/software could extract and display it. By moving my eyes around as fast as I could (and that’s pretty fast), and watching the display play catch-up, I estimated an end-to-end system latency of somewhere between 50ms and 100ms. Take these numbers with a grain of salt; there is no way of measuring such short delays without gizmos. I arrived at these numbers by mentally comparing the response time of the tracking system to response times of other systems with known latency (such as my 3D tracking systems). The most important result is this: the current system latency is too high for one of the more forward-looking applications of eye tracking, foveated rendering. I clearly saw the tracking circle lagging behind my eye movements, and I would have noticed reduced rendering resolutions and “pop” from the software switching between levels of detail just as clearly.
I can go into more detail regarding accuracy, and it’s a fairly interesting story, too, because it sheds some light on the calibration processes that are still required in practice for a theoretically “calibration-free” eye-tracked HMD. These observations are again based on the calibration utility illustrated in Figure 2. When the utility starts up, it performs a one-point calibration: it draws a small dot in the center of the 3D display, and asks the user to look at that dot with both eyes and press a button or key. Afterwards, it presents the aforementioned pattern of dots, and the real-time tracking circle. As it turned out, the accuracy of that default setup wasn’t very good. It is possible that my contact lenses make my corneas’ shapes too different from the ideal (I’m quite near-sighted), but whatever the reason, there was a significant mismatch in gaze direction tracking (see Figure 3).
However, the utility also contained a three-point calibration procedure, showing a sequence of three dots around the periphery of the field of view (I estimated around 40° off-center), asking the user to look at each dot and press a button. After that calibration step, which took maybe ten seconds tops, tracking results were much improved (see Figure 4). But here’s a surprising observation: even after three-point calibration, tracking accuracy was very sensitive to HMD placement. In other words, any very slight shift in the position of the Rift on my face, even just putting pressure on the top or side of the Rift without really moving it, caused a noticeable shift in the tracking data.
And that shouldn’t happen. The benefit of 3D eye tracking is precisely that the system should be able to automatically adjust to changes in the user’s eye position, either due to taking off the HMD and putting it back on, or due to facial movements or rapid head motions. Why didn’t that work here? My current hunch is that the calibration utility itself is purely 2D. In other words, it does not actually take the user’s 3D eye positions and the internal geometry of the Rift into account, and directly maps offsets between the user’s 3D gaze direction vector and the forward direction established by one-point calibration, optionally using scale factors from three-point calibration, to displacements on a virtual 2D image plane positioned at some distance in front of the user. That would explain why it can’t react properly to moving the HMD, and why it needs one- or three-point calibration in the first place: it doesn’t actually use the full data provided by the tracking system. Another clue: the demos I saw were Unity-based, and the Oculus plug-in for Unity doesn’t expose any of the Rift-internal projection parameters that would be required for full 3D gaze tracking.
So there’s hope that gaze direction calculations based on 3D positions and vectors and the Rift’s full internal geometry would indeed work as they should, and not require additional calibration, but unfortunately that’s pure conjecture at this point, because I didn’t get to see the raw eye tracking data, and wasn’t able to sic my own VR software on it. I’ll definitely look into that when/if I get my hands on an eye-tracked Rift for good.
Oh, about cutting holes into the Rift’s lenses: the holes were visible when using the Rift, as small blurry dark squares at the very bottom edge of the field of view. But I didn’t really find them annoying. Your mileage may vary, and they’re supposed to go away with the next prototype.
Finally: what’s eye tracking good for? I already alluded to several applications, but here’s a bullet-point list, in approximate order of “should work right now” to “long-term:”
Automatic and dynamic calibration: In my opinion, miscalibration in HMDs is an important cause of simulator sickness. Right now, the Oculus SDK contains an external calibration utility where users can enter (or measure) interpupillary distance and approximate eye relief, or, in other words, the rough 3D position of their eyes in HMD space. But many users don’t do it, or not often enough, or don’t do it when giving their Rift to a friend to try. With 3D eye tracking, that calibration should not only be automatic (but see some caveats regarding the current tracking software above), but also dynamic — eye tracking can, well, track the user’s 3D eye position during use to react properly to eye movement as users look around the virtual 3D space, or any subtle movements of the HMD relative to the user’s head. I believe that automatic calibration is an important property for consumer-facing HMDs, and that dynamic calibration is an important improvement to user experience.
Gaze-based interaction: By tracking the user’s head, it becomes essentially another 3-DOF input device that can be used to interact with the virtual environment, or with user interface elements embedded into that environment (see “Gaze-directed Text Entry in VR Using Quikwrite,” for example). But with only head tracking, users have to move their heads to interact, and require some form or aiming reticle to receive visual feedback on their interactions. With eye tracking and 3D gaze direction vectors, on the other hand, explicit feedback is no longer necessary (we know what we’re looking at), and it is much easier and faster to move one’s eyes than one’s head (top saccade speed is about 900°/s; head rotation is probably not faster than 180°/s). One of the demos in SMI’s package demonstrated this effectively: it was a relatively simple scene, a vault or wine cellar of some kind, with lots of boxes and barrels inside (see Figure 5). By pressing a button on a game controller, users could “shoot” along their current viewing direction. Meaning, there was no aiming reticle or explicit way to aim, it was literally “look and shoot.” And it worked, with some minor problems probably caused by the same 2D approach to integrating 3D gaze direction into the Unity engine that I believe is behind the accuracy issues in the calibration utility. Basically, aiming didn’t quite follow viewing direction at the periphery of the Rift’s field of view, but the targets (boxes and barrels) were large enough that a user not actively trying to break the system probably wouldn’t have noticed — and three-point calibration (or, as I believe, proper 3D calculation throughout the pipeline) would have fixed the problem.
Social interaction: A special case of gaze-based interactions are social interactions based on eye contact. This is something that can not be approximated by head tracking alone: looking at someone straight on or furtively out of the corners of one’s eyes are different social cues. Proper eye tracking (which should easily detect when the user blinks) could enable new forms of NPC interactions in games, such as staring down a potential foe, or less robotic NPC behavior by having NPCs react to the user looking at them (or not). This might even be a way to get around a fundamental problem in VR tele-presence and tele-collaboration: if everybody is wearing HMDs, nobody can see the other participants’ eyes. Eye tracking might support better animated face models, albeit with potential uncanny valley issues.
Foveated rendering: The primate (and human) eye has much higher resolution in the center than in the periphery. This fact could be exploited to significantly increase rendering performance (and therefore reduce rendering latency) by presenting virtual objects close to the user’s viewing direction at full detail, and rapidly reducing detail as objects are located farther out in the user’s peripheral vision. Without going into details, the potential performance gains are huge. The tricky part, of course, is ensuring that the user never sees the lower-resolution parts of the displayed scene. In other words, eye tracking needs to be accurate enough, and have low-enough latency, that the user’s eye can never move outside the current full-resolution image zone before the display system catches up. And, as mentioned above, humans can move their eyes really fast. To give a ballpark estimate: assume that a foveated rendering system uses a full-resolution area of 9° around the current viewing direction (that’s very small compared to the Rift’s 100° field of view). Then, assuming a 900°/s saccading speed, the display system must have a total end-to-end latency, including eye tracking and display refresh, of less than 1/100s, or 10ms. If latency is higher, the user’s eyes will be able to “outrun” the full-resolution zone, and see not only a low-resolution render, but also a distinct “popping” effect when the display system catches up and renders the now-foveated part of the scene at full detail. Both of these effects are very annoying, as we have learned from our out-of-core, multi-resolution 3D point cloud and global topography viewers (which use a form of foveated rendering to display terabyte-sized data sets at VR frame rates on regular computers). As mentioned above, SMI’s eye tracked Rift prototype does not yet have low enough latency to make foveated rendering effective.
Simulation of accommodation: This one I consider a bad idea, but it comes up all the time, so there it is. Among the primary depth cues, the only one that can not be simulated by current and near-future HMDs is accommodation, the fact that the eyes’ lenses — just like in regular cameras — have to change focal length based on the distance to observed objects. In an HMD, however, all light entering the eye comes from a flat screen that is typically (and specifically, in the Rift) projected out into infinity by the HMD’s internal optics. In other words, all virtual objects, no matter their virtual distances from the viewer, will appear focused at the same distance. This leads to conflicts because our visual system is trained from very young age to couple accommodation to vergence, the amount by which our eyes are crossed when focusing on an object. So, in practice, if a virtual object is located far away, accommodation will work naturally: the eyes will focus on infinity, and the object will appear sharp. If a virtual object is close, on the other hand, the eyes will change focus to the near distance, but because the light is (optically) coming from infinitely far away, the virtual object will appear blurry. This conflict between accommodation and the other depth cues can cause eye strain, and potentially depth perception problems while using HMDs (and for a short time thereafter). Now the idea is the following: eye tracking can measure vergence, so why not artificially blur objects in post-processing, based on their distance from the user’s current focal plane? In theory, that should make the display seem more “real.” Well, two problems. For one, latency rears its ugly head again. If vergence detection can’t keep up, users might focus on an object at a different depth, and it might take time for that object to become “unblurred.” That’s basically the same annoying effect as visible low resolution and “popping” in foveated rendering. But worse, while it’s possible to blur objects in post-processing, it’s not possible to unblur objects that are optically out of focus. In other words, when the user focuses on a far-away object, close-by objects can be blurred for a more realistic picture, but if the user focuses on a nearby object, it will appear blurry no matter what, and now all other objects will be rendered blurry as well. And looking at blurry images is really annoying, because it messes with the eyes’ accommodation reflex. But like I said, this is my opinion, and others disagree.
So here’s the bottom line: SMI have a working 3D eye tracker that fits into the cramped interior of an Oculus Rift, currently with some (visible) modifications. Tracking latency and accuracy should be good enough for automatic and dynamic calibration and gaze-based interaction, but I can’t be sure because I only saw two demos: one 2D demo that (I think) suffered from not using the full 3D tracking data, and one Unity-based 3D demo that used view-based aiming with large enough targets to cover up tracking problems. Latency is currently not low enough for foveated rendering or simulation of accommodation (not that I think the latter should be done, anyway). If I were in charge of Oculus VR, I would start serious negotiations to integrate SMI’s eye tracker into the consumer version of the Oculus Rift, either out-of-the-box or as an officially supported option or sanctioned third-party modification.