An Eye-tracked Oculus Rift

I have talked many times about the importance of eye tracking for head-mounted displays, but so far, eye tracking has been limited to the very high end of the HMD spectrum. Not anymore. SensoMotoric Instruments, a company with around 20 years of experience in vision-based eye tracking hardware and software, unveiled a prototype integrating the camera-based eye tracker from their existing eye tracking glasses with an off-the-shelf Oculus Rift DK1 HMD (see Figure 1). Fortunately for me, SMI were showing their eye-tracked Rift at the 2014 Augmented World Expo, and offered to bring it up to my lab to let me have a look at it.

Figure 1: SMI’s after-market modified Oculus Rift with one 3D eye tracking camera per eye. The current tracking cameras need square cut-outs at the bottom edge of each lens to provide an unobstructed view of the user’s eyes; future versions will not require such extensive modifications.

As can be seen in Figure 1, this is not a finished product yet. Besides the fact that the Rift DK1 is no longer for sale, and that Rift DK2s, which are supposed to start shipping in July, have a slightly different internal layout, the modifications required to fit the existing eye tracker into the cramped interior of the Rift are somewhat invasive. SMI engineers had to cut a square hole into the bottom-most portion of both of the Rift’s lenses in order to give the tracking cameras an unobstructed view of the user’s eyes, and had to place several infrared LEDs around the rim of each lens to create the illumination needed for 3D eye tracking (more on that later). Christian Villwock, SMI’s director, assured me that their next-generation tracking camera won’t require cutting chunks out of the lenses, and will be small enough to be packaged entirely into the Rift’s lens holders; in other words, eye tracking could be added to a Rift as an after-market mod, by exchanging the original lens holders with modified ones. But leaving predictions of the future aside, let’s focus on the important questions: how does it work, how well does it work, and what are the immediate and long-term applications?

Without going into the nitty-gritty technical details (and potentially spilling trade secrets), the most important thing about SMI’s eye tracker is that it tracks the user’s pupil position and gaze direction in three dimensions. Each of the two cameras is a 3D camera, not entirely dissimilar to the first-generation Kinect. Based on the depth data captured by each camera, SMI’s software is able to derive a detailed 3D model of the user’s eyes, including cornea shape, pupil position, and even features of the retina. This level of detail sets the system apart from two-dimensional eye trackers, which primarily track the centroid of the user’s pupil in the camera’s 2D image plane. In one of the firsts posts on this blog, I posited that future HMDs should contain one stereo camera per eye for 3D eye tracking; it didn’t occur to me that per-eye 3D cameras would be even better. The most important measurements provided by the 3D eye model are the 3D position of the pupil’s center point, which is required for properly calibrated projection, and a 3D gaze direction vector, which can serve as input for gaze-based interaction schemes, both relative to the cameras. But since the camera is rigidly attached to the inside of the Rift, it is possible to transform both measurements to “HMD space,” and then to the VR application’s model space based on head tracking data.

Figure 2: Display of SMI’s calibration utility, approximately embedded into the Oculus Rift’s full field of view. The outer dots were located about 25° away from the central dot.

Now that we have a basic idea of how it works, how well does it work? The crucial cornerstones of eye tracking performance are accuracy and latency. Let’s address latency first, because that’s where I have to guesstimate. SMI’s demo included a 2D calibration/accuracy utility (see Figure 2), which simply displayed four small dots arranged around a central dot, covering a visual angle of (estimated) 25° from the center in all directions. The user’s current visual focus point was indicated by a larger circle, apparently as fast as the firmware/software could extract and display it. By moving my eyes around as fast as I could (and that’s pretty fast), and watching the display play catch-up, I estimated an end-to-end system latency of somewhere between 50ms and 100ms. Take these numbers with a grain of salt; there is no way of measuring such short delays without gizmos. I arrived at these numbers by mentally comparing the response time of the tracking system to response times of other systems with known latency (such as my 3D tracking systems). The most important result is this: the current system latency is too high for one of the more forward-looking applications of eye tracking, foveated rendering. I clearly saw the tracking circle lagging behind my eye movements, and I would have noticed reduced rendering resolutions and “pop” from the software switching between levels of detail just as clearly.

I can go into more detail regarding accuracy, and it’s a fairly interesting story, too, because it sheds some light on the calibration processes that are still required in practice for a theoretically “calibration-free” eye-tracked HMD. These observations are again based on the calibration utility illustrated in Figure 2. When the utility starts up, it performs a one-point calibration: it draws a small dot in the center of the 3D display, and asks the user to look at that dot with both eyes and press a button or key. Afterwards, it presents the aforementioned pattern of dots, and the real-time tracking circle. As it turned out, the accuracy of that default setup wasn’t very good. It is possible that my contact lenses make my corneas’ shapes too different from the ideal (I’m quite near-sighted), but whatever the reason, there was a significant mismatch in gaze direction tracking (see Figure 3).

Figure 3: Gaze tracking accuracy after one-point calibration. The tracked gaze direction is significantly offset from the real gaze direction. This diagram is drawn from memory to illustrate the offset’s magnitude.

However, the utility also contained a three-point calibration procedure, showing a sequence of three dots around the periphery of the field of view (I estimated around 40° off-center), asking the user to look at each dot and press a button. After that calibration step, which took maybe ten seconds tops, tracking results were much improved (see Figure 4). But here’s a surprising observation: even after three-point calibration, tracking accuracy was very sensitive to HMD placement. In other words, any very slight shift in the position of the Rift on my face, even just putting pressure on the top or side of the Rift without really moving it, caused a noticeable shift in the tracking data.

Figure 4: Gaze tracking accuracy after three-point calibration. There was still a visible offset between real and tracked gaze direction, but it was small enough that gaze-directed user interaction would have worked without problems. This diagram is drawn from memory to illustrate the offset’s magnitude.

And that shouldn’t happen. The benefit of 3D eye tracking is precisely that the system should be able to automatically adjust to changes in the user’s eye position, either due to taking off the HMD and putting it back on, or due to facial movements or rapid head motions. Why didn’t that work here? My current hunch is that the calibration utility itself is purely 2D. In other words, it does not actually take the user’s 3D eye positions and the internal geometry of the Rift into account, and directly maps offsets between the user’s 3D gaze direction vector and the forward direction established by one-point calibration, optionally using scale factors from three-point calibration, to displacements on a virtual 2D image plane positioned at some distance in front of the user. That would explain why it can’t react properly to moving the HMD, and why it needs one- or three-point calibration in the first place: it doesn’t actually use the full data provided by the tracking system. Another clue: the demos I saw were Unity-based, and the Oculus plug-in for Unity doesn’t expose any of the Rift-internal projection parameters that would be required for full 3D gaze tracking.

So there’s hope that gaze direction calculations based on 3D positions and vectors and the Rift’s full internal geometry would indeed work as they should, and not require additional calibration, but unfortunately that’s pure conjecture at this point, because I didn’t get to see the raw eye tracking data, and wasn’t able to sic my own VR software on it. I’ll definitely look into that when/if I get my hands on an eye-tracked Rift for good.

Oh, about cutting holes into the Rift’s lenses: the holes were visible when using the Rift, as small blurry dark squares at the very bottom edge of the field of view. But I didn’t really find them annoying. Your mileage may vary, and they’re supposed to go away with the next prototype.

Finally: what’s eye tracking good for? I already alluded to several applications, but here’s a bullet-point list, in approximate order of “should work right now” to “long-term:”

Automatic and dynamic calibration: In my opinion, miscalibration in HMDs is an important cause of simulator sickness. Right now, the Oculus SDK contains an external calibration utility where users can enter (or measure) interpupillary distance and approximate eye relief, or, in other words, the rough 3D position of their eyes in HMD space. But many users don’t do it, or not often enough, or don’t do it when giving their Rift to a friend to try. With 3D eye tracking, that calibration should not only be automatic (but see some caveats regarding the current tracking software above), but also dynamic — eye tracking can, well, track the user’s 3D eye position during use to react properly to eye movement as users look around the virtual 3D space, or any subtle movements of the HMD relative to the user’s head. I believe that automatic calibration is an important property for consumer-facing HMDs, and that dynamic calibration is an important improvement to user experience.

Gaze-based interaction: By tracking the user’s head, it becomes essentially another 3-DOF input device that can be used to interact with the virtual environment, or with user interface elements embedded into that environment (see “Gaze-directed Text Entry in VR Using Quikwrite,” for example). But with only head tracking, users have to move their heads to interact, and require some form or aiming reticle to receive visual feedback on their interactions. With eye tracking and 3D gaze direction vectors, on the other hand, explicit feedback is no longer necessary (we know what we’re looking at), and it is much easier and faster to move one’s eyes than one’s head (top saccade speed is about 900°/s; head rotation is probably not faster than 180°/s). One of the demos in SMI’s package demonstrated this effectively: it was a relatively simple scene, a vault or wine cellar of some kind, with lots of boxes and barrels inside (see Figure 5). By pressing a button on a game controller, users could “shoot” along their current viewing direction. Meaning, there was no aiming reticle or explicit way to aim, it was literally “look and shoot.” And it worked, with some minor problems probably caused by the same 2D approach to integrating 3D gaze direction into the Unity engine that I believe is behind the accuracy issues in the calibration utility. Basically, aiming didn’t quite follow viewing direction at the periphery of the Rift’s field of view, but the targets (boxes and barrels) were large enough that a user not actively trying to break the system probably wouldn’t have noticed — and three-point calibration (or, as I believe, proper 3D calculation throughout the pipeline) would have fixed the problem.

Figure 5: Second of SMI’s demos showing gaze-directed interaction, in this case shooting. Users can aim at barrels or boxes stacked in a wine cellar-like environment without the need for visual feedback such as aiming reticles, or an explicit aiming mechanism.

Social interaction: A special case of gaze-based interactions are social interactions based on eye contact. This is something that can not be approximated by head tracking alone: looking at someone straight on or furtively out of the corners of one’s eyes are different social cues. Proper eye tracking (which should easily detect when the user blinks) could enable new forms of NPC interactions in games, such as staring down a potential foe, or less robotic NPC behavior by having NPCs react to the user looking at them (or not). This might even be a way to get around a fundamental problem in VR tele-presence and tele-collaboration: if everybody is wearing HMDs, nobody can see the other participants’ eyes. Eye tracking might support better animated face models, albeit with potential uncanny valley issues.

Foveated rendering: The primate (and human) eye has much higher resolution in the center than in the periphery. This fact could be exploited to significantly increase rendering performance (and therefore reduce rendering latency) by presenting virtual objects close to the user’s viewing direction at full detail, and rapidly reducing detail as objects are located farther out in the user’s peripheral vision. Without going into details, the potential performance gains are huge. The tricky part, of course, is ensuring that the user never sees the lower-resolution parts of the displayed scene. In other words, eye tracking needs to be accurate enough, and have low-enough latency, that the user’s eye can never move outside the current full-resolution image zone before the display system catches up. And, as mentioned above, humans can move their eyes really fast. To give a ballpark estimate: assume that a foveated rendering system uses a full-resolution area of 9° around the current viewing direction (that’s very small compared to the Rift’s 100° field of view). Then, assuming a 900°/s saccading speed, the display system must have a total end-to-end latency, including eye tracking and display refresh, of less than 1/100s, or 10ms. If latency is higher, the user’s eyes will be able to “outrun” the full-resolution zone, and see not only a low-resolution render, but also a distinct “popping” effect when the display system catches up and renders the now-foveated part of the scene at full detail. Both of these effects are very annoying, as we have learned from our out-of-core, multi-resolution 3D point cloud and global topography viewers (which use a form of foveated rendering to display terabyte-sized data sets at VR frame rates on regular computers). As mentioned above, SMI’s eye tracked Rift prototype does not yet have low enough latency to make foveated rendering effective.

Simulation of accommodation: This one I consider a bad idea, but it comes up all the time, so there it is. Among the primary depth cues, the only one that can not be simulated by current and near-future HMDs is accommodation, the fact that the eyes’ lenses — just like in regular cameras — have to change focal length based on the distance to observed objects. In an HMD, however, all light entering the eye comes from a flat screen that is typically (and specifically, in the Rift) projected out into infinity by the HMD’s internal optics. In other words, all virtual objects, no matter their virtual distances from the viewer, will appear focused at the same distance. This leads to conflicts because our visual system is trained from very young age to couple accommodation to vergence, the amount by which our eyes are crossed when focusing on an object. So, in practice, if a virtual object is located far away, accommodation will work naturally: the eyes will focus on infinity, and the object will appear sharp. If a virtual object is close, on the other hand, the eyes will change focus to the near distance, but because the light is (optically) coming from infinitely far away, the virtual object will appear blurry. This conflict between accommodation and the other depth cues can cause eye strain, and potentially depth perception problems while using HMDs (and for a short time thereafter). Now the idea is the following: eye tracking can measure vergence, so why not artificially blur objects in post-processing, based on their distance from the user’s current focal plane? In theory, that should make the display seem more “real.” Well, two problems. For one, latency rears its ugly head again. If vergence detection can’t keep up, users might focus on an object at a different depth, and it might take time for that object to become “unblurred.” That’s basically the same annoying effect as visible low resolution and “popping” in foveated rendering. But worse, while it’s possible to blur objects in post-processing, it’s not possible to unblur objects that are optically out of focus. In other words, when the user focuses on a far-away object, close-by objects can be blurred for a more realistic picture, but if the user focuses on a nearby object, it will appear blurry no matter what, and now all other objects will be rendered blurry as well. And looking at blurry images is really annoying, because it messes with the eyes’ accommodation reflex. But like I said, this is my opinion, and others disagree.

So here’s the bottom line: SMI have a working 3D eye tracker that fits into the cramped interior of an Oculus Rift, currently with some (visible) modifications. Tracking latency and accuracy should be good enough for automatic and dynamic calibration and gaze-based interaction, but I can’t be sure because I only saw two demos: one 2D demo that (I think) suffered from not using the full 3D tracking data, and one Unity-based 3D demo that used view-based aiming with large enough targets to cover up tracking problems. Latency is currently not low enough for foveated rendering or simulation of accommodation (not that I think the latter should be done, anyway). If I were in charge of Oculus VR, I would start serious negotiations to integrate SMI’s eye tracker into the consumer version of the Oculus Rift, either out-of-the-box or as an officially supported option or sanctioned third-party modification.

33 thoughts on “An Eye-tracked Oculus Rift

  1. Nice post Oliver. Very much looking forward to foveated rendering in LiDAR Viewer…!
    BTW, have you ran any VRUI applications with the Rift DK2, or have you yet to lay your hands on one?

  2. Thanks again for a great post. The best part about it is I’ve learned the term for the idea I’ve been thinking of for a while and which I started a discussion for on Reddit today and I got a little slated for being a bit late to the party.

    Anyway, Foveated Rendering!! I think this needs to happen.

    Graphics Cards need to step up their game to get us to where we need to be. Foveated Rendering makes the most sense as a temporary solution until we have implants.

    Hopefully eye tracking latency can be improved. I wonder what the latency for getting eye movement data directly from the brain is like. I know the EPOC has had it’s issues due to the problems with getting an accurate reading through the skull but again maybe a job for implants? Yay for the future!

  3. Nice post ! I have a question,what if we make the “foveated area” larger? Not 9° but maybe 50° in radius so that the latency can be higher up to 50ms? It’s still can reduce the GPU burden quite significant.

    • It’s true; the foveated area can be enlarged to tolerate higher latencies. The 9° were just an example chosen to make the math easier. However, 50° radial would be 100° diagonal, and that’s pretty much the Rift’s entire field-of-view, so there wouldn’t be any savings (there is additional overhead from doing level-of-detail rendering in the first place).

      I don’t think that foveated rendering will start paying off before end-to-end latency drops to around 20ms.

  4. Any idea when affordable lightfield displays small enough to go on HMDs and affordable graphics cards capable of rendering lightfields fast enough will reach the market?

  5. Random thought: With the social interaction part it would also be interesting to have information about the shape of the eye-lid. How closed is it .. is the player squinting etc.
    Would also be interesting in combination with the camera that will be included with the DK2 (although I have no idea about it’s specs). You could then combine it with the rest of the facial expression of the player.

  6. Accommodation will come sooner than you think due to light field vr goggles. google for “NVIDIA Research’s near-eye light field display prototype” to see a prototype.

    • You’re right about light field displays being a viable option to simulate all depth cues including accommodation, but I disagree about them becoming competitive to regular (magnifier-based) HMDs any time soon. Based on the paper describing NVDIDA’s prototype, they used 1280×720 microdisplays from a Sony HMZ-T1 HMD, and achieved a spatial resolution of 146×78 pixels (light field oversampling factor 9x), over a 30°x16° field of view, with a (sufficient) representable accommodation range from 30cm to infinity. To achieve resolution and field of view on the same order as the Rift DK1, they would need a screen with roughly 11500×7200 pixels. I didn’t take into account that the oversampling factor needed to create a light field depends on the ratio of eye relief and accommodation range. To achieve a 100° field of view, they would either need a very large screen, or would have to move the screen very close to the eyes, at the cost of even more oversampling and higher required resolution.

      • How large is “very large”?

        Couldn’t they just make it curved to reduce the area needed to cover the whole field of view?

        • Good question. Based on the pictures in the paper I linked, eye relief on the prototype is around 2″. To get 100° field of view, you’d need a display around 5″ tall and 10″ wide at that distance. Problem: you can’t do it, because the left and right screens would now overlap based on an IPD average of 2.5″. So you need to get closer to the eye, say to 1″ eye relief, which requires twice the pixels horizontally and vertically to achieve the same accommodation range based on the math of light fields. Making the screens cylindrical would reduce required size somewhat, but not by a significant amount.

          • Hm, they couldn’t get enough resolution with screens with spherical curvature at a radius of about the distance between the center of the eyeballs and the closest part of the center line of the nose?

            Btw, since the pupils got a limited range of positions they can be in relation to the screen, can’t some pixels of the cells away from the center be discarded to squeeze a bigger apparent resolution without needing more physical pixels? (i guess the lenslet array would then need to be composed of unique lenslets since each cell would be a bit different than the others)

  7. What if the eye tracking was used to measure vergence and adjust real accommodation by actually changing the focus of the lenses (mechanically) instead of simulating accommodation by blurring the image?

    This is a huge problem with AR OST-HMDs — objects that are supposed to be on the table in front of you end up looking like they are on a big screen under the floor. Is it possible to mitigate this with eye tracking and dynamically focusing optics?

    I guess latency (eye tracking + focal adjustment) could be a deal-breaker for this application though.

    • It’s theoretically feasible, but I’m not sure if it’s practical. I think that the mechanical complexity of dynamically adjusting lenses, and the problems in accurate vergence tracking with sufficiently low latency, would outweigh the gains from simulating accommodation. But I may be biased because I’m fairly tolerant to accommodation/vergence conflict myself. I think that, in the long run, near-eye light field or near-eye true holographic displays would be the technology of choice.

      • I wonder if near-sighted people will initially be less bothered by the accommodation/vergence conflict because they already have extensive experience with two different relationships between accommodation and vergence. I’m used to focusing my eyes to “infinity” when reading something 9″ away from my eyes.

        • Yes, that is precisely why I personally am not bothered much by it. I have been wearing glasses since I was 11, and now contacts for the last ten years (-3 diopters). I effectively untrained my vergence/accommodation reflex while wearing glasses. I got it back with contacts, but then I put the glasses back on last weekend, and my depth perception was completely fubar while I was wearing them.

        • I don’t quite understand. I am also nearsighted, but I am always wearing my glasses when I am not sleeping. Are you saying that for someone who removes their glasses often, they get accustomed to disassociating vergence from accommodation because there are two different relationships?

          • I can only speak for myself, but as someone who used to wear glasses every waking minute, I still noticed that my vergence/accommodation response became a lot weaker as a result. It got stronger again when I switched to contacts, which approximate natural vision much more closely (if someone had told me that was the case, I would have switched a decade earlier).

          • I only wear contacts and not glasses, but I frequently read or watch a movie on my iPad after I’ve taken my contacts out.
            I have -4 diopter contacts, so I can only focus to a distance of about 9″ without my contacts. Immediately after I put my contacts in, it’s usually hard to focus on close objects.

            As people age, they lose the ability to focus at close distances, so the Rift’s always-infinite focus may be a positive for some people :-)

  8. What about simulating accommodation by physically moving the lens just fractions of a mm closer or farther from the screen. This should allow for the eye to accommodate focus much as it does naturally. Of course it will be similar to the whole scene shifting out of, and then back into focus as our eye accommodates but if it is done quickly enough and accurately enough everything outside of our foveal vision will be ignored anyways.

    If this would work (and I cant fathom why it wouldn’t) then I suppose the remaining question is does accommodation add enough (anything?) to either eye relief or sense of presence to be worth it and additionally is the pay off large enough the be worth the high cost of physical movements of the lenses. I would expect the answer is no, but still, I’m curious as to whether or not this is a theoretical possibility.

  9. That’s really interesting. I guess the usability of foveate rendering also depends on what’s the relative high resolution area. Rendering only 50° out of 100° at high resolution may give around 50% perforce boost, and it may work with 50-100 ms latency. Not a huge performance boost, but may be good enough for now and definitely better than nothing. What do you think?

  10. Did they confirm that the cameras were definitely depth-sensing cameras (time-of-flight, structured light, whatever), and not just ‘cameras that can measure the pupil positioon in 3D’?

    I ask, because Is suspect from your description of the issues with HMd movement and contact lenses that they may be measuring multiple Purkinje images and using those to infer the pupil’s 3D position (assuming a non-squishing eye with perfect rotational symmetry). The multiple IR illuminators may also work into this, if they’re pulsed in a pattern to produce multiple offset Purkinje images.

    • Fantastic question; unfortunately, thinking about it now, I can’t confirm that it’s really a 3D camera. Christian explained it to me as all 3D, and described the camera as using the same scanning tech that’s also found in some true 3D cameras, but not being an expert in eye tracking, I didn’t ask the follow-up questions you just posed. In short, I might have misunderstood him and filled in blanks coming from my 3D capture / 3D rendering background. If I did indeed misunderstand how the cameras work, that would invalidate a lot of what I wrote here.

      I’ll try getting clarification/confirmation, and I might have to do a mea culpa and post a follow-up article. Thanks for bringing it up.

      • No problem, thanks for the reply. If you can get an update on their camera tech (and it’s not under NDA) I’d love to hear more about it.

  11. I have a question regarding the vergence and accommodation. I have noticed in the Rift when switching from viewing a close object to a distance landscape that the vergence just seems a tad off to me for the distant viewing. Very subtle, that I think most users wouldn’t even notice. I get this sensation that if vergence were adjusted in real-time with eye tracking when snapping to a distant view, it would feel more realistic.. (If it were fast enough) I can’t tell if it is the “image offset” when snapping to a distant view that needs to be “separated” or “come together” just a tad more for my eyes. and I’m not talking about the blur factor at all here. Is this accommodation I am talking about, or is there some other factor like my eyes not actually being focused at infinity? I hope this makes sense.

  12. A well tested solution is availble from Ergoneers and Sensics, right off the shelf. Here you can have a look into the details: http://goo.gl/sIwR2O. But they also have a high quality kit to integrate it in your own HMD.

    Here is a sample video in a virtual supermarket, including virtual areas of interest http://goo.gl/wJ4zE8

  13. I’d love to play with that one!

    Concerning your latency estimates for Foveated rendering, wouldn’t you think some extra latency tolerance is available because of “Saccadic masking” : “…the brain selectively blocks visual processing during eye movements in such a way that neither the motion of the eye (and subsequent motion blur of the image) nor the gap in visual perception is noticeable to the viewer”. (http://en.wikipedia.org/wiki/Saccadic_masking). This effect might even be abused to “discard” images during fast head+eye movements.

    I do (want to) believe the use of dual eye-tracking (measure vergence) for simulation of accommodation is useful to some extent. Having experienced in various VR setups, I got used to disconnecting vergence/ real accommodation. I always felt one major cause for visual strain was the ‘sharp/high contrast’ distractions of nearby objects when looking in the distance. A rough estimate would already be great for ‘separate’ blurring treatment for a limited number of zones, or even just foreground and background. I recall that perception work (e.g. Cutting’s 1997 “How the eye measures reality and virtual reality”) summarized that it is ordinal at best, so maybe no need for super fidelity at that part. Also, the vergence/accommodation speed is quite slow and low frequency and I expect not so latency sensitive.

  14. Did some research and a quick back of envelope calculation, this may be handy for someone:
    Given saccade accelerations of 1500deg/s/s, a human fovea of 2°, and a foveal rendering HMD with a 9° central high-res area. This means that once a saccade starts, you’d have only 73ms before your fovea leaves the high-res display area.

    Saccade accelerations may seem fast, but (if i’ve got this right) cheap high-speed hobby servos (0.06 sec) can move the same speed. So tracking may be a challenge, but at least it’s within the range of possibility.