A Closer Look at the Oculus Rift

I have to make a confession: I’ve been playing with the Oculus Rift HMD for almost a year now, and have been supporting it in Vrui for a long time as well, but I haven’t really spent much time using it in earnest. I’m keenly aware of the importance of calibrating head-mounted displays, of course, and noticed right away that the scale of virtual objects seen through the Rift was way off for me, but I never got around to doing anything about it. Until now, that is.

Continue reading

First impressions from the Oculus Rift dev kit

My friend Serban got his Oculus Rift dev kit in the mail today, and he called me over to check it out. I will hold back a thorough evaluation until I get the Rift supported natively in my own VR software, so that I can run a direct head-to-head comparison with my other HMDs, and also my screen-based holographic display systems (the head-tracked 3D TVs, and of course the CAVE), using the same applications. Specifically, I will use the Quake ||| Arena viewer to test the level of “presence” provided by the Rift; as I mentioned in my previous post, there are some very specific physiological effects brought out by that old chestnut, and my other HMDs are severely lacking in that department, and I hope that the Rift will push it close to the level of the CAVE. But here are some early impressions.

Figure 1: What it would look like to unbox an Oculus VR dev kit, if one were to have such a thing.

Continue reading

How Milo met CAVE

I just read an interesting article, a behind-the-scenes look at the infamous “Milo” demo Peter Molyneux did at 2009’s E3 to introduce Project Natal, i.e., Kinect.

This article is related to VR in two ways. First, the usual progression of overhyping the capabilities of some new technology and then falling flat on one’s face because not even one’s own developers know what the new technology’s capabilities actually are is something that should be very familiar to anyone working in the VR field.

But here’s the quote that really got my interest (emphasis is mine):

Others recall worrying about the presentation not being live, and thinking people might assume it was fake. Milo worked well, they say, but filming someone playing produced an optical illusion where it looked like Milo was staring at the audience rather than the player. So for the presentation, the team hired an actress to record a version of the sequence that would look normal on camera, then had her pretend to play along with the recording. … “We brought [Claire] in fairly late, probably in the last two or three weeks before E3, because we couldn’t get it to [look right]” says a Milo team member. “And we said, ‘We can’t do this. We’re gonna have to make a video.’ So she acted to a video. “Was that obvious to you?” Following Molyneux’s presentation, fans picked apart the video, noting that it looked fake in certain places.

Gee, sounds familiar? This is, of course, the exact problem posed by filming a holographic display, and a person inside interacting with it. In a holographic display, the images on the screens are generated for the precise point of view of the person using it, not the camera. This means it looks wrong when filmed straight up. If, on the other hand, it’s filmed so it looks right on camera, then the person inside will have a very hard time using it properly. Catch 22.

With the “Milo” demo, the problem was similar. Because the game was set up to interact with whoever was watching it, it ended up interacting with the camera, so to speak, instead of with the player. Now, if the Milo software had been set up with the level of flexibility of proper VR software, it would have been an easy fix to adapt the character’s gaze direction etc. to a filming setting, but since game software in the past never had to deal with this kind of non-rigid environment, it typically ends up fully vertically integrated, and making this tiny change would probably have taken months of work (that’s kind of what I meant when I said “not even one’s own developers know what the new technology’s capabilities actually are” above). Am I saying that Milo failed because of the demo video? No. But I don’t think it helped, either.

The take-home message here is that mainstream games are slowly converging towards approaches that have been embodied in proper VR software for a long time now, without really noticing it, and are repeating old mistakes. The Oculus Rift will really bring that out front and center. And I am really hoping it won’t fall flat on its face simply because software developers didn’t do their homework.

How head tracking makes holographic displays

I’ve talked about “holographic displays” a lot, most recently in my analysis of the upcoming zSpace display. What I haven’t talked about is how exactly such holographic displays work, what makes them “holographic” as opposed to just stereoscopic, and why that is a big deal.

Teaser: A user interacting with a virtual object inside a holographic display.

Continue reading

Seeing “The Hobbit” in 3D

I’m on vacation in Mexico right now, and yesterday evening my brother-in-law took my wife and me to see “The Hobbit,” in 3D, in quite the fancy movie theater, with reclining seats and footrests and to-the-seat service and such.

I don’t want to talk about the movie per se, short of mentioning that I liked it, a lot, but about the 3D. Or the “stereo,” I should say, as I mentioned previously. My overall impression was that it was done very well. Obviously, the movie was shot in stereo (otherwise I’d have refused to see it that way), and obviously a lot of planning went into that aspect of it. There was also no apparent eye fatigue, or any other typical side effect of bad stereo, and considering how damn long the movie was, and that I was consciously looking for conversion problems or artifacts, that means someone was doing something right. As a technical note to cinemas: there was a dirty spot on the screen, a bit off to the side (looked as if someone had thrown a soda at the screen a while ago), and that either degraded the screen polarization, or was otherwise slightly visible in the image, and was a bit distracting. So, keep your stereo screens immaculately clean! Another very slightly annoying thing was due to the subtitles (the entire movie was shown in English with Spanish subtitles, and then there were the added subtitles when characters spoke Elvish or the Dark Tongue), and even though I didn’t read the subtitles, I still automatically looked at them whenever they popped up, and that was distracting because they were sticking out from the screen quite a bit.

Continue reading

Good stereo vs. bad stereo

I received an email about a week ago that reminded me that, even though stereoscopic movies and 3D graphics have been around for at least six decades, there are still some wide-spread misconceptions out there. Those need to be addressed urgently, especially given stereo’s hard push into the mainstream over the last few years. While, this time around, the approaches to stereo are generally better than the last time “3D” hit the local multiplex (just compare Avatar and Friday the 13th 3D), and the wide availability of commodity stereoscopic display hardware is a major boon to people like me, we are already beginning to see a backlash. And if there’s a way to do things better, to avoid that backlash, then I think it’s important to do it.

So here’s the gist of this particular issue: there are primarily two ways of setting up a movie camera, or a virtual movie camera in 3D computer graphics, to capture stereoscopic images — one is used by the majority of existing 3D graphics software, and seemingly also by the “3D” movie industry, and the other one is correct.

Toe-in vs skewed frustum

So, how do you set up a stereo camera? The basic truth is that stereoscopy works by capturing two slightly different views of the same 3D scene, and presenting these views separately to the viewers’ left and right eyes. The devil, as always, lies in the details.

Say you have two regular video cameras, and want to film a “3D” movie (OK, I’m going to stop putting “3D” in quotes now. My pedantic point is that 3D movies are not actually 3D, they’re stereoscopic. Carry on). What do you do? If you put them next to each other, with their viewing directions exactly parallel, you’ll see that it doesn’t quite give the desired effect. When viewing the resulting footage, you’ll notice that everything in the scene, up to infinity, appears to float in front of your viewing screen. This is because the two cameras, being parallel, are stereo-focused on the infinity plane. What you want, instead, is that near objects float in front of the screen, and that far objects float behind the screen. Let’s call the virtual plane separating “in-front” and “behind” objects the stereo-focus plane.

So how do you control the position of the stereo-focus plane? When using two normal cameras, the only solution is to rotate both slightly inwards, so that their viewing direction lines intersect exactly in the desired stereo-focus plane. This approach is often called toe-in stereo, and it sort-of works — under a very lenient definition of the words “sort-of” and “works.”

The fundamental problem with toe-in stereo is that it makes sense intuitively — after all, don’t our eyes rotate inwards when we focus on nearby objects? — but that our intuition does not correspond to how 3D movies are shown. 3D (or any other kind of) movies are not projected directly onto our retinas, they are projected onto screens, and those screens are in turn viewed by us, i.e., they project onto our retinas.

Now, when a normal camera records a movie, the assumption is that the movie will later be projected onto a screen that is orthogonal to the projector’s projection direction, which is implicitly the same as the camera’s viewing direction (the undesirable effect of non-orthogonal projection is called keystoning). In a toe-in stereo camera, on the other hand, there are two viewing directions, at a slight angle towards each other. But, in the theater, the cameras’ views are projected onto the same screen, meaning that at least one, but typically both, of the component images will exhibit keystoning (see Figures 1 and 2).

Figure 1: The implied viewing directions and screen orientations caused by a toe-in stereo camera based on two on-axis projection cameras. The discrepancy between the screen orientations implied by the cameras’ models and the real screen causes keystone distortion, which leads to 3D convergence issues and eye strain.

Figure 2: The left stereo image shows the keystoning effect caused by toe-in stereo. A viewer will not be able to merge these two views into a single 3D object. The right stereo image shows the correct result of using skewed-frustum stereo. You can try for yourself using a pair of red/blue anaglyphic glasses.

The bad news is that keystoning from toe-in stereo leads to problems in 3D vision. Because the left/right views of captured objects or scenes do not actually look like they would if viewed directly with the naked eye, our brains refuse to merge those views and perceive the 3D objects therein, causing a breakdown of the 3D illusion. When keystoning is less severe, our brains are flexible enough to adapt, but our eyes will dart around trying to make sense of the mismatching images, which leads to eye strain and potentially headaches. Because keystoning is more severe towards the left and right edges of the image, toe-in stereo generally works well enough for convergence around the center of the images, and generally breaks down towards the edges.

And this is why I think a good portion of current 3D movies are based on toe-in stereo (I haven’t watched enough 3D movies to tell for sure, and the ones I’ve seen were too murky to really tell): I have spoken with 3D movie experts (an IMAX 3D film crew, to be precise), and they told me the two basic rules of thumb for good stereo in movies: artificially reduce the amount of eye separation, and keep the action, and therefore the viewer’s eyes, in the center of the screen. Taken together, these two rules exactly address the issues caused by toe-in stereo, but of course they’re only treating the symptom, not the cause. As an aside: when we showed this camera crew how we are doing stereo in the CAVE, they immediately accused us of breaking the two rules. What they forgot is that stereo in the CAVE obviously works, including for them, and does not cause eye strain, meaning that those rules are only workarounds for a problem that doesn’t exist in the first place if stereo is done properly.

So what is the correct way of doing it? It can be derived by simple geometry. If a 3D movie or stereo 3D graphics are to be shown on a particular screen, and will be seen by a viewer positioned somewhere in front of that screen, then the two viewing volumes for the viewer’s eyes are exactly the two pyramids defined by each eye, and the four corners of the screen. In technical terms, this leads to skewed-frustum stereo. The following video explains this pretty well, better than I could here in words or a single diagram, even though it is primarily about head tracking and the screen/viewer camera model:

In a nutshell, skewed-frustum stereo works exactly as ordered. Even stereo pairs with very large disparity can be viewed without convergence problems or eye strain, and there are no problems when looking towards the edge of the image.

To allow for a real and direct comparison, I prepared two stereoscopic images (using red/blue anaglyphic stereo) of the same scene from the same viewpoint and with the same eye separation, one using toe-in stereo, one using skewed-frustum stereo. They need to be large and need to be seen at original size to appreciate the effect, which is why I’m only linking them here. Ideally, switch back-and-forth between the images several times and focus on the structure close to the upper-left corner. The effect is subtle, but noxious:

Good (skewed-frustum) stereo vs bad (toe-in) stereo.

I generated these using the Nanotech Construction Kit and Vrui; as it turns out, Vrui is flexible enough to support bad stereo, but at least it was considerably harder setting it up than good stereo. So that’s a win, I guess.

There are only two issues to be aware of: for one, objects at infinity will have the exact separation of the viewer’s eyes, so if the programmed-in eye separation is larger than the viewer’s actual eye separation, convergence for very far away objects will fail (in reality, objects can’t be farther away than infinity, or at least our brains seem to think so). Fortunately, the distribution of eye separations in the general population is quite narrow; just stick close to the smaller end. But it’s a thing to keep in mind when producing stereoscopic images for a small screen, and then showing them on a large screen: eye separation scales with screen size when baked into a video. This is why, ideally, stereoscopic 3D graphics should be generated specifically for the size of the screen on which they will be shown, and for the expected position of the audience.

The other issue is that virtual objects very close to the viewer will appear blurry. This is because when the brain perceives an object to be at a certain distance, it will tell the eyes to focus their lenses to that distance (a process called accommodation). But in stereoscopic imaging, the light reaching the viewer’s eyes from close-by virtual objects will still come from the actual screen, which is much farther away, and so the eyes will focus on the wrong plane, and the entire image will appear blurry.

Unfortunately, there’s nothing we can do about that right now, but at least it’s a rather subtle effect. In our CAVE, users standing in the center can see virtual objects floating only a few inches in front of their eyes quite clearly, even though the walls, i.e., the actual screens, are four feet away. This focus miscue does have a noticeable after-effect: after having used the CAVE for an extended period of time, say a few hours, the real world will look somewhat “off,” in a way that’s hard to describe, for a few minutes after stepping out. But this appears to be only a temporary effect.

Taking it back to the real 3D movie world: the physical analogy to skewed-frustum stereo is lens shift. Instead of rotating the two cameras inwards, one has to shift their lenses inwards. The amount of shift is, again, determined by the distance to the desired stereo-focus plane. Technically, creating lens-shift stereo cameras should be feasible (after all, lens shift photography is all the rage these days), so everybody should be using them. And some 3D movie makers might very well already do that — I’m not a part of that crowd, but from what I hear, at least some don’t.

In the 3D graphics world, where cameras are entirely virtual, it should be even easier to do stereo right. However, many graphics applications use the standard camera model (focus point, viewing direction, up vector, field-of-view), and can only represent non-skewed frusta. The fact that this camera model, as commonly implemented, does not support proper stereo, is just another reason why it shouldn’t be used.

So here’s the bottom line: Toe-in stereo is only a rough approximation of correct stereo, and it should not be used. If you find yourself wondering how to specify the toe-in angle in your favorite graphics software, hold it right there, you’re doing it wrong. The fact that toe-in stereo is still used — and seemingly widely used — could explain the eye strain and discomfort large numbers of people report with 3D movies and stereoscopic 3D graphics. Real 3D movie cameras should use lens shift, and virtual stereoscopic cameras should use skewed frusta, aka off-axis projection. While the standard 3D graphics camera model can be generalized to support skewed frusta, why not just replace it with a model that can do it without additional thought, and is more flexible and more generally applicable to boot?

Update: With the Oculus Rift in developers’ hands now, I’m getting a lot of questions about whether this article applies to head-mounted displays in general, and the Rift specifically. Short answer: it does. There isn’t any fundamental difference between large screens far away from the viewer, and small screens right in front of the viewer’s eyes. The latter add a wrinkle because they necessarily need to involve lenses and their concomitant distortions so that viewers are able to focus on the screens, but the principle remains the same. One important difference is that small screens close to the viewer’s eyes are more sensitive to miscalibration, so doing stereo right is, if anything, even more important than on large-screen displays. And yes, the official Oculus Rift software does use off-axis projection, even though the SDK documentation flat-out denies it.

Standard camera model considered harmful

With apologies to Edsger W. Dijkstra (and pretty much everyone else).

So what’s wrong with the canonical 3D graphics camera model? To recap, in this model a camera is defined by a focus point (the “eye”), a viewing direction, an “up” vector, a screen aspect ratio, and a field-of-view angle (“fov”) (see Figure 1). Throw in a near- and far-plane distance, and together these parameters uniquely define a viewing frustum in model space, and hence the modelview and projection matrices required to render a view of a 3D scene. Nothing wrong with that, per se.

Figure 1: The standard 3D graphics camera model, defined by a focus point position, viewing direction, “up” vector, and screen aspect ratio (ratio of screen width to screen height, not shown in diagram).

The problem arises when this same camera model is applied to (semi-) immersive environments, such as when one wants to adapt an existing graphics package or game engine to, say, a 3D TV or a head-mounted display with only minimal changes. There are two main problems with that: for one, the standard camera model does not support proper stereo image generation, leading to 3D vision problems, eye strain, and discomfort (but that’s a topic for another post).

The problem I want to discuss here is the implicit link between the camera model and viewpoint navigation. In this context, viewpoint navigation is the mechanism by which a 3D graphics application represents the viewer moving through the virtual 3D environment. For example, in a typical first-person video game, the player will be represented in the game world as some kind of avatar, and the camera model will be attached to that avatar’s head. The game engine will provide some mechanism for the player to directly control the position and viewing direction of the avatar, and therefore the camera, in the virtual world (this could be the just-as-canonical WASD+mouse navigation metaphor, or something else). But no matter the details, the bottom line is that the game engine is always in complete control of the player avatar’s — and the camera’s — position and orientation in the game world.

But in immersive display environments, where the user’s head is tracked inside some tracking volume, this is no longer true. In such environments, there are two ways to change the camera position: using the normal navigation metaphor, or simply physically moving inside the tracking volume. Obviously, the game engine has no way to control the latter (short of tethers or an electric shock collar). The problem is that the standard camera model, or rather its implementation in common graphics engines, does not account for this separation.

This is not really a technical problem, as there are ways to manipulate the camera model to yield the correct end result, but it’s a thinking problem. Having to think inside this ill-fitting camera model causes real headaches for developers, who will be running into lots of detail problems and  edge cases when trying to implement correct camera behavior in immersive environments.

My preferred approach is to use an entirely different way to think about navigation in virtual worlds, and the camera model that directly follows from it. In this paradigm, a display system is not represented by a virtual camera, but by a physical display environment consisting of a collection of screens and viewers. For example, a typical desktop system would consist of a single screen (the actual monitor), and a single viewer (the user) sitting in a fixed position in front of it (fixed because typical desktop systems have no way to detect the viewer’s actual position). At the other extreme, a CAVE environment would consist of four to six large screens in fixed positions forming a cube, and a viewer whose position and orientation is measured in real time by a head tracking system (see Figure 2). The beauty is that this simple environment model is extremely flexible; it can support any number of screens in arbitrary positions, including moving screens, and any number of fixed or tracked viewers. It can support non-rectangular screens without problems (but that’s a topic for another post), and non-flat screens can be tesselated to a desired precision. So far, I have not found a single concrete display environment that cannot be described by this model, or at least approximated to arbitrary precision.

Figure 2: Photo of a CAVE environment consisting of four screens (three walls and one floor) and one viewer (in this case a camera on a tripod). Note how the image of the 3D protein model in the CAVE spans all four screens, but still appears seamless to the camera.

In more detail, a screen is defined by its position, orientation, width, and height (let’s ignore non-rectangular screens for now). A viewer, on the other hand, is solely defined by the position of its two eyes (two eyes instead of one to support proper stereo; sorry, spiders and Martians need not apply). All screens and viewers forming one display environment are defined in the same coordinate system, called physical space because it refers to real-world entities, namely display screens and users.

How does this environment model affect navigation? Instead of moving a virtual camera through virtual space, navigation now moves an entire environment, i.e., a collection of any number of screens and viewers, through the virtual space, still under complete program control (fans of Dr Who are free to call it the “Tardis model”). Additionally, any viewers can freely move through the environment, at least if they’re head-tracked, and this part is not under program control. From a mathematical point of view, this means that viewers can freely walk through physical space, whereas physical space as a whole is mapped into the virtual world by the graphics engine, effected by the so-called navigation transformation.

At first glance, it seems that this model does nothing but add another intermediate coordinate system and is therefore superfluous (and mathematically speaking, that’s true), but in my experience, this model makes it a lot more straightforward to think about navigation and user motion in immersive environments, and therefore makes it easier to develop novel and correct navigation metaphors that work in all circumstances. The fact that it treats all possible display environments from desktops to HMDs and CAVEs in a unified way is just a welcome bonus.

The really neat effect of this environment model is that it directly implies a camera model as well (see Figure 3). Using the standard model, it is quite a tricky prospect to maintain the collection of virtual cameras that are required to render to a multi-screen environment, and ensure that they correspond to desired viewpoint changes in the virtual world, and the viewer’s motion inside the display environment. Using the viewer/screen model, there is no extra camera model to maintain. It turns out that a viewing frustum is also uniquely identified by a combination of a flat (rectangular) screen, a focus point position, and near- and far-plane distances. However, the first two components are directly provided by the environment model, and the latter two parameters can be chosen more or less arbitrarily. As a result, the screen/viewer model has no free parameters to define a viewing frustum besides the two plane distances, and the resulting viewing frusta will always lead to correct projections, which are also automatically seamless across multiple screens (but that’s a topic for another post).

Figure 3: The screen/viewer camera model defined by the position, orientation, and size of a screen, and the position of a focus point, in some 3D coordinate system. Besides the additional near- and far-plane distances, the model has no free parameters besides those that can be measured directly via calibration and head tracking.

Looking at it mathematically again, one screen/viewer pair uniquely defines a viewing frustum in the physical coordinate space in which the screen and viewer are defined, and hence a modelview and a projection matrix. Now, the mapping from physical space to virtual world space is typically also expressed as a matrix, meaning that this model really just adds yet another modelview matrix. And since the product of two matrices is a matrix, it boils down to the same projection pipeline as the standard model. As I mentioned earlier, the model does not introduce any new capabilities, it just makes it easier to think about the existing capabilities.

So the bottom line is that the viewer/screen model makes it simpler to reason about program-controlled navigation, completely removes the need for an explicit camera model and the extra work required to keep it consistent with the display environment, and — if the display environment was measured properly — automatically leads to distortion-free and seamless images even across multiple screens, and to always correct and eye strain-free stereo displays.

Although this model stems from the immersive environment world, applying it in the desktop realm has immediate practical benefits. For one, it supports proper stereo without extra work from the application developer; additionally, it supports flexible multi-display configurations where users can put their displays however they like, and get correct and seamless images without special application support. It even provides correct desktop head-tracking for free. Sounds like a win-win to me.

Will the Oculus Rift make you sick?

Head-mounted displays (HMDs) are making a comeback! Yay!

I don’t think there’s need to introduce the Oculus Rift HMD. Everyone’s heard of it, and everyone’s psyched – including me.

However, HMDs are prone to certain issues, and while that shouldn’t detract us from embracing them, we should be careful to do it right this time. The last thing the VR field needs right now is a viral YouTube video along the lines of “Oh, an Oculus Rift! Cool! Let me try it on… Wow, that’s awesoBLEEAAARRGHHH.”

To back up a little: when HMDs became a thing in the 80s, they tended to induce dizziness and nausea in viewers, after a relatively short time of using them. Interestingly, HMDs had generally worse effects than other types of immersive display environments such as CAVEs. The basic theory of simulation sickness is based on virtual motion, and does not account for this difference.

The commonly stated explanation for this difference is display lag. In an HMD, the screens move with the viewer’s head, and any delay will cause the virtual world to move along with the viewer until the display system catches up. Imagine wearing an HMD and quickly turning your head to the side. Say it takes 30ms total until this motion is noticed by the head tracking system, the application updates its internal state, renders the new state, and refreshes the HMD’s screens. During this interval, the world will turn with you, and it will snap back to its original orientation once the delay time has passed. The real world does not behave like that, and because HMD-based graphics tap deeply into our brain’s visual system, this is very disorienting and adds to the discomfort. In a CAVE, on the other hand, the screens do not move with the viewer. Delay will still cause a disturbance in the projection of the virtual world, as the actual viewer position will not match the virtual one, but because screens are large and relatively far away, this will be barely noticeable. So far, so good.

Alas, there is an additional, often overlooked, factor — display calibration. Any immersive graphics system, HMD or CAVE or else, needs to exactly replicate how virtual objects are projected onto the system’s real screens, and then seen by the user (how exactly that works is a topic for another post). The bottom line is that the graphics software needs to know the absolute positions and orientations of all screens, and the absolute positions of the viewer’s eyes. Determining this is the job of head tracking and system calibration. But in an HMD, unlike in a CAVE, the tolerances for calibration are very low. The screens are very small and very close to the viewer’s eyes, which doesn’t leave much room for error (see Figure 1). Even worse, there is no way to precisely don an HMD short of putting screws into one’s skull; every time you put it on, it sits slightly differently. And that means any pre-configured projection parameters will not match reality.

Figure 1: Diagram of a hypothetical HMD for calibration purposes. The HMD consists of small real screen mounted directly in front of the viewer’s eyes, and uses optics to create larger virtual screens at a longer distance away to allow users to properly focus on those screens. For proper calibration, graphics software needs to know the precise positions of the viewer’s pupils and the exact positions and sizes of the virtual screens, in some coordinate system. Head tracking will provide the mapping from this viewer-attached coordinate system to the world coordinate system to allow users to look and walk around.

These mismatches have several effects. For one, imagine that a viewer wears an HMD slightly askew, so that the two screens have different vertical positions in front of their respective eyes. If the software does not account for that, the two stereo images will be vertically displaced, something that does not happen in real life. The viewer’s eyes will make up for it, up to a point, by moving up/down independently, but that is an unnatural motion and causes eye strain. It’s the same effect as watching a 3D movie in a theater while not holding one’s head level — it will hurt later.

Another, more subtle, effect is that in a miscalibrated display system the virtual world does not behave as the real world would. Do a simple experiment: fire up some first-person video game that allows view configuration, such as Doom3, and set a high field of view. Then rotate the view and observe. The virtual world will display a strong distortion effect, meaning that the sizes of objects, and their internal angles, change as the viewpoint changes. This is an extreme example, but even slight discrepancies are subconsciously unsettling, because our visual system is very good at detecting if something is not right with the world, and it tells us that by making us sick.

Even in non-immersive 3D graphics, a too large discrepancy between real field of view (how large the screen looms in our visual field) and programmatic field-of-view is known to cause motion sickness, and immersive 3D graphics with the same issue will be much worse. FOV discrepancy is only one symptom of miscalibration, but it’s the one that’s easiest to demonstrate; the others are more subtle (but that’s a topic for another post). In the end, miscalibration is a nasty problem because it is subtle, very hard to correct, and causes significant ill effects.

I noticed these things when I started experimenting with my own HMDs a while ago (I have an eMagin Z800 3DVisor and a Sony HMZ-T1). I experimented with rapid motions, but those didn’t really make me dizzy. I did notice, however, that the world didn’t seem solid, but as if it was made from jelly. I expected that, not having done proper calibration yet, so I used an interactive calibration utility to set up the system just so. After that, the world seemed stable, and interestingly I didn’t notice any more issues from lag. Not having done any further experiments, my hunch is that miscalibration is actually a bigger problem than lag. (Disclosure: while I was using a low-latency Intersense IS-900 tracking system, the computer running the show was fairly old, and the Quake3 renderer had no particular performance tweaking, so I estimated total system delay around 30ms).

So what’s the take-home message from this wall of text? If we want HMDs to succeed, we need to treat them properly in our graphics software. We need to use proper projection models instead of the standard camera model (but that’s a topic for another post), and not simply apply ad-hoc stereo models such as toe-in etc. (but that’s a topic for another post). It might work for a demo, but it won’t be pretty, and it will make our users sick. Instead, we need to know exactly how the HMD is laid out internally (screen placement and size, effects from the optical system in front of the screens, lens distortions, etc.), and, just as importantly, we need to know exactly where the viewer’s eyes are with respect to the screens (see Figure 1). This last one is the hard part. Maybe a future perfect HMD will contain one pair of stereo cameras per screen that will accurately track the viewer’s pupils and allow the graphics software to set up the projection parameters correctly, no matter how the HMD is worn and how the viewer moves. But until then, we need to come up with a practical approach, and we need to find simple methods to calibrate HMDs on the fly, and teach our users how to use those methods.

Well, and, of course, we mustn’t forget about minimizing lag, either. That would be too easy.

Oh, and by the way, want to get a quick glimpse of just how immersive the Oculus Rift will be (going by current specs)? If your monitor is X inches wide, put your eye X/2 inches in front of the monitor’s center — that’s about what it will look like. If you want to play a first-person game from that viewpoint and have it look right, set the horizontal field of view to 90 degrees.