How do all the technical components of VR headsets, e.g., screens, lenses, tracking, etc., actually come together to create realistic-looking virtual environments? Specifically, why do virtual environment in VR look more “real” compared to when viewed via other media, for example panoramic video?
The reason I’m bringing this up again is that the question keeps getting asked, and that it’s really kinda hard to answer. Most attempts to answer it fall back on technical aspects, such as stereoscopy, head tracking, etc., but I find that this approach somewhat misses the point by focusing on individual components, or at least gets mired in technical details that don’t make much sense to those who have to ask the question in the first place.
I prefer to approach the question from the opposite end: not through what VR hardware produces, but instead through how the viewer perceives 3D objects and/or environments, and how either the real world on the one hand, or virtual reality displays on the other, create the appropriate visual input to support that perception.
The downside with that approach is that it doesn’t lend itself to short answers. In fact, last summer, I gave a 25 minute talk about this exact topic at the 2016 VRLA Summer Expo. It may not be news, but I haven’t linked this video from here before, and it’s probably still timely:
Why do virtual objects close to my face appear blurry when wearing a VR headset? My vision is fine!
And why does the real world look strange immediately after a long VR session?
These are another two (relates ones) of those frequently-asked questions about VR and head-mounted displays (HMDs) that I promised to address a while back.
Here’s the short answer: In all currently-available HMDs, the screens creating the virtual imagery are at a fixed optical distance from the user. But our eyes have evolved to automatically adjust their optical focus based on the perceived distance to objects, virtual or real, that they are looking at. So when a virtual object appears to be mere inches in front of the user’s face, but the screens showing images of that object are — optically — several meters away, the user’s eyes will focus on the wrong distance, and as a result, the virtual object will appear blurry (the same happens, albeit less pronounced, when a virtual object appears to be very far away). This effect is called accommodation-vergence conflict, and besides being a nuisance, it can also cause eye strain or headaches during prolonged VR sessions, and can cause vision problems for a short while after such sessions.
“It can’t be comfortable or healthy to stare at a screen a few inches in front of your eyes.”
The popularity of Google Cardboard, and the upcoming commercial releases of the Oculus Rift, HTC Vive, and other modern head-mounted displays (HMDs) have raised interest in virtual reality and VR devices in parts of the population who have never been exposed to, or had reason to care about, VR before. Together with the fact that VR, as a medium, is fundamentally different from other media with which it often gets lumped in, such as 3D cinema or 3D TV, this leads to a number of common misunderstandings and frequently-asked questions. Therefore, I am planning to write a series of articles addressing these questions one at a time.
First up: How is it possible to see anything on a screen that is only a few inches in front of one’s face?
Short answer: In HMDs, there are lenses between the screens (or screen halves) and the viewer’s eyes to solve exactly this problem. These lenses project the screens out to a distance where they can be viewed comfortably (for example, in the Oculus Rift CV1, the screens are rumored to be projected to a distance of two meters). This also means that, if you need glasses or contact lenses to clearly see objects several meters away, you will need to wear your glasses or lenses in VR.
I have briefly mentioned HoloLens, Microsoft’s upcoming see-through Augmented Reality headset, in a previous post, but today I got the chance to try it for myself at Microsoft’s “Build 2015” developers’ conference. Before we get into the nitty-gritty, a disclosure: Microsoft invited me to attend Build 2015, meaning they waived my registration fee, and they gave me, like all other attendees, a free HP Spectre x360 notebook (from which I’m typing right now because my vintage 2008 MacBook Pro finally kicked the bucket). On the downside, I had to take Amtrak and Bart to downtown San Francisco twice, because I wasn’t able to get a one-on-one demo slot on the first day, and got today’s 10am slot after some finagling and calling in of favors. I guess that makes us even. š
So, on to the big question: is HoloLens real? Given Microsoft’s track record with product announcements (see 2009’s Project NatalĀ trailer and especially the infamous MiloĀ “demo”), there was some well-deserved skepticism regarding the HoloLens teaser released in January, and even the on-stage demo that was part of the Build 2015 keynote:
I have talkedmanytimes about the importance of eye tracking for head-mounted displays, but so far, eye tracking has been limited to the very high end of the HMD spectrum. Not anymore. SensoMotoric Instruments, a company with around 20 years of experience in vision-based eye tracking hardware and software, unveiled a prototype integrating the camera-based eye tracker from their existing eye tracking glasses with an off-the-shelf Oculus Rift DK1 HMD (see Figure 1). Fortunately for me, SMI were showing their eye-tracked Rift at the 2014 Augmented World Expo, and offered to bring it up to my lab to let me have a look at it.
Figure 1: SMI’s after-market modified Oculus Rift with one 3D eye tracking camera per eye. The current tracking cameras need square cut-outs at the bottom edge of each lens to provide an unobstructed view of the user’s eyes; future versions will not require such extensive modifications.
Update: Please tear your eyes away from the blue lady and also read this follow-up post. It turns out things are worse than I thought. Now back to your regularly scheduled entertainment.
I somehow missed this when it was hot a few weeks or so ago, but I just found out about an interesting Kickstarter project: HOLOVISION — A Life Size Hologram. Don’t bother clicking the link, the project page has been taken down following a DMCA complaint and might not ever be up again.
Why do I think it’s worth talking about? Because, while there is an actual design for something called Holovision, and that design is theoretically feasible, and possibly even practical, the public’s impression of the product advertised on Kickstarter is decidedly not. The concept imagery associated with the Kickstarter project presents this feasible technology in a way that (intentionally?) taps into people’s misconceptions about holograms (and I’m talking about the “real” kind of holograms, those involving lasers and mirrors and beam splitters). In other words, it might not be a scam per se, and it might even be unintentional, but it is definitely creating a false impression that might lead to very disappointed backers.
I’ve talked about “holographic displays” a lot, most recently in my analysis of the upcoming zSpace display. What I haven’t talked about is how exactly such holographic displays work, what makes them “holographic” as opposed to just stereoscopic, and why that is a big deal.
Teaser: A user interacting with a virtual object inside a holographic display.
I received an email about a week ago that reminded me that, even though stereoscopic movies and 3D graphics have been around for at least six decades, there are still some wide-spread misconceptions out there. Those need to be addressed urgently, especially given stereo’s hard push into the mainstream over the last few years. While, this time around, the approaches to stereo are generally better than the last time “3D” hit the local multiplex (just compare Avatar and Friday the 13th 3D), and the wide availability of commodity stereoscopic display hardware is a major boon to people like me, we are already beginning to see a backlash. And if there’s a way to do things better, to avoid that backlash, then I think it’s important to do it.
So here’s the gist of this particular issue: there are primarily two ways of setting up a movie camera, or a virtual movie camera in 3D computer graphics, to capture stereoscopic images — one is used by the majority of existing 3D graphics software, and seemingly also by the “3D” movie industry, and the other one is correct.
Toe-in vs skewed frustum
So, how do you set up a stereo camera? The basic truth is that stereoscopy works by capturing two slightly different views of the same 3D scene, and presenting these views separately to the viewers’ left and right eyes. The devil, as always, lies in the details.
Say you have two regular video cameras, and want to film a “3D” movie (OK, I’m going to stop putting “3D” in quotes now. My pedantic point is that 3D movies are not actually 3D, they’re stereoscopic. Carry on). What do you do? If you put them next to each other, with their viewing directions exactly parallel, you’ll see that it doesn’t quite give the desired effect. When viewing the resulting footage, you’ll notice that everything in the scene, up to infinity, appears to float in front of your viewing screen. This is because the two cameras, being parallel, are stereo-focused on the infinity plane. What you want, instead, is that near objects float in front of the screen, and that far objects float behind the screen. Let’s call the virtual plane separating “in-front” and “behind” objects the stereo-focus plane.
So how do you control the position of the stereo-focus plane? When using two normal cameras, the only solution is to rotate both slightly inwards, so that their viewing direction lines intersect exactly in the desired stereo-focus plane. This approach is often called toe-in stereo, and it sort-of works — under a very lenient definition of the words “sort-of” and “works.”
The fundamental problem with toe-in stereo is that it makes sense intuitively — after all, don’t our eyes rotate inwards when we focus on nearby objects? — but that our intuition does not correspond to how 3D movies are shown. 3D (or any other kind of) movies are not projected directly onto our retinas, they are projected onto screens, and those screens are in turn viewed by us, i.e., they project onto our retinas.
Now, when a normal camera records a movie, the assumption is that the movie will later be projected onto a screen that is orthogonal to the projector’s projection direction, which is implicitly the same as the camera’s viewing direction (the undesirable effect of non-orthogonal projection is called keystoning). In a toe-in stereo camera, on the other hand, there are two viewing directions, at a slight angle towards each other. But, in the theater, the cameras’ views are projected onto the same screen, meaning that at least one, but typically both, of the component images will exhibit keystoning (see Figures 1 and 2).
Figure 1: The implied viewing directions and screen orientations caused by a toe-in stereo camera based on two on-axis projection cameras. The discrepancy between the screen orientations implied by the cameras’ models and the real screen causes keystone distortion, which leads to 3D convergence issues and eye strain.
Figure 2: The left stereo image shows the keystoning effect caused by toe-in stereo. A viewer will not be able to merge these two views into a single 3D object. The right stereo image shows the correct result of using skewed-frustum stereo. You can try for yourself using a pair of red/blue anaglyphic glasses.
The bad news is that keystoning from toe-in stereo leads to problems in 3D vision. Because the left/right views of captured objects or scenes do not actually look like they would if viewed directly with the naked eye, our brains refuse to merge those views and perceive the 3D objects therein, causing a breakdown of the 3D illusion. When keystoning is less severe, our brains are flexible enough to adapt, but our eyes will dart around trying to make sense of the mismatching images, which leads to eye strain and potentially headaches. Because keystoning is more severe towards the left and right edges of the image, toe-in stereo generally works well enough for convergence around the center of the images, and generally breaks down towards the edges.
And this is why I think a good portion of current 3D movies are based on toe-in stereo (I haven’t watched enough 3D movies to tell for sure, and the ones I’ve seen were too murky to really tell): I have spoken with 3D movie experts (an IMAX 3D film crew, to be precise), and they told me the two basic rules of thumb for good stereo in movies: artificially reduce the amount of eye separation, and keep the action, and therefore the viewer’s eyes, in the center of the screen. Taken together, these two rules exactly address the issues caused by toe-in stereo, but of course they’re only treating the symptom, not the cause. As an aside: when we showed this camera crew how we are doing stereo in the CAVE, they immediately accused us of breaking the two rules. What they forgot is that stereo in the CAVE obviously works, including for them, and does not cause eye strain, meaning that those rules are only workarounds for a problem that doesn’t exist in the first place if stereo is done properly.
So what is the correct way of doing it? It can be derived by simple geometry. If a 3D movie or stereo 3D graphics are to be shown on a particular screen, and will be seen by a viewer positioned somewhere in front of that screen, then the two viewing volumes for the viewer’s eyes are exactly the two pyramids defined by each eye, and the four corners of the screen. In technical terms, this leads to skewed-frustum stereo. The following video explains this pretty well, better than I could here in words or a single diagram, even though it is primarily about head tracking and the screen/viewer camera model:
In a nutshell, skewed-frustum stereo works exactly as ordered. Even stereo pairs with very large disparity can be viewed without convergence problems or eye strain, and there are no problems when looking towards the edge of the image.
To allow for a real and direct comparison, I prepared two stereoscopic images (using red/blue anaglyphic stereo) of the same scene from the same viewpoint and with the same eye separation, one using toe-in stereo, one using skewed-frustum stereo. They need to be large and need to be seen at original size to appreciate the effect, which is why I’m only linking them here. Ideally, switch back-and-forth between the images several times and focus on the structure close to the upper-left corner. The effect is subtle, but noxious:
I generated these using the Nanotech Construction Kit and Vrui; as it turns out, Vrui is flexible enough to support bad stereo, but at least it was considerably harder setting it up than good stereo. So that’s a win, I guess.
There are only two issues to be aware of: for one, objects at infinity will have the exact separation of the viewer’s eyes, so if the programmed-in eye separation is larger than the viewer’s actual eye separation, convergence for very far away objects will fail (in reality, objects can’t be farther away than infinity, or at least our brains seem to think so). Fortunately, the distribution of eye separations in the general population is quite narrow; just stick close to the smaller end. But it’s a thing to keep in mind when producing stereoscopic images for a small screen, and then showing them on a large screen: eye separation scales with screen size when baked into a video. This is why, ideally, stereoscopic 3D graphics should be generated specifically for the size of the screen on which they will be shown, and for the expected position of the audience.
The other issue is that virtual objects very close to the viewer will appear blurry. This is because when the brain perceives an object to be at a certain distance, it will tell the eyes to focus their lenses to that distance (a process called accommodation). But in stereoscopic imaging, the light reaching the viewer’s eyes from close-by virtual objects will still come from the actual screen, which is much farther away, and so the eyes will focus on the wrong plane, and the entire image will appear blurry.
Unfortunately, there’s nothing we can do about that right now, but at least it’s a rather subtle effect. In our CAVE, users standing in the center can see virtual objects floating only a few inches in front of their eyes quite clearly, even though the walls, i.e., the actual screens, are four feet away. This focus miscue does have a noticeable after-effect: after having used the CAVE for an extended period of time, say a few hours, the real world will look somewhat “off,” in a way that’s hard to describe, for a few minutes after stepping out. But this appears to be only a temporary effect.
Taking it back to the real 3D movie world: the physical analogy to skewed-frustum stereo is lens shift. Instead of rotating the two cameras inwards, one has to shift their lenses inwards. The amount of shift is, again, determined by the distance to the desired stereo-focus plane. Technically, creating lens-shift stereo cameras should be feasible (after all, lens shift photography is all the rage these days), so everybody should be using them. And some 3D movie makers might very well already do that — I’m not a part of that crowd, but from what I hear, at least some don’t.
In the 3D graphics world, where cameras are entirely virtual, it should be even easier to do stereo right. However, many graphics applications use the standard camera model (focus point, viewing direction, up vector, field-of-view), and can only represent non-skewed frusta. The fact that this camera model, as commonly implemented, does not support proper stereo, is just another reason why it shouldn’t be used.
So here’s the bottom line: Toe-in stereo is only a rough approximation of correct stereo, and it should not be used. If you find yourself wondering how to specify the toe-in angle in your favorite graphics software, hold it right there, you’re doing it wrong. The fact that toe-in stereo is still used — and seemingly widely used — could explain the eye strain and discomfort large numbers of people report with 3D movies and stereoscopic 3D graphics. Real 3D movie cameras should use lens shift, and virtual stereoscopic cameras should use skewed frusta, aka off-axis projection. While the standard 3D graphics camera model can be generalized to support skewed frusta, why not just replace it with a model that can do it without additional thought, and is more flexible and more generally applicable to boot?
Update: With the Oculus Rift in developers’ hands now, I’m getting a lot of questions about whether this article applies to head-mounted displays in general, and the Rift specifically. Short answer: it does. There isn’t any fundamental difference between large screens far away from the viewer, and small screens right in front of the viewer’s eyes. The latter add a wrinkle because they necessarily need to involve lenses and their concomitant distortions so that viewers are able to focus on the screens, but the principle remains the same. One important difference is that small screens close to the viewer’s eyes are more sensitive to miscalibration, so doing stereo right is, if anything, even more important than on large-screen displays. And yes, the official Oculus Rift software does use off-axis projection, even though the SDK documentation flat-out denies it.