3D Movies and the VR community: a match made in heaven or hell?

I know I’m several years late to the party talking about the recent 3D movie renaissance, but bear with me. I want to talk not about 3D movies, but about their influence on the VR field, good and bad.

First, the good. It’s impossible to deny the huge impact 3D movies have had on VR, simply by commodifying 3D display hardware. I’m going to go out on a limb and say that without Avatar, you wouldn’t be able to go into an electronics store and pick up a 70″ 3D TV for $2100. And without that crucial component, we would not be able to build low-cost fully-immersive 3D display systems for $7000. And we wouldn’t have neat toys like Sony’s HMZ-T1 or the upcoming Oculus Rift either — although the latter is designed for gaming from the ground up, I don’t think the Kickstarter would have taken off if 3D movies weren’t a thing right now.

And the effect goes beyond simply making real VR cheaper. It is that now real VR is affordable for a much larger segment of people. $7000 is still a bit much to spend for home entertainment, but it’s inside the equipment budget for many scientists. And those are my target audience. We are not selling low-cost VR systems per se, but we’re giving away the designs to build them, and the software to run them. And we’ve “sold” dozens of them, primarily to scientists who work with 3D data that is too complex to meaningfully analyze with desktop 3D visualization, but who don’t have the budget to build “professional” systems. Now, dozens is absolutely zilch in mainstream terms, but for our niche it’s a big deal, and it’s just taking off. We’re even getting them into high schools now. And we’re not the only ones “selling” them.

The end result is that many more people are getting exposed to real immersive 3D display environments, and to the practical benefits that they offer for their daily work. That will benefit us all.

But there are some downsides to the 3D movie renaissance as well, and while those can be addressed, we first need to be aware of them. For one, while 3D movies are definitely in the public conscience, I found that nobody is exactly completely bonkers about them. Roger Ebert is an extreme example (I think that Mr. Ebert is wrong in the sense that he claims 3D does not work in principle, whereas I think 3D does not work in many concrete implementations seen in theaters right now, but that’s a topic for another post), but the majority of people I speak to are decidedly “meh” about 3D movies. They say “3D doesn’t work for me” or “I get headaches” or “I get dizzy” etc.

Now that is a problem for VR as a whole, because there is no distinction in the public mind between 3D movies and real immersive 3D graphics. Meaning that people think that VR doesn’t work. But it does. I just did a quick guesstimate, and in the seven years we’ve had our CAVE, I’ve probably brought 1000 people through there, from every segment of the population. It has worked for every single one of them. How do I know? Everyone who enters the CAVE goes through the training course — a beach ball-sized globe hanging in the middle of the CAVE, shown in this video:

(Oh boy, just looking at this six-year-old video, the user interface in Vrui has improved so much. It’s almost embarrassing.)

I ask every single person to step in, touch the globe, and then indicate how big it is. And they all do the same thing: use both hands to make a cradling gesture around a virtual object that’s not actually there. If the 3D effect wouldn’t work for them, they couldn’t do it. QED. Before you ask: I’m aware that a significant percentage of the general population have no stereo vision at all, but immersive 3D graphics works for them as well because it provides motion parallax. I know because one of my best friends has monocular vision, and it works for him. He even co-stars with me in a silly video.

The upshot is that the conversation goes differently now. It used to be that I talk to “VR virgins” about what I do, and they have no pre-conception of 3D, are curious, try the CAVE, and it works for them and they like it. These days, I talk about the CAVE, they immediately say that 3D doesn’t work for them, and they’re very reluctant to try the CAVE. I twist their arms to get them in there nonetheless, and it works for them, and they like it. This is not a problem if I have someone there in person, but it’s a problem when I can’t just stuff the person I’m describing VR to into a VR system, as in, say, when you’re writing a proposal to beg for money. And that’s bad news, big time (but it’s a topic for another post).

There is another interesting change in behavior: let’s say I have a group of people coming in for a tour (yeah, we sometimes get strongarmed into doing those). Used to be, they would come into the CAVE room, and stand around not sure what to expect or what to do. These days, they immediately sit down at the conference table, grab a pair of 3D glasses if they find one, and get ready to be entertained. I then have to tell them that no, that’s not how it works, would they please put the non-head tracked glasses down until later, get up, and get ready to get into the CAVE itself and see it properly? It’s pretty funny, actually.

The other downside is that the use of the word “3D” for movies has watered down that term even more. Now there are:

  • “3D graphics” for projected 2D images of 3D scenes, i.e., virtual and real photos or movies, i.e., basically everything anybody has ever done. The end results of 3D graphics are decidedly 2D, but the term was coined to distinguish it from 2D graphics, i.e., pictures of scenes playing in flatland.
  • “3D movies” meaning stereoscopic movies shown on stereoscopic displays. In my opinion, a better term would be “2D plus depth” movies (or they could just go with “stereo movies,” you know), because most directors at this time treat the stereoscopic dimension as a separate entity from the other two dimensions, as something that can be tweaked and played with. And I think that’s one cause of the problem, because they’re messing with people’s brains. And don’t even get me started on “upconverted” 3D movies, oh my.
  • “3D displays” meaning stereoscopic displays, those used to show 3D movies. They are a necessary component to create 3D images, but not 3D by themselves.
  • “3D displays” meaning immersive 3D displays like CAVEs. The distinguishing feature of these is that they show three-dimensional scenes and objects in a way similar enough to how we would perceive the same scenes and objects if they were real that our brains accept the illusion, and allow us to work with them as if they were real — and this last bit is really the main point. The difference between this and “3D movies” cannot be overstated. I would rather call these displays “holographic,” but then I get flak from the “holograms are only holograms if they’re based on lasers and interference” crowd, who are technically correct (and isn’t that the best form of correctness?) because that’s how the word was defined, but it’s wrong because these displays look and feel exactly like holograms — they are free-standing, solid-appearing, touchable virtual objects. After all, “hologram,” loosely translated from Greek, means “shows the whole thing.” And that’s exactly what immersive 3D displays do.

And I probably missed a few. So there’s clearly a confusion of terms, and we need to find ways to distinguish what real immersive 3D graphics does from what 3D movies do, and need to do it in ways that don’t create unrealistic expectations, either. Don’t reference “the Matrix,” try not to mention holodecks (but it’s so tempting!), don’t say it’s an indistinguishable replication of reality (in other words, don’t say “virtual reality,” ha!). Ideally, don’t say anything — show them.

In summary, “3D” is now widely embedded in the public conscience, and the VR community has to deal with it. There are obvious and huge benefits, but there are some downsides as well, and those have to be addressed. They can be addressed — fortunately, immersive 3D graphics are not the same as 3D movies — but it takes care and effort. Time to get started.

Good stereo vs. bad stereo

I received an email about a week ago that reminded me that, even though stereoscopic movies and 3D graphics have been around for at least six decades, there are still some wide-spread misconceptions out there. Those need to be addressed urgently, especially given stereo’s hard push into the mainstream over the last few years. While, this time around, the approaches to stereo are generally better than the last time “3D” hit the local multiplex (just compare Avatar and Friday the 13th 3D), and the wide availability of commodity stereoscopic display hardware is a major boon to people like me, we are already beginning to see a backlash. And if there’s a way to do things better, to avoid that backlash, then I think it’s important to do it.

So here’s the gist of this particular issue: there are primarily two ways of setting up a movie camera, or a virtual movie camera in 3D computer graphics, to capture stereoscopic images — one is used by the majority of existing 3D graphics software, and seemingly also by the “3D” movie industry, and the other one is correct.

Toe-in vs skewed frustum

So, how do you set up a stereo camera? The basic truth is that stereoscopy works by capturing two slightly different views of the same 3D scene, and presenting these views separately to the viewers’ left and right eyes. The devil, as always, lies in the details.

Say you have two regular video cameras, and want to film a “3D” movie (OK, I’m going to stop putting “3D” in quotes now. My pedantic point is that 3D movies are not actually 3D, they’re stereoscopic. Carry on). What do you do? If you put them next to each other, with their viewing directions exactly parallel, you’ll see that it doesn’t quite give the desired effect. When viewing the resulting footage, you’ll notice that everything in the scene, up to infinity, appears to float in front of your viewing screen. This is because the two cameras, being parallel, are stereo-focused on the infinity plane. What you want, instead, is that near objects float in front of the screen, and that far objects float behind the screen. Let’s call the virtual plane separating “in-front” and “behind” objects the stereo-focus plane.

So how do you control the position of the stereo-focus plane? When using two normal cameras, the only solution is to rotate both slightly inwards, so that their viewing direction lines intersect exactly in the desired stereo-focus plane. This approach is often called toe-in stereo, and it sort-of works — under a very lenient definition of the words “sort-of” and “works.”

The fundamental problem with toe-in stereo is that it makes sense intuitively — after all, don’t our eyes rotate inwards when we focus on nearby objects? — but that our intuition does not correspond to how 3D movies are shown. 3D (or any other kind of) movies are not projected directly onto our retinas, they are projected onto screens, and those screens are in turn viewed by us, i.e., they project onto our retinas.

Now, when a normal camera records a movie, the assumption is that the movie will later be projected onto a screen that is orthogonal to the projector’s projection direction, which is implicitly the same as the camera’s viewing direction (the undesirable effect of non-orthogonal projection is called keystoning). In a toe-in stereo camera, on the other hand, there are two viewing directions, at a slight angle towards each other. But, in the theater, the cameras’ views are projected onto the same screen, meaning that at least one, but typically both, of the component images will exhibit keystoning (see Figures 1 and 2).

Figure 1: The implied viewing directions and screen orientations caused by a toe-in stereo camera based on two on-axis projection cameras. The discrepancy between the screen orientations implied by the cameras’ models and the real screen causes keystone distortion, which leads to 3D convergence issues and eye strain.

Figure 2: The left stereo image shows the keystoning effect caused by toe-in stereo. A viewer will not be able to merge these two views into a single 3D object. The right stereo image shows the correct result of using skewed-frustum stereo. You can try for yourself using a pair of red/blue anaglyphic glasses.

The bad news is that keystoning from toe-in stereo leads to problems in 3D vision. Because the left/right views of captured objects or scenes do not actually look like they would if viewed directly with the naked eye, our brains refuse to merge those views and perceive the 3D objects therein, causing a breakdown of the 3D illusion. When keystoning is less severe, our brains are flexible enough to adapt, but our eyes will dart around trying to make sense of the mismatching images, which leads to eye strain and potentially headaches. Because keystoning is more severe towards the left and right edges of the image, toe-in stereo generally works well enough for convergence around the center of the images, and generally breaks down towards the edges.

And this is why I think a good portion of current 3D movies are based on toe-in stereo (I haven’t watched enough 3D movies to tell for sure, and the ones I’ve seen were too murky to really tell): I have spoken with 3D movie experts (an IMAX 3D film crew, to be precise), and they told me the two basic rules of thumb for good stereo in movies: artificially reduce the amount of eye separation, and keep the action, and therefore the viewer’s eyes, in the center of the screen. Taken together, these two rules exactly address the issues caused by toe-in stereo, but of course they’re only treating the symptom, not the cause. As an aside: when we showed this camera crew how we are doing stereo in the CAVE, they immediately accused us of breaking the two rules. What they forgot is that stereo in the CAVE obviously works, including for them, and does not cause eye strain, meaning that those rules are only workarounds for a problem that doesn’t exist in the first place if stereo is done properly.

So what is the correct way of doing it? It can be derived by simple geometry. If a 3D movie or stereo 3D graphics are to be shown on a particular screen, and will be seen by a viewer positioned somewhere in front of that screen, then the two viewing volumes for the viewer’s eyes are exactly the two pyramids defined by each eye, and the four corners of the screen. In technical terms, this leads to skewed-frustum stereo. The following video explains this pretty well, better than I could here in words or a single diagram, even though it is primarily about head tracking and the screen/viewer camera model:

In a nutshell, skewed-frustum stereo works exactly as ordered. Even stereo pairs with very large disparity can be viewed without convergence problems or eye strain, and there are no problems when looking towards the edge of the image.

To allow for a real and direct comparison, I prepared two stereoscopic images (using red/blue anaglyphic stereo) of the same scene from the same viewpoint and with the same eye separation, one using toe-in stereo, one using skewed-frustum stereo. They need to be large and need to be seen at original size to appreciate the effect, which is why I’m only linking them here. Ideally, switch back-and-forth between the images several times and focus on the structure close to the upper-left corner. The effect is subtle, but noxious:

Good (skewed-frustum) stereo vs bad (toe-in) stereo.

I generated these using the Nanotech Construction Kit and Vrui; as it turns out, Vrui is flexible enough to support bad stereo, but at least it was considerably harder setting it up than good stereo. So that’s a win, I guess.

There are only two issues to be aware of: for one, objects at infinity will have the exact separation of the viewer’s eyes, so if the programmed-in eye separation is larger than the viewer’s actual eye separation, convergence for very far away objects will fail (in reality, objects can’t be farther away than infinity, or at least our brains seem to think so). Fortunately, the distribution of eye separations in the general population is quite narrow; just stick close to the smaller end. But it’s a thing to keep in mind when producing stereoscopic images for a small screen, and then showing them on a large screen: eye separation scales with screen size when baked into a video. This is why, ideally, stereoscopic 3D graphics should be generated specifically for the size of the screen on which they will be shown, and for the expected position of the audience.

The other issue is that virtual objects very close to the viewer will appear blurry. This is because when the brain perceives an object to be at a certain distance, it will tell the eyes to focus their lenses to that distance (a process called accommodation). But in stereoscopic imaging, the light reaching the viewer’s eyes from close-by virtual objects will still come from the actual screen, which is much farther away, and so the eyes will focus on the wrong plane, and the entire image will appear blurry.

Unfortunately, there’s nothing we can do about that right now, but at least it’s a rather subtle effect. In our CAVE, users standing in the center can see virtual objects floating only a few inches in front of their eyes quite clearly, even though the walls, i.e., the actual screens, are four feet away. This focus miscue does have a noticeable after-effect: after having used the CAVE for an extended period of time, say a few hours, the real world will look somewhat “off,” in a way that’s hard to describe, for a few minutes after stepping out. But this appears to be only a temporary effect.

Taking it back to the real 3D movie world: the physical analogy to skewed-frustum stereo is lens shift. Instead of rotating the two cameras inwards, one has to shift their lenses inwards. The amount of shift is, again, determined by the distance to the desired stereo-focus plane. Technically, creating lens-shift stereo cameras should be feasible (after all, lens shift photography is all the rage these days), so everybody should be using them. And some 3D movie makers might very well already do that — I’m not a part of that crowd, but from what I hear, at least some don’t.

In the 3D graphics world, where cameras are entirely virtual, it should be even easier to do stereo right. However, many graphics applications use the standard camera model (focus point, viewing direction, up vector, field-of-view), and can only represent non-skewed frusta. The fact that this camera model, as commonly implemented, does not support proper stereo, is just another reason why it shouldn’t be used.

So here’s the bottom line: Toe-in stereo is only a rough approximation of correct stereo, and it should not be used. If you find yourself wondering how to specify the toe-in angle in your favorite graphics software, hold it right there, you’re doing it wrong. The fact that toe-in stereo is still used — and seemingly widely used — could explain the eye strain and discomfort large numbers of people report with 3D movies and stereoscopic 3D graphics. Real 3D movie cameras should use lens shift, and virtual stereoscopic cameras should use skewed frusta, aka off-axis projection. While the standard 3D graphics camera model can be generalized to support skewed frusta, why not just replace it with a model that can do it without additional thought, and is more flexible and more generally applicable to boot?

Update: With the Oculus Rift in developers’ hands now, I’m getting a lot of questions about whether this article applies to head-mounted displays in general, and the Rift specifically. Short answer: it does. There isn’t any fundamental difference between large screens far away from the viewer, and small screens right in front of the viewer’s eyes. The latter add a wrinkle because they necessarily need to involve lenses and their concomitant distortions so that viewers are able to focus on the screens, but the principle remains the same. One important difference is that small screens close to the viewer’s eyes are more sensitive to miscalibration, so doing stereo right is, if anything, even more important than on large-screen displays. And yes, the official Oculus Rift software does use off-axis projection, even though the SDK documentation flat-out denies it.

Standard camera model considered harmful

With apologies to Edsger W. Dijkstra (and pretty much everyone else).

So what’s wrong with the canonical 3D graphics camera model? To recap, in this model a camera is defined by a focus point (the “eye”), a viewing direction, an “up” vector, a screen aspect ratio, and a field-of-view angle (“fov”) (see Figure 1). Throw in a near- and far-plane distance, and together these parameters uniquely define a viewing frustum in model space, and hence the modelview and projection matrices required to render a view of a 3D scene. Nothing wrong with that, per se.

Figure 1: The standard 3D graphics camera model, defined by a focus point position, viewing direction, “up” vector, and screen aspect ratio (ratio of screen width to screen height, not shown in diagram).

The problem arises when this same camera model is applied to (semi-) immersive environments, such as when one wants to adapt an existing graphics package or game engine to, say, a 3D TV or a head-mounted display with only minimal changes. There are two main problems with that: for one, the standard camera model does not support proper stereo image generation, leading to 3D vision problems, eye strain, and discomfort (but that’s a topic for another post).

The problem I want to discuss here is the implicit link between the camera model and viewpoint navigation. In this context, viewpoint navigation is the mechanism by which a 3D graphics application represents the viewer moving through the virtual 3D environment. For example, in a typical first-person video game, the player will be represented in the game world as some kind of avatar, and the camera model will be attached to that avatar’s head. The game engine will provide some mechanism for the player to directly control the position and viewing direction of the avatar, and therefore the camera, in the virtual world (this could be the just-as-canonical WASD+mouse navigation metaphor, or something else). But no matter the details, the bottom line is that the game engine is always in complete control of the player avatar’s — and the camera’s — position and orientation in the game world.

But in immersive display environments, where the user’s head is tracked inside some tracking volume, this is no longer true. In such environments, there are two ways to change the camera position: using the normal navigation metaphor, or simply physically moving inside the tracking volume. Obviously, the game engine has no way to control the latter (short of tethers or an electric shock collar). The problem is that the standard camera model, or rather its implementation in common graphics engines, does not account for this separation.

This is not really a technical problem, as there are ways to manipulate the camera model to yield the correct end result, but it’s a thinking problem. Having to think inside this ill-fitting camera model causes real headaches for developers, who will be running into lots of detail problems and  edge cases when trying to implement correct camera behavior in immersive environments.

My preferred approach is to use an entirely different way to think about navigation in virtual worlds, and the camera model that directly follows from it. In this paradigm, a display system is not represented by a virtual camera, but by a physical display environment consisting of a collection of screens and viewers. For example, a typical desktop system would consist of a single screen (the actual monitor), and a single viewer (the user) sitting in a fixed position in front of it (fixed because typical desktop systems have no way to detect the viewer’s actual position). At the other extreme, a CAVE environment would consist of four to six large screens in fixed positions forming a cube, and a viewer whose position and orientation is measured in real time by a head tracking system (see Figure 2). The beauty is that this simple environment model is extremely flexible; it can support any number of screens in arbitrary positions, including moving screens, and any number of fixed or tracked viewers. It can support non-rectangular screens without problems (but that’s a topic for another post), and non-flat screens can be tesselated to a desired precision. So far, I have not found a single concrete display environment that cannot be described by this model, or at least approximated to arbitrary precision.

Figure 2: Photo of a CAVE environment consisting of four screens (three walls and one floor) and one viewer (in this case a camera on a tripod). Note how the image of the 3D protein model in the CAVE spans all four screens, but still appears seamless to the camera.

In more detail, a screen is defined by its position, orientation, width, and height (let’s ignore non-rectangular screens for now). A viewer, on the other hand, is solely defined by the position of its two eyes (two eyes instead of one to support proper stereo; sorry, spiders and Martians need not apply). All screens and viewers forming one display environment are defined in the same coordinate system, called physical space because it refers to real-world entities, namely display screens and users.

How does this environment model affect navigation? Instead of moving a virtual camera through virtual space, navigation now moves an entire environment, i.e., a collection of any number of screens and viewers, through the virtual space, still under complete program control (fans of Dr Who are free to call it the “Tardis model”). Additionally, any viewers can freely move through the environment, at least if they’re head-tracked, and this part is not under program control. From a mathematical point of view, this means that viewers can freely walk through physical space, whereas physical space as a whole is mapped into the virtual world by the graphics engine, effected by the so-called navigation transformation.

At first glance, it seems that this model does nothing but add another intermediate coordinate system and is therefore superfluous (and mathematically speaking, that’s true), but in my experience, this model makes it a lot more straightforward to think about navigation and user motion in immersive environments, and therefore makes it easier to develop novel and correct navigation metaphors that work in all circumstances. The fact that it treats all possible display environments from desktops to HMDs and CAVEs in a unified way is just a welcome bonus.

The really neat effect of this environment model is that it directly implies a camera model as well (see Figure 3). Using the standard model, it is quite a tricky prospect to maintain the collection of virtual cameras that are required to render to a multi-screen environment, and ensure that they correspond to desired viewpoint changes in the virtual world, and the viewer’s motion inside the display environment. Using the viewer/screen model, there is no extra camera model to maintain. It turns out that a viewing frustum is also uniquely identified by a combination of a flat (rectangular) screen, a focus point position, and near- and far-plane distances. However, the first two components are directly provided by the environment model, and the latter two parameters can be chosen more or less arbitrarily. As a result, the screen/viewer model has no free parameters to define a viewing frustum besides the two plane distances, and the resulting viewing frusta will always lead to correct projections, which are also automatically seamless across multiple screens (but that’s a topic for another post).

Figure 3: The screen/viewer camera model defined by the position, orientation, and size of a screen, and the position of a focus point, in some 3D coordinate system. Besides the additional near- and far-plane distances, the model has no free parameters besides those that can be measured directly via calibration and head tracking.

Looking at it mathematically again, one screen/viewer pair uniquely defines a viewing frustum in the physical coordinate space in which the screen and viewer are defined, and hence a modelview and a projection matrix. Now, the mapping from physical space to virtual world space is typically also expressed as a matrix, meaning that this model really just adds yet another modelview matrix. And since the product of two matrices is a matrix, it boils down to the same projection pipeline as the standard model. As I mentioned earlier, the model does not introduce any new capabilities, it just makes it easier to think about the existing capabilities.

So the bottom line is that the viewer/screen model makes it simpler to reason about program-controlled navigation, completely removes the need for an explicit camera model and the extra work required to keep it consistent with the display environment, and — if the display environment was measured properly — automatically leads to distortion-free and seamless images even across multiple screens, and to always correct and eye strain-free stereo displays.

Although this model stems from the immersive environment world, applying it in the desktop realm has immediate practical benefits. For one, it supports proper stereo without extra work from the application developer; additionally, it supports flexible multi-display configurations where users can put their displays however they like, and get correct and seamless images without special application support. It even provides correct desktop head-tracking for free. Sounds like a win-win to me.