How head tracking makes holographic displays

I’ve talked about “holographic displays” a lot, most recently in my analysis of the upcoming zSpace display. What I haven’t talked about is how exactly such holographic displays work, what makes them “holographic” as opposed to just stereoscopic, and why that is a big deal.

Teaser: A user interacting with a virtual object inside a holographic display.

Before we start, here’s why it’s a big deal to answer these questions. With “3D” movies and displays having become a thing, many more people have been exposed to stereoscopic imagery, and the response is not always positive — there’s even a backlash beginning to form. That’s a problem for people like me, because non-experts don’t know that there’s a fundamental difference between “3D” movies and immersive visualization / virtual reality. Whereas the former have serious issues, the latter just works. Here’s why.

In a nutshell, the difference between stereoscopic and holographic displays is that the former add depth perception to normal 2D images, whereas the latter presents virtual objects exactly as if they were real. If done correctly, the illusion is so good that there is absolutely no effort required by the viewer — no need to squint or cross your eyes. The illusion is so good, in fact, that it’s impossible not to be fooled. And why is that important? Once we get our brain to accept virtual objects as real, we can apply interaction techniques that we’ve used all our lives to do actual work (at least work involving 3D data or objects) much more efficiently than otherwise possible. In other words, we can cut through several layers of artificial and indirect user interfaces. Instead of having to worry about how to, say, rotate an object 90° around the X axis, users can just do it by picking it up with their hands, and worry about important things instead.

The basic observation when looking at a real 3D object is this: when you look at it from a different position, it looks different. This fact is so deeply built into our visual system, that if an object doesn’t look different when we change position, our brains refuse to believe it’s real. And there is the basic problem with stereoscopic displays: because they present the same view of virtual objects no matter from where you look at them, they’re not convincing. It’s somewhat OK if viewers don’t move, like when they’re sitting in a movie theater seat, but it breaks down pretty much anywhere else, because humans move all the time, without really noticing it consciously. But our brains notice.

So that’s where head tracking comes in. With it, the computer knows at all times where the viewer’s eyes are located in 3D space (let’s only talk about single viewers for now). The software can then update its display of virtual objects based on the viewer’s eye positions, so that objects do look different as the viewer moves around. But that’s not the whole trick; after all, every 3D graphics program has some built-in way to change the view of displayed objects — what’s the difference with head tracking?

The difference is that with head tracking, the views of virtual objects don’t just change in some random way, but they change in precisely the same way as the views of identical real objects would. Pulling this off, however, requires some extra work. To understand it, we need to look at how our eyes work, and do some basic geometry. Figure 1 is a top-down view on a pair of disembodied eyeballs looking at a real object. The principle is very simple: light from the object enters the eyes through their pupils and lenses, and excites nerve endings on the retina in the back of the eyeballs. Nerve impulses travel along the optic nerve to the brain, where the visual cortex takes the input from both eyes, and reconstructs a mental 3D model of the observed objects.

Figure 1: Top-down view of a viewer’s two eyes observing a real object. Light from the object enters the eyes and hits the retinas. Impulses travel along the optic nerve to the brain, where the visual cortex creates a mental 3D model based on the two retinal imprints.

Now let’s take those light rays entering the eyes, and extend them backwards, through the real 3D object. Then let’s place a screen somewhere behind the object, and let’s create a pair of images (one for the left eye, one for the right eye) on the screen that exactly match the light travelling “backwards” from the eyes through the object (see Figure 2). Because light travels in a straight line, creating those images is straightforward if we know the exact positions of both eyes, the real object, and the screen (keep that in mind for later).

Figure 2: The light rays entering the eyes are reversed, and create a pair of stereoscopic images on a screen behind the real object.

Finally, let’s remove the real object (see Figure 3). Now, very carefully compare Figures 1 and 3. Although the real object is gone, and the light hitting the eyes now comes from the screen instead, the light impressions on the viewer’s retinas are exactly the same as before. This means that, as far as the viewer is concerned, what is now a pair of images on a screen looks exactly like a real object; in other words, for the viewer, the real object is still there. And that’s why holographic displays just work: because there is no way for the viewer to distinguish the virtual images from real objects, there is no way not to be fooled by the illusion.

Figure 3: The real object is removed, and the light entering the eyes now comes from the screen. Because the retinal imprints from the screen are exactly the same as those from the real object, the brain’s visual cortex still perceives the same object.

Let’s get into the technical details a bit more. As I stated above, to pull off this illusion, the software needs to know the position of the viewer’s eyes, the position (and size and orientation) of the screen, and the position of the virtual object. Based on these data, the software can set up a virtual camera that will create exactly the required images. Of course, if the software is stuck using the standard 3D graphics camera model, that’s still pretty hard. But with the viewer-screen camera model, viewer position and screen layout are all that’s needed.

The last important point is in the details. This should go without saying, but of course the software needs to know the positions of all relevant things in the same coordinate system. That poses a problem, because normally display systems are not calibrated. User’s don’t know exactly how big their screens are, let alone exactly where they are positioned and how they are oriented, and if they have head tracking systems, those will work in their own coordinate systems. Meaning that building a holographic display takes very careful global calibration, something that’s not only difficult, but also requires special tools. That’s why a pre-integrated, factory-calibrated display like the zSpace is such a big deal: it already knows all the information that’s required to create the perfect illusion; the software only has to provide the virtual objects. It’s the first plug&play holographic display.

Now that the principle is clear, everything else follows. If the viewer moves but the virtual objects stay the same, the images on the screen move as well. There is a simple analogy to explain their motion: pretend that the viewer’s eyes are small flashlights. Then light is actually coming from the eyes and hitting the objects, and what we see on the screens are exactly the shadows of the objects. As the flashlights move, the shadows move in exactly the way real shadows would.

Here’s a slide show I made a few years ago trying to explain the same thing, using the same diagrams (and I’m also talking about the difference between head-mounted displays and CAVEs towards the end):

So what happens if there is no screen behind the virtual object, to be a source of light hitting the eyes? Simple: the virtual object gets cut off. This is a very disturbing effect, because apparently an object gets occluded by something that’s behind it. Our brains have a really hard time believing that, which is why it has to be avoided at all costs. This is why virtual reality environments like CAVEs have multiple very large screens. The screens are not used to actually show images; they are simply there so that there is always some screen area behind any virtual object that the viewer might be looking at. Interestingly, the viewer-screen camera model works for multiple screens without any extra effort; the resulting images will automatically be seamless.

The cut-off effect is, however, a problem for small-screen holographic displays like the zSpace. And, going against many people’s expectations, the same problem applies to “real” laser-based holograms. They also need a screen behind them to be visible. There is no such thing as a holographic projector, like the “help us, Obi-Wan Kenobi” kind!

Here is a video showing a CAVE, a large-scale holographic display. In this case, it was the video camera that was head-tracked, which is why the virtual globe looks almost as real as the real person. Pay special attention to the edges and corners of the CAVE (which is basically a big cube): although two or three screens come together at right angles, the globe still looks seamless.

Now a confession: I lied above. The light coming from the screen does not look exactly like the light from a real object would. The one thing that head-tracked stereoscopic holographic displays cannot create — and what distinguishes them from laser-based holograms — is focus. Our eyes have to focus differently for light coming from different distances. In a holographic display, while the light appears to come from a virtual object, it still really comes from a screen somewhere behind the object. Because our eyes are otherwise fooled by the illusion, they will automatically focus on the position of the virtual object, causing the object to appear slightly blurry (since the light is actually coming from somewhere else). Fortunately, this is a subtle effect that most people don’t notice until it’s pointed out, but it is a difference, and it’s even a concern for its potential (but unconfirmed) long-term effects.

Oh, and cool story: Why is the CAVE called CAVE? “CAVE automated virtual environment” is obviously a (recursive) backronym, so why CAVE? I think it works at three levels: for one, standing in a CAVE feels somewhat like standing in a cave. OK, weak. Second, there’s Plato’s Allegory of the Cave. Basically, it says that we perceive reality not directly, but only through its impression on our senses (or, as Plato put it, through its shadows on a cave wall). That’s pretty deep. But here’s the joke: in a CAVE, that’s literally true. As I explained above, head-tracked stereoscopic holographic displays present objects through their shadows on one or more CAVE walls, projected from light sources in the viewer’s eyes. Whoa.