The VR software gap

When it comes to VR in the public’s mind, it’s all about the hardware. And that’s understandable, when there’s all that new and shiny tech out there: the Oculus Rift, the Leap Motion Leap, the Razer Hydra, the newly-announced Sony HMZ-T2 (couldn’t find an actual Sony link), you name it. But with that comes the unstated assumption that the hardware is all you need, that if you just buy the gadget, it will somehow work on its own. And at least when it comes to VR, that’s simply not the truth. Without proper software running it, the gadget is nothing but a glorified paperweight (of course the reverse is just as true, but I’m a software guy, so there).

The emphasis here is on proper software. Because all that shiny tech that came out in the past, and that nobody remembers (or tries to purge from their minds — remember the Virtual Boy? You’re welcome!), it all came with software. Just not software anybody was willing to use.

Which is why I was delighted to read this recent interview with Valve Software‘s Michael Abrash. Here’s a guy who gets it:

So first, I’ll tell you what’s necessary for VR to work well. For VR to work well, you need display technology that gives you an image both your brain and eye are happy with. Trust me, that’s much harder than you think. Even if it was just a HUD, people wouldn’t be that happy, because you’re always moving. Your head is never still. And this is moving relative to the world, and if your brain is trying to fuse it, that can be rather tiresome. I’ll tell you there are lots of issues with getting that image up in front of you.

I couldn’t agree more. Here’s what I have to add to this statement: as a game developer, Mr. Abrash should not have to worry about it in the first place. Should game developers worry about implementing the projective geometry and arithmetic necessary to turn triangles forming a 3D world into pixels on a screen in massively-parallel special-purpose silicon? No, that’s what OpenGL or Direct3D are for. Should game developers worry about how to scan the hardware of a keyboard to read key presses, or how to safely send data packets across a heterogeneous network of interconnected computers? No, that’s what the operating system is for.

Along the same lines, someone else should have to worry about how to properly display a 3D virtual world on a head-mounted display so that it looks correct and doesn’t cause eye strain or nausea, because that’s really hard, and really important. And while the Michael Abrashes and John Carmacks of this world can surely do it, others will get it wrong. I know that because others have been getting it wrong, for going on twenty years now. And it’s the wrong approaches that sour people of the whole VR idea.

But the problem is that, at this time, game developers still have to worry about it, because there is not an equivalent to OpenGL for VR yet, in the sense that it is a widely-accepted industry-standard toolkit that has all the functionality that’s required to build successful applications on it. Now, there are plenty of VR toolkits out there — and, for disclosure, I created one of them — but none fulfill these criteria. Let’s talk about that.

Like other support or middleware software, VR toolkits can work at several levels of abstraction. I’m going to use “standard” 3D graphics toolkits as analogies here, assuming that those who have read this far know about such things.

At the low level, we have things that are the equivalent of OpenGL itself, or the glut windowing toolkit built on top of it. These are things that give you minimum abstractions, and offload any higher-level functionality to individual applications. Take glut: it will open a display window for you (no easy task in itself), and allow you to query the mouse, but if you want to use the mouse to rotate your 3D scene in the window, you’re on your own, pal. Result: glut developers roll their own navigation interfaces, and if their actual goal is something besides navigation, and they just do the bare minimum, the results usually suck hard.

The equivalent to glut in the VR world would be a toolkit that opens windows for you and sets them up to do proper stereo, and gives you abstract input devices, typically represented as 4×4 homogeneous matrices. If you want to use those input devices to do anything, let’s hope you really grok projective geometry. The results are often, let’s say, somewhat glut-ish in nature.

The canonical example of a low-level VR toolkit is the cavelib, but there are many others. I want to mention one other, because I might catch flak for it: VR Juggler. Now I haven’t looked at version 3.0 yet, but in the VR Juggler I know, the above abstractions are what you get. There is a lot of work going on under the hood, with clever ways of dynamically managing input devices and displays etc., but in the end what you get is a number of set-up-for-3D windows, and a bunch of matrices. Everything else is up to you. Don’t get me wrong: I’m not saying that these toolkits are bad, I’m merely saying that they’re low-level. If low-level is what you want or need, these are for you.

On the other extreme, there are high-level toolkits (who’da thunk?). These are basically content creation and management engines, equivalent to things like commercial or open-source game engines, think id Tech, Unreal Engine, Ogre, Horde 3D, etc. These are very powerful and easy to use — at least relative to their feature sets — but they are written with a particular purpose in mind. You could probably tweak the Unreal Engine to do 3D visualization of volumetric data, but you’d really be better off not doing it.

The only high-level VR toolkit I know by more than just name is WorldViz, but I think it’s pretty canonical. It’s very easy to put together a 3D world with it, and show it on a wide variety of VR display devices, but if you have more specific needs, it will be so much harder to punch through the high abstraction layer to get to the guts you need to get to.

A quick secondary analogy: low-level is like raw X11, high level is like a certain office software suite, and the middle level is like gtk+ or Qt. You can see where I’m going: nobody has been writing apps in raw X11 for twenty years (with very good reason), and the really exciting part is in the middle, because developing apps in that unnamed office software suite is for code monkeys (that was a joke).

I haven’t seen many medium-level VR toolkits. In the non-VR world, scene graph toolkits like OpenSceneGraph or OpenSG would qualify for this level, but while there exist some VR embeddings of these toolkits, those are not quite standard, and — I believe — are still lacking in the input department.

It was this lack of medium-level software that led me to start my own VR toolkit back in the day. There’s much to be said about that, but it’s a topic for another post. For now, I just want to mention that what separates it from low-level are its built-in 3D interaction metaphors, such as navigation. If you want to rotate your scene with the mouse, you don’t have to reinvent the wheel. But if you do really want to make your own navigation metaphor, there’s an “official” way to do so — and that’s what separates it from high-level toolkits.

But back on topic. Why do I insist that game developers use VR middleware, instead of working on the bare metal themselves? I already mentioned that there’s the danger of getting it wrong, and having middleware that does it right prohibits that, but there’s another reason that holds even if all game developers do it right.

Games have been rolling their own user interfaces since day 1, and there’s a certain appeal to having vastly different looking interfaces in different games that fit with the visual style of each game, but there’s the thing: take away the skin, and they all work the same. You don’t have to read the manual to know how to navigate a game’s menu (or if you do, you should ask for your money back), and if you play certain genres, say first-person shooters, you know that they all use WASD+mouse, so you’re right at home. But imagine if games used functionally different interfaces. Simple example: imagine half of FPS games looking up when you push the mouse forward, and the other half looking down, and there being no way to change that. Now imagine you’re really good at one, and try the other. You’ll love it.

And that’s a problem in VR, because the number of potential ways of doing the same thing, multiplied by the number of fundamentally different input devices (gamepad? Wiimote? data glove? Kinect? else?) would lead to an explosion of mutually incompatible choices. Using a common middleware, which is based on tested and working interaction metaphors, and allows users to pick their own favorite metaphors out of a large pool and use them across all applications, would really help here.

To break up this Wall of Text, I’m going to throw in two related videos. Both show VR “games,” with somewhat different ways of incorporating the players’ bodies into the action. The first one is a pretty straight-up FPS. It’s decidedly old-school, being based on maps and models from 1997’s Descent (best game ever!), so look past the dated graphics and observe the seamless integration of the player, and the physical user interface, particularly the aiming. Keep in mind the catch-22 of VR: in order to film this, the user can’t see properly, which is why my aim is so poor. If done for real, it’s much better. Please watch both halves, the second (starting at 2:18) makes it clearer what’s going on in the first:

The second video also shows an FPS, at least on the surface, being based on maps from Doom 3. But the video is not only a lot more whimsical (feel free to roll your eyes), it also doesn’t feature shooting, and shows a larger variety of bodily interactions, including being able to draw free-handedly in 3D space for an ad-hoc noobs’ round of tic-tac-toe. It’s not a game, as it’s meant to show remote collaboration with virtual holograms, but it was a blast nonetheless, some hardware trouble notwithstanding:

Back to middleware, and the final issue: think configuring a desktop PC game is bad? With all the drivers and graphics options and knobs and twiddles, and FAQs on the web how to get it wide-screen etc.? VR is a hundred times worse, and on top of that, if you get it slightly wrong, it will make you sick. Now imagine you’ve just set it up perfectly in one game, and have to do it all over again for the next game, and the knobs and dials you have to twiddle are completely different. If there’s common middleware, you only have to do it once. Wouldn’t it be nice if same-genre games of today, like FPSs, would share mouse and keyboard settings at least? Or if you tell one to run at 1920×1080, the next one will, too? One can dream, right? Well, with a good medium-level toolkit, that is exactly what happens.

I am not proposing that all games should work exactly the same, not even games in the same genre. Even when using a medium-level toolkit with powerful built-in user interface features, there is still a huge amount of design space for individual games to establish their own look & feel, or provide special-purpose interaction metaphors — allowing that is the whole point of medium-level toolkits — but at least the fundamentals are strong.

I kept the best news for last: a good medium-level VR toolkit does not only work in actual VR, it also works splendidly on a desktop — in fact, an application based on a VR toolkit done right is functionally indistinguishable from a native desktop application. I have plenty of VR applications to prove it. That means that game developers will not have to make specific VR and desktop versions of their games. VR (or the desktop, depending on your perspective) will come for free.

So here’s my call to arms: now is the exactly right time to get going on that middleware thing. We can’t wait until great consumer-level VR hardware hits the mainstream market; the software has to be already there and ready the moment it does. The people making VR middleware, and the people who should be using it, should already be talking at this point. Are they?

KeckCAVES on Mars

You might have heard that NASA has a new rover on Mars. What you might not know is that KeckCAVES is quite involved with that mission. One of KeckCAVES’ core scientists, Dawn Sumner, is a member of the Curiosity Science Team. Dawn talks about her experiences as tactical long term planner for the rover’s science mission, and co-investigator on several of the rover’s cameras, on her blog, Dawn on Mars.

Immersive 3D visualization has been used at several stages of mission planning and preparation, including selection of the rover’s landing site. Crusta, the virtual globe software developed by KeckCAVES, was used to create a high-resolution global topography model of Mars, merging the best-quality data available for the entire planet and each of the originally proposed landing sites. Crusta’s ability to run in an immersive 3D display environment such as KeckCAVES’ CAVE, allowing users to virtually walk on the surface of Mars at 1:1 (or any other) scale, and to create maps by drawing directly on the 3D surface, was important in weighing the relative merits of the four proposed sites from engineering and scientific viewpoints.

Dawn made the following video after Gale Crater, her preferred landing site, had been selected for the mission to illustrate her rationale. The video is stereoscopic and can be viewed using red/blue anaglyphic glasses or several other stereo viewing methods:

We filmed this video entirely virtually. Dawn is working with Crusta on a low-cost immersive 3D environment based on a 3D TV, which means she perceived Crusta’s Mars model as a tangible 3D object and was able to interact with it via natural gestures using an optically-tracked Nintendo Wii controller as input device, and point out features of interest on the surface using her fingers. Dawn herself was filmed by two Kinect 3D video cameras, and the combination of virtual Mars and virtual Dawn was rendered into a stereo movie file in real-time while she was working with the software.

Now that Curiosity is on Mars, we are planning to continue using Crusta to visualize and evaluate its progress, and we hope that Crusta will soon help planning and executing the rover’s journey up Mt. Sharp (NASA have their own 3D path planning software, but we believe Crusta has useful complementary features).

Furthermore, as the rover progresses, it will send high-resolution stereo images from its mast-mounted navigation camera. Several KeckCAVES developers are working on software to convert these stereo images into ultra-high resolution digital terrain models, and to register these to, and integrate them with, Crusta’s existing Mars topography model as they become available.

We already tried this process with stereo imagery from the previous two Mars rovers, Spirit and Opportunity. We took the highest-resolution orbital topography data available, collected by the HiRISE camera, and merged it with the rover data, which is approximately 1000 times more dense. The following figure shows the result (click to embiggen):

The white arrow in panel A shows the location of the rover’s high-resolution data patch shown in panels B and C. In panel C, a stratum of rock — identified by its different color — was selected, and a plane was fit to the selected points (highlighted in green) to measure the stratum’s bedding angle.

The above images were created with LiDAR Viewer, another KeckCAVES software package. LiDAR Viewer is used to visually analyze very large 3D point clouds, such as those resulting from laser scanning surveys, or, in this case, orbital and terrestrial stereo imagery.

The terrain data we expect from Curiosity’s stereo cameras will be even higher resolution than that. The end result will be an integrated global Martian topography model with local patches down to millimeter resolution, allowing a scientist in the CAVE to virtually pick up individual pebbles.

3D Movies and the VR community: a match made in heaven or hell?

I know I’m several years late to the party talking about the recent 3D movie renaissance, but bear with me. I want to talk not about 3D movies, but about their influence on the VR field, good and bad.

First, the good. It’s impossible to deny the huge impact 3D movies have had on VR, simply by commodifying 3D display hardware. I’m going to go out on a limb and say that without Avatar, you wouldn’t be able to go into an electronics store and pick up a 70″ 3D TV for $2100. And without that crucial component, we would not be able to build low-cost fully-immersive 3D display systems for $7000. And we wouldn’t have neat toys like Sony’s HMZ-T1 or the upcoming Oculus Rift either — although the latter is designed for gaming from the ground up, I don’t think the Kickstarter would have taken off if 3D movies weren’t a thing right now.

And the effect goes beyond simply making real VR cheaper. It is that now real VR is affordable for a much larger segment of people. $7000 is still a bit much to spend for home entertainment, but it’s inside the equipment budget for many scientists. And those are my target audience. We are not selling low-cost VR systems per se, but we’re giving away the designs to build them, and the software to run them. And we’ve “sold” dozens of them, primarily to scientists who work with 3D data that is too complex to meaningfully analyze with desktop 3D visualization, but who don’t have the budget to build “professional” systems. Now, dozens is absolutely zilch in mainstream terms, but for our niche it’s a big deal, and it’s just taking off. We’re even getting them into high schools now. And we’re not the only ones “selling” them.

The end result is that many more people are getting exposed to real immersive 3D display environments, and to the practical benefits that they offer for their daily work. That will benefit us all.

But there are some downsides to the 3D movie renaissance as well, and while those can be addressed, we first need to be aware of them. For one, while 3D movies are definitely in the public conscience, I found that nobody is exactly completely bonkers about them. Roger Ebert is an extreme example (I think that Mr. Ebert is wrong in the sense that he claims 3D does not work in principle, whereas I think 3D does not work in many concrete implementations seen in theaters right now, but that’s a topic for another post), but the majority of people I speak to are decidedly “meh” about 3D movies. They say “3D doesn’t work for me” or “I get headaches” or “I get dizzy” etc.

Now that is a problem for VR as a whole, because there is no distinction in the public mind between 3D movies and real immersive 3D graphics. Meaning that people think that VR doesn’t work. But it does. I just did a quick guesstimate, and in the seven years we’ve had our CAVE, I’ve probably brought 1000 people through there, from every segment of the population. It has worked for every single one of them. How do I know? Everyone who enters the CAVE goes through the training course — a beach ball-sized globe hanging in the middle of the CAVE, shown in this video:

(Oh boy, just looking at this six-year-old video, the user interface in Vrui has improved so much. It’s almost embarrassing.)

I ask every single person to step in, touch the globe, and then indicate how big it is. And they all do the same thing: use both hands to make a cradling gesture around a virtual object that’s not actually there. If the 3D effect wouldn’t work for them, they couldn’t do it. QED. Before you ask: I’m aware that a significant percentage of the general population have no stereo vision at all, but immersive 3D graphics works for them as well because it provides motion parallax. I know because one of my best friends has monocular vision, and it works for him. He even co-stars with me in a silly video.

The upshot is that the conversation goes differently now. It used to be that I talk to “VR virgins” about what I do, and they have no pre-conception of 3D, are curious, try the CAVE, and it works for them and they like it. These days, I talk about the CAVE, they immediately say that 3D doesn’t work for them, and they’re very reluctant to try the CAVE. I twist their arms to get them in there nonetheless, and it works for them, and they like it. This is not a problem if I have someone there in person, but it’s a problem when I can’t just stuff the person I’m describing VR to into a VR system, as in, say, when you’re writing a proposal to beg for money. And that’s bad news, big time (but it’s a topic for another post).

There is another interesting change in behavior: let’s say I have a group of people coming in for a tour (yeah, we sometimes get strongarmed into doing those). Used to be, they would come into the CAVE room, and stand around not sure what to expect or what to do. These days, they immediately sit down at the conference table, grab a pair of 3D glasses if they find one, and get ready to be entertained. I then have to tell them that no, that’s not how it works, would they please put the non-head tracked glasses down until later, get up, and get ready to get into the CAVE itself and see it properly? It’s pretty funny, actually.

The other downside is that the use of the word “3D” for movies has watered down that term even more. Now there are:

  • “3D graphics” for projected 2D images of 3D scenes, i.e., virtual and real photos or movies, i.e., basically everything anybody has ever done. The end results of 3D graphics are decidedly 2D, but the term was coined to distinguish it from 2D graphics, i.e., pictures of scenes playing in flatland.
  • “3D movies” meaning stereoscopic movies shown on stereoscopic displays. In my opinion, a better term would be “2D plus depth” movies (or they could just go with “stereo movies,” you know), because most directors at this time treat the stereoscopic dimension as a separate entity from the other two dimensions, as something that can be tweaked and played with. And I think that’s one cause of the problem, because they’re messing with people’s brains. And don’t even get me started on “upconverted” 3D movies, oh my.
  • “3D displays” meaning stereoscopic displays, those used to show 3D movies. They are a necessary component to create 3D images, but not 3D by themselves.
  • “3D displays” meaning immersive 3D displays like CAVEs. The distinguishing feature of these is that they show three-dimensional scenes and objects in a way similar enough to how we would perceive the same scenes and objects if they were real that our brains accept the illusion, and allow us to work with them as if they were real — and this last bit is really the main point. The difference between this and “3D movies” cannot be overstated. I would rather call these displays “holographic,” but then I get flak from the “holograms are only holograms if they’re based on lasers and interference” crowd, who are technically correct (and isn’t that the best form of correctness?) because that’s how the word was defined, but it’s wrong because these displays look and feel exactly like holograms — they are free-standing, solid-appearing, touchable virtual objects. After all, “hologram,” loosely translated from Greek, means “shows the whole thing.” And that’s exactly what immersive 3D displays do.

And I probably missed a few. So there’s clearly a confusion of terms, and we need to find ways to distinguish what real immersive 3D graphics does from what 3D movies do, and need to do it in ways that don’t create unrealistic expectations, either. Don’t reference “the Matrix,” try not to mention holodecks (but it’s so tempting!), don’t say it’s an indistinguishable replication of reality (in other words, don’t say “virtual reality,” ha!). Ideally, don’t say anything — show them.

In summary, “3D” is now widely embedded in the public conscience, and the VR community has to deal with it. There are obvious and huge benefits, but there are some downsides as well, and those have to be addressed. They can be addressed — fortunately, immersive 3D graphics are not the same as 3D movies — but it takes care and effort. Time to get started.

Running WordPress on Linux

When I got talked into starting a blog, I decided I’d at least have some fun with it and install the blog software myself — after all, I’m already running a web server; why not use that? Should be easy, right?

Well, turns out it was easy, except one “little snag.” I could see the blog template, change settings, post and comment — but I couldn’t upload media. Some digging turned up that WordPress stuffs everything besides media into its underlying MySQL database, but media files are directly uploaded to the directory structure managed by the web server running the WordPress software. Whenever I tried uploading media, I got the dreaded “Unable to create directory – Is its parent directory writable by the server?” error message.

Now, being an old Linux hand, I blamed permission issues right away, and started doing the usual tests — find out under which user account apache and php scripts are run, chown the offending directory tree to that user, set proper permissions on that directory tree, all to no avail. I even tried the last-resort fix of making the entire wordpress directory world-writable (don’t try this at home), but even that didn’t work. And that had me stumped.

So I turned to Google, and found out that about 10 million other people had already run into the same problem. There were lots of proposed fixes, most of them bogus of course (use ftp to upload media; manually create a new uploads/<year>/<month> directory every month; make everything world-writable; disable your plug-ins; re-install WordPress; change hosting services; exorcise your server; do a rain dance, etc.). Not a single one of them worked, including the rain dance.

Now here was the actual issue: my web server is sitting on a modern Linux kernel, and if you know your stuff you know where I’m going with this: it’s running SELinux. And of course SELinux was the culprit. It turns out that apache runs under a very restrictive SELinux regime (as it should), and this regime makes all files underneath the root html directory read-only, no matter what the UNIX file permissions say. You can chmod 0777 to your heart’s content, it will make not an ounce of difference. But once you re-label WordPress’ upload directory to httpd_sys_rw_content_t, everything is peachy keen.

So why is this so hard? SELinux is not to blame; it’s doing exactly what it’s supposed to be doing. It’s running apache in the most restrictive way possible that allows it to run at all. Changing the label of a directory is a one-line command, and there’s even a man page (httpd_selinux) that explains what needs to be done. So why on earth did not a single one of the answers I found mention SELinux at all? (Of course, in hindsight, if you Google for “WordPress SELinux media upload” you get lots of good advice, but if you are already suspecting SELinux is involved, you don’t need to Google anymore).

I think the basic problem is that the fact that SELinux has completely changed the way UNIX does security has stayed under the radar for most people. UNIX file permissions are no longer the first line of defense against unauthorized file accesses. I myself completely forgot about SELinux while frantically trying to solve the problem; I only remembered it when I had already wasted a good part of a day. What I’m saying is, SELinux needs a really good PR campaign. Oh, and somebody needs to write a blog post containing the keywords “WordPress cannot upload images” and “SELinux” and “httpd_sys_rw_content_t” so that poor noobs like me will be able to find it on Google in the future. Oh, I guess I just did.

So here’s my solution, where /var/www/html/wordpress is my wordpress root directory, and apache is running as user apache:

  1. Chown the entire /var/www/html hierarchy to root:apache and make it readable for user and group only:
    • chown -R root:apache /var/www/html
    • chmod -R g-w,o-rwx /var/www/html
  2. Create (if it doesn’t already exist) and chown /var/www/html/wordpress/wp-content/uploads to apache:apache and make sure it’s user-writable (it should already be):
    • mkdir /var/www/html/wordpress/wp-content/uploads
    • chown -R apache:apache /var/www/html/wordpress/wp-content/uploads
    • chmod -R u+w,g-w,o-rwx /var/www/html/wordpress/wp-content/uploads
  3. Change the SELinux context of /var/www/hml/wordpress/wp-content/uploads to httpd_sys_rw_content_t:
    • chcon -R –type=httpd_sys_rw_content_t /var/www/html/wordpress/wp-content/uploads
  4. Optionally, make the change permanent, i.e., make it survive a complete file system re-label:
    • semanage fcontext -a -t httpd_sys_rw_content_t “/var/www/html/wordpress/wp-content/uploads(/.*)?”

Oh, and if you think my advice to make all of /var/www/html verboten for other is too restrictive, at least do the world a favor and make the WordPress configuration file, you know the one that contains your MySQL database’s admin password in plain text, yeah go ahead and make that one user- and group-only.

Ah hell, while you’re at it, just save yourself a lot of trouble and go ahead and read this.

Update: I finally figured out how to get notification emails from WordPress. SELinux was the problem again.

Good stereo vs. bad stereo

I received an email about a week ago that reminded me that, even though stereoscopic movies and 3D graphics have been around for at least six decades, there are still some wide-spread misconceptions out there. Those need to be addressed urgently, especially given stereo’s hard push into the mainstream over the last few years. While, this time around, the approaches to stereo are generally better than the last time “3D” hit the local multiplex (just compare Avatar and Friday the 13th 3D), and the wide availability of commodity stereoscopic display hardware is a major boon to people like me, we are already beginning to see a backlash. And if there’s a way to do things better, to avoid that backlash, then I think it’s important to do it.

So here’s the gist of this particular issue: there are primarily two ways of setting up a movie camera, or a virtual movie camera in 3D computer graphics, to capture stereoscopic images — one is used by the majority of existing 3D graphics software, and seemingly also by the “3D” movie industry, and the other one is correct.

Toe-in vs skewed frustum

So, how do you set up a stereo camera? The basic truth is that stereoscopy works by capturing two slightly different views of the same 3D scene, and presenting these views separately to the viewers’ left and right eyes. The devil, as always, lies in the details.

Say you have two regular video cameras, and want to film a “3D” movie (OK, I’m going to stop putting “3D” in quotes now. My pedantic point is that 3D movies are not actually 3D, they’re stereoscopic. Carry on). What do you do? If you put them next to each other, with their viewing directions exactly parallel, you’ll see that it doesn’t quite give the desired effect. When viewing the resulting footage, you’ll notice that everything in the scene, up to infinity, appears to float in front of your viewing screen. This is because the two cameras, being parallel, are stereo-focused on the infinity plane. What you want, instead, is that near objects float in front of the screen, and that far objects float behind the screen. Let’s call the virtual plane separating “in-front” and “behind” objects the stereo-focus plane.

So how do you control the position of the stereo-focus plane? When using two normal cameras, the only solution is to rotate both slightly inwards, so that their viewing direction lines intersect exactly in the desired stereo-focus plane. This approach is often called toe-in stereo, and it sort-of works — under a very lenient definition of the words “sort-of” and “works.”

The fundamental problem with toe-in stereo is that it makes sense intuitively — after all, don’t our eyes rotate inwards when we focus on nearby objects? — but that our intuition does not correspond to how 3D movies are shown. 3D (or any other kind of) movies are not projected directly onto our retinas, they are projected onto screens, and those screens are in turn viewed by us, i.e., they project onto our retinas.

Now, when a normal camera records a movie, the assumption is that the movie will later be projected onto a screen that is orthogonal to the projector’s projection direction, which is implicitly the same as the camera’s viewing direction (the undesirable effect of non-orthogonal projection is called keystoning). In a toe-in stereo camera, on the other hand, there are two viewing directions, at a slight angle towards each other. But, in the theater, the cameras’ views are projected onto the same screen, meaning that at least one, but typically both, of the component images will exhibit keystoning (see Figures 1 and 2).

Figure 1: The implied viewing directions and screen orientations caused by a toe-in stereo camera based on two on-axis projection cameras. The discrepancy between the screen orientations implied by the cameras’ models and the real screen causes keystone distortion, which leads to 3D convergence issues and eye strain.

Figure 2: The left stereo image shows the keystoning effect caused by toe-in stereo. A viewer will not be able to merge these two views into a single 3D object. The right stereo image shows the correct result of using skewed-frustum stereo. You can try for yourself using a pair of red/blue anaglyphic glasses.

The bad news is that keystoning from toe-in stereo leads to problems in 3D vision. Because the left/right views of captured objects or scenes do not actually look like they would if viewed directly with the naked eye, our brains refuse to merge those views and perceive the 3D objects therein, causing a breakdown of the 3D illusion. When keystoning is less severe, our brains are flexible enough to adapt, but our eyes will dart around trying to make sense of the mismatching images, which leads to eye strain and potentially headaches. Because keystoning is more severe towards the left and right edges of the image, toe-in stereo generally works well enough for convergence around the center of the images, and generally breaks down towards the edges.

And this is why I think a good portion of current 3D movies are based on toe-in stereo (I haven’t watched enough 3D movies to tell for sure, and the ones I’ve seen were too murky to really tell): I have spoken with 3D movie experts (an IMAX 3D film crew, to be precise), and they told me the two basic rules of thumb for good stereo in movies: artificially reduce the amount of eye separation, and keep the action, and therefore the viewer’s eyes, in the center of the screen. Taken together, these two rules exactly address the issues caused by toe-in stereo, but of course they’re only treating the symptom, not the cause. As an aside: when we showed this camera crew how we are doing stereo in the CAVE, they immediately accused us of breaking the two rules. What they forgot is that stereo in the CAVE obviously works, including for them, and does not cause eye strain, meaning that those rules are only workarounds for a problem that doesn’t exist in the first place if stereo is done properly.

So what is the correct way of doing it? It can be derived by simple geometry. If a 3D movie or stereo 3D graphics are to be shown on a particular screen, and will be seen by a viewer positioned somewhere in front of that screen, then the two viewing volumes for the viewer’s eyes are exactly the two pyramids defined by each eye, and the four corners of the screen. In technical terms, this leads to skewed-frustum stereo. The following video explains this pretty well, better than I could here in words or a single diagram, even though it is primarily about head tracking and the screen/viewer camera model:

In a nutshell, skewed-frustum stereo works exactly as ordered. Even stereo pairs with very large disparity can be viewed without convergence problems or eye strain, and there are no problems when looking towards the edge of the image.

To allow for a real and direct comparison, I prepared two stereoscopic images (using red/blue anaglyphic stereo) of the same scene from the same viewpoint and with the same eye separation, one using toe-in stereo, one using skewed-frustum stereo. They need to be large and need to be seen at original size to appreciate the effect, which is why I’m only linking them here. Ideally, switch back-and-forth between the images several times and focus on the structure close to the upper-left corner. The effect is subtle, but noxious:

Good (skewed-frustum) stereo vs bad (toe-in) stereo.

I generated these using the Nanotech Construction Kit and Vrui; as it turns out, Vrui is flexible enough to support bad stereo, but at least it was considerably harder setting it up than good stereo. So that’s a win, I guess.

There are only two issues to be aware of: for one, objects at infinity will have the exact separation of the viewer’s eyes, so if the programmed-in eye separation is larger than the viewer’s actual eye separation, convergence for very far away objects will fail (in reality, objects can’t be farther away than infinity, or at least our brains seem to think so). Fortunately, the distribution of eye separations in the general population is quite narrow; just stick close to the smaller end. But it’s a thing to keep in mind when producing stereoscopic images for a small screen, and then showing them on a large screen: eye separation scales with screen size when baked into a video. This is why, ideally, stereoscopic 3D graphics should be generated specifically for the size of the screen on which they will be shown, and for the expected position of the audience.

The other issue is that virtual objects very close to the viewer will appear blurry. This is because when the brain perceives an object to be at a certain distance, it will tell the eyes to focus their lenses to that distance (a process called accommodation). But in stereoscopic imaging, the light reaching the viewer’s eyes from close-by virtual objects will still come from the actual screen, which is much farther away, and so the eyes will focus on the wrong plane, and the entire image will appear blurry.

Unfortunately, there’s nothing we can do about that right now, but at least it’s a rather subtle effect. In our CAVE, users standing in the center can see virtual objects floating only a few inches in front of their eyes quite clearly, even though the walls, i.e., the actual screens, are four feet away. This focus miscue does have a noticeable after-effect: after having used the CAVE for an extended period of time, say a few hours, the real world will look somewhat “off,” in a way that’s hard to describe, for a few minutes after stepping out. But this appears to be only a temporary effect.

Taking it back to the real 3D movie world: the physical analogy to skewed-frustum stereo is lens shift. Instead of rotating the two cameras inwards, one has to shift their lenses inwards. The amount of shift is, again, determined by the distance to the desired stereo-focus plane. Technically, creating lens-shift stereo cameras should be feasible (after all, lens shift photography is all the rage these days), so everybody should be using them. And some 3D movie makers might very well already do that — I’m not a part of that crowd, but from what I hear, at least some don’t.

In the 3D graphics world, where cameras are entirely virtual, it should be even easier to do stereo right. However, many graphics applications use the standard camera model (focus point, viewing direction, up vector, field-of-view), and can only represent non-skewed frusta. The fact that this camera model, as commonly implemented, does not support proper stereo, is just another reason why it shouldn’t be used.

So here’s the bottom line: Toe-in stereo is only a rough approximation of correct stereo, and it should not be used. If you find yourself wondering how to specify the toe-in angle in your favorite graphics software, hold it right there, you’re doing it wrong. The fact that toe-in stereo is still used — and seemingly widely used — could explain the eye strain and discomfort large numbers of people report with 3D movies and stereoscopic 3D graphics. Real 3D movie cameras should use lens shift, and virtual stereoscopic cameras should use skewed frusta, aka off-axis projection. While the standard 3D graphics camera model can be generalized to support skewed frusta, why not just replace it with a model that can do it without additional thought, and is more flexible and more generally applicable to boot?

Update: With the Oculus Rift in developers’ hands now, I’m getting a lot of questions about whether this article applies to head-mounted displays in general, and the Rift specifically. Short answer: it does. There isn’t any fundamental difference between large screens far away from the viewer, and small screens right in front of the viewer’s eyes. The latter add a wrinkle because they necessarily need to involve lenses and their concomitant distortions so that viewers are able to focus on the screens, but the principle remains the same. One important difference is that small screens close to the viewer’s eyes are more sensitive to miscalibration, so doing stereo right is, if anything, even more important than on large-screen displays. And yes, the official Oculus Rift software does use off-axis projection, even though the SDK documentation flat-out denies it.

Whither Leap Motion?

Leap Motion‘s Leap, an optical tracking system enabling using one’s hands directly to interact with computers in three dimensions, has been the talk of the town recently. So what’s my take on it, and particularly its use for immersive graphics?

Cool story, bro. Two months ago, a group of researchers from UC Davis and I visited the company in their San Francisco offices to see the device for ourselves. Several of Leap Motion’s engineers had seen our booth at the recent Bay Area Maker Faire, and invited us to bring one of our low-cost semi-immersive displays (a 3D TV with a Razer Hydra 6-DOF input device) and show our stuff. We obliged, packed our things, and down along I-80 to SF we went. We showed them ours, they showed us theirs, and fun was had by all.

So what’s the intelligence gathered from this visit? There’s good news, and there’s bad news. The good news is the hardware. Leap Motion have been touting the Leap as a much more precise alternative to the Kinect, and they have that absolutely right. The precision, resolution, and responsiveness of the device are exactly what they claim. Interestingly, I did not glean that insight from the actual software demos they were showing, but from a very simple utility that just showed the raw 3D point cloud of everything that entered the device’s capture space, and identified hands, fingers, and other gadgets such as pencils accurately and in real time. Having done extensive work with the Kinect, I can say that it’s an entirely different kind of tracking, altogether.

So what’s the bad news? Well, as usual, it’s the software and application side. Leap Motion’s company line is that the Leap will make mouse and keyboard obsolete. Not so fast there, buckaroo. Probably 99.99% of computer interactions done by normal people are two-dimensional in nature, and the mouse/keyboard are really good at those. You would not want to use a free-space 3D interface for intrinsically 2D interactions, which is, incidentally, my only gripe with the famous Minority Report interface (but that’s a topic for another post). The end result from doing that already has a fitting name: “Gorilla Arm.” I think I can speak to that because that’s exactly what happens when you’re doing 2D tasks (like using a web browser or filling in a spreadsheet) in an immersive display environment. Trust me, it’s not something you want to do if you can avoid it.

On the other hand, if you’re one of the minority of people who use their computers for 3D tasks, e.g., 3D modeling, sculpting, or, naturally, immersive 3D graphics, it’s an entirely different story. For such applications in the desktop realm, the Leap is a godsend. Instead of having to do the mental gymnastics of using a 2D input device to perform 3D interactions, you just interact directly with the 3D data. This is, again, exactly what’s happening in immersive graphics, and yes, it’s something you definitely do want to do.

So that’s good news, right? Well, yeah, but… The problem here is, and it’s a big problem, that in order to pipe 3D interactions captured by a device like the Leap into a 3D application, you have to punch through the existing 2D-based user interface of that application. The previous approach companies developing novel 3D input devices (think all the data gloves, 3D mice, etc. that have come out and failed over the years) have taken is to provide some form of mouse emulation, so that their devices can be used immediately with existing software. This does not work, ever. In this setup, 3D interactions performed with the device are first boiled down to 2D by the device’s driver, fed into the application, and then turned back into 3D interactions using whatever interface paradigm the application is using. The first step, going from 3D to 2D, is already awkward, and the second step is typically optimized for particular 2D devices, such as mice, which a “simulated” mouse device is most decidedly not. In other words, there are two levels of ill-fitting interface paradigms stacked on top of each other.

So what needs to be done? The answer is quite simple: if you want to effectively use the Leap with a piece of 3D software, that software has to explicitly support the Leap, and needs to use appropriate direct 3D interaction metaphors. Meaning the application developers have to buy into the Leap, dream up good problem-specific 3D interaction metaphors, do studies or experiments to fine-tune them, and then include them in their software. That takes a lot of time and money, and they won’t do it unless there is high demand, i.e., the Leap is already a widely-used device. But it won’t become a widely-used device unless a lot of widely-used 3D software already supports it in an effective way.

So it’s a classical chicken-and-egg problem. Unless you happen to use a certain VR development toolkit that is based around exactly this idea: providing device-optimized 3D interaction metaphors outside of an application’s purview, so that hardware developers can integrate their devices into existing applications without having to change those applications in any way, or even getting to their source code. But I digress…

Back on topic, what Leap Motion need to do is find at least one “killer application,” and do their utmost to get that application just exactly right. And then they have to bundle that application with every device sold. If the people buying their device are stuck with playing Fruit Ninja, or navigating with Google Earth (another thing a mouse is really good at, because Google successfully boiled down the interaction to 2D, and Leap’s Google Earth plug-in doesn’t add any new functionality) or have to use the device to write emails, they won’t recommend it to their friends.

By the way: will the Leap work out-of-the box for 3D video games? Hard to say, but I’m skeptical. They show a “finger gun” control scheme for first-person shooters — again implemented via mouse emulation — but doing that for more than a few minutes will lead to a very sore shoulder. Not that it’s a bad idea in itself — see below for a video showing exactly that interface in a CAVE — but unless the Leap is integrated into a fully calibrated desktop system, it won’t allow a player to actually aim with the “finger gun;” it will be just an equally indirect replacement for moving the mouse left-to-right.

On their web site, Leap Motion mention CAD and clay modeling as applications that inspired them to develop it. Could these be killer applications? Time will tell, but it’s at least a good starting point. So, go ahead and do it! I happen to have a 3D virtual clay modeling application with direct 3D interaction metaphors lying around, just saying…

Now, to restate my overall point after all this skepticism. From what I’ve personally seen, the Leap is an awesome device. I will definitely buy at least one when it comes out. That’s because all the software I’m developing and using on a daily basis is already poised to work with it, due to its input abstraction paradigm. Give me a low-level driver, and the rest is gravy — please, give me a low-level driver! But will the device succeed in the mainstream market, given the issues discussed here? Will it sell hundreds of millions of units, as they hope? For that to happen, I think, they’ll have to do significantly more than what they showed us. Maybe that’s why they pushed back the release date by half a year — here’s hoping.

Standard camera model considered harmful

With apologies to Edsger W. Dijkstra (and pretty much everyone else).

So what’s wrong with the canonical 3D graphics camera model? To recap, in this model a camera is defined by a focus point (the “eye”), a viewing direction, an “up” vector, a screen aspect ratio, and a field-of-view angle (“fov”) (see Figure 1). Throw in a near- and far-plane distance, and together these parameters uniquely define a viewing frustum in model space, and hence the modelview and projection matrices required to render a view of a 3D scene. Nothing wrong with that, per se.

Figure 1: The standard 3D graphics camera model, defined by a focus point position, viewing direction, “up” vector, and screen aspect ratio (ratio of screen width to screen height, not shown in diagram).

The problem arises when this same camera model is applied to (semi-) immersive environments, such as when one wants to adapt an existing graphics package or game engine to, say, a 3D TV or a head-mounted display with only minimal changes. There are two main problems with that: for one, the standard camera model does not support proper stereo image generation, leading to 3D vision problems, eye strain, and discomfort (but that’s a topic for another post).

The problem I want to discuss here is the implicit link between the camera model and viewpoint navigation. In this context, viewpoint navigation is the mechanism by which a 3D graphics application represents the viewer moving through the virtual 3D environment. For example, in a typical first-person video game, the player will be represented in the game world as some kind of avatar, and the camera model will be attached to that avatar’s head. The game engine will provide some mechanism for the player to directly control the position and viewing direction of the avatar, and therefore the camera, in the virtual world (this could be the just-as-canonical WASD+mouse navigation metaphor, or something else). But no matter the details, the bottom line is that the game engine is always in complete control of the player avatar’s — and the camera’s — position and orientation in the game world.

But in immersive display environments, where the user’s head is tracked inside some tracking volume, this is no longer true. In such environments, there are two ways to change the camera position: using the normal navigation metaphor, or simply physically moving inside the tracking volume. Obviously, the game engine has no way to control the latter (short of tethers or an electric shock collar). The problem is that the standard camera model, or rather its implementation in common graphics engines, does not account for this separation.

This is not really a technical problem, as there are ways to manipulate the camera model to yield the correct end result, but it’s a thinking problem. Having to think inside this ill-fitting camera model causes real headaches for developers, who will be running into lots of detail problems and  edge cases when trying to implement correct camera behavior in immersive environments.

My preferred approach is to use an entirely different way to think about navigation in virtual worlds, and the camera model that directly follows from it. In this paradigm, a display system is not represented by a virtual camera, but by a physical display environment consisting of a collection of screens and viewers. For example, a typical desktop system would consist of a single screen (the actual monitor), and a single viewer (the user) sitting in a fixed position in front of it (fixed because typical desktop systems have no way to detect the viewer’s actual position). At the other extreme, a CAVE environment would consist of four to six large screens in fixed positions forming a cube, and a viewer whose position and orientation is measured in real time by a head tracking system (see Figure 2). The beauty is that this simple environment model is extremely flexible; it can support any number of screens in arbitrary positions, including moving screens, and any number of fixed or tracked viewers. It can support non-rectangular screens without problems (but that’s a topic for another post), and non-flat screens can be tesselated to a desired precision. So far, I have not found a single concrete display environment that cannot be described by this model, or at least approximated to arbitrary precision.

Figure 2: Photo of a CAVE environment consisting of four screens (three walls and one floor) and one viewer (in this case a camera on a tripod). Note how the image of the 3D protein model in the CAVE spans all four screens, but still appears seamless to the camera.

In more detail, a screen is defined by its position, orientation, width, and height (let’s ignore non-rectangular screens for now). A viewer, on the other hand, is solely defined by the position of its two eyes (two eyes instead of one to support proper stereo; sorry, spiders and Martians need not apply). All screens and viewers forming one display environment are defined in the same coordinate system, called physical space because it refers to real-world entities, namely display screens and users.

How does this environment model affect navigation? Instead of moving a virtual camera through virtual space, navigation now moves an entire environment, i.e., a collection of any number of screens and viewers, through the virtual space, still under complete program control (fans of Dr Who are free to call it the “Tardis model”). Additionally, any viewers can freely move through the environment, at least if they’re head-tracked, and this part is not under program control. From a mathematical point of view, this means that viewers can freely walk through physical space, whereas physical space as a whole is mapped into the virtual world by the graphics engine, effected by the so-called navigation transformation.

At first glance, it seems that this model does nothing but add another intermediate coordinate system and is therefore superfluous (and mathematically speaking, that’s true), but in my experience, this model makes it a lot more straightforward to think about navigation and user motion in immersive environments, and therefore makes it easier to develop novel and correct navigation metaphors that work in all circumstances. The fact that it treats all possible display environments from desktops to HMDs and CAVEs in a unified way is just a welcome bonus.

The really neat effect of this environment model is that it directly implies a camera model as well (see Figure 3). Using the standard model, it is quite a tricky prospect to maintain the collection of virtual cameras that are required to render to a multi-screen environment, and ensure that they correspond to desired viewpoint changes in the virtual world, and the viewer’s motion inside the display environment. Using the viewer/screen model, there is no extra camera model to maintain. It turns out that a viewing frustum is also uniquely identified by a combination of a flat (rectangular) screen, a focus point position, and near- and far-plane distances. However, the first two components are directly provided by the environment model, and the latter two parameters can be chosen more or less arbitrarily. As a result, the screen/viewer model has no free parameters to define a viewing frustum besides the two plane distances, and the resulting viewing frusta will always lead to correct projections, which are also automatically seamless across multiple screens (but that’s a topic for another post).

Figure 3: The screen/viewer camera model defined by the position, orientation, and size of a screen, and the position of a focus point, in some 3D coordinate system. Besides the additional near- and far-plane distances, the model has no free parameters besides those that can be measured directly via calibration and head tracking.

Looking at it mathematically again, one screen/viewer pair uniquely defines a viewing frustum in the physical coordinate space in which the screen and viewer are defined, and hence a modelview and a projection matrix. Now, the mapping from physical space to virtual world space is typically also expressed as a matrix, meaning that this model really just adds yet another modelview matrix. And since the product of two matrices is a matrix, it boils down to the same projection pipeline as the standard model. As I mentioned earlier, the model does not introduce any new capabilities, it just makes it easier to think about the existing capabilities.

So the bottom line is that the viewer/screen model makes it simpler to reason about program-controlled navigation, completely removes the need for an explicit camera model and the extra work required to keep it consistent with the display environment, and — if the display environment was measured properly — automatically leads to distortion-free and seamless images even across multiple screens, and to always correct and eye strain-free stereo displays.

Although this model stems from the immersive environment world, applying it in the desktop realm has immediate practical benefits. For one, it supports proper stereo without extra work from the application developer; additionally, it supports flexible multi-display configurations where users can put their displays however they like, and get correct and seamless images without special application support. It even provides correct desktop head-tracking for free. Sounds like a win-win to me.

Will the Oculus Rift make you sick?

Head-mounted displays (HMDs) are making a comeback! Yay!

I don’t think there’s need to introduce the Oculus Rift HMD. Everyone’s heard of it, and everyone’s psyched – including me.

However, HMDs are prone to certain issues, and while that shouldn’t detract us from embracing them, we should be careful to do it right this time. The last thing the VR field needs right now is a viral YouTube video along the lines of “Oh, an Oculus Rift! Cool! Let me try it on… Wow, that’s awesoBLEEAAARRGHHH.”

To back up a little: when HMDs became a thing in the 80s, they tended to induce dizziness and nausea in viewers, after a relatively short time of using them. Interestingly, HMDs had generally worse effects than other types of immersive display environments such as CAVEs. The basic theory of simulation sickness is based on virtual motion, and does not account for this difference.

The commonly stated explanation for this difference is display lag. In an HMD, the screens move with the viewer’s head, and any delay will cause the virtual world to move along with the viewer until the display system catches up. Imagine wearing an HMD and quickly turning your head to the side. Say it takes 30ms total until this motion is noticed by the head tracking system, the application updates its internal state, renders the new state, and refreshes the HMD’s screens. During this interval, the world will turn with you, and it will snap back to its original orientation once the delay time has passed. The real world does not behave like that, and because HMD-based graphics tap deeply into our brain’s visual system, this is very disorienting and adds to the discomfort. In a CAVE, on the other hand, the screens do not move with the viewer. Delay will still cause a disturbance in the projection of the virtual world, as the actual viewer position will not match the virtual one, but because screens are large and relatively far away, this will be barely noticeable. So far, so good.

Alas, there is an additional, often overlooked, factor — display calibration. Any immersive graphics system, HMD or CAVE or else, needs to exactly replicate how virtual objects are projected onto the system’s real screens, and then seen by the user (how exactly that works is a topic for another post). The bottom line is that the graphics software needs to know the absolute positions and orientations of all screens, and the absolute positions of the viewer’s eyes. Determining this is the job of head tracking and system calibration. But in an HMD, unlike in a CAVE, the tolerances for calibration are very low. The screens are very small and very close to the viewer’s eyes, which doesn’t leave much room for error (see Figure 1). Even worse, there is no way to precisely don an HMD short of putting screws into one’s skull; every time you put it on, it sits slightly differently. And that means any pre-configured projection parameters will not match reality.

Figure 1: Diagram of a hypothetical HMD for calibration purposes. The HMD consists of small real screen mounted directly in front of the viewer’s eyes, and uses optics to create larger virtual screens at a longer distance away to allow users to properly focus on those screens. For proper calibration, graphics software needs to know the precise positions of the viewer’s pupils and the exact positions and sizes of the virtual screens, in some coordinate system. Head tracking will provide the mapping from this viewer-attached coordinate system to the world coordinate system to allow users to look and walk around.

These mismatches have several effects. For one, imagine that a viewer wears an HMD slightly askew, so that the two screens have different vertical positions in front of their respective eyes. If the software does not account for that, the two stereo images will be vertically displaced, something that does not happen in real life. The viewer’s eyes will make up for it, up to a point, by moving up/down independently, but that is an unnatural motion and causes eye strain. It’s the same effect as watching a 3D movie in a theater while not holding one’s head level — it will hurt later.

Another, more subtle, effect is that in a miscalibrated display system the virtual world does not behave as the real world would. Do a simple experiment: fire up some first-person video game that allows view configuration, such as Doom3, and set a high field of view. Then rotate the view and observe. The virtual world will display a strong distortion effect, meaning that the sizes of objects, and their internal angles, change as the viewpoint changes. This is an extreme example, but even slight discrepancies are subconsciously unsettling, because our visual system is very good at detecting if something is not right with the world, and it tells us that by making us sick.

Even in non-immersive 3D graphics, a too large discrepancy between real field of view (how large the screen looms in our visual field) and programmatic field-of-view is known to cause motion sickness, and immersive 3D graphics with the same issue will be much worse. FOV discrepancy is only one symptom of miscalibration, but it’s the one that’s easiest to demonstrate; the others are more subtle (but that’s a topic for another post). In the end, miscalibration is a nasty problem because it is subtle, very hard to correct, and causes significant ill effects.

I noticed these things when I started experimenting with my own HMDs a while ago (I have an eMagin Z800 3DVisor and a Sony HMZ-T1). I experimented with rapid motions, but those didn’t really make me dizzy. I did notice, however, that the world didn’t seem solid, but as if it was made from jelly. I expected that, not having done proper calibration yet, so I used an interactive calibration utility to set up the system just so. After that, the world seemed stable, and interestingly I didn’t notice any more issues from lag. Not having done any further experiments, my hunch is that miscalibration is actually a bigger problem than lag. (Disclosure: while I was using a low-latency Intersense IS-900 tracking system, the computer running the show was fairly old, and the Quake3 renderer had no particular performance tweaking, so I estimated total system delay around 30ms).

So what’s the take-home message from this wall of text? If we want HMDs to succeed, we need to treat them properly in our graphics software. We need to use proper projection models instead of the standard camera model (but that’s a topic for another post), and not simply apply ad-hoc stereo models such as toe-in etc. (but that’s a topic for another post). It might work for a demo, but it won’t be pretty, and it will make our users sick. Instead, we need to know exactly how the HMD is laid out internally (screen placement and size, effects from the optical system in front of the screens, lens distortions, etc.), and, just as importantly, we need to know exactly where the viewer’s eyes are with respect to the screens (see Figure 1). This last one is the hard part. Maybe a future perfect HMD will contain one pair of stereo cameras per screen that will accurately track the viewer’s pupils and allow the graphics software to set up the projection parameters correctly, no matter how the HMD is worn and how the viewer moves. But until then, we need to come up with a practical approach, and we need to find simple methods to calibrate HMDs on the fly, and teach our users how to use those methods.

Well, and, of course, we mustn’t forget about minimizing lag, either. That would be too easy.

Oh, and by the way, want to get a quick glimpse of just how immersive the Oculus Rift will be (going by current specs)? If your monitor is X inches wide, put your eye X/2 inches in front of the monitor’s center — that’s about what it will look like. If you want to play a first-person game from that viewpoint and have it look right, set the horizontal field of view to 90 degrees.