Standard camera model considered harmful

With apologies to Edsger W. Dijkstra (and pretty much everyone else).

So what’s wrong with the canonical 3D graphics camera model? To recap, in this model a camera is defined by a focus point (the “eye”), a viewing direction, an “up” vector, a screen aspect ratio, and a field-of-view angle (“fov”) (see Figure 1). Throw in a near- and far-plane distance, and together these parameters uniquely define a viewing frustum in model space, and hence the modelview and projection matrices required to render a view of a 3D scene. Nothing wrong with that, per se.

Figure 1: The standard 3D graphics camera model, defined by a focus point position, viewing direction, “up” vector, and screen aspect ratio (ratio of screen width to screen height, not shown in diagram).

The problem arises when this same camera model is applied to (semi-) immersive environments, such as when one wants to adapt an existing graphics package or game engine to, say, a 3D TV or a head-mounted display with only minimal changes. There are two main problems with that: for one, the standard camera model does not support proper stereo image generation, leading to 3D vision problems, eye strain, and discomfort (but that’s a topic for another post).

The problem I want to discuss here is the implicit link between the camera model and viewpoint navigation. In this context, viewpoint navigation is the mechanism by which a 3D graphics application represents the viewer moving through the virtual 3D environment. For example, in a typical first-person video game, the player will be represented in the game world as some kind of avatar, and the camera model will be attached to that avatar’s head. The game engine will provide some mechanism for the player to directly control the position and viewing direction of the avatar, and therefore the camera, in the virtual world (this could be the just-as-canonical WASD+mouse navigation metaphor, or something else). But no matter the details, the bottom line is that the game engine is always in complete control of the player avatar’s — and the camera’s — position and orientation in the game world.

But in immersive display environments, where the user’s head is tracked inside some tracking volume, this is no longer true. In such environments, there are two ways to change the camera position: using the normal navigation metaphor, or simply physically moving inside the tracking volume. Obviously, the game engine has no way to control the latter (short of tethers or an electric shock collar). The problem is that the standard camera model, or rather its implementation in common graphics engines, does not account for this separation.

This is not really a technical problem, as there are ways to manipulate the camera model to yield the correct end result, but it’s a thinking problem. Having to think inside this ill-fitting camera model causes real headaches for developers, who will be running into lots of detail problems and  edge cases when trying to implement correct camera behavior in immersive environments.

My preferred approach is to use an entirely different way to think about navigation in virtual worlds, and the camera model that directly follows from it. In this paradigm, a display system is not represented by a virtual camera, but by a physical display environment consisting of a collection of screens and viewers. For example, a typical desktop system would consist of a single screen (the actual monitor), and a single viewer (the user) sitting in a fixed position in front of it (fixed because typical desktop systems have no way to detect the viewer’s actual position). At the other extreme, a CAVE environment would consist of four to six large screens in fixed positions forming a cube, and a viewer whose position and orientation is measured in real time by a head tracking system (see Figure 2). The beauty is that this simple environment model is extremely flexible; it can support any number of screens in arbitrary positions, including moving screens, and any number of fixed or tracked viewers. It can support non-rectangular screens without problems (but that’s a topic for another post), and non-flat screens can be tesselated to a desired precision. So far, I have not found a single concrete display environment that cannot be described by this model, or at least approximated to arbitrary precision.

Figure 2: Photo of a CAVE environment consisting of four screens (three walls and one floor) and one viewer (in this case a camera on a tripod). Note how the image of the 3D protein model in the CAVE spans all four screens, but still appears seamless to the camera.

In more detail, a screen is defined by its position, orientation, width, and height (let’s ignore non-rectangular screens for now). A viewer, on the other hand, is solely defined by the position of its two eyes (two eyes instead of one to support proper stereo; sorry, spiders and Martians need not apply). All screens and viewers forming one display environment are defined in the same coordinate system, called physical space because it refers to real-world entities, namely display screens and users.

How does this environment model affect navigation? Instead of moving a virtual camera through virtual space, navigation now moves an entire environment, i.e., a collection of any number of screens and viewers, through the virtual space, still under complete program control (fans of Dr Who are free to call it the “Tardis model”). Additionally, any viewers can freely move through the environment, at least if they’re head-tracked, and this part is not under program control. From a mathematical point of view, this means that viewers can freely walk through physical space, whereas physical space as a whole is mapped into the virtual world by the graphics engine, effected by the so-called navigation transformation.

At first glance, it seems that this model does nothing but add another intermediate coordinate system and is therefore superfluous (and mathematically speaking, that’s true), but in my experience, this model makes it a lot more straightforward to think about navigation and user motion in immersive environments, and therefore makes it easier to develop novel and correct navigation metaphors that work in all circumstances. The fact that it treats all possible display environments from desktops to HMDs and CAVEs in a unified way is just a welcome bonus.

The really neat effect of this environment model is that it directly implies a camera model as well (see Figure 3). Using the standard model, it is quite a tricky prospect to maintain the collection of virtual cameras that are required to render to a multi-screen environment, and ensure that they correspond to desired viewpoint changes in the virtual world, and the viewer’s motion inside the display environment. Using the viewer/screen model, there is no extra camera model to maintain. It turns out that a viewing frustum is also uniquely identified by a combination of a flat (rectangular) screen, a focus point position, and near- and far-plane distances. However, the first two components are directly provided by the environment model, and the latter two parameters can be chosen more or less arbitrarily. As a result, the screen/viewer model has no free parameters to define a viewing frustum besides the two plane distances, and the resulting viewing frusta will always lead to correct projections, which are also automatically seamless across multiple screens (but that’s a topic for another post).

Figure 3: The screen/viewer camera model defined by the position, orientation, and size of a screen, and the position of a focus point, in some 3D coordinate system. Besides the additional near- and far-plane distances, the model has no free parameters besides those that can be measured directly via calibration and head tracking.

Looking at it mathematically again, one screen/viewer pair uniquely defines a viewing frustum in the physical coordinate space in which the screen and viewer are defined, and hence a modelview and a projection matrix. Now, the mapping from physical space to virtual world space is typically also expressed as a matrix, meaning that this model really just adds yet another modelview matrix. And since the product of two matrices is a matrix, it boils down to the same projection pipeline as the standard model. As I mentioned earlier, the model does not introduce any new capabilities, it just makes it easier to think about the existing capabilities.

So the bottom line is that the viewer/screen model makes it simpler to reason about program-controlled navigation, completely removes the need for an explicit camera model and the extra work required to keep it consistent with the display environment, and — if the display environment was measured properly — automatically leads to distortion-free and seamless images even across multiple screens, and to always correct and eye strain-free stereo displays.

Although this model stems from the immersive environment world, applying it in the desktop realm has immediate practical benefits. For one, it supports proper stereo without extra work from the application developer; additionally, it supports flexible multi-display configurations where users can put their displays however they like, and get correct and seamless images without special application support. It even provides correct desktop head-tracking for free. Sounds like a win-win to me.

13 thoughts on “Standard camera model considered harmful

  1. wow I understood only have of this, I will be thinking of this next time I see pp bash stuff like the rift now and forever thanks man

  2. Pingback: ZSpace: a turn-key holographic display | Doc-Ok.org

  3. Pingback: First impressions from the Oculus Rift dev kit | Doc-Ok.org

  4. Pingback: On the road for VR: zSpace developers conference | Doc-Ok.org

  5. I wonder how the screen of a HMD like the Oculus Rift can fit in this viewer/screen model. In other words, which are the screen origin, width, height and focal point values when modeling the OR screen and the eyeballs of the user wearing it.
    (yes, fancy english. sorry, I’m not native english speaker)

    • It’s extremely straightforward. The screen position and origin are defined in some convenient local coordinate system, and the eye positions relative to the screen are measured / expressed in same. For example, querying the Oculus Rift’s firmware gives the following for the two screens, where the X axis goes from left to right and the Z axis goes up:

      Left screen:
      origin (-2.94803, 1.96063, -1.84252)
      width 2.94803
      height 3.68504

      Right screen:
      origin (0.0, 1.96063, -1.84252)
      width 2.94803
      height 3.68504

      All measurements are in inches.

      The left and right eye positions, in the same coordinate system, are then:

      leftEyePosition (-1.25, 0.0, 0.0)
      rightEyePosition (1.25, 0.0, 0.0)

      in this case, yielding an IPD of 2.5″ or 6.35 cm. The radial lens distortion correction is done by a post-rendering distortion shader, and works inside this coordinate system.

      Finally, the mapping from screen-relative to world coordinates is handled by the tracking unit.

      The beauty of the viewer/screen model is how easily it applies to all kinds of displays, including HMDs. There is no messing around with viewing directions and aspect ratios and FOVs and post-skewing matrices, like the Oculus Rift Unity embedding has to do.

      • I thought that the lenses had to be taken into account and hence that things were more complicated that they really are. In my mind I was imagining a screen placed at an infinite distance but with finite area and I could not imagine how the dimensions of such a thing could be calculated.
        Being taught on the conventional camera model I never thought about it the way you describe in this post. I’m fascinated by how easy it all suddenly becomes.

        • You’re bringing up an important detail. With HMDs, because the screens are attached to your head, you have the freedom to place them anywhere in 3D space by sliding them along the pyramid defined by the real screen and the real eye. Some HMDs use “virtual screens,” defined by their optics, which are some distance away, and typically set up such that the two view pyramids exactly intersect at the virtual screen position. This makes it simpler for old 3D software, because the left and right screens are in the same position. In those cases, the calibration data for the device would give the position of the virtual screen, and you would put that into the viewer/screen model. That’s how my Z800 and HMZ-T1 do it.

          The Oculus people weren’t concerned with backwards compatibility, so they went the simple route and used the real position and size of the screens.

  6. I’m trying to implement Head-Coupled Perspective using head tracking on a 2D or 3D monitor in OpenGL. I was planning on generating asymmetric frusta using the standard camera model, which as you said, is doable, but hard to wrap one’s head around. My question is, are you recommending just a new way of thinking about rendering immersive 3D graphics using a library such as OpenGL, or are you recommending a new way for library developers to model 3D environments? Would it be possible to do what you are suggesting in OpenGL, or would I have to use a graphics library which modeled environments using screens and viewers, or develop my own graphics library, or extend OpenGL with functions which generated a traditional camera model with assymetric frusta from a screen/viewer model?

    • The viewer/screen model is just a different way of thinking about camera setup, but in the end it boils down to 4×4 projection matrices, just like any other. In other words, you need code somewhere that takes a viewer/screen description and creates the necessary matrices, but the benefit is you only need to do it once. You could think of this code as an OpenGL extension, or part of some OpenGL utility library (like the glutLookAt function in glut), or part of the application.

      For example, in my Vrui VR toolkit, which is based on OpenGL and Vrui applications are expected to do their own rendering using OpenGL, all matrix setup is done via the viewer/screen model. This happens underneath the toolkit API. Applications are shielded from having to set up and manage windows and matrices.

      Even if you have to make your own conversion code, the model still pays off when you need to combine head tracking with user-controlled motion such as mouse look or WASD walking. There’ve been a lot of questions on the Oculus developer subreddit about this, but in the viewer/screen model it’s trivial. Head tracking only moves the viewer, and explicit motion moves the coordinate system containing the (head-tracked) viewer and the screen.

  7. Pingback: A Closer Look at the Oculus Rift | Doc-Ok.org

  8. Pingback: On the Road for VR: Augmented World Expo 2015, Part I: VR | Doc-Ok.org

Please leave a reply!