# A Closer Look at the Oculus Rift

I have to make a confession: I’ve been playing with the Oculus Rift HMD for almost a year now, and have been supporting it in Vrui for a long time as well, but I haven’t really spent much time using it in earnest. I’m keenly aware of the importance of calibrating head-mounted displays, of course, and noticed right away that the scale of virtual objects seen through the Rift was way off for me, but I never got around to doing anything about it. Until now, that is.

The primary reason for my negligence was that I didn’t know enough about the internal details of the Rift to completely understand how to account for different viewer parameters, i.e., mainly the positions of the viewer’s eyes in front of the screens/lenses. Vrui uses the viewer/screen camera model, which makes it very easy to calibrate a wide range of VR display environments because calibration is based entirely on directly measurable parameters: the size and position of screens, and the position of the viewer’s eyes relative to those screens, in real-world measurement units.

The Rift’s firmware exposes the basic required parameters: screen size and position, and distance from the viewer’s pupils to the screen. Screen size is reported as 149.76mm x 93.6mm for both halves, and because there’s only one physical screen, the left half starts at -74.88mm along the X axis, and the right half at 0.0mm, when using a convenient HMD-centered coordinate system. Both screens start at -46.8mm along the Z axis (I like Z going up), and the viewer’s pupils are 49.8mm from the screen. The Rift SDK assumes that the viewer’s face is symmetrical, i.e., that the left pupil is at -ipd/2 and the right at +ipd/2 along X where ipd is the viewer’s inter-pupillary distance in mm, and both are at 0.0mm along Z (by the way, Vrui is more flexible than that and handles asymmetric conditions without extra effort).

That would be all that’s needed to set up a perfectly calibrated display, if it weren’t for those darned lenses. The lenses’ image distortion, and their larger effects on the Rift’s viewing parameters, are treated as black magic by the Rift SDK and its documentation (quote: “Here, lens separation is used instead of the eye separation due to
the properties of the collimated light” — sure, that explains it). The Rift’s firmware only reports the horizontal distance between the lenses’ centers and the coefficients for the post-rendering lens distortion correction shaders. It doesn’t report the lenses’ distance from the screen, or anything about the lenses’ optical properties.

So when I tried doing what I normally do and configured the eye positions according to my own IPD, things got worse — virtual objects appeared even more distorted than when using the SDK-default eye separation (and fixed lens separation) of 64mm. That’s when I decided to shelve the issue for later, and then never got around to un-shelving it.

At least, until we started buying additional Rifts for an undisclosed project we’re cooking, and most test users reported issues with scale and motion sickness. We can’t have that, so deeper exploration was in order. Since the SDK documentation wasn’t helpful, and the Googles didn’t turn up anything useful either, I figured I’d have to write a small “HMD simulator” to finally grok what’s really going on, and enable proper calibration when running Vrui on the Rift. Here’s the result:

As this video is rather long (my longest, actually, at 21:50m), here’s the executive summary: lens distortion affects the viewer/screen model in rather counter-intuitive ways, and while there is a simple approximation to get things more or less right even for viewers with IPDs other than 64mm, doing it properly would require precise knowledge of the Rift’s lenses, i.e., their geometric shape and the index of refraction of their material. Oh, and it would require eye tracking as well, but I knew that going in. The good news is that I now know how to create the proper approximating calibration, and that doing it on a per-user basis is straightforward. Turns out the viewer/screen model works even with lenses involved.

Update: Following a suggestion by reader TiagoTiago, see this follow-up post for an improved approximation to correct 3D rendering under lack of eye tracking.

For future reference, let me set a few things straight that really threw me off when trying to parse the Oculus SDK documentation. The main document (Oculus VR SDK Overview, version 0.2.5) is almost obfuscating in its description, and if it had done a better job, it would have saved me a lot of time. So let’s complain a little.

At the top of page 23 (section 5.5.1), the doc says that “unlike stereo TVs, rendering inside of the Rift does not require off-axis or asymmetric projection,” which is then followed by two pages deriving — guess what — an off-axis or asymmetric projection matrix. Yeah. Multiplying an on-axis projection matrix (P in the doc) from the left with a translation matrix (H in the doc) results in an off-axis projection matrix. If you don’t believe me, calculate matrix P’ = HP yourself, invert it, and multiply the eight corners of the normalized device coordinate view volume, (±1, ±1, ±1, 1), with that matrix. The eight transformed vertices will indeed form a skewed truncated pyramid in eye space. What I don’t understand is why the doc makes a big deal out of the projection’s on- or off-axisness in the first place.

Edit: I just found out I can directly embed $\LaTeX$ code into my posts! OMG!

While I’m talking about matrix P, the doc derives it through a field-of-view angle, following the convention used by the canonical camera model. First the doc calculates $\phi_\mathrm{fov} = 2 \cdot \tan^{-1}\bigl(\frac{\mathrm{VScreenSize}}{2 \cdot \mathrm{EyeScreenDistance}}\bigr)$, but in the derivation of P, $\phi_\mathrm{fov}$ is only used in the form of $\tan (\phi_\mathrm{fov} / 2) = \tan \Bigl(2 \cdot \tan^{-1}\bigl(\frac{\mathrm{VScreenSize}}{2 \cdot \mathrm{EyeScreenDistance}}\bigr) / 2\Bigr)$, which conveniently cancels out to $\frac{\mathrm{VScreenSize}}{2 \cdot \mathrm{EyeScreenDistance}}$.

P also uses a, the screens’ aspect ratio, which the doc calculates as $a = \frac{\mathrm{HResolution}}{2 \cdot \mathrm{VResolution}}$, which is conceptually wrong. Aspect ratio should be calculated as $a = \frac{\mathrm{HScreenSize} / 2}{\mathrm{VScreenSize}}$, i.e., in physical units like all other parameters. This makes sense, because HScreenSize/2 and VScreenSize are precisely the width and height of the screen half to which the projection matrix applies. What’s the difference, you ask? The result is the same, so much is true. But it’s only the same because the Rift’s LCD uses square pixels. Think about it: if the Rift’s screen had twice the horizontal resolution, but the same physical size, i.e., 1:2 rectangular pixels, the aspect ratio between horizontal and vertical field-of-view would still be 0.8, and not 1.6. I know LCDs typically have square pixels, but see below why having done it right would have simplifed things.

In toto, P is presented in the doc as

$P = \begin{pmatrix} \frac{1}{a \cdot \tan(\phi_\mathrm{fov} / 2)} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan(\phi_\mathrm{fov} / 2)} & 0 & 0 \\ 0 & 0 & \frac{z_\mathrm{far}}{z_\mathrm{near} - z_\mathrm{far}} & \frac{z_\mathrm{far} \cdot z_\mathrm{near}}{z_\mathrm{near} - z_\mathrm{far}} \\ 0 & 0 & -1 & 0 \end{pmatrix}$

where $a = \frac{\mathrm{HResolution}}{2 \cdot \mathrm{VResolution}}$ and $\phi_\mathrm{fov} = 2 \cdot \tan^{-1}\bigl(\frac{\mathrm{VScreenSize}}{2 \cdot \mathrm{EyeScreenDistance}}\bigr)$.

Had the doc calculated aspect ratio correctly, based on physical sizes (see above), and not insisted on using a field-of-view angle in the first place, it could have derived P as

$P = \begin{pmatrix} \frac{2 \cdot \mathrm{EyeToScreenDistance}}{\mathrm{HScreenSize} / 2} & 0 & 0 & 0 \\ 0 & \frac{2 \cdot \mathrm{EyeToScreenDistance}}{\mathrm{VScreenSize}} & 0 & 0 \\ 0 & 0 & \frac{z_\mathrm{far}}{z_\mathrm{near} - z_\mathrm{far}} & \frac{z_\mathrm{far} \cdot z_\mathrm{near}}{z_\mathrm{near} - z_\mathrm{far}} \\ 0 & 0 & -1 & 0 \end{pmatrix}$

Wouldn’t that have been easier? (Update: If the above doesn’t look like a normal OpenGL projection matrix to you, then that’s because it isn’t one. Please see my correction post on the matter.)

To be honest, the above complaints are mere nitpicks. What really caused problems for me was the calculation of the translation (or, rather, skew) matrix H. Based on the Rift’s internal layout, and the rules of 3D perception (and the viewer/screen model), the displacement value h should have been based on the viewer’s eye distance, not the Rift’s lens distance (this is also where the doc waves its hands and refers to the special “properties of the collimated light”).

Since this setup of H deviates from physical reality, and lens distortion correction is a black box, this is where I gave up on first reading. After having run the simulation, it is now perfectly clear: the truth is that using lens distance for h is not supposed to make sense — it’s a performance optimization. What’s really happening here is that a component of lens distortion correction is bleeding into projection set-up. Looking through a lens from an off-axis position introduces a lateral shift, which should, in principle, be corrected by the post-rendering lens distortion correction shader. But since shift is a linear operation, it can be factored out and be put into the projection matrix, where it incurs no extra cost while saving an addition in the per-pixel shader. So using lens distance for h in matrix H is a composite of using eye distance during projection, and the difference of eye distance and lens distance during distortion correction. That’s all perfectly fine, but optimization has no place in explanation — or as the master said, “premature optimization is the root of all evil.”

I don’t want to get into the derivation of lens distortion correction and field-of-view (section 5.5.4 in the doc), where $\phi_\mathrm{fov}$ makes an appearance again, besides saying that using physical sizes instead of field-of-view would have made this simpler as well.

So I guess what I’m really saying is that explaining the Rift’s projection setup using the viewer/screen model would have been considerably better than bending over backwards and framing it in terms of the canonical camera model. But I think that’s what they call fighting a lost battle.

## 32 thoughts on “A Closer Look at the Oculus Rift”

1. (continuing the discussion from http://doc-ok.org/?p=730#comment-1248 )

What i mean is, since any ray entering the pupil perfectly straight is always gonna be passing thru the center of the eyeball, couldn’t you project the environment on an imaginary sphere sharing the same center as the eyeball, and calculate the distortions that way? I guess my original question could be rephrased as: What would change if you just consider all directions as being stared at straight on, instead of calculating the peripheral vision for each direction?

• That clarifies it. It’s true, when the eye focuses on a 3D point, the ray from that point enters the pupil on-axis, and in that sense a 3D projection onto an imaginary sphere could simulate that. But the problem is that at any time, the eye can only point in a single direction, and at that time, all rays from other points on the imaginary sphere would enter the pupil at a different angle than prescribed by the model, and the instantaneous impression on the retina would be distorted. I would have to write a new simulator to properly explain it, but the world would still wobble like Jell-O as the eye flits around. while the effect would be different in detail from what’s happening now, it would be just as wrong.

• Would the distortions be as noticeable if the correction targeted the straight-on rays for all eyeball positions, instead of correcting the peripheral vision for an static eyeball?

• OK, I think what you’re describing is the equivalent of placing the virtual camera position at the center of each eyeball, and leaving everything else the same. I just tried that with the simulator I’m using in the video, and the effect would be as you describe: as the viewer foveates on any point in the 3D scene, that point is perceived at the proper position and with locally correct size, but everything else appears distorted, and the distortions change rapidly as the viewer’s eyes dart around the scene. It would be hard to do a real test in the Rift without first getting rid of all the other distortions, but I have a hunch that it would feel very weird indeed.

• Short answer: no. 🙂 See follow-up post.

And thank you.

2. Btw, is there such a thing as a digital camera that has the same size and optical properties as the human eye? Are they affordable?

If the answer to both is yes, i was thinking, perhaps you could make a rig with a few servos to turn the camera, then show some test patterns rendered in 3d in the Oculus Rift and automatically extract a model of it’s optics, or calibration parameters to counter-distort 3d scenes?

• I don’t think it’s even necessary to have a camera with the same properties as the human eye. I was thinking about a calibration rig where you remove the lenses from the Rift and put them onto a printed checkerboard pattern, and place a normal camera or webcam at the position of the viewer’s left or right eye relative to the lens. Then you take a picture, and run a standard lens distortion estimation algorithm. This would result in a set of correction parameters for each point you sample, and you could pick the optimal set of parameters for a given viewer setup and apply it. If you had eye-tracking, it could even be dynamic and real-time.

Still, having a precise geometric model of the lens(es) and calculating correction parameters on-the-fly would be my preferred approach.

• Wouldn’t it be better to not dissemble it, to make sure everything stays exactly in place, and to not ignore unknown details that might be present (like distance between the pixels and the lens, the angle of the screen relative to the lens etc)?

And since it’s a screen, you could have the program test the validity of the calculations immediately and automatically by applying the estimated calibration and checking if everything is where it is supposed to be.

• Fortunately, the Rift’s lenses are meant to be exchangeable. But you’re right, the removable lens holders don’t sit directly on the screen, so precise lens-screen distance remains unknown. Screen/lens angle is 0 degrees based on design.

3. It sounds like an HMD really needs a dynamic lens positioner that would scan the wearer’s face, determine the eye separation, distance from the screen and any other “yaw” type variances and auto-adjust the lenses into place. An “automatic” solution would be ideal, but there is no reason to think that one couldn’t write a manual calibration tool. However, I don’t think that the current Rift lets you finely adjust the position of the lenses in order to perform such a calibration.

• Just knowing the viewer’s pupil positions relative to the lenses in real time would be the major step. Being able to adjust the lens separation to exactly match the viewer’s eye separation would be the icing on the cake.

4. Pingback: VR Movies | Doc-Ok.org

5. Pingback: How to Measure your IPD | Doc-Ok.org

6. I am wondering what the lens correction is doing in your simulator. Is it for image distortion due to the lens? Or there is more to it?

• In this simulation, the lens correction undoes every effect of the lens on the display. Besides radial distortion and other side effects, this also includes virtual screen projection and magnification.

• I actually downloaded and tried the lens simulator, but it seems I am unable to replicate the effect of the “oculus rift hack” as it appears in your video. Is it possible the newer version of your lens simulator does something different? I downloaded through the link you shared in the Youtube comments.

7. I am experimenting with Rift and a toed-in stereo camera rig placed in front of it. The image of the cameras are displayed in front of each eye on a rectangular plane in front of each virtual camera. The distance of such plane from the virtual camera is uninmportant since its size and scale are computed so that real camera FOV is correctly mapped into virtual camera FOV.

Now, there is discrepancy in eye convergence when looking (through the Rift) at a virtual object at 1 unit from me and a real object at 1 meter from the real camera (I want it to be the same). I assume such discrepancy can be solved by toeing-in the two real cameras at the right angle, but I need to know what is the relationship between units from virtual camera viewer and correspondent eye convergence in the rift for that virtual distance.

In the scene virtual cameras are parallel and uses SDK matrix. Can I get such information from it and decide toe-in angle accordingly? Or there is no other way than doing it experimentally?

• Step 1: Don’t toe in the physical cameras. If you do, you must toe in the rectangles on which you display the camera video by the same angle, but that only works if the cameras are in the exact same position as the viewer’s real eyes, otherwise toe-in affects the apparent inter-pupillary distance of the captured stereoscopic video.

There is only one way to capture stereoscopic video that matches real-world perception and perception of virtual objects 1:1, and that is to have the oprical centers of the physical cameras located in the optical centers of the user’s eyes (i.e., pupils or eyeball rotation center, for different trade-offs). Without gouging out the user’s eyes, this can only be achieved by folding the camera’s light path, as in this diagram:

Without a rig like that, the closest approximation to 1:1 stereo pass-through is to place the stereo cameras such that the distance between their optical centers matches the user’s inter-pupillary distance, have the cameras pointed straight ahead (zero toe-in), and push them as close to the user’s eyes as possible.

• Thank you for the answer!

Yes I read mirrors should be used from other research papers. My goal is to experiment video see-through with high field of view cameras (mirrors is way too complex for me and it will reduce such FOV) and mono or stereo computer vision at the same time.
So, when using fisheye lenses for example my rig is positioned parallel. With standard wide-angle lenses instead, I want all possible stereo FOV for stereo vision and this is why I toe them in.

Now yes planes are accordingly rotated to correct keystoning and camera sensors are at IPD. I tried this setup and I could match convergence of eyes when focusing on a real marker and a virtual object on top of it… but only at a specific distance. Such distance depends on the toe-in angle.

So what you are saying is the less I toe-in the cameras the more distance range virtual and real stereo will match? Since Oculus virtual cameras use off-axis matrices (and not simple parallel cameras) shouldn’t exist a correspondent toe-in configuration that approximates it (with keystoning)? Real cameras could than match that angle for best approximation. This was the idea.

• The most important consideration to get pass-through AR right is the distance between the optical centers of the cameras and the user’s eyes. If those two match, the orientation of the cameras (toe-in or not or even toe-out) is less important, as it would only affect the resulting field of view.

If the optical centers do not coincide, then the only viewing direction for which reality and pass-through match is the line through both optical centers, and the only point in space where 3D appears correct is the intersection between the left- and right-eye lines (if they intersect at all, that is). Meaning you could fine-tune camera placement for an intended ideal viewing point by sliding the cameras along a line parallel to the user’s interpupillary vector. The toe-in angle between the cameras doesn’t affect this, as long as you account for it correctly by drawing the video feeds at the correct position and orientation.

The deviation from ideal representation gets larger as the distance between the cameras’ and eyes’ optical centers increases, so you want to make that as small as possible.

The internal geometry of the HMD, and the projection matrices it uses, have nothing to do with it.

• Ok so summing up my setup is: two cameras in toe-in configuration, optical center in front of each eye. In the oculus, on each eye a different plane is rendered in the front and can be rotated to tackle keystoning. A marker is detected by left camera and a virtual object is placed in the scene (using left virtual camera as reference): in both views virtual object appears perfectly overlapped.

Now with depth perception virtual object and marker obviously cannot match. But I would like to minimize such discrepancy. I saw that if the offset between the eye and the real camera is equal to the offset between the screen and the zero-parallax plane, the error is also a constant offset of the same size. I did a geogebra file to test it: http://tube.geogebra.org/student/m2066371
By moving Z and S points, target T one can play around with it. T’ is what is actually perceived by the eyes, so TT’ is the offset I am talking about.

The problem is I don’t know where the “screen” is in my case on oculus. Is it at infinite in my setup or somewhere else? If I knew I could choose toe-in angle accordingly. I assumed that since when you have parallel cameras then you need to shift the images (or just use off-axis) then there must be a toe-in equivalent version with the obvious keystoning drawback.

I wish then to experiment it further: I can for instance apply this offset to the virtual object relatively to the marker or just leave it as it is.

• OK, I got it. There are two main choices. One, you can set things up such that the discrepancy between where objects are seen and where they are is constant across the entire field of view. To do that, you need to display your video frames by rooting the camera’s capture pyramid at the viewer’s eye position instead of the camera’s optical center. For example, if you have 90° FOV cameras looking straight ahead, you would draw two textured rectangles 2 units wide (one per camera), exactly 1 unit in front of the viewer’s left or right eye. Which unit you choose is completely arbitrary. As a result, if your cameras’ inter-“pupillary” distance matches the viewer’s, then all objects will appear offset by the vector from the cameras’ centers of projection to the viewer’s. At this point, toeing the cameras in or out makes zero difference, as long as the display rectangles are rotated by the same amount.

The second option is to choose one particular distance at which objects appear in their exact right position, and pay for that by increasing misalignment when moving away from that center distance. To do that, again for 90° FOV cameras, you draw a rectangle 2 units wide exactly 1 unit in front of the cameras’ centers of projection. Here, you choose your unit such that the rectangles are drawn at the distance where you want exact matches. This will deform all objects, and toeing the cameras in or out in this configuration will add non-linear distortions to the perceived objects. As this is hard to explain, here is a small simulation program to play with all this interactively.

• Thanks for the simulator. Looks awesome I will play with it and Vrui.

I probably have already inadvertedly chosen the first option: using your example, at 90° the plane is 2 units wide and 1 unit far from the XYZ position of the virtual cameras used for oculus (which I think is what you referred to with “viewer’s eye position”). If I increase the distance, also dimension of the plane increases accordingly so that user cannot tell it changed.

I am not sure on the difference between rooting the camera’s capture pyramid and rooting camera’s optical center, so I hope I got it.

Regarding the second option, instead, you said: “you choose your unit such that the rectangles are drawn at the distance where you want exact matches”; you meant you change the plane distance without changing its dimension accordingly, am I right?

• When moving the image display rectangles, you have to adjust their sizes so that the apparent field of view from the chosen center of projection stays constant. You’ll see how I do it in the simulator.