**Note:** This is part 3 of a four-part series. [Part 1] [Part 2] [Part 4]

In the previous part of this ongoing series of posts, I described how the Oculus Rift DK2’s tracking LEDs can be identified in the video stream from the tracking camera via their unique blinking patterns, which spell out 10-bit binary numbers. In this post, I will describe how that information can be used to estimate the 3D position and orientation of the headset relative to the camera; the first important step in full positional head tracking.

3D pose estimation, or the problem of reconstructing the 3D position and orientation of a known object relative to a single 2D camera, also known as the perspective-n-point problem, is a well-researched topic in computer vision. In the special case of the Oculus Rift DK2, it is the foundation of positional head tracking. As I tried to explain in this video, an inertial measurement unit (IMU) by itself cannot track an object’s absolute position over time, because positional drift builds up rapidly and cannot be controlled without an external 3D reference frame. 3D pose estimation via an external camera provides exactly such a reference frame.

So why was it so important to be able to identify LEDs in the tracking camera’s video stream? At its heart, 3D pose estimation is a multi-dimensional non-linear optimization problem. Given a known model, i.e., a collection of 3D points such as the DK2’s tracking LEDs, a camera with known intrinsic parameters, and a set of 2D points in the camera’s image, such as the set of extracted LED blobs, one can try to reconstruct the unknown position (tx, ty, tz) and orientation (yaw, pitch, roll) of the model with respect to the camera (in reality, one would never use yaw, pitch, and roll angles to do this, but that’s a technical detail). In theory, the approach is simple. Given a candidate set of unknown parameters (tx, ty, tz, yaw, pitch, roll), one takes the set of 3D model points, transforms them by the rigid body transformation defined by the six parameters, projects them into image space using the camera’s intrinsic parameters, and then calculates the sum of their squared distances from the true observed image points (this is called reprojection error). This process defines an error function F(tx, ty, tz, yaw, pitch, roll), and the problem is reduced to finding the set of parameters that globally minimizes the value of the error function.

But here’s a problem: Given a predicted image point, *which one* of the observed image points should it be compared to? Without knowing anything else, one would have to test all possible associations of observed and predicted image points, and pick the association which yields the smallest error after optimization. Unfortunately, there are a lot of possible associations; in general, if there are N predicted image points and M<=N observed image points, then there are N!/(N-M)! possible associations. To pick an example, for N=40 (number of LEDs on DK2) and M=20, there are 335,367,096,786,357,081,410,764,800,000 potential associations to test, and that’s a large number even for a computer. There are many heuristic and/or iterative methods to establish associations automatically, but they tend to be rather slow and fragile. The best approach, obviously, is to somehow make it possible to identify a-priori which observed image point belongs to which 3D model point, and the DK2’s flashing 10-bit patterns do exactly that. Well played, Oculus.

# The implementation

Pose estimation algorithms fall into two very broad categories: iterative methods and ab-initio algebraic methods. The former are based on relatively simple steps that take some approximation to the desired pose, and calculate another slightly different pose that hopefully yields a smaller reprojection error, and then repeat that same process until the approximation is considered good enough, or does not improve any further. The latter algorithms tend to be more complex, and typically apply linear algebra methods to directly calculate pose candidates in a single step.

In general, iterative methods can be fast and highly accurate, but they require an initial (guessed) pose that is already close to the ideal pose, or they might take very long to find a good solution, or don’t find one at all. Ab-initio methods, on the other hand, tend to be slower and less accurate, but they can find a good pose estimate without requiring an initial guess (and it’s *so* embarrassing when the guess turns out wrong and tracking comes to a grinding halt).

For my experiments, I chose a combination of an ab-initio method and an iterative method. The ab-initio method is the one described in a 2008 paper by Lepetit, Moreno-Noguer, and Fua. The authors claim it is faster and more accurate than other ab-initio methods, and I took their word for it (an added benefit is that it is relatively easy to implement). As it turns out, Lepetit et al.’s method is fairly robust, but its accuracy could be better. I decided to improve it by adding a few (50) steps of an iterative pose estimation method, using the final result of Lepetit et al.’s algorithm as input for the iterative algorithm. Additionally, when there is already a good pose estimate (judged via its reprojection error) from the previous frame, I skip the ab-initio stage altogether and go right into iterative optimization.

For the iterative method, I jumped on the wayback machine and dusted off the 3D pose estimation algorithm I wrote for my Wiimote tracking experiments in 2007. It is a straight application of the non-linear optimization strategy I described above, using the Levenberg-Marquardt algorithm (LMA). LMA is nice because it adapts itself between the robustness of gradient descent and the speed of Gauss-Newton iteration (GNA). In this special application, where I have very good initial pose estimates (either from the ab-initio stage or from the previous video frame), straight-up Gauss-Newton would probably get better performance, but you go with what you have.

# Improved LED Identification

With a pose estimation algorithm in place, it is now possible to go back and improve the LED identification algorithm. Before, my code identified an LED by observing its flashing patterns over at least 10 frames, and continually updating its bit pattern as the video stream progressed. Due to the differential detection mechanism, rapid movements of the headset in the camera’s depth direction would sometimes lead to bit errors, which could often be corrected relying on the LED patterns’ Hamming distance of 3. But the rare bit error would slip through, and a misidentified LED is *very* bad news for both ab-initio and iterative pose estimation algorithms. The new algorithm has two cases. If there is no good pose estimate yet (at the beginning of tracking, or after tracking is temporarily lost), the old algorithm is applied, and it takes 10 frames or more to regain LED IDs and pose estimation. If, on the other hand, there is a good pose for the current frame, then the new algorithm projects all LEDs that should be visible to the camera, based on the current pose estimate and the LEDs’ emission directions (which are fortunately provided by the DK2’s firmware), into the camera’s image where the next iteration of the LED identification loop will pick them up. As a result, if a new LED comes into view due to changes in the headset’s position or orientation, it will be picked up immediately instead of 10 or more frames later. And as long as the pose estimation loop maintains a good estimate, the original identification algorithm won’t be run at all. The results are solid, as can be seen in the video below:

# Next steps

As can be seen in the video, there are still tracking drop-outs, where the LED identification and pose estimation algorithms cannot keep up with fast headset movements. This is because neither algorithm has any idea how the headset moves between video frames, which arrive only every 16.666ms. If an LED blob moves by more than 10 pixels (arbitrary cut-off) between frames, the algorithm will no longer treat it as the same LED, and it will lose its accumulated ID bits and have to acquire 10 new ones.

Fortunately, the DK2’s IMU can be used to predict the motion of the headset between frames, which I expect (hope?) will make LED identification robust in the presence of fast motions.

The other main benefit of combing IMU measurements with optical pose estimation is noise reduction. Due to random noise, and exacerbated by the LEDs’ flashing patterns, raw position and orientation estimates are rather noisy, especially if only a few LEDs are visible, and even more especially if those few are (almost) coplanar. The pose estimation algorithm cannot do much to reduce this noise, but the IMU is very sensitive to noise, and if the pose estimation reports a relative motion that is not compatible with IMU readings, the noise can be filtered out.

This means the next big step is to combine the IMU-based tracking algorithm that I already have with the new pose estimation algorithm.

# Pose estimation performance and quality

I still have to run thorough tests, but here are some early estimates of the algorithm’s performance and the quality of the results.

I measured the pose estimation loop’s wall-clock run-time on my 2.8 GHz Intel Core i7 CPU. This loop includes:

- Extract bright blobs from the incoming 8-bit greyscale video frame.
- Create an 8-bit RGB output video frame in which identified blobs are colored bright green (this is only for debugging/visualization purposes, and would be removed in an production tracking implementation).
- Match roughly disk-shaped blobs from the current frame with those extracted from the previous frame, to feed their size differences into the differential bit decoder, and accumulate their 10-bit IDs and LED associations.
- Write all identfied blobs to a file (also only for debugging).
- If there are four or more identified blobs and no good current pose estimate, run the ab-initio pose estimation method.
- If there are four or more identified blobs, run the iterative pose estimation method on the current pose, which comes either from the previous frame or the ab-initio method.
- If there is a good current pose, project all theoretically visible LED positions into the image frame and label them with their respective 10-bit IDs and LED indices, to aid LED identification in the next frame.
- Sort all extracted or projected 2D LED positions into a kd-tree for fast matching in the next frame.
- Push all results (colored video frame, current pose estimate, current set of identified LEDs) to the main thread for real-time visualization.

Without any optimization (yet), the loop currently takes around 1.6ms to execute. A large chunk of that is creating the colored output frame. The combined ab-initio/iterative pose estimation algorithm takes around 0.16ms.

To get a rough idea of pose estimation quality, I placed both the tracking camera and the headset onto the same table, 25″ away from each other, and collected a long stretch of data (2326 frames) without touching or moving anything. According to LibreOffice Calc, reprojection error RMS has an average of 0.164648 pixels and a standard deviation of 0.011238 pixels (this means that the lens distortion and intrinsic parameters I calculated are fantastic); x position has average -0.069664m and standard deviation 0.000023m, y position has 0.032276m and 0.000071m, z position (camera distance) has 0.632714m and 0.000114m (so the headset was only 24.91″ from the camera — sue me), yaw angle has an average of -179.286° and a standard deviation of 0.020°, pitch has 2.584° and 0.155°, and roll has 5.720° and 0.107°. In other words, in this artificial setup, position is precise to 0.1mm (with z being much worse than x and y), and orientation angles are precise to 0.15° (with pitch being worst). The latter isn’t great, but since IMU’s are really good at orientation tracking, I expect that proper sensor fusion will improve that. Optical tracking and inertial tracking go together like, uh, two things that go really well together.

Not sure if it helps, but Oculus is using OpenCV for this work. From the obj filenames in the SDK along with the symbol names used in them, it’s pretty easy to see what OpenCV modules they are using.

I just found out that the ab-initio method I’m using is in OpenCV as an option for their PnP algorithm (it’s called epnp), and a Levenberg-Marquardt optimization-based method was the standard last time I checked. It would be interesting to know which method the official run-time is using.

Pingback: Hacking the Oculus Rift DK2 | Doc-Ok.org

Impressive post, thanks for the info

I saw you wrote a tutorial on using OpenCV’s POSIT algorithm. I was going to use POSIT for this myself, or rather I did, but it went to pieces because in most cases the camera sees only one face of the headset, and all those LEDs are almost coplanar. I was surprised (or I should say taken aback) by how bad POSIT worked on most test cases, even when the thickness of the 3D model was more than 10% of the width/height. I wasn’t expecting POSIT to break that early.

But I found a great replacement in Lepetit et al.’s method, so I’m not looking back.

yes, you are right, it does not work with co-planar points.

Very interesting!

Awesome! I’m really pleased that so much progress is being made towards a driver for Oculus and other future HMD.

I’ve really enjoyed this and your other writeups. I really learned a lot following along and it made the field substantially more accessible, I can’t thank you enough.

Is geometric algebra useful in your work? ie http://www.amazon.com/Geometric-Algebra-Computer-Science-Revised/dp/0123749425

Geometric algebra is another framework to think about geometry, and in some ways a more elegant one than “traditional” vector algebra. In the end it’s the same thing, just like C++ and Lisp are two different ways to think about programming, but any program you can write in one you can write in the other.

Specifically for 2D/3D geometry, I think neither of the two are at an appropriate abstraction level. In vector algebra, if you want to know the angle between two vectors, you need to think “aha, cross product,” in geometric algebra you need to think “aha, wedge product.” Doesn’t really make it easier, does it. What you should have instead is a way to directly asking “what’s the angle between these two vectors?” It gets much worse once you start thinking about transformations, and rotations especially.

What one should use is a higher-level geometry library that encapsulates geometric objects into classes with a use-driven interface. So a “Plane” class would have methods to get the distance to a point, intersect a ray, intersect two planes, etc., and whether a plane is expressed as a point and a normal vector or a bivector is irrelevant. That’s why the Vrui VR toolkit has a high-level geometry library doing exactly this. If you want to rotate a vector or point, you create a rotation object and say rot.transform(point), and you don’t need to know whether a rotation is implemented as a matrix or a quaternion or a rotor.

Ah yes I take your point about needing higher level tools, thanks for the thoughtful reply.

I’m working my way through “Linear and Geometric Algebra” by Macdonald as a recovering english major long ago turned into software developer.

I’m enjoying the material but was wondering how much used it was, I don’t see it in course catalogs so far.

P.S. great to see you on github 🙂

(just subscribing to the comments feed)

Hi,

really nice work. Is it possible to obtain your implementation or are do you plan to put it on Github, so that we can use and contribute to your work?

With best regards,

Marc

https://github.com/Doc-Ok/OpticalTracking

Hi ive tried to get it working, but i couldnt determine the v4l device name? How did you do that?

I finally got that, but i’ve recognized that your oculus serial id 0x0021 passed to and mine 0x0201 differs.

Best regards,

Marc

The DK2 system is listed as three separate USB devices. 0x0021 is the headset, 0x0201 is the camera, and 0x2021 is the device root hub. My software uses 0x0021 to talk to the headset; it talks to the camera via the Linux V4L2 interface, without using USB IDs. You can run “LEDFinder list” to display the names of all connected video input devices. “Camera DK2” is the name of the DK2’s tracking camera, and it will show up twice (once as a “quirky” device handled by my custom driver and once as a regular V4L2 device. LEDFinder will only see the first instance.

Very interesting work. Is there a method to force the sensed XZY-RYP values into the Oculus SDK, (or are you planning to use this work for your projects)?

If yes, I wonder whether there is any interest in implementing a positional solution for the DK1 and/or other devices. Either by strapping LEDs to it, or by using visible markers such as 2D barcodes.

Libdmtx will detect data-matrix codes in a scene AND give you the corner co-ordinates of each.

In addition do you think that it is possible to combine the results from multiple cameras to give positional coverage over a larger area. Once set up, the relative positions could be auto-learned by walking a DK2 through the area….

Simon.

The official run-time and SDK are separate processes, meaning it would be possible to replace the run-time with a different one while retaining the same communications protocol. This could enable things like multi-camera tracking while staying compatible with SDK-based applications.

I started working on a positional tracker for DK1 in August 2013, but I didn’t have enough time to get it anywhere. I 3D-printed a clamp with four LEDs that snaps onto the DK1’s front plate, to be tracked by a PS Eye or equivalent camera. I got pretty good initial tracking results out of that; I’m hoping that this work will allow me to get back to the DK1 tracker.

Another angle is that I want to use 3D motion capture systems I already have (NaturalPoint OptiTrack) to track one or more DK2s over a large area. For this I don’t need the optical tracking component, as that would be taken care of by the motion capture system, but I do need good sensor fusion.

Thanks for the fabulous post, it has great value (at least for me). Are you planing to post something on optical / sensor fusion?

I got side-tracked after finishing the optical component, and I’m no expert on sensor fusion unfortunately. That part is still on my to-do list, as it’s applicable to a lot of the work I’m doing — I need sensor fusion for PS Move tracking, and especially to fuse the inertial readings from a Wiimote+ with the optical tracking data from our “professional” tracking systems for more stable results.

There will be a follow-up post at some point in the future. 🙂

Thank you for great post.

I have a question about using EPnP to solve the initial pose estimation.

In EPnP, it assumes that the world coordinate of reference points are known.

But after video frame identification, you determine the blobs with unique pattern.

How do you know the coordinate of these blobs?

(I assume that these blobs are treated as reference points.)

Once individual LEDs are identified via their blinking patterns, the 3D positions of those LEDs in headset-fixed coordinates come from the firmware (see previous posts in this series). Thos are the world coordinate reference points needed by EPnP. As the points are in headset coordinates, the result is the camera pose in headset coordinates as well. But assuming that the camera doesn’t move, one can simply invert the transformation returned by EPnP and get the headset pose in camera space.

Thank you for response.

Another question:

After EPnP, an initial pose is gotten.

Then in new frame, for example, only 10 blobs are observed and identified by blinking patterns.

I don’t understand how LM method is used to estimate pose in the way

” take some approximation to the desired pose, and calculate another slightly different pose that hopefully yields a smaller reprojection error”

Can you expound more detailed?

Regards.

It’s an iterative method. Say you have a pose (P, O) (position, orientation) from camera frame n. Now you receive frame n+1, and use pose (P, O) and the camera’s known intrinsic parameters to project the known 3D positions of the LEDs in headset space to camera image space. You then compare those positions to the recognized LED blobs in the camera frame n+1 and calculate some error value e, usually sum of squared distances. Then you slightly tweak (P, O) and re-calculate the error value, which is now (hopefully) smaller. You keep doing that until the error drops below some threshold, stops dropping, or you run out of time. The result is pose (P’, O’) for time step n+1.

A good method to tweak a pose (P, O) is Levenberg-Marquardt optimization. It requires the error value at each step, and also the partial derivatives of error values with respect to the pose parameters. Fortunately, those derivatives are relatively easy to calculate analytically.

Pingback: 【你問我答專欄】請教Oculus rift的constellation和HTC Vive的lighthouse技術相比，何者較佳? Part2 | VR幼幼班

Pingback: 【你問我答專欄】請教Oculus rift的constellation和HTC Vive的lighthouse技術相比，何者較佳? Part2 | VR幼幼班

Hi, thanks for sharing your work. Are you planning on testing what happens when the tracker is moved/rotated? Under Windows with Oculus runtime and desk demo scene, the HMD camera orientation changes every second or so when a rotation/translation is applied to the IR cam…

That’s the dynamic self-calibration algorithm, which derives the camera’s orientation from the orientation reported by the headset’s inertial sensors. It uses a very long-baseline low-pass filter to avoid any kind of jitter in the camera orientation, which causes the slow reaction to moving the camera.

If that’s what you mean.

Yes indeed!

In our context, the HMD and IR tracker are moving together in the cabin of a real vehicle (car/plane/boat).

We’re trying to dodge the self-calibration behavior without rewriting the Oculus driver: any ideas?

That setup is fundamentally incompatible with how Oculus’ tracking driver does dead reckoning tracking with drift correction. The auto-calibration routine ensures that instantaneous motions detected by the Rift’s IMU line up exactly with delayed motions detected by the camera. This assumes that the camera stays still for long periods and only moves rarely, and discretely, i.e., it only moves when it’s picked up by the user and placed somewhere else. The auto-calibration algorithm is optimized to handle that case. If the camera moves continuously, this all breaks down.

Your only choice will be to disable dead reckoning, and use only optical tracking results, which won’t care whether the camera and headset are in a moving frame or not. You still need initial calibration to get the camera’s alignment with respect to gravity (so that the virtual horizon is level), but afterwards it will work. The downside is that purely-optical tracking has higher latency (at least 17ms more), and significantly more jitter especially in orientation.

I don’t think you can set Oculus’ tracking driver to optical-only. You might have to grab the optical-only tracking code I put on github and see if that’s still working (a recent firmware upgrade in the Rift might have changed some things in the USB protocol).

Would it be possible to add IMUs to the camera, and subtract the values reported by that from the values reported by the headset, and use that result for deadreckoning?

It would most probably help, but someone would have to try. While IMUs in the camera could not be used to track the position of the camera over time, for the same reason the camera is needed in the first place, they could be used to track its orientation relatively reliably, even in a moving reference frame under acceleration. That should be enough to fix this specific problem. The VR system would still need to know the real-world position and orientation of the car to create a convincing simulation of its movement in the virtual world.

There was that car commercial someone linked on reddit a while ago, where they put a DK2 in a car, and used a large-area 3D measurement system (RF-based, I think) to track the car over a race track. If they were to measure the car’s orientation, they could feed those values into the tracking server, and disable auto-calibration, to solve this problem without adding IMUs to the camera.