In the previous part of this ongoing series of posts, I described how the Oculus Rift DK2’s tracking LEDs can be identified in the video stream from the tracking camera via their unique blinking patterns, which spell out 10-bit binary numbers. In this post, I will describe how that information can be used to estimate the 3D position and orientation of the headset relative to the camera; the first important step in full positional head tracking.
3D pose estimation, or the problem of reconstructing the 3D position and orientation of a known object relative to a single 2D camera, also known as the perspective-n-point problem, is a well-researched topic in computer vision. In the special case of the Oculus Rift DK2, it is the foundation of positional head tracking. As I tried to explain in this video, an inertial measurement unit (IMU) by itself cannot track an object’s absolute position over time, because positional drift builds up rapidly and cannot be controlled without an external 3D reference frame. 3D pose estimation via an external camera provides exactly such a reference frame.
So why was it so important to be able to identify LEDs in the tracking camera’s video stream? At its heart, 3D pose estimation is a multi-dimensional non-linear optimization problem. Given a known model, i.e., a collection of 3D points such as the DK2’s tracking LEDs, a camera with known intrinsic parameters, and a set of 2D points in the camera’s image, such as the set of extracted LED blobs, one can try to reconstruct the unknown position (tx, ty, tz) and orientation (yaw, pitch, roll) of the model with respect to the camera (in reality, one would never use yaw, pitch, and roll angles to do this, but that’s a technical detail). In theory, the approach is simple. Given a candidate set of unknown parameters (tx, ty, tz, yaw, pitch, roll), one takes the set of 3D model points, transforms them by the rigid body transformation defined by the six parameters, projects them into image space using the camera’s intrinsic parameters, and then calculates the sum of their squared distances from the true observed image points (this is called reprojection error). This process defines an error function F(tx, ty, tz, yaw, pitch, roll), and the problem is reduced to finding the set of parameters that globally minimizes the value of the error function.
But here’s a problem: Given a predicted image point, which one of the observed image points should it be compared to? Without knowing anything else, one would have to test all possible associations of observed and predicted image points, and pick the association which yields the smallest error after optimization. Unfortunately, there are a lot of possible associations; in general, if there are N predicted image points and M<=N observed image points, then there are N!/(N-M)! possible associations. To pick an example, for N=40 (number of LEDs on DK2) and M=20, there are 335,367,096,786,357,081,410,764,800,000 potential associations to test, and that’s a large number even for a computer. There are many heuristic and/or iterative methods to establish associations automatically, but they tend to be rather slow and fragile. The best approach, obviously, is to somehow make it possible to identify a-priori which observed image point belongs to which 3D model point, and the DK2’s flashing 10-bit patterns do exactly that. Well played, Oculus.
Pose estimation algorithms fall into two very broad categories: iterative methods and ab-initio algebraic methods. The former are based on relatively simple steps that take some approximation to the desired pose, and calculate another slightly different pose that hopefully yields a smaller reprojection error, and then repeat that same process until the approximation is considered good enough, or does not improve any further. The latter algorithms tend to be more complex, and typically apply linear algebra methods to directly calculate pose candidates in a single step.
In general, iterative methods can be fast and highly accurate, but they require an initial (guessed) pose that is already close to the ideal pose, or they might take very long to find a good solution, or don’t find one at all. Ab-initio methods, on the other hand, tend to be slower and less accurate, but they can find a good pose estimate without requiring an initial guess (and it’s so embarrassing when the guess turns out wrong and tracking comes to a grinding halt).
For my experiments, I chose a combination of an ab-initio method and an iterative method. The ab-initio method is the one described in a 2008 paper by Lepetit, Moreno-Noguer, and Fua. The authors claim it is faster and more accurate than other ab-initio methods, and I took their word for it (an added benefit is that it is relatively easy to implement). As it turns out, Lepetit et al.’s method is fairly robust, but its accuracy could be better. I decided to improve it by adding a few (50) steps of an iterative pose estimation method, using the final result of Lepetit et al.’s algorithm as input for the iterative algorithm. Additionally, when there is already a good pose estimate (judged via its reprojection error) from the previous frame, I skip the ab-initio stage altogether and go right into iterative optimization.
For the iterative method, I jumped on the wayback machine and dusted off the 3D pose estimation algorithm I wrote for my Wiimote tracking experiments in 2007. It is a straight application of the non-linear optimization strategy I described above, using the Levenberg-Marquardt algorithm (LMA). LMA is nice because it adapts itself between the robustness of gradient descent and the speed of Gauss-Newton iteration (GNA). In this special application, where I have very good initial pose estimates (either from the ab-initio stage or from the previous video frame), straight-up Gauss-Newton would probably get better performance, but you go with what you have.
Improved LED Identification
With a pose estimation algorithm in place, it is now possible to go back and improve the LED identification algorithm. Before, my code identified an LED by observing its flashing patterns over at least 10 frames, and continually updating its bit pattern as the video stream progressed. Due to the differential detection mechanism, rapid movements of the headset in the camera’s depth direction would sometimes lead to bit errors, which could often be corrected relying on the LED patterns’ Hamming distance of 3. But the rare bit error would slip through, and a misidentified LED is very bad news for both ab-initio and iterative pose estimation algorithms. The new algorithm has two cases. If there is no good pose estimate yet (at the beginning of tracking, or after tracking is temporarily lost), the old algorithm is applied, and it takes 10 frames or more to regain LED IDs and pose estimation. If, on the other hand, there is a good pose for the current frame, then the new algorithm projects all LEDs that should be visible to the camera, based on the current pose estimate and the LEDs’ emission directions (which are fortunately provided by the DK2’s firmware), into the camera’s image where the next iteration of the LED identification loop will pick them up. As a result, if a new LED comes into view due to changes in the headset’s position or orientation, it will be picked up immediately instead of 10 or more frames later. And as long as the pose estimation loop maintains a good estimate, the original identification algorithm won’t be run at all. The results are solid, as can be seen in the video below:
As can be seen in the video, there are still tracking drop-outs, where the LED identification and pose estimation algorithms cannot keep up with fast headset movements. This is because neither algorithm has any idea how the headset moves between video frames, which arrive only every 16.666ms. If an LED blob moves by more than 10 pixels (arbitrary cut-off) between frames, the algorithm will no longer treat it as the same LED, and it will lose its accumulated ID bits and have to acquire 10 new ones.
Fortunately, the DK2’s IMU can be used to predict the motion of the headset between frames, which I expect (hope?) will make LED identification robust in the presence of fast motions.
The other main benefit of combing IMU measurements with optical pose estimation is noise reduction. Due to random noise, and exacerbated by the LEDs’ flashing patterns, raw position and orientation estimates are rather noisy, especially if only a few LEDs are visible, and even more especially if those few are (almost) coplanar. The pose estimation algorithm cannot do much to reduce this noise, but the IMU is very sensitive to noise, and if the pose estimation reports a relative motion that is not compatible with IMU readings, the noise can be filtered out.
This means the next big step is to combine the IMU-based tracking algorithm that I already have with the new pose estimation algorithm.
Pose estimation performance and quality
I still have to run thorough tests, but here are some early estimates of the algorithm’s performance and the quality of the results.
I measured the pose estimation loop’s wall-clock run-time on my 2.8 GHz Intel Core i7 CPU. This loop includes:
- Extract bright blobs from the incoming 8-bit greyscale video frame.
- Create an 8-bit RGB output video frame in which identified blobs are colored bright green (this is only for debugging/visualization purposes, and would be removed in an production tracking implementation).
- Match roughly disk-shaped blobs from the current frame with those extracted from the previous frame, to feed their size differences into the differential bit decoder, and accumulate their 10-bit IDs and LED associations.
- Write all identfied blobs to a file (also only for debugging).
- If there are four or more identified blobs and no good current pose estimate, run the ab-initio pose estimation method.
- If there are four or more identified blobs, run the iterative pose estimation method on the current pose, which comes either from the previous frame or the ab-initio method.
- If there is a good current pose, project all theoretically visible LED positions into the image frame and label them with their respective 10-bit IDs and LED indices, to aid LED identification in the next frame.
- Sort all extracted or projected 2D LED positions into a kd-tree for fast matching in the next frame.
- Push all results (colored video frame, current pose estimate, current set of identified LEDs) to the main thread for real-time visualization.
Without any optimization (yet), the loop currently takes around 1.6ms to execute. A large chunk of that is creating the colored output frame. The combined ab-initio/iterative pose estimation algorithm takes around 0.16ms.
To get a rough idea of pose estimation quality, I placed both the tracking camera and the headset onto the same table, 25″ away from each other, and collected a long stretch of data (2326 frames) without touching or moving anything. According to LibreOffice Calc, reprojection error RMS has an average of 0.164648 pixels and a standard deviation of 0.011238 pixels (this means that the lens distortion and intrinsic parameters I calculated are fantastic); x position has average -0.069664m and standard deviation 0.000023m, y position has 0.032276m and 0.000071m, z position (camera distance) has 0.632714m and 0.000114m (so the headset was only 24.91″ from the camera — sue me), yaw angle has an average of -179.286° and a standard deviation of 0.020°, pitch has 2.584° and 0.155°, and roll has 5.720° and 0.107°. In other words, in this artificial setup, position is precise to 0.1mm (with z being much worse than x and y), and orientation angles are precise to 0.15° (with pitch being worst). The latter isn’t great, but since IMU’s are really good at orientation tracking, I expect that proper sensor fusion will improve that. Optical tracking and inertial tracking go together like, uh, two things that go really well together.