Details about the next version of Microsoft’s Kinect, to be bundled with the upcoming Xbox One, are slowly emerging. After an initial leak of preliminary specifications on February 20th, 2013, finally some official data are to be had. This article about the upcoming next Kinect-for-Windows mentions “Microsoft’s proprietary Time-of-Flight technology,” which is an entirely different method to sense depth than the current Kinect’s structured light approach. That’s kind of a big deal.
Given that additional bit of information, the leaked depth camera specs make a lot more sense. According to the leak, the new Kinect (“Kinect2” from here on out) has a depth camera resolution of 512×424 pixels. This surprised me initially, given that Kinect1’s depth camera has a resolution of 640×480 pixels. But, the Xbox 360 only used a depth image of 320×240 pixels for its skeletal tracking, mostly for performance reasons. So at first I guessed that the new Xbox would again only use a downsampled depth image, and that the leaked resolution was the downsampled one, leading to a “true” depth resolution of 1024×848 pixels. That sounds nice, but read on.
But here’s the problem: the Kinect1’s depth camera is not a real camera; it’s a virtual camera, created by combining images from the real IR camera (which has 1280×1024 resolution) with light patterns projected by the IR emitter. And therein lies the rub. While the virtual depth camera’s nominal resolution is 640×480, the IR camera can only calculate a depth value for one of its (real) pixels if that pixel happens to see one of the myriad of light dots projected by the pattern emitter. And because the light dots must have some space between them, to be told apart by the IR camera, and to create a 2D pattern with a long repetition length, only a small fraction of the IR camera’s pixels will see light dots in any given setting. The depth values from those pixels will then be resampled into the 640×480 output image, and depth values for all other pixels will be created out of thin air, by interpolation between neighboring real depth values.
The bottom line is that in Kinect1, the depth camera’s nominal resolution is a poor indicator of its effective resolution. Roughly estimating, only around 1 in every 20 pixels has a real depth measurement in typical situations. This is the reason Kinect1 has trouble detecting small objects, such as finger tips pointing directly at the camera. There’s a good chance a small object will fall entirely between light dots, and therefore not contribute anything to the final depth image. This also means that simply increasing the depth camera’s resolution, say to 1024×848, without making the projected IR pattern finer and denser as well, would not result in more data, only in more interpolation. That’s why I wasn’t excited until I found out about the change in technology.
In a time-of-flight depth camera, the depth camera is a real camera (with a single real lens), with every pixel containing a real depth measurement. This means that, while the nominal resolution of Kinect2’s depth camera is lower than Kinect1’s, its effective resolution is likely much higher, potentially by a factor of ten or so. Time-of-flight depth cameras have their own set of issues, so I’ll have to hold off on making an absolute statement until I can test a Kinect2, but I am expecting much more detailed depth images, and if early leaked depth images (see Figure 2) are not doctored, then that’s supported by evidence.
From a purely technical point of view, if Kinect2 really does use a time-of-flight depth camera, and if that camera’s native resolution really is 512×424, that’s a major achievement in itself. As of now, time-of-flight cameras have very low resolutions, usually 160×120 or 320×240. Even Intel/Creative’s upcoming depth camera is reported to use 320×240, or a factor of three fewer pixels than Kinect2.
Structured-light depth cameras have another subtle drawback. To measure the depth of a point on a surface, that point has to be visible to both the camera and pattern emitter. This leads to distinct “halos” around foreground objects. More distant surfaces on the left side of the foreground object can’t be seen by the camera, whereas surfaces on the right side can’t be seen by the pattern emitter (or the other way around, depending on camera layout). The larger the depth distance between foreground and background objects, the wider the halo. A time-of-flight camera, on the other hand, can measure the depth of any surfaces it can see itself. In truth, there is still an emitter involved; the emitter needs to create a well-timed pulse of light whose return time can be measured. But since depth resolution does not linearly depend on the distance between the camera and emitter, the emitter can be very close to the camera — it can even shoot through the same lens — and the resulting halos are much smaller, or gone completely.
So is the higher depth resolution just an incremental improvement, or a major new feature? For some applications, like skeleton tracking or 3D video, it is indeed only incremental, albeit highly welcome. But there are very important applications for which Kinect1’s depth resolution was barely not good enough, most importantly finger and face tracking. Based on the known specs, I am expecting that Kinect2’s depth camera will be able to resolve finger tips at medium distance reliably, even when pointing directly at the camera. This will enable new natural user interfaces for 3D interactions, such as grabbing, moving, rotating, and scaling virtual three-dimensional objects (where the Leap Motion would otherwise be king). Reliable face tracking could be used to create truly holographic 3D displays completely based on commodity hardware, i.e., PC, Kinect2, 3D TV. My VR software could already use both of these features, if the current Kinect’s resolution were just a tad higher.
These significant improvements in the depth camera aside, the other changes are really quite minor. Kinect2 has a higher-resolution color camera, which can allegedly stream 1920×1080 pixel color images at 30 Hz, compared to Kinect1’s 640×480 pixel images at 30 Hz, or 1280×1024 pixel images at 15 Hz. Because it was already possible to combine the Kinect with external cameras, that’s not that important. And the new microphone array seems to be basically the same as the old.
So on to the really important question: can someone like me actually use those new capabilities? Or, phrased differently, is Kinect2 as easy to use off-label as Kinect1? Or, phrased yet another way, is Kinect2 hackable? Looking back 2.5 years, it took only a few days between the original Kinect’s appearance in stores and its USB protocol having been reverse-engineered, because Microsoft “forgot” to put encryption or authentication into the protocol. Microsoft’s PR machine put a happy face on the whole incident back then, but I’m not sure they wouldn’t rather have kept control.
Additionally, and this is a sly move, Kinect2 will be sold bundled with Xbox One. The original Kinect is an add-on, and sold separately, initially for around $150, and now usually for around $100. I currently have six of them. Kinect for Windows, on the other hand, has a suggested retail price of $249, for basically exactly the same hardware. Go figure. Microsoft ran into a trap Nvidia figured out long ago: if you want to sell a “professional” product at high mark-ups, you can’t sell the exact same product in the games market, where there is fierce competition. Positioning Kinect2 as an integral part of every Xbox One, and not selling it separately, will not poison the market for a high-priced “for Windows” version later.
Will Kinect2 be available for stand-alone purchase? If every Xbox One comes with one, and can use only one, why would it be available separately? Will I have to buy six Xboxes, for (rumored price) $299 each, to get my fix? Will I have to wait for Kinect2 for Windows, for whatever that will cost? Time will tell.