The Kinect 2.0

Details about the next version of Microsoft’s Kinect, to be bundled with the upcoming Xbox One, are slowly emerging. After an initial leak of preliminary specifications on February 20th, 2013, finally some official data are to be had. This article about the upcoming next Kinect-for-Windows mentions “Microsoft’s proprietary Time-of-Flight technology,” which is an entirely different method to sense depth than the current Kinect’s structured light approach. That’s kind of a big deal.

Figure 1: The Xbox One. The box on top, the one with the lens, is probably the Kinect2. They should have gone with a red, glowing lens.

Given that additional bit of information, the leaked depth camera specs make a lot more sense. According to the leak, the new Kinect (“Kinect2” from here on out) has a depth camera resolution of 512×424 pixels. This surprised me initially, given that Kinect1’s depth camera has a resolution of 640×480 pixels. But, the Xbox 360 only used a depth image of 320×240 pixels for its skeletal tracking, mostly for performance reasons. So at first I guessed that the new Xbox would again only use a downsampled depth image, and that the leaked resolution was the downsampled one, leading to a “true” depth resolution of 1024×848 pixels. That sounds nice, but read on.

But here’s the problem: the Kinect1’s depth camera is not a real camera; it’s a virtual camera, created by combining images from the real IR camera (which has 1280×1024 resolution) with light patterns projected by the IR emitter. And therein lies the rub. While the virtual depth camera’s nominal resolution is 640×480, the IR camera can only calculate a depth value for one of its (real) pixels if that pixel happens to see one of the myriad of light dots projected by the pattern emitter. And because the light dots must have some space between them, to be told apart by the IR camera, and to create a 2D pattern with a long repetition length, only a small fraction of the IR camera’s pixels will see light dots in any given setting. The depth values from those pixels will then be resampled into the 640×480 output image, and depth values for all other pixels will be created out of thin air, by interpolation between neighboring real depth values.

The bottom line is that in Kinect1, the depth camera’s nominal resolution is a poor indicator of its effective resolution. Roughly estimating, only around 1 in every 20 pixels has a real depth measurement in typical situations. This is the reason Kinect1 has trouble detecting small objects, such as finger tips pointing directly at the camera. There’s a good chance a small object will fall entirely between light dots, and therefore not contribute anything to the final depth image. This also means that simply increasing the depth camera’s resolution, say to 1024×848, without making the projected IR pattern finer and denser as well, would not result in more data, only in more interpolation. That’s why I wasn’t excited until I found out about the change in technology.

In a time-of-flight depth camera, the depth camera is a real camera (with a single real lens), with every pixel containing a real depth measurement. This means that, while the nominal resolution of Kinect2’s depth camera is lower than Kinect1’s, its effective resolution is likely much higher, potentially by a factor of ten or so. Time-of-flight depth cameras have their own set of issues, so I’ll have to hold off on making an absolute statement until I can test a Kinect2, but I am expecting much more detailed depth images, and if early leaked depth images (see Figure 2) are not doctored, then that’s supported by evidence.

Figure 2: Alleged depth image from Kinect 2.0, unverified. Watermark not part of depth image.

From a purely technical point of view, if Kinect2 really does use a time-of-flight depth camera, and if that camera’s native resolution really is 512×424, that’s a major achievement in itself. As of now, time-of-flight cameras have very low resolutions, usually 160×120 or 320×240. Even Intel/Creative’s upcoming depth camera is reported to use 320×240, or a factor of three fewer pixels than Kinect2.

Structured-light depth cameras have another subtle drawback. To measure the depth of a point on a surface, that point has to be visible to both the camera and pattern emitter. This leads to distinct “halos” around foreground objects. More distant surfaces on the left side of the foreground object can’t be seen by the camera, whereas surfaces on the right side can’t be seen by the pattern emitter (or the other way around, depending on camera layout). The larger the depth distance between foreground and background objects, the wider the halo. A time-of-flight camera, on the other hand, can measure the depth of any surfaces it can see itself. In truth, there is still an emitter involved; the emitter needs to create a well-timed pulse of light whose return time can be measured. But since depth resolution does not linearly depend on the distance between the camera and emitter, the emitter can be very close to the camera — it can even shoot through the same lens — and the resulting halos are much smaller, or gone completely.

So is the higher depth resolution just an incremental improvement, or a major new feature? For some applications, like skeleton tracking or 3D video, it is indeed only incremental, albeit highly welcome. But there are very important applications for which Kinect1’s depth resolution was barely not good enough, most importantly finger and face tracking. Based on the known specs, I am expecting that Kinect2’s depth camera will be able to resolve finger tips at medium distance reliably, even when pointing directly at the camera. This will enable new natural user interfaces for 3D interactions, such as grabbing, moving, rotating, and scaling virtual three-dimensional objects (where the Leap Motion would otherwise be king). Reliable face tracking could be used to create truly holographic 3D displays completely based on commodity hardware, i.e., PC, Kinect2, 3D TV. My VR software could already use both of these features, if the current Kinect’s resolution were just a tad higher.

These significant improvements in the depth camera aside, the other changes are really quite minor. Kinect2 has a higher-resolution color camera, which can allegedly stream 1920×1080 pixel color images at 30 Hz, compared to Kinect1’s 640×480 pixel images at 30 Hz, or 1280×1024 pixel images at 15 Hz. Because it was already possible to combine the Kinect with external cameras, that’s not that important. And the new microphone array seems to be basically the same as the old.

So on to the really important question: can someone like me actually use those new capabilities? Or, phrased differently, is Kinect2 as easy to use off-label as Kinect1? Or, phrased yet another way, is Kinect2 hackable? Looking back 2.5 years, it took only a few days between the original Kinect’s appearance in stores and its USB protocol having been reverse-engineered, because Microsoft “forgot” to put encryption or authentication into the protocol. Microsoft’s PR machine put a happy face on the whole incident back then, but I’m not sure they wouldn’t rather have kept control.

Additionally, and this is a sly move, Kinect2 will be sold bundled with Xbox One. The original Kinect is an add-on, and sold separately, initially for around $150, and now usually for around $100. I currently have six of them. Kinect for Windows, on the other hand, has a suggested retail price of $249, for basically exactly the same hardware. Go figure. Microsoft ran into a trap Nvidia figured out long ago: if you want to sell a “professional” product at high mark-ups, you can’t sell the exact same product in the games market, where there is fierce competition. Positioning Kinect2 as an integral part of every Xbox One, and not selling it separately, will not poison the market for a high-priced “for Windows” version later.

Will Kinect2 be available for stand-alone purchase? If every Xbox One comes with one, and can use only one, why would it be available separately? Will I have to buy six Xboxes, for (rumored price) $299 each, to get my fix? Will I have to wait for Kinect2 for Windows, for whatever that will cost? Time will tell.

27 thoughts on “The Kinect 2.0

  1. One advantage of Intel’s camera is that it’s supposedly 60fps (so that means 33ms latency? hard to find any hard numbers) compared to the kinect 2.0’s 60ms latency.
    Ideally latency would be even lower of course.

    Microsoft already announced that they’ll sell the kinect 2.0 seperately for windows too, but of course they didn’t actually mention any timetable .. so as far as we know it’ll be released for windows years later

    But hey, if you do need to buy those xbox ones for the kinects, maybe you can make a beowolf cluster out of them!
    .. or just have lots of xbox game parties with your friends 😉

    • Hey Sander!

      Camera latency is primarily determined by exposure time, read-out time, any in-camera processing, and transmission to the host. The Kinect1, when using “raw” depth and color data from a PC, has approximately 20ms latency. Using a downsampled image like on the Xbox will likely reduce that latency to around 15ms.

      The oft-discussed Kinect tracking latency stems mostly from the skeletal tracking algorithm run on the Xbox, which is computationally involved. I’m guessing it’s 90ms-15ms=75ms, give or take. Without changing the algorithm, going to a 60 fps camera will probably reduce latency only a little, if at all. Say 10ms from the camera, that still leaves 85ms total.

      The MSDN blog post I’m linking above vaguely states that “Kinect 2.0 for Windows will be available in 2014.” It doesn’t say anything about price, but I’m guessing it won’t be cheaper than the current Kinect for Windows at $249.

  2. 60FPS is a 16ms update rate, but that’s not the latency of the camera – the time from your actions taking place to the data being available is a different figure.

    Not many specifics about camera performance have been announced, but Wired’s Kinect video shows fingers at a decent distance as well as face detail ( http://www.youtube.com/watch?v=Hi5kMNfgDS4#t=37s )

    It’s been announced that it’ll come to Windows eventually so it’ll be available. The more interesting problem you have for hacking with it before then is the connector ( http://msnbcmedia4.msn.com/j/streams/2013/May/130521/6C7512066-xbox-one-back-ports.blocks_desktop_large.jpg )

    • Thank you for the picture of the connector, although it’s somewhat bad news. I had a hunch Microsoft wouldn’t repeat the “mistake” they made with the Kinect1, and keep it locked down this time. While there is a chance that the Xbox will ship with an adapter cable with a standard USB3 plug — like the Kinect1 — it’s a slim chance in my book.

        • For the Kinect 1.0, that’s correct. The power drain of the tilt motor was too much. Since the Kinect 2.0 has USB3, with higher bus power, and no tilt motor, this time they’ll have to come up with a different excuse. 🙂

  3. Pingback: Cracking Kinect: the Xbox One's new sensor could be a hardware hacker's dream – The Verge - ABC News

  4. Pingback: Cracking Kinect: the Xbox One’s new sensor could be a hardware hacker’s dream | Wireless Sensors

  5. Pingback: Cracking Kinect: the Xbox One’s new sensor could be a hardware hacker’s dream | Hear the newest Technology News

  6. What are your feelings about using multiple kinect 2.0’s and their interference? According to wikipedia, using multiple time of flight cameras in the same room interfere with each other and that might pose problems which are harder to overcome than interference between multiple structured light patterns.

    • Yes, it’s more of a concern. I was wrong 2.5 years ago when I said you can’t use multiple Kinects in the same space, but I’m saying it again this time. With a structured light scanner, interference is localized to areas where the individual patterns (dots in the Kinect’s case) overlap in unfortunate ways, and that can clearly be seen in practice.

      With a time-of-flight scanner, having multiple IR pulses go off at unsynchronized times could potentially screw up depth measurements over the entire sensor. If supported by the hardware, time synchronization could potentially fix it. In other words, if one camera fires at time t, and then at time t+dt (where dt is the frame rate, i.e., 1/30s), you could have the second camera set up to fire at t+dt/2 and t+dt*3/2 and so forth. But that requires exact timing support in the firmware, which the Kinect2 probably won’t have, because why would it?

        • If I understand it correctly, a single TOF depth picture is not made from a single exposure, but from a sequence of exposures at high frequencies. A single pulse would simply not be able to collect enough photons to trigger a reaction from the sensor.

          In practice, this means that the exposure time of a single shot is not just a few picoseconds, but something much longer, probably on the same order as normal cameras (which are using similar sensors, after all). So it’s a period of exposure of probably several milliseconds, in which there are probably thousands of “microexposures” where a very short light pulse is followed by a very short exposure period.

          As a result, if two cameras have overlapping exposure periods, for which there would be a high probability, things would go wrong.

          Someone correct me if I’m wrong.

          • I’m only familiar with the spinning laser lidars used in the grand challenges. Since they were spinning i think they would only take a single sample and they didn’t seem to need bright lasers. There was often several lidars on each car and they didn’t seem it interfere with each other. I doubt the kinect will use a spinning lidar but I guess it could use something like the kinect 1 projector to aim a beam or set of beams.

          • The main two differences between LiDAR and TOF cameras are that LiDAR “cameras” only have a single pixel, which can have a very large light-sensitive sensor area, and that the laser in LiDAR is highly focused, whereas the LED in a TOF camera has to illuminate a large area (around 90 degrees field-of-view for the Intel camera, and probably the Kinect2). The combination of those two factors makes me believe that TOF cameras use multiple microexposures to make up for the much smaller number of photons that make it from the light source to each pixel.

  7. Pingback: Kinect 2.0 for Virtual Reality Could Be Key

  8. Might be naive but what if instead of using multiple Kinects putting up 2 Mirrors behind yourself with an 120° angle between the mirrors and the Kinect.
    Would that trick the Kinect into thinking there are three people in the room?
    If so you might be able to combine the three skeleton models into one giving you full 360° body tracking.

    Grüße aus Deutschland
    Max

  9. I thought you might like some more information about Kinect 2.
    There was a presentation at Build 2013 all about the sensor and it’s SDK (http://channel9.msdn.com/Events/Build/2013/3-702)
    It’s worth a watch but if you don’t want to the main highlights were:
    > No protection on the output. The guys claims that they’ve actually made the protocol nicer to deal with in this version, and that they want people to use it with other toolkits. From other sources, not the video, it’s been claimed the weird plug is to ensure that it gets a dedicated USB hub to itself, rather then ending up sharing one with a HDD or other USB2/3 devices that could be plugged in.
    > They confirm that you can run two in parallel without issues. They say there’s a small risk of overlap but since the pulses are so short you should be ok.

  10. Loved your Kinect demos but more importantly, loved your tech talk! You made things a lot easier to understand. Have you contacted Microsoft about using the new Kinect? I read that they will allow people to be indie developers for the Xbox One and will allow them to use the retail console as the development kit. If that’s the case, I have an idea that I might try to develop (similar to Avatar Kinect on the 360) but I was wondering if you heard anything.

    • No, I haven’t talked to Microsoft. I’m waiting until things settle a bit (required component? not?), and then I’ll see where the chips fall. I’ll definitely try to do something with the Kinect 2, as the tech is too promising not to, but I’m not sure how much Microsoft would be willing to cooperate.

  11. Dear Sir

    Have you managed to obtain a Kinect 2 for windows, and have you played with the lens and depth correction for this device?
    I am convinced that the color camera has a built in hardware un-distortion algorithm (the picture is damn nice), but it appears that the depth image in not corrected.

    What are your findings?

    • The depth camera definitely has undistortion parameters stored in the firmware, and there is something that looks distortion-related for the color camera as well. It’s true that the color camera exhibits a lot less distortion than the depth camera, but I think that’s simply due to better lenses, and not due to some special process.

Please leave a reply!