3D Video Capture with Three Kinects

I just moved all my Kinects back to my lab after my foray into experimental mixed-reality theater a week ago, and just rebuilt my 3D video capture space / tele-presence site consisting of an Oculus Rift head-mounted display and three Kinects. Now that I have a new extrinsic calibration procedure to align multiple Kinects to each other (more on that soon), and managed to finally get a really nice alignment, I figured it was time to record a short video showing what multi-camera 3D video looks like using current-generation technology (no, I don’t have any Kinects Mark II yet). See Figure 1 for a still from the video, and the whole thing after the jump.

Figure 1: A still frame from the video, showing the user’s real-time “holographic” avatar from the outside, providing a literal kind of out-of-body experience to the user.

I decided to embed the live 3D video into a virtual 3D model of an office, to show a possible setting for remote collaboration / tele-presence (more on that coming soon), and to contrast the “raw” nature of the 3D video with the much more polished look of the 3D model. One of the things we’ve noticed since we started working with 3D video to create “holographic” avatars many years ago was that, even with low-res and low-quality 3D video, the resulting avatars just feel real, in some sense even more real than higher-quality motion-captured avatars. I believe it’s related to the uncanny valley principle, in that fuzzy 3D video that moves in a very lifelike fashion is more believable to the brain than high-quality avatars that don’t quite move right. But that’s a topic for another post.

Now, one of the things that always echoes right back when I bring up Kinect and VR is latency, or rather, that the Kinect’s latency is too high to be useful for VR. Well, we need to be careful here, and distinguish between the latency of the skeletal reconstruction algorithm that’s used by the Xbox and that’s deservedly knocked for being too slow, and the latency of raw depth and color video received from the Kinect. In my applications I’m using the latter, and while I still haven’t managed to properly measure the end-to-end latency of the Kinect in “raw” mode, it appears to be much lower than that of skeletal reconstruction. Which makes sense, because skeletal reconstruction is a very involved process that runs on the general-purpose Xbox processors, whereas raw depth image creation runs on the Kinect itself, in dedicated custom silicon.

The bottom line is that, at least to me and everybody else who has tried my system, latency of the 3D video is either not noticeable, or not a problem. Even when waving my hands directly in front of my face, they feel completely like my hands, and whatever latency is there does not lead to a disconnect. Based on several other observations we have made, such as the thing about swiveling in a chair I point out in the video, make me believe that 3D video, fuzziness and artifacts and all, creates a strong sense of presence in one’s own body.

A little more information about the capture site: It’s run by a single Linux computer (Intel Core i7 @ 3.5 GHz, 8 GB RAM, Nvidia Geforce GTX 770), which receives raw depth and color image streams from three Kinects-for-Xbox, is connected to an external tracking server for head position and wand position and orientation, and drives an Oculus Rift and a secondary monoscopic view from the same viewpoint (exactly the view shown in the video) on the desktop display.

To provide some (limited) freedom of movement to the user, the Rift is connected to the computer via extra long cables: a 15′ HDMI cable, an 11′ USB cable, and a 12′ power cord (I cut the Rift’s original cord in two and spliced in a 6′ extension). The user wears the Rift’s control box on a belt, and the long cables to the main computer are tied together via a spiral cable tunnel. That setup works quite well, but, as evident at 4:55 in the video, one better is careful not to yank the cable, or it might knock a Kinect out of alignment. Oops. :) Fortunately, with the new calibration procedure, fixing that only takes a few minutes.

Update: Caving in to overwhelming public demand — well, some guy asked for it on reddit — I uploaded an Oculus Rift-formatted version of the above video to my YouTube channel. It’s exactly the same video, but if you already own an Oculus Rift dev kit version 1, you can watch the video in full 3D by dragging the playback window to the display feeding your Rift, and — this is very important! — full-screening it in 1280×800 resolution (1280×800, not 1280×720 or 1920×1080). Since I move my head a lot, and your view will be hard-locked to mine no matter how you move your head, the video might make you dizzy, so be careful. There are more detailed instructions and warning labels in the video description on YouTube:

50 thoughts on “3D Video Capture with Three Kinects

  1. Perhaps you could add something that tracks the cables and pops a warning to let you know before you do anything bad to them?

    • For the CAVE and other screen-based environments we have literal screen savers that detect when a tracked device (wand or head) comes too close to a screen, and pops up a grid showing the screen’s location in physical space, so you don’t accidentally reach right through it. I’ll need to implement something similar here.

  2. Would it be possible to mount a downward facing kinect (or camera equivalent) on the rift itself, having it capture your body in that manner? I imagine segmenting the image is still challenging, perhaps it could be done based on boundaries drawn with an IR camera that maps your body’s natural heat signal. I mean, there must be a solution that doesn’t involve setting up green screens.

    • In principle yes. The Kinect is rather too heavy for that, but others have done this with smaller depth cameras like the Intel/Creative time-of-flight camera, or the Leap Motion.

      The nice thing about depth cameras like the Kinect is that you don’t need green screens. In case you’re wondering, my Kinects are standing in my unmodified lab space, with desks and cubicles and all kinds of equipment all around. Background removal is done by capturing the per-pixel depth of the room when noone is in the designated capture area, and then removing all pixels behind that “null facade” prior to rendering.

      • Would a PS4 Camera be too heavy (once ‘liberated’ from it’s enclosure)?

        In theory you already know the 3D space in-front user from the Kinects, however processing the stereo image could fill in occluded objects.

        • I don’t know much about the PS4 camera. Does it convert its stereo views into 3D views, i.e., a depth map, internally via custom silicon? If not, that process is quite expensive when done on the main CPU; that’s how the 3D cameras we used prior to Kinect worked. Each camera pair required a dedicated fairly high-end PC to create depth maps. Granted, that was four years ago.

  3. It would be interesting to see how this performs with the higher qualify, current gen Kinects instead of the IR-based last-gen models. One sensor reading the others IR dots could explain some of the noise.

    • Interference between structured light scanners is always a problem, but in this case the video doesn’t look much worse than what would get captured from a single Kinect at the same distance. It’s just that in order to scan my entire body, I have to move the Kinects almost 2m away, and at that distance lateral and depth resolution aren’t great. The Kinect v2 should significantly improve the video quality seen here, see my related post on why I think that.

  4. Hi!
    I very much enjoyed your post, thank you.

    I noticed in the end where you are saying that it is running on a linux machine and I would like to ask, are you using the libfreenect software or is there any other way to use kinect under linux?

    Thanks!

    • No, this is not based on libfreenect. I’m using my own driver package to talk to the Kinect cameras at the USB level, see my Kinect Hacking page for details. I developed my software at the same time as the libfreenect people did theirs, but of course I like my own better. ;)

  5. Pingback: Oculus Rift y Kinect juntos para usar todo nuestro cuerpo en entornos 3D

  6. That is a very cool setup you have going there! I kinda like the chunky polygons, especially in the context of the “mathematically perfect” environment. Have you thought of trying to clean it up using “Shake ‘n’ Sense”?

    • Yup. Shake’n'Sense is one of those things I’ve always wanted to use, but never managed to. I need a couple of Igors who are good at electrical and mechanical engineering — I’m not.

      • Well, by looking at the paper it seems that it doesn’t need to be very precise to work, at the end they even hot-glue one motor on top of a Kinect which in turn uses velcro to get fixed to a table:

        http://research.microsoft.com/pubs/171706/shake'n'Sense.pdf

        The velcro keeps the Kinect in place while allowing for enough “shaking” of the Kinect. And the motor only needs to have some off-axis weight and run above 20Hz to work.

        So no need for fancy electronics either, just a battery and maybe a potentiometer to regulate the speed and you should be good to go :-)

        • You overestimate my hot-gluing and potentiometer-soldering skills, but yes, Shake’n'Sense is an obvious next step. I’ve procrastinated on it so far, but I’ll have to try it.

  7. Pingback: 3D Video Capture with Three Kinects | Doc-Ok.or...

  8. Pingback: Skapa en virtuell kropp med Oculus Rift och Kinect - Hälsa på dig själv | Feber / Pryl

  9. I love this work! I’m currently brainstorming an Idea I want to build soon where I use most probably a kinect for the graphics of a 3D flight simulator. Like that these guys have done, but with 3D graphics. https://www.kickstarter.com/projects/vipersim/the-viper-full-motion-flight-simulator
    My dream is a simulated 3d cockpit where the joystick is mapped in as well as the pilot suit and of course the terrain outside the plane – perhaps using some open source Kinect software. Although a 3D flight sim with 3d vision would be enough of an accomplishment if I can actually get into this project!

  10. I just thought of something: Starting from the assumption that the seams between the shapes captured by each kinect should be minimal for any real object the rig is capturing; wouldn’t it be possible to have the system monitors the size of the seams, and when they seem to get too big make make it start trying to adjust the position and rotation of each feed to reduce the size of the seams again?

    There must be an algorithm ready out there that will find the best position and rotation to match two sets of 3d lines ( in this case, the edges of the shells created by each kinect feed)

    • Hm, actually since it’s 3 kinects, there would be 3 sets of lines… I guess it could still work if for each feed you treat the remaining sets as a single set, removing any duplicates, where lines get too close to other lines from the other set.

    • It’s a bit trickier than that. The seams between the Kinects’ facades are not only due to misalignment. They stem from the fact that the Kinect cannot sense surfaces that face its camera at a steep angle. For example, when I point out the seam running along my arm, the band of missing geometry between the two facades was only seen at a glancing angle by the two Kinects involved. Snapping the seams together via alignment would remove a long strip of my arm, in other words. Another factor is interference between multiple Kinects, which is a particular problem when a surface is seen by two Kinects at approximately the same angle.

      • But wouldn’t an algorithm that iteratively tries to minimize the seam thickness between each and all feeds converge to a solution even if it involves keeping some gaps bigger than zero?

        And regarding the interference; perhaps instead of just using the seam size measurements of each frame, it could use a moving average over several frames to discard outliers and smooth out smaller impressions?

        • I made a diagram, but I don’t think I can post it in a comment.

          Wait, I can: Alignment between three Kinects with missing seams

          The object on the left is a perfect circle with missing seams, the seam-aligned object on the right is smaller, and not quite a circle.

          • But what would happen if it was a more complex scene, with multiple objects?

          • Yes, that’s a good question. It might do the right thing, but if the initial calibration is good — and with the new method, it’s really good – then the right thing would be doing (almost) nothing. The seams need to be fixed on the capture side, by using more cameras, or using different scanning technology such a time-of-flight in Kinect2, or by merging facades in software (which is really hard). Once the seams are there, they have to stay because they are part of the real object, so to speak.

          • Could this be improved with adding another kinetic or by adjusting for lens flare from the cameras?

          • Adding more Kinects would reduce the seams, but would cause more interference. It’s a game of diminishing returns. There is a nifty method to reduce interference between multiple Kinects (Shake ‘n’ Sense), but I haven’t gotten around trying it yet. I’ll have to at some point.

          • How long do you think it would’ve taken for such an algorithm to fix things after you yanked the cable in the video, if it was running at the time?

          • Now that is a really good question; I hadn’t thought about your suggestion in that context. This was the first time I knocked a camera; normally they’re locked down tight and out of reach. Threading the HDMI cable through the tripod’s legs was a bonehead move. There are two parts to calibration: the Kinects to each other, and the set of Kinects to the head and hand tracking system so that the view and the interaction cursor line up with the 3D video. The first part could be done iteratively like you describe, but I can’t estimate how many steps it would take. You have to be careful to do the iteration with a very low acceptance weight, or else temporary dropouts in the 3D video from shadowing etc. might negatively affect calibration. I’m purely guessing, but a severe knock like in the video could take minutes to correct itself by on-line calibration. Still much better than doing nothing.

            The second part is harder, because there is no obvious way to detect how the tracking and 3D video data should align short of mounting a very obvious marker on the hand-held device, like the glowy ball on the PS Move, only bigger. Detecting a Rift faceplate in the 3D video is doable (working on that for positional tracking), but other input devices are trickier — I can pass the wand from hand to hand, or put it on a table, or in a pocket, etc.

          • About the interference issue: Would putting linear polarizing filters over the emitters and cameras of each kinect, with 60 degrees of difference between the set of filters on each of the 3 kinects, help reduce the noise significantly? Or would it darken things too much to be better overall?

          • The Kinect 1 uses a laser diode to generate the speckle pattern, so it’s already linearly polarized. However, the polarization gets lost once the light hits non-metallic surfaces (reason why 3D theaters need silver screens). The ideal thing would be if the Kinect came in multiple IR “colors,” but I guess that ship has sailed now.

  11. Btw, could you upload the 3d video on Youtube with the 3d flag set, please? I have to make the video too small to be able to get the two sides to match when trying to make my eyes parallel, crosseyed is much easier for me.

    • The Rift format isn’t good for “normal” stereo viewing at all; due to the distortion and the internal geometry of the Rift, it only works well if you watch it through a Rift, which is why I disabled the 3D flag. I had considered uploading yet another version in YouTube 3D format, but I didn’t want to diversify the video too much (I already have to keep track of comments on two versions). Here’s a deal: I’ll upload the next video, which will feature remote collaboration, in mono and YouTube stereo.

      • When there is enough camera movement, i can get my brain to invert the depth on videos in the Rift format while watching cross-eyed (the effect doesn’t stick for long when there aren’t enough motion cues to help, but it’s still better than just watching in 2d). It doesn’t look as good as if it was made for cross-eyed screen watching, but it does work somewhat. Having the stereo functionality enabled would allow me to swap left and right; the distortion towards the edges would still be a problem, but it is still more fun than watching in 2d.

    • Thanks for the link; I know those guys, but hadn’t seen this paper yet. It’s very impressive, but the video shows their (fast) system running at 4.4 fps for an input point cloud of 175k points. That’s about the number of points in a background-removed Kinect frame, and 4.4 fps means that their processing adds 227ms latency. That’s a bit too much for my application. The bit about Kinect data later only shows a small cut-out of a full frame, and judging by the reduction in speed when they turn on denoising, it seems to run at around 10 fps, or 100ms latency for a partial frame.

  12. Pingback: Here we go again with Apple’s holography patent | Doc-Ok.org

  13. I’m interested in knowing how you think your current system (Intel Core i7 @ 3.5 GHz, 8 GB RAM, Nvidia Geforce GTX 770) will hold up with the Kinetic for Windows 2.0?

    I understand it’s speculative at the moment however will it be able to handle multiple Kinetics? I’m assuming that 3-4 Kinetics would need a much higher spec system for a live feed but it could still handle merging and rendering a 3D image post recording?

    With USB 3.0 requirements it will require fairly powerful tech to do what you are doing. With your current system, how much data is being transferred from 3 Kinetics? Is there the possibility of streaming it online?

    • The computer can handle simultaneous 3D video from six Kinects without breaking a sweat, at least that’s the largest number I’ve tried. That was from a three-site tele-collaborative system where each site had two Kinects attached, in other words, each site received 3D video from two local Kinects and from four more over the network. There was a fourth computer without local Kinects that received six streams over the network, and that one only had a Core i7 @ 2.6 GHz, 2 GB RAM, and a Geforce 275 (it was an older computer), and it still kept up; although, it was only rendering a mono view for a large-screen projector.

      The Kinect 2 isn’t really that much more demanding for the computer. It sends depth images at 512×424 pixels (Kinect 1: 632×480 pixels), and handling the depth image is the difficult part. The color stream resolution is much higher at 1920×1080, but texture mapping is relatively cheap. Network streaming will become more difficult because the color stream has to be compressed on the sender and decompressed on the receiver; I might have to downsample the color stream or come up with a multi-resolution approach for remote viewing. Compressing two and decompressing four 640×480 color streams simultaneously was part of the system I described above, so it’s not a problem.

      The raw amount of data streamed from one Kinect is 22 MB/s; after background removal and depth and color compression, typical bandwidth for network streaming is around 800 KB/s per Kinect. That is using a Theora video codec for color, and a custom lossless codec for depth.

  14. Pingback: On the road for VR: Silicon Valley Virtual Reality Conference & Expo | Doc-Ok.org

  15. Do you think it wouldbe possible to also use Kinect (better 8-2 Kiect devices in a room) to take a 360° 3D Picture of the shooting object, transfer the data and create it into a 3D printing file (STL file or similar), that can be send to a 3D Printer Station ?