There has been a lot of discussion about VR movies in the blogosphere and forosphere (just to pick two random examples), and even on Wired, recently, with the tenor being that VR movies will be the killer application for VR. There are even downloadable prototypes and start-up companies.
But will VR movies actually ever work?
This is a tricky question, and we have to be precise. So let’s first define some terms.
When talking about “VR movies,” people are generally referring to live-action movies, i.e., the kind that is captured with physical cameras and shows real people (well, actors, anyway) and environments. But for the sake of this discussion, live-action and pre-rendered computer-generated movies are identical.
We’ll also have to define what we mean by “work.” There are several things that people might expect from “VR movies,” but not everybody might expect the same things. The first big component, probably expected by all, is panoramic view, meaning that a VR movie does not only show a small section of the viewer’s field of view, but the entire sphere surrounding the viewer — primarily so that viewers wearing a head-mounted display can freely look around. Most people refer to this as “360° movies,” but since we’re all thinking 3D now instead of 2D, let’s use the proper 3D term and call them “4π sr movies” (sr: steradian), or “full solid angle movies” if that’s easier.
The second component, at least as important, is “3D,” which is of course a very fuzzy term itself. What “normal” people mean by 3D is that there is some depth to the movie, in other words, that different objects in the movie appear at different distances from the viewer, just like in reality. And here is where expectations will vary widely. Today’s “3D” movies (let’s call them “stereo movies” to be precise) treat depth as an independent dimension from width and height, due to the realities of stereo filming and projection. To present filmed objects at true depth and with undistorted proportions, every single viewer would have to have the same interpupillary distance, all movie screens would have to be the exact same size, and all viewers would have to sit in the same position relative the the screen. This previous post and video talks in great detail about what happens when that’s not the case (it is about head-mounted displays, but the principle and effects are the same). As a result, most viewers today would probably not complain about the depth in a VR movie being off and objects being distorted, but — and it’s a big but — as VR becomes mainstream, and more people experience proper VR, where objects are at 1:1 scale and undistorted, expectations will rise. Let me posit that in the long term, audiences will not accept VR movies with distorted depth.
Using these two cornerstones, what is the gold standard of VR movie? Simple: interactive real-time computer-generated experiences, i.e., VR video games. VR video games are full solid angle by nature, and if users properly calibrated their displays, they also show proper distortion-free 3D, at least as long as developers don’t screw up. (I won’t name names here, but I’ve seen some very bad stuff among the current Oculus Rift demos. Fortunately, the market will soon weed those out.) Since video games rely too much on interactivity to scratch the same itch as watching a movie, the closest thing to perfect VR movies right now are Half-Life-style cutscenes, where the viewer can freely move and look around while scripted events are unfolding, and the developers took great care to make sure viewers don’t miss important things by wandering off (watch the excellent Half-Life 2 embedded commentaries to learn what exactly goes into that).
So what’s the difference between VR video games as movie-like experiences and stereo movies? The former are free-viewpoint video, i.e., the viewer can change viewing position and direction dynamically after the movie has been produced, whereas the former have baked-in stereo, i.e., they show a 3D scene from a fixed viewpoint and using fixed viewing parameters, which cannot be changed afterwards and only look correct if viewed using the exact same parameters. Machinima is a good illustration: if machinima movies were distributed as script files that are viewed live inside the game engine that produced them, they would be VR movies; if they are rendered into (stereo or mono) movie files and viewed on YouTube, they are not.
To reiterate: stereo movies can only be watched properly if the projection parameters used during production precisely match those used during viewing (the old toe-in vs. skewed frustum debate is just one little aspect of that larger issue). That’s why I don’t call them 3D, and that’s why they don’t work in VR (with “work” defined as above).
Let’s assume there exists some camera rig that can capture completely seamless full-solid-angle video, to cover the panoramic aspect. How could this rig be used to capture stereo video? In regular stereo filming, there are two cameras side-by-side, approximating the average viewing parameters used during playback, i.e., approximating the position of the viewers’ eyeballs. But if the viewer is free to look around during playback, how can the fixed camera positions used during recording match that? Spoiler alert: they can’t. Now, there are approximations to create stereo movies that provide some sense of depth for 360° panoramic projection — but notice I’m using the 2D term “360°” here, on purpose. That’s because those methods account for viewers looking around horizontally, i.e., for yaw, but they don’t account for viewers looking up or down, i.e., for pitch, or at least they don’t right now. But even if the approximations did account for pitch (which would be a straightforward extension), there are two things they can not account for: head tilt, i.e., roll, and translation.
Let’s talk roll first. When viewing a stereo picture, the displacement vector between the left and right views must be parallel to the line connecting the viewer’s pupils. Do a simple experiment: look at a stereo picture, or go see a stereo movie, and tilt your head 90° to the left or right. Boom, stereo gone, instant eye pain. With baked-in stereo, there is no way to account for different roll angles short of rotating the entire video with the viewer, which would be a recipe for instant nausea (think about it).
Translation, such as imposed by positional head tracking, has the same fundamental problem. The reason why positional head tracking is important is because it introduces motion parallax, an important depth cue. But how can a movie with baked-in stereo reproduce motion parallax? If there is a cube directly in front of you, you see the front face. If you move to the left, you see the side face as well. But if the capturing camera didn’t see the side face, how can the movie show it to you during playback?
The bottom line is that showing a movie with baked-in stereo in a VR setting, e.g., on a head-mounted display, implies disabling positional head tracking, and disabling roll tracking. And even if viewers are conditioned not to move or tilt their heads, even yaw and pitch are only approximations that introduce some noticeable distortions. As the Oculus Rift dev kit v1 has shown, disabling positional tracking is bad enough for most people; disabling roll tracking is much, much worse.
Does this mean live-action VR movies are impossible? No. It only means that baked-in stereo is not an applicable technology, so don’t buy stock in companies that produce “360° 3D movie capture cameras.” They might get away with it now, but they won’t be able to for long. Just like black&white or silent movies don’t really have blockbuster appeal these days.
What are the alternatives? Machinima is one, clearly. Why shouldn’t there be a machinima Citizen Kane? If zombie Orson Welles were working today, he probably would be able to use the constraints of the medium to glorious effect. And Lucasfilm are on the job.
But to get to real live action, we need true 3D cameras instead of stereo cameras. The distinguishing feature of a true 3D camera is that it doesn’t capture baked-in stereo views of whatever scene it’s recording, but a true 3D model of the scene, which can then be played back from arbitrary viewpoints later on. The Kinect is one example of a (low-res) 3D camera. LiDAR is an alternative, and so are algorithms creating 3D models from many 2D photographs. The problem with the latter two, at least right now, is that they don’t work in real time. But for movie production, that’s not really a problem. And before you get all nitpicky: yes, it’s possible to convert a stereo camera into a 3D camera by running a stereo reconstruction algorithm and creating 3D models from the stereo footage. The distinction is in the product (baked-in stereo vs live-rendered 3D models), not in the technology to create the latter.
Any way it’s done, the bottom line is that the technology for proper live action VR movies is not there yet. 3D cameras are low-res and have occlusion problems, or 3D capture spaces are so involved gear-wise that they are not practical for larger or longer scenes. In the short run, if you’re an aspiring VR movie producer, better read up on how to use game engines to create machinima.
Let me close with a related thought: I’m expecting that proper VR movies will use an entirely different language, so to speak, than “traditional” movies. Watching a VR movie would be more like being on stage during a live play, instead of viewing a scene unfold through a viewpoint prescribed by the director. It’s going to be very exciting to see how this develops.
Great post. I also think that movies are intrinsically hard to pair with VR since movies are all about attention distribution, something which is very hard to do in an environment where restrictions feel very unnatural. I do, however, think that AR and VR ‘theater’ has the possibility of being successful, since the focus there lies more on the people on stage and less on the environment.
I’m currently working on a VR play, as in I’m doing the tech side of it. We’ll see how it plays out.
I’ve been speculating about this a lot. People take videos with consumer cameras and phones to capture moments. You watch those videos to relive those moments. I’ve always wanted to take that “moment-capturing” to the next level. If you wanted to capture that moment to the fullest extent, you would want 3D, 360 degree video with binaural sound, so you could look around this moment as it’s happening, similar to this musical performance (http://www.hello-again.com/beck360/main/beck360.html), except 3D. I thought this would be possible if you had 3D wide angle cameras pointing in 6 directions, stitched together in post. Reading this blog post made me realize how much this won’t work, but makes me realize how awesome the consumer cameras of the future will be.
One way i could see it sorta working that wouldn’t be a too big paradigm shift for old-school movie makers, would be to have a “virtual theater”, a ring array instead of a stereo pair (to account for roll), and some algorithm to interpolate between viewing positions (and cameras). This would still, mostly, let directors retain control over camera positions and orientation, zoom etc, while providing a immersive yet theater-like experience for the viewers.
For a more full 360×180 3d experience, i guess it could be possible to have a camera ball with lots of fisheye lenses with overlapping fields of view (each lens’ field of view should be fully covered by the combination of it’s neighbors’, and even better if also by it’s neigbhors’ neighbors), and the depth information would be extracted from the parallax between the different lenses; allowing for the viewer to move their head freely, within a certain volume. Hiding the person/rig holding the camera would be added obstacle for this (unless the cameraball is built around something like a x-copter drone, or hung on fishing lines or something of the sort). If the director insists on directing the viewer’s attention to an specific direction, a variety of filters could be applied, mainly stuff similar to vignetting; though i guess it would be possible to lock the view angle for more extreme cases (obviously something that shouldn’t be abused). For zooming, besides depth compression, it would be possible to change the relative scale of the world, by making the cameraball expand and contract (either just put the cameras on telescopic poles, or something more complex, like a Hoberman sphere ( https://www.youtube.com/watch?v=xRL0tMaNgjQ )); well, for the less extreme cases, i guess it could be possible to just interpolate the viewer size without changing the physical size of the camera.
There is a third possibility, but this might be a bit too high-tech for the moment; big lightfield cameras (bigger than most of the ones we got; the lightfield capturing surface area bigger than a human head). This would have similar benefits as the ring array but require less extrapolation; and i guess you could make a cube of it and get a 360×180 capture as well). I think Microsoft was working on a screen that could both emit and capture lightfields at high resolution without being thicker than a pane of glass (i don’t remember the details, but it would somehow redirect light inside of it almost 90 degrees and send it down to one of the screen’s edges, while preserving the lightfield information).
And i guess if extrapolating parallax via software can produce good enough results with just 3 points of view as input, it might work to have 4 almost fully spherical cameras arranged on the tips of a tetrahedron (with their blindspots aimed at the center of the tetrahedron); i guess this would be like a minimalist version of the cameraball described above.
I’ve been thinking this over a lot recently. Specifically I’ve been thinking of replication of real-world spaces in VR. The 360 degree video or photo has a fixed origin around which the user can rotate perspective. When the video moves from the origin, the user is in a tunnel, like a sphere stretched out. At best this is like rewinding or forwarding video. You are on a train tour of the scene, but are locked onto the rails. A solution would be to capture a room on a grid of interconnected 360 photos, and have a software program draw ALL the intervening frames similar to image stitching or tweening. This would allow X & Y freedom of movement at eye level. For the Vertical shift, a second grid around hip height would be used, which would connect to the eye-level grid at the same ratio as each “snap point” on the upper grid.
For the photography, a 360 degree camera on a adjustable height tripod that snaps between the two height positions (roughly 5′ 9″ and 2′ 9″) would be used. To get perfect replication, a grid would be used to capture the room. A laser projector that displays the photo positions as points on the floor. Each point, snap a lower and higher 360 photo.
This would give you a series of photos (similar to Google Streetview) that have all connecting positions extrapolated by software. Pair this with a 3D scan of the room ( Kinect, Project Tango) and you would have a highly accurate roam-able recreation of space with strong detail.
Email me if you want, I have diagrams and many more ideas on this.
Ryan, your post is precisely what this article is trying to call attention to. Having even a spherical representation allows you to rotate your head but there is no 3D. One needs to understand that depth perception comes from the different position of your eyes in space seeing things as different angles. The effect you get with Oculus Rift on with a 360 degree picture is that of looking at the inside of a painted sphere, not an experience of 3D. Kinect scan has a finite depth and won’t work for outdoor spaces. Lasers can give you estimated depths but you still can’t get all the information captured for all the different angles you eyes can be be position and be viewing.
I’m glad to see this article as it seems so obvious to me. I just don’t understand why Oculus is necessary for what they are calling VR movies. Clearly you can’t move your head at all in a VR movie and if you can’t, then just use some simpler device which allows simple 3D viewing. To get full VR movies, you need incredible processing power to allow the viewer to move in the 3D environment because each frame has to be dynamically generated. This can not be achieved by home computing currently and even for something like existing movies it takes massive parallel computing along with post processing and editing effects to produce something like Avatar. There may be a point at which we reduce the level of complexity somewhere in between home computing and massive parallel and time consuming movie like quality to allow for cloud computing to allow users to participate in somewhat movie like computing if you allow for something like Onlive to process and deliver the video to you. Then you have other concerns like latency. If you think moving your head fast and the latency on an Oculus is a problem now, try putting in a delay caused by round trip over the cloud. However, I think predictive algorithms and some hybrid local processing along with the technology like Artemis can provide could potentially reduce this problem. I have experimented with positional head tracking with and Oculus using just a web cam about a year ago and I found it “acceptable”. If a web cam can be acceptable with it’s latency, latency over the cloud is achievable. In my personal opinion, Facebook should now purchase a company like Onlive and then we’ve got an awesome VR future, with one more purchase such as Square/Enix it would be mind blowing !
I only just discovered this blog and really like what I’m reading and what’s being discussed. I agree on the physical limitations of real-life recording for VR, and that machinima would be wise choice for now.
However, I do see an opportunity for technical solutions in the grey area between these two. Meaning either 1) some lightfield-like capturing/playback of ‘local’ elements like faces, distant people that are streamed into a layer of a real-time 3D CG scene or 2) Non-realtime CG synthesis of a “good enough” lightfield that can be rendered from in realtime when using an HMD. For this option, I would already like to see how well a static scene can be made.
Technicalities aside, getting the “Visual Story” right will be a challenge on its own. (see http://www.amazon.com/The-Visual-Story-Creating-Structure/dp/0240807790)
ps. OK, it’s been awhile since we talked — but I left Academia (and VR) 1.5 years ago and now reading up VR stuff again for a hobby. Good to see you still going strong, just like with your VR/LIDAR/Kinect tracking from some years ago.
Hi Gerwin, how are you doing? Sad to hear you left VR. Yeah, I’m still doing it; somebody has to carry the torch so that all the stuff that was discovered/invented long ago doesn’t get forgotten by the new developers rushing in now. Apparently, there are a few users of Vrui and LiDAR Viewer at TU Delft now. I’m getting bug reports from there.
I don’t know how many ways to say this but capturing the environment with two points in space and thinking you can now use head movement in VR just doesn’t work. You have to completely recreate the physical world in 3D space in order to view it in the unlimited angles once the Oculus is strapped on. Are people still just not grasping this concept?
@Harley, I agree that just capturing from two points is not enough, even if you capture the “full solid angle” for both. I guess I wasn’t completely clear, because I was referring to a much broader approach to light field capturing.
Theoretically, one should capture the “complete” (high-dimensional) lightfield, meaning a continuous function of all incoming light from any angle at any position. Like with all practical forms of lightfield capturing, you give up some degrees of freedom, limit the capture domain, and (often) discretise data. Examples are taking a normal photo or in CG rendering a spherical full solid angle at a single point.
The interesting challenge in lightfield capturing specifically for a (seated) HMD experience as the target, is to match the translational and rotational range, and to determine its lower limit sample resolution. This lower limit is determined by how well we need to restore/render the original lightfield by interpolation and reconstruction (at run-time) for HMD output. In practice: where, how many and at what resolution full solid angles should we capture.
Even in CG, increasing dimensions and sampling rate makes it quickly into a seemingly unsurpassable amount of data. Most likely, big parts of this enormous amount of data has a high redundancy and could be subject to compression and easy interpolation. For example, distant scene objects hardly change appearance throughout the spatial range of a (seated) HMD user.
I’ve seen approaches for autostereoscopic displays, but I’d be interested to read about or even see experiments in people attempting to do this type of stuff for HMDs.
Oh, oh. “Jaunt provides an end-to-end solution for producing cinema-quality [VR] experiences that are simply unforgettable,”‘ said Arthur van Hoff, CTO of Jaunt.
Well, let’s see how that pans out.
The article made for an interesting read, no doubt, but there’s quite a few things that aren’t clear. ( to me)
>>”…Let’s talk roll first. When viewing a stereo picture, the displacement vector between the left and right views must be parallel to the line connecting the viewer’s pupils. Do a simple experiment: look at a stereo picture, or go see a stereo movie, and tilt your head 90° to the left or right. Boom, stereo gone, instant eye pain. ..”
This boom, stereo gone is true, but mainly because of the viewing medium (usually polarized glasses cancelling the stereo effect), not via a dedicated per ee view that stereo HMD offers.
>>”…No. It only means that baked-in stereo is not an applicable technology, so don’t buy stock in companies that produce “360° 3D movie capture cameras.” They might get away with it now, but they won’t be able to for long…”
I;m going to disagree with this. Someone’s already mentioned Jaunt.
I’ll go as far as saying that, in a “narrative Vr movie” there are solutions that actually leverage the gradual stereo fall off that 360 S3D “cameras” or 360 full sphere lens based cameras could offer that are a happy trade-off that satisfies the audience as well as the Director (the storyteller whos fighting for an audiences eyeballs in a VR movie)
Here’s a demo on the oculus share site of such a one-shot full sphere 3D-360 camera: https://share.oculusvr.com/app/tech-demo-narrative-cinematic-vr-via-stereoscopic-one-shot-360-capture
Some thoughts on the language of VR filmmaking : http://realvision.ae/blog/2014/08/the-language-of-visual-storytelling-in-360-virtual-reality/
Almost all 3D cinema uses circular polarization, which works no matter how you turn your head. Either way, that’s not the problem. The problem is that the stereo images are generated using horizontal offsets, because our eyes are horizontally offset when we hold our heads level. If you now tilt your head 90 degrees, your eyes are no longer horizontally offset, but vertically offset, but the stereo image on the screen is still horizontally offset. Therefore, the illusion of 3D breaks down completely, and you get a headache while your eyes are trying to make sense of what they’re seeing. That stays true with HMDs. In real-time CG viewing, the stereo separation in an HMD is always adjusted to match the separation vector between your eyes, but with pre-rendered video with baked-in stereo that’s not possible.
I recently installed a large stereo screen at a partnering institution, and the principal there told me that she always gets headaches from 3D movies (and our screen). I watched here watching it, and noticed she leans her head to one side by about 20 degrees or so, like many people do. That’s the problem right there, because the separation vector between her eyes is no longer parallel to that of the stereo imagery, and her eyes try to compensate by moving vertically, independently. That never happens in the real world, and causes eye strain and headaches.
We’ll see how this pans out. Efforts like Jaunt etc. get people really excited at first glimpse, because nobody has ever seen panoramic stereoscopic movies before. But once people spend more time with those, we’ll know for sure whether they really work or not.
For some reason, I didn’t get a notification about your reply despite subscribing…
I seem to have skimmed the part where you explicitly mentioned “turn your head 90 degrees” before the boom-stereo-gone.
Your right on that, stereo will break down at that angle…
but with circular / anaglyph / shutter, i’ve noticed up to 45 degrees any tilt, left or right shows no detrimental effect to stereo. (I suppose mileage varies depending on person)
Circular polarization also does break down at those angles, linear polarization breaks down at much smaller tilt.
on an unrelated note – absolutely loved the video on the 3 kinect 3D model presence demo!
Bookmarked this blog / repository on VR.
The brain/eyes are quite flexible with respect to misaligned stereo, but even if it apparently works, there are side-effects. From speaking to many people, I think one of the main reasons why viewers report headaches from 3D movies is that they tilt their heads while viewing, if not even by much. They still get some stereo effect, but they are giving their eye muscles a really hard time to adjust to vertical disparity, which never appears in the real world.
We are seeing a lot of this in the CAVE. By design it’s a single-user environment because of its head-tracked stereo, but in practice it’s used by small groups. We have to explicitly tell the tracked user to keep their head level, or the other viewers will get headaches quickly. The CAVE uses active (shuttered time-multiplexed) stereo, so loss of stereo separation has nothing to do with it.
Pingback: On the road for VR: Silicon Valley Virtual Reality Conference & Expo | Doc-Ok.org