The old AR Sandbox support forum, which was quite active and significantly reduced my support load, not only by allowing me to answer common questions only once instead of dozens of times, but also by community members directly helping each other, unfortunately went down due to hardware problems a good while ago, and there is currently no avenue of getting it back up.

So I decided to create a new AR Sandbox support forum on this here web site, as a hopefully temporary replacement. I was not able to move over any of the old forum content due to not having access to the original database files, which is a major pity because there was a ton of helpful stuff on there. I am hoping that the new forum will accumulate its own set of helpful stuff quickly, and if/when I migrate the forum to a permanent location, I *will* be able to move all content because I have full access to this web site’s code and database. So here’s hoping.

This is the first forum on this web site, so I hope that things will work right from the start; if not, we’ll figure out how to fix it. Please be patient.

And as a quick reminder: These are the **only** official AR Sandbox installation instructions. Accept no substitutes.

The server (by which I mean the physical mid-size tower PC, see Figure 1) that used to run this blog, and was stashed in a server room in my old building on UC Davis campus, went down in June 2021 due to a brief power outage, and I never got around to turning it back on due to the COVID-related campus lock-down.

I finally remembered to ask the CS department’s IT support staff to pull it out of that server room a few days ago, and have been migrating this site to a new, actually virtual, server since then. And here we are! There’s still a lot of maintenance to do, such as upgrading all the hideously outdated platform packages, but at least the old content is back for the time being.

In other news, I myself moved from the Department of Earth & Planetary Sciences to the UC Davis DataLab around the same time this site went down, and recently finished setting up VRoom!, DataLab’s new multi-user VR space. There will be a detailed post about that soon. There has been a lot of movement on Vrui’s collaboration infrastructure as well, and there were some exciting adventures in Lighthouse tracking.

]]>“Kelly subtracted 2.3 from 20 and got 17.7. Explain why this answer is reasonable.”

The obvious answer is “because it is correct.” But that would get the student zero points. The expected (I assume) answer is about number sense / estimation, e.g., “If I subtract 2 from 20 I get 18, but I have to subtract a little bit more, and 17.7 is a little bit less than 18, so 17.7 is a reasonable answer.” Now my issue with this problem is that the actual arithmetic is so simple that it is arguably easier to do just do it than it is to go the estimation route. **The problem sets the students up for failure, and undercuts the point of the unit: that estimation is a valuable tool.** A better problem would have used numbers with more digits to hint that the students were supposed to estimate the result instead of calculating it, and to show that estimation saves time and effort.

“At a local swim meet, the second-place swimmer of the 100-m freestyle had a time of 9.33 sec. …”

This one made me laugh out loud, and I’m not even a sports fan who follows swimming. But even I know that swimming is a lot slower than running, and upon checking, I found that the world record for the 100m freestyle is 46.91 seconds. Who was competing in this “local swim meet?” Aquaman? My issue here is that the problem creator failed to understand the reason for using this type of word problem: reinforcing the important notion that math is important in the real world. But by choosing these laughable numbers, the creator not only undercut that notion, but created exactly the opposite impression in the students: that **math has no relationship to the real world**.

And from today’s section of the textbook, this table:

Location | Rainfall amount in atypical year (in inches) |

Macon, GA | 45 |

Boise, ID | 12.19 |

Caribou, ME | 37.44 |

Springfield, MO | 44.97 |

Followed by this question: “What is the typical yearly rainfall for all four cities?” The book expects 139.6 inches as the answer, but **that answer makes no sense**. Rainfall amounts measured in inches can not be added up between multiple locations, because they are **ratios**, specifically volume of rain per area. How is that supposed to work? Stacking the four cities on top of each other? As in the previous example, this problem undercuts the goal of showing that math has a relationship to the real world. These students, being in fifth grade, wouldn’t necessarily realize the issue with this problem, but it really makes me think whether the person creating this example has advanced beyond fifth grade. Or, even worse, if that person is actively trying to create the impression that math is just some numbers game that happens in a vacuum. If so, good job.

My daughter was actually stumped by this last one, having no idea what the book meant by “typical yearly rainfall for all four cities,” and I had to explain to her that the question makes no sense, and reassure her that math is important, even if the math textbook goes out of its way to teach the students that math is frustrating, incomprehensible, and has no point. Again, good job, textbook writers.

In violation of Betteridge’s Law, I will answer the question posed in this post’s headline with a resounding “**YES!**“

Two hours and 153 lines of code later, here are a couple of images which are hopefully true to scale. I used 160km as the Death Star’s diameter, based on its Wookiepedia entry (Wikipedia states 120km, but I’m siding with the bigger nerds here), and I assumed the meridian trench’s width and depth to be 50m, based on the size of an X-Wing fighter and shot compositions from the movie.

**Side note: **I don’t know how common this misconception is, but the trench featured in the trench run scenes is *not* the equatorial trench prominently visible in Figure 1. That one holds massive hangars (as seen in the scene where the Millennium Falcon is tractor-beamed into the Death Star) and is vastly larger than the actual trench, with is a meridian (north-south facing) trench on the Death Star’s northern hemisphere, as clearly visible on-screen during the pre-attack briefing (but then, who ever pays attention in briefings).

The images in Figures 2-6 are 3840×2160 pixels. Right-click and select “View Image” to see them at full size.

As can be seen from Figures 2-6, the difference between the flat miniature used in the movie, and the spherical model I used, is relatively minor, but noticeable — ignoring the glaring lack of greebles in my model, obviously. I noticed the lack of curvature for the first time while re-watching A New Hope when the prequels came out, but can’t say I ever cared. Still, this was a good opportunity for some recreational coding.

]]>Here are some quotes from one article I found: “I was initially hoping to use UDP because latency is important…” “I haven’t been able to fully test using TCP yet, but I’m hopeful that the trade-off in latency won’t be too bad.”

Here are quotes from another article: “UDP has it’s [sic] uses. It’s relatively fast (compared with TCP/IP).” “TCP/IP would be a poor substitute [for UDP], with it’s [sic] latency and error-checking and resend-on-fail…” “[UDP] can be broadcast across an entire network easily.” “Repeat that for multiple players sharing a game, and you’ve got a pretty slow, unresponsive game. Compared to TCP/IP then UDP is fast.” “For UDP’s strengths as a high-volume, high-speed transport layer…” “Sending data via TCP/IP has an ‘overhead’ but at least you know your data has reached its destination.” “… if the response time [over TCP] was as much as a few hundred milliseconds, the end result would be no different!”

First thing first: Yes, UDP can send broadcast or multicast IP packets. But that’s not relevant for 99.9% of applications: IP broadcast only works on a single local network segment, and IP multicast does not work on the public Internet — there is currently no mechanism to assign multicast addresses dynamically, and therefore multicast packets that do not use well-known reserved addresses are ignored by Internet routers. So no points there.

In summary, according to these articles (which reflect common wisdom; I do not intend to pick on these specific authors or articles), TCP is slow. Specifically — allegedly — it has high latency (a few hundred milliseconds over UDP, according to the second article), and low bandwidth compared to UDP.

Now let’s put that common wisdom to the test. Fortunately, my collaboration framework has some functionality built in that allows a direct comparison. For example, the base protocol can send echo requests (akin to ICMP ping) at regular intervals, to keep a running estimate of transmission delay between the server and all clients, and to synchronize the server’s and client’s real-time clocks. These ping packets are typically sent over UDP, but since not all clients can always use UDP, the protocol can fall back to using TCP. The echo protocol is simple: the client sends an echo request over TCP or UDP, the server receives the request, and immediately sends an echo reply to the client, over the same channel on which it received the request. This allows us to compare the latency of sending data over TCP vs. UDP.

I ran the first experiment between my home PC and my server at UC Davis. Here are the results from 200 echo packet round-trips: (I also list timings using ICMP, i.e., the “real” ping protocol, as a baseline):

Ping method | mean round-trip time [ms] | Std. deviation [ms] |
---|---|---|

TCP | 48.043 | 4.171 |

UDP | 48.352 | 4.280 |

ICMP | 47.273 | 4.721 |

Oookay, that’s not exactly what common wisdom would predict. TCP and UDP have the same latency (the minor numerical difference is safely within the margin of error), and are less than 2% slower than bare-metal ICMP. Let’s try that again, but between a client and server running on the same computer:

Ping method | mean round-trip time [ms] | Std. deviation [ms] |
---|---|---|

TCP | 0.3100 | 0.0524 |

UDP | 0.2896 | 0.0280 |

ICMP | 0.0710 | 0.0120 |

In this test, UDP is indeed faster than TCP, by a whopping 0.02 ms (again, within the margin of error). Notably, ICMP is now faster than TCP and UDP by a factor of more than four, which is explained by ICMP running entirely in kernel space, while my collaboration infrastructure sends and receives packages from user space.

So what gives? Why does everybody know that TCP sucks for low latency? The issue is failure recovery. TCP is — clearly — just as “fast” (in terms of latency) as UDP *as long as no IP packets get lost*. So what happens if IP packets *do* get lost? In UDP’s case, nothing happens. The receiver doesn’t receive the packet, and the loss does not affect latency. In TCP’s case, the error recovery algorithm will notice that a packet was lost, duplicated, or sent out-of-order, and the receiver will request re-transmission of the bad packet (actually, TCP uses positive acknowledgment, but whatever). And because re-sending a bad packet takes at least a full round-trip between receiver and sender, it does indeed add to worst-case latency.

So that’s bad. Under failure conditions, TCP can have higher latency. But what’s the alternative? The sender (usually) does not send packets for funsies, but to communicate. And if packets get dropped or otherwise mangled, communication does not happen. Meaning, if some UDP sender needs to make sure that some bit of data actually arrives at the receiver, it has to implement some mechanism to deal with packet loss. And that will increase worst-case latency, just as it does for TCP. So the bottom line is: UDP has lower worst-case latency than TCP *if and only if* any individual piece of sent data does not matter. In other words: UDP has lower worst-case latency than TCP only when sending *idempotent* data, meaning data where it doesn’t matter if not everything arrives, or some data arrives multiple times, or packets arrive out-of-order, as long as a certain fraction of data arrives. Typical examples for this type of data are simple state updates in online games (an example discussed in the second article I linked), or audio packets for real-time voice chat. In most other cases using UDP does not actually help, or even hurts (the main point of the second article I linked). Even in online games, it is generally only okay if one of a player’s many mouse movement packets is lost, because the next one will update to the correct state (idempotent!), but if a button click packet gets lost and the player’s gun doesn’t shoot, there’ll be hell to pay.

So the common wisdom should actually be: If you want to send a stream of packets at low latency, and each subsequent packet will contain the full end state of your system, i.e., updates are idempotent as opposed to incremental, then use UDP. In all other cases, use TCP. And, generally speaking, don’t attempt a custom implementation of TCP’s failure correction in your UDP code, because it’s highly likely that the TCP developers did it better (and TCP runs in kernel space, to boot).

That’s for latency. What about sending high-volume data, in other words, what about bandwidth? Time for another experiment, again between my home PC and my UC Davis server. First, I sent a medium amount of data (100MB) over TCP, using a simple sender and receiver. This took 158 seconds, for an average bandwidth of 0.634MB/s. (Yes, I know. I am appropriately embarrassed by my Internet speed.) Next, I sent the same data over UDP, simply blasting a sequence of 74,899 datagrams of 1400 data bytes each over my home PC’s outgoing network interface. That took about 2.3s, for an average bandwidth of 43.28MB/s. Success! But oh wait. How many of those datagrams actually arrived at my server? It turns out that 95.88% of the datagrams I sent were lost en route. Oh well, I guess those data weren’t important anyway.

Seriously, though, the problem is that UDP, by design, does not do any congestion control. If the rate of sent datagrams at any time overwhelms any of the network links between sender and receiver, datagrams will be discarded silently. So we need to implement some form of traffic shaping ourselves. That’s not easy (there’s a reason TCP is the complex protocol that it is), and as a first approach, I simply calculated the average number of packets that were sent by my simple TCP sender per second, and set up a timer on the sender side to spread datagrams out to the same average rate. This ended up taking 160s (duh!), for a bandwidth of 0.623MB/s. At this rate, only 0.015% of datagrams were lost en route. Clearly not better than TCP, but then that’s expected if set up this way.

Next, I tried pushing the effective bandwidth up, by sending datagrams at increasingly higher rates. At 0.812MB/s on the sender side, 20.93% of datagrams were lost, for an effective bandwidth on the receiver side of 0.642MB/s, or 1.3% more than TCP’s. In a real bulk data protocol, the sender would have had to re-send those missing packets, so this is an *upper limit* on the bandwidth that could have been achieved. Trying even faster, with a sender-side bandwidth of 1.181MB/s, 44.29% of datagrams were lost, for a receiver-side bandwidth of 0.658MB/s, or 3.8% above TCP. And again, this is a *loose upper limit.* Any mechanism to re-send those dropped packets would have lowered effective end-to-end bandwidth.

From these numbers we can extrapolate that in the best case, where no datagrams are lost whatsoever, UDP can *maybe* transmit bulk data a few percent faster than TCP. (I said maybe, because in a situation with no packet loss, TCP wouldn’t have to spend time re-transmitting data, either). In any real situation, where IP packets are invariably lost, the necessary re-transmission overhead would have brought UDP back to about the same level as TCP. The price we pay for this tiny potential improvement (if it’s even there) is that we have to implement TCP’s failure correction and traffic shaping algorithms ourselves, in user space. Again, that’s generally not a good idea. The bottom line is the same as for latency: if you want to send data that doesn’t all have to arrive at the receiver, like real-time audio chat data, UDP is a good choice. In all other cases, use TCP.

Finally, let’s compare TCP and UDP bandwidth in the local case, where sender and receiver are on the same computer. Here we have a somewhat counter-intuitive result: UDP transmitted 100MB at a bandwidth of 450MB/s, with 0% packet loss as expected, while TCP transmitted at 890MB/s, almost twice as fast. Huh? The answer here is that TCP on a single host is directly mapped to UNIX’s pipe mechanism, meaning there is no traffic shaping or failure recovery, and that my test program was able to pass data to TCP in larger chunks, because it didn’t have to send individual datagrams (concretely, I sent 4096 bytes per system call for TCP, vs. 1400 bytes per system call for UDP). Fewer system calls, less time, higher bandwidth.

In summary: Is TCP really that slow? Answer: no, not at all. Under *very specific circumstances*, data transmitted over UDP can have lower worst-case latency, or potentially higher bandwidth, than the same data transmitted over TCP. If your data falls within those circumstances, i.e., if re-sending lost, mangled, or mis-ordered packets would not be helpful, like in idempotent state updates or in data streams that have built-in forward error correction or loss masking like real-time audio chat data, use UDP. In the general case, or if you don’t know for sure, use TCP.

“Would you happen to know the effective or perceived resolution of the [Valve Index headset] when viewing a 50″ virtual screen from say.. 5 feet away? Do you think its equivalent to a 50″ 1080p tv from 5 ft away yet? I was also wondering why when I look at close up objects on the index that I can see basically no screen door effect, but when looking into the distance at the sky then suddenly the sde becomes very noticeable.”

Okay, so that’s actually two questions. Let’s start with the first one, and do the math.

The first thing we have to figure out is the resolution of a 50″ 1080p TV from 5 feet away. That’s pretty straightforward: a 1080p TV has 1920 pixels horizontally and 1080 pixels vertically. Meaning, it has √(1920^{2} + 1080^{2}) = 2202.9 pixels along the diagonal, and – assuming the pixels are square – a pixel size of 50″/2202.9 = 0.0227″. Next we have to figure out the angle subtended by one of those pixels, when seen from 5 feet away. That’s α = tan^{-1}(0.0227″/(5⋅12″)) = 0.0217°. Inverting that number yields the TV’s resolution as 46.14 pixels/°.

Figuring out a VR headset’s resolution is more complex, and I still haven’t measured a Valve Index, but I estimate its resolution in the forward direction somewhere around 15 pixels/°. That means the resolution of the hypothetical 50″ TV, viewed from 5 feet away, is approximately three times as high as the resolution of a Valve Index. The interested reader can simulate the perceived resolution of a VR headset of known resolution by following the steps in this article.

The second question is about screen-door effect (SDE). As shown in Figure 1, SDE is a high-frequency grid superimposed over a low-frequency (low-resolution) pixel grid, which makes it so noticeable and annoying. But why does it become less noticeable or even disappear when viewing virtual objects that are close to the viewer? That’s vergence-accommodation conflict rearing its typically ugly, but in this case beneficial, head. When viewing a close-by virtual object, the viewer’s eyes accommodate to focus on a close distance, but the virtual image shown by the VR headset is still at its fixed distance, somewhere around 1.5‒2m away depending on headset model. Meaning, the image will be somewhat blurred, and SDE, being a high-frequency signal, will be affected much more than the lower-frequency actual image signal.

]]>Why does my headset show dark grey when it’s supposed to show black? Shouldn’t LED displays be able to show perfect blacks?

I addressed this in detail a long time ago, but the question keeps popping up, and it is often answered like the following: “LED display pixels have a memory effect when they are turned off completely, which causes ‘black smear.’ This can be avoided by never turning them off completely.”

Unfortunately, that answer is mostly wrong. LED display pixels *do* have a memory effect (for reasons too deep to get into right now), but it is not due to being turned off completely. The obvious counter argument is that, in the low-persistence displays used in all LED-based headsets, all display pixels are completely turned off for around 90% of the time anyway, no matter how brightly they are turned on during their short duty cycle. That’s what “low persistence” means. So having them completely turned off during their 1ms or so duty cycles as well won’t suddenly cause a memory effect.

The real answer is mathematics. In a slightly simplified model, the memory effect of LED displays has the following structure: if some pixel is set to brightness b_{1} in one frame, and set to brightness b_{2} in the next frame, it will only “move” by a certain fraction of the difference, i.e., its resulting effective brightness in the next frame will *not* be b_{2} = b_{1} + (b_{2} – b_{1}), but b_{2}‘ = b_{1} + (b_{2} – b_{1})⋅s, where s, the “smear factor,” is a number between zero and one (it’s usually around 0.9 or so).

For example, if b_{1} was 0.1 (let’s measure brightness from 0 = completely off to 1 = fully lit), b_{2} is 0.7, and s = 0.8, then the pixel’s effective brightness in frame 2 is b_{2}‘ = 0.1 + (0.7 – 0.1)⋅0.8 = 0.58, so too dark by 17%. This manifests as a darkening of bright objects that move into previously dark areas (“black smear”). The opposite holds, too: if the pixel’s original brightness was b_{1} = 0.7, and its new intended brightness is b_{2} = 0.1, its effective new brightness is b_{2}‘ = 0.7 + (0.1 – 0.7)⋅0.8 = 0.22, so too bright by 120%(!). This manifests as bright trails following bright objects moving over dark backgrounds (“white smear”).

The solution to black and white smear is to “overdrive” pixels from one frame to the next. If a pixel’s old brightness is b_{1}, and its intended new brightness is b_{2}, instead of setting the pixel to b_{2}, it is set to an “overdrive brightness” b_{o} calculated by solving the smear formula for value b_{2}, where b_{2}‘ is now the intended brightness: b_{o} = (b_{2} – b_{1})/s + b_{1}.

Let’s work through the two examples I used above: First, from dark to bright: b_{1} = 0.1, b_{2} = 0.7, and s = 0.8. That yields b_{o} = (0.7 – 0.1)/0.8 + 0.1 = 0.85. Plugging b_{o} = 0.85 into the smear formula as b_{2} yields b_{2}‘ = 0.1 + (0.85 – 0.1)⋅0.8 = 0.7, as intended. Second, going from bright to dark: b_{1} = 0.7, b_{2} = 0.1, and s = 0.8 yields b_{o} = (0.1 – 0.7)/0.8 + 0.7 = **-0.05**. **Oops.** In order to force a pixel that had brightness 0.7 on one frame to brightness 0.1 on the next frame, we would need to set the pixel’s brightness to a negative value. But that can’t be done, because pixel brightness values are limited to the interval [0, 1]. Ay, there’s the rub.

This is a fundamental issue, but there’s a workaround. If the range of *intended* pixel brightness values is limited from the full range of [0, 1] to the range [b_{min}, b_{max}], such that going from b_{min} to b_{max} will yield an overdrive brightness b_{o} = 1, and going from b_{max} to b_{min} will yield an overdrive brightness b_{o} = 0, then black and white smear can be fully corrected. The price for this workaround is paid on both ends of the range: the high brightness values (b_{max}, 1] can’t be used, meaning the display is a tad darker than physically possible (a negligible issue with bright LEDs), and the low brightness values [0, b_{min}) can’t be used, which is a bigger problem because it significantly reduces contrast ratio, which is a big selling point of LED displays in the first place, and means that surfaces intended to be completely black, such as night skies, will show up as dark grey.

Let’s close by working out b_{min} and b_{max}, which only depend on the smear factor s and can be derived from the two directions of the overdrive formula: 1 = (b_{max} – b_{min})/s + b_{min} and 0 = (b_{min} – b_{max})/s + b_{max}. Solving yields b_{min} = (1 – s)/(2 – s) and b_{max} = 1/(2 – s). Checking these results by calculating the overdrive values to go from b_{min} to b_{max}, which should be 1, and from b_{max} to b_{min}, which should be 0, is left as an exercise to the reader.

In a realistic example, using a smear factor of 0.9, the usable brightness range works out to [0.09, 0.91], meaning the darkest the display can be is 9% grey.

]]>Then, if FoV is measured either as a single angle or a pair of angles, how does one compare different FoVs fairly? If one headset has 100° FoV, and another has 110°, does the latter show 10% more of a virtual 3D environment? What if one has 100°⨉100° and another has 110°⨉110°, does the latter show 21% more?

To find a reasonable answer, let’s go back to the basics: what does FoV actually measure? The general idea is that FoV measures how much of a virtual 3D environment a user can see at any given instant, meaning, without moving their head. A larger FoV value should mean that a user can see more, and, ideally, an FoV value that is twice as large should mean that a user can see twice as much.

Now, what does it mean that “something can be seen?” We can see something if light from that something reaches our eye, enters the eye through the cornea, pupil, and lens, and finally hits the retina. In principle, light travels towards our eyes from all possible directions, but only some of those directions end up on the retina due to various obstructions (we can’t see behind our heads, for example). So a reasonable measure of field of view (for one eye) would be the total number of different 3D directions from which light reaches that eye’s retina. The problem is that there is an infinite number of different directions from which light can arrive, so simple counting does not work.

Another way of thinking about the problem is to place an imaginary sphere of some arbitrary radius around the viewer’s eye, such that the sphere’s center coincides with that eye’s pupil. Then there is a one-to-one correspondence between 3D directions and points on that imaginary sphere: each light ray enters the sphere in exactly one point. As a result, instead of counting 3D directions, one can measure FoV as the total *area* of the set of all points on the sphere that correspond to 3D directions which can be seen by the eye.

As it so happens, if the imaginary sphere’s arbitrary radius is set to one, this is precisely the definition of *solid angle*. If nothing can be seen, i.e., the set of all “visible” points on the sphere is empty, the area of that set is zero. If *everything* can be seen, the set of visible points is the full surface of the sphere, which has an area of 4π. If only half of everything can be seen, for example because the viewer is standing on an infinite plane, that viewer’s field of view is 2π, and so forth. As an aside, the surface area of a sphere of radius one, without a unit of measurement, is also unit-less, but in order to distinguish solid angle values from other unit-less numbers, they are assigned the unit *steradian*, or short *sr*, same as how regular (2D) angles, also fundamentally unit-less, are given in *radian* (r) or *degree* (°).

In summary, solid angle is a solid way to measure FoV: it can measure fields of view of arbitrary shapes and sizes in a single number, and there is a direct linear relationship between that number and the amount of “stuff” that can be seen.

So far we have talked about *field of vision*, i.e., how much of a 3D environment a “naked” eye can see. As seen below, that number is important in itself, but the real question is how to measure the *field of view* of VR headsets. The general idea is the same: calculate how much of a virtual 3D environment can be seen by a user. However, unlike in a real 3D environment, light from a virtual environment does not arrive at the viewer’s eye from all possible directions. Instead, if only arrives from directions that, when tracing them backwards from the eye, go through one of the headset’s lenses, and end up on the display screen behind that lens. FoV is still calculated the same way, but now backwards: FoV is the area of all points on a unit sphere around the user’s eye that correspond to directions that end up on a screen (assuming that that point of the screen is used to show image data by the VR pipeline, but that’s another question).

Fortunately, this area can be measured in a rather straightforward manner. Any camera works by projecting a 3D environment onto an imaging surface (a photoplate or a photosensor), specifically by assigning, to each point on the imaging surface, a 3D direction of light entering through the camera’s focal point. In a calibrated camera, this mapping from image points to 3D directions is precisely known (how it is computed is a topic for another post).

The approach, then, is to place such a calibrated camera, ideally one with a very wide-angle lens, in the same place where a user’s eye would be while the user is wearing a VR headset, and taking a picture of the headset’s screen through one of its lenses. One then looks at each of the picture’s pixels, determines whether a pixel shows some part of the headset’s screen, and sums up the individual solid angles of all pixels that do (that last part is a bit complicated in detail and left as an exercise for the reader). The bottom line, though, is that pictures just like the ones I’ve been taking for a long time are all that’s needed.

Calculating a single solid angle for a given headset is nice for quantitative comparisons, but a picture is often worth a thousand words. How, then, can field of view be visualized in a fair manner? After all, field of view is defined as the area of a part of a sphere’s surface, and as everybody who has ever looked at a world map knows, the surface of a sphere can not be shown on a flat image without introducing distortions. Fortunately, there is a class of map projections that preserve area, meaning, that the area of some region in the projected map is proportional to, or ideally the same, as the area of that same region on the sphere itself. Given that solid angle, or sphere area, is a fair measure of FoV, using such an area-preserving projection should result in a fair visualization: If one headset’s FoV is twice as large as that of another, its FoV will appear exactly twice as large in the picture.

For this article, I measured the fields of view of three models of VR headset I happened to have at hand: HTC Vive Pro, Oculus Rift CV1, and PlayStation VR. For context, I also measured the “average” human naked-eye field of vision in the same way, based on an established and oft-cited chart (see Figure 1).

I traced the outer limit of vision in the diagram (which includes eye movement), re-projected that outline into an area-preserving map projection (see Figure 2), and calculated its solid angle, the combined solid angle of both eyes (assuming the two fields of vision are symmetric), and the solid angle of the intersection of both eyes’ fields of view, i.e., the binocular overlap. The values are as follows: one eye: 5.2482 sr (or 1.6705π sr); both eyes: 6.5852 sr (or 2.0961π sr); overlap: 3.9112 (or 1.2450π sr). I quoted solid angles both as straight steradian and as multiples of π steradian, in case the latter are easier to visualize: 2π sr is a hemisphere, and 4π sr is a full sphere. Interestingly, the combined field of vision of both eyes is slightly more than a hemisphere. While field of vision differs from person to person, these values are average measurements that can be used to put the FoV values of VR headsets into context.

Next, I processed the aforementioned through-the-lens pictures of my three VR headsets in the same way, by tracing the outline of the visible portion of the screen, re-projecting the outline using the same area-preserving map projection, calculating the single-eye, total, and overlap solid angles (see Table 1), and creating diagrams superimposing the fields of view over the average human FoV for context (see Figures 3-5). For each headset, I only used the FoV-maximizing eye relief value, as quoted in the figures. Given that FoV depends strongly on eye relief, one should ideally take a sequence of pictures and tabulate the function FoV(eye relief).

Headset | Single-eye FoV | Combined FoV | Overlap FoV | Overlap % |
---|---|---|---|---|

Human Eye | 5.2482 sr 1.6705π sr | 6.5852 sr 2.0961π sr | 3.9112 sr 1.2450π sr | 59.39% |

Vive Pro | 2.9300 sr 0.9327π sr 55.83% | 3.2076 sr 1.0210π sr 48.71% | 2.6524 sr 0.8443π sr 67.82% | 82.69% |

Rift CV1 | 2.2286 sr 0.7094π sr 42.46% | 2.4982 sr 0.7952π sr 37.94% | 1.9588 sr 0.6235π sr 50.08% | 78.41% |

PSVR | 2.6042 sr 0.8289π sr 49.62% | 2.7275 sr 0.8682π sr 41.42% | 2.4808 sr 0.7897π sr 63.43% | 90.95% |

Table 1: Fields of view of average human (including eye movement) and three VR headsets. Each FoV measurement is given in steradian and multiple of π steradian. FoV measurements for headsets are additionally given in percent of the corresponding measurement for average human FoV.

]]>This one is long overdue. Back in 2015, on September 30th to be precise, I uploaded a video showing preliminary results from a surprisingly robust optical 3D tracking algorithm I had cooked up specifically to track PS Move controllers using a standard webcam (see Figure 1).

**Figure 1:** A video showing my PS Move tracking algorithm, and my surprised face.

During discussion of that video, I promised to write up the algorithm I used, and to release source code. But as it sometimes happens, I didn’t do either. I was just reminded of that by an email I received from one of the PS Move API developers. So, almost two years late, here is a description of the algorithm. Given that PSVR is now being sold in stores, and that PS Move controllers are more wide-spread than ever, and given that the algorithm is interesting in its own right, it might still be useful.

The PS Move controller has two devices that allow it to be tracked in three dimensions: an inertial measurement unit (IMU), and a rubbery sphere that can be illuminated — in all colors of the rainbow! — by an RGB LED inside the sphere. As I have discussed ad nauseam before, an IMU *by itself* is not sufficient to track the 3D position of an object over time. It needs to be backed up by an external absolute reference system to eliminate drift. That leaves the glowy ball, and the question of how to determine the 3D position of a sphere using a standard camera. In principle, this is possible. In practice, as always, there are multiple ways of going about it.

To an idealized pinhole camera, a sphere of uniform brightness (fortunately, Sony’s engineers did a great job in picking the diffuse ball material) looks like an ellipse of uniform color (see Figure 2). Based on the position of that ellipse in the camera’s image plane, one can calculate a ray in 3D camera space (where the projection center is the origin, and the viewing direction runs along the negative Z axis), and based on the ellipse’s apparent size, one can calculate a distance along that 3D ray. Together, those two define a unique point in 3D space. The algorithm breaks down as follows:

- Identify the set of pixels belonging to the sphere’s projection.
- Calculate the parameters of an ellipse fitting the outer boundary of the set of identified pixels.
- Calculate a 3D position such that the projection of a sphere of known radius at that position matches the observed ellipse.

Step 1 is a basic image processing problem, namely blob extraction. Step 2 is more tricky, as fitting an ellipse to a set of pixels is a non-linear optimization problem, which is difficult to implement efficiently. Step 3 is also difficult, but for fundamental mathematical reasons.

No matter the representation, uniquely identifying an ellipse requires five parameters. One representation could be (center x and y; major axis; minor axis; rotation angle). Another one could be (focal point 1 x and y; focal point 2 x and y; total chord length). But identifying a sphere in 3D space requires only four parameters (center point x, y, and z; radius). Therefore, there must be an infinity of 2D ellipses that do not correspond to any possible projection of a 3D sphere onto a pinhole camera’s image plane. However, the ellipse fitting algorithm in step 2 does not know that. Under real-world conditions, one would expect to receive parameters that — at best — merely resemble a projected sphere. On top of that, calculating the sphere parameters that correspond to a projected ellipse, even assuming there is such a sphere in the first place, is another tricky non-linear optimization problem (for example, the projection of the sphere’s center is *not* the center of the projected ellipse). In practice, this means most 2D approaches to the problem involve some guesswork or heuristics, and are not particularly robust against noise or partial occlusion.

Given the 2D approach’s issues, would it be possible to solve the problem directly in three dimensions? The answer is yes, and the surprising part is that the 3D approach is *much* simpler.

Imagine, if you will, a sphere in 3D space, and a 3D point that is somewhere outside the sphere. Those two together uniquely identify a cone whose apex is at the imagined point, and whose mantle exactly touches the imagined sphere (see Figure 3). Now, without loss of generality, assume that the imagined point is at the origin of some 3D coordinate system, for example the origin of a 3D camera-centered system, i.e., the camera’s focal point. Then, assuming that the sphere’s radius is known a priori, its remaining three parameters have a 1:1 correspondence with the three parameters defining the cone (axis direction, which only has two free parameters due to its unit length, and opening angle). In other words, if one has the parameters of a cone touching a sphere, and the sphere’s radius, one has the 3D position of that sphere relative to the cone’s apex.

Even better, calculating the sphere’s parameters from the cone’s is simple trigonometry: the sphere’s center must be somewhere along the cone’s axis, and due to the cone’s enveloping nature, any ray along the cone’s mantle, a radius from the center of the sphere to the point where that ray touches the sphere, and the cone’s axis form a right triangle. Thus, distance along the ray is d = r / sin(α), where r is the sphere’s radius, and α is the cone’s opening angle.

How does this help, when all one has is a set of pixels in a camera’s image that belong to the projection of a sphere? First, and this is yet another basic image processing problem, one reduces the set of pixels to its boundary, i.e., only those pixels that have at least one neighbor that is not itself part of the set. That set of pixels, in turn, defines a set of 3D lines through the camera’s focal point, whose direction vectors can be calculated from the camera’s intrinsic projection parameters, which are assumed to be known a priori. The important observation, now, is that the 3D lines corresponding to those boundary pixels in the camera’s image must be part of the imaginary cone’s mantle, because they all go through the cone’s apex (the camera’s focal point), and all touch the sphere.

Therefore, given a set of 2D pixels in a camera image (or their corresponding lines in 3D camera space), we can calculate the parameters of the unique cone fitting all these lines by solving a large set of equations. Assuming that each line already goes through the cone’s apex, which is a given based on how a camera works, when does a line lie inside a cone’s mantle? The geometric definition of a cone is the 2D surface swept out by a line going through a fixed point (the cone’s apex, check!) rotating around another fixed line going through the same fixed point (the cone’s axis). In other words, all lines in the cone’s mantle form the same angle with the cone’s axis (the cone’s opening angle).

Using vector algebra, the angle between two direction vectors **p** and **a** is calculated as cos(α) = **p**·**a** / (|**p**|·|**a**|), where **p**·**a** is a vector dot product, and |**p**| and |**a**| are the Euclidean lengths of vectors **p** and **a**, respectively. Or, if **p** and **a** are 3D Cartesian vectors **a** = (ax, ay, az)^{T} and **p** = (px, py, pz)^{T}, then **p**·**a** = px·ax+py·ay+pz·az, |**p**| = √(px·px+py·py+pz·pz), and |**a**| = √(ax·ax+ay·ay+az·az).

In other words, each boundary pixel’s line direction **p**_{i} defines an equation cos(α) = **p**_{i}·**a** / (|**p**_{i}|·|**a**|), where **a** and α are the unknown cone parameters. Together with an additional equation restricting **a** to unit length, i.e., **|a|** = 1, this system solves the problem. Unfortunately, this is still a *non-linear* system (the unknowns ax, ay, and az appear squared and underneath a square root), and a large one at that (with one non-linear equation per boundary pixel), which means it is still difficult to solve.

Fortunately, there is a way to restate the problem that turns the system into a linear one, and best of all — it is not an approximation! The trick, as so often, is to express the system of equations using different unknowns. Let us look at **a** and α again. Due to the camera being a real camera, we know that az, the third Cartesian component of **a**, is always smaller than zero (I mentioned above that in the canonical camera system, the camera looks along the negative Z axis). That means we can divide vector **a** by the negative of its third component, yielding **a**‘ = **a**/-az = (ax/-az, ay/-az, -1)^{T} = (ax’, ay’, -1)^{T}. Turning this around yields **a** = **a**‘·(-az), and we can now rewrite the original equation as cos(α) = **p**·(**a**‘·(-az)) / (|**p**|·|**a**‘·(-az)|).

But didn’t this just complicate matters? Turns out, things are not as they seem. Dot product and Euclidean length are linear operators, meaning we can pull the “·(-az)” out of both, remembering that |**a**‘·(-az)| = |**a**‘|·(-az) because az is negative, and then cancel the one in the numerator with the one in the denominator, resulting in cos(α) = **p**·**a**‘ / (|**p**|·|**a**‘|). And lo, given that **a**‘ only has *two* unknowns, ax’ and ay’, we just eliminated a variable — leaving us three, exactly the number we expected based on the problem statement!

Now we multiply both sides by (|**p**|·|**a**‘|), yielding cos(α)·(|**p**|·|**a**‘|) = **p**·**a**‘, re-arrange the terms on the left side to form (cos(α)·|**a**‘|)·|**p**| = **p**·**a**‘, and finally introduce a new unknown az’ = cos(α)·|**a**‘|. This reduces the original equation to az’·|**p**| = **p**·**a**‘, or, spelled out, az’·√(px·px+py·py+pz·pz) = px·ax’+py·ay’-pz, which is a *linear* equation in the three unknowns ax’, ay’, and az’. Well done!

Given a set of n≥3 line directions **p**_{i} calculated from boundary pixels, these equations form an over-determined linear system M·(ax’, ay’, az’)^{T} = **b**, where M is an n×3 matrix whose rows have the form (px_{i}, py_{i}, -√(px_{i}·px_{i}+py_{i}·py_{i}+pz_{i}·pz_{i})), and **b** = (pz_{1}, …, pz_{n}) is an n-vector. This system can be solved easily, and very quickly, using the linear least squares method M^{T}·M·(ax’, ay’, az’)^{T} = M^{T}·**b**, where M^{T}·M is a 3×3 matrix, and M^{T}·**b** is a 3-vector.

The final step is to extract the cone’s axis and angle, and the resulting sphere position, from the system’s solution vector (ax’, ay’, az’)^{T}: the cone’s (non-normalized) axis direction is **a** = (ax’, ay’, -1)^{T}, the axis length is |**a**| = √(ax’·ax’+ay’·ay’+1), therefore the normalized axis is **a**/|**a**|, and the opening angle is α = cos^{-1}(az’/|**a**|). And at last, the sphere’s position is **p** = (**a**/|**a**|)·r/sin(α) = **a**·r/(|**a**|·sin(cos^{-1}(az’/|**a**|))), where r is the sphere’s radius.

The 3D algorithm described above is really quite elegant, but there are some details left to mop up. Most importantly, and that should go without saying, getting good 3D position estimates from this algorithm — or from any other algorithm solving the same problem, for that matter — requires that the calculation of 3D line directions from image pixels is as accurate as possible. This, in turn, requires precise knowledge of the camera’s intrinsic projection parameters and its lens distortion correction coefficients. It is therefore necessary to run a good-quality camera calibration routine prior to using it in applications that require accurate tracking.

Second, the algorithm’s description calls for 3D line directions that touch the 3D sphere, but as written above, using boundary “inside” pixels does not yield that. An “inside” pixel with “outside” neighbors does not define a line tangential to the sphere, but a line that intersects the sphere. This results in the cone’s opening angle being under-estimated, and the resulting distance between the sphere and the camera’s focal point being over-estimated. A simple trick can remove this systematic bias: instead of using the 3D line direction defined by a boundary inside pixel, one uses the direction halfway between that inside pixel and its outside neighbor.

As apparent in the video in Figure 1, the tracking algorithm is surprisingly robust against partial occlusion of the tracked sphere. How is this possible? The described algorithm assumes that the boundary of the set of pixels identified as belonging to the sphere all lie on the mantle of a cone enveloping the sphere, but if the sphere is partially occluded, some boundary pixels will lie inside of the cone, leading to a wrong estimate of both axis direction and opening angle.

I addressed this issue by iterating the cone extraction algorithm, based on the observation that interior boundary pixels cause the cone’s opening angle to be under-estimated. Meaning, true boundary pixels will generally end up outside the estimated cone, while false boundary pixels will generally end up inside. Therefore I initially extract the full boundary, B_{0}, of the set of pixels identified as belonging to the sphere. Then I iterate the following, starting with i = 0:

- Fit a cone to the set of pixels B
_{i}. - Remove all pixels that are inside the cone by more than some distance ε from set B
_{0}to form set B_{i+1}. - If there were points inside the cone, and i is not larger than some maximum, repeat from step 1 with pixel set B
_{i+1}.

In my experiments, this iteration converged rapidly towards the true set of tangential boundary pixels, needing no more than five iterations in all situations I tested.

I said there would be no error analysis, but I lied. I found a graph (see Figure 4) with data from an experiment I ran when I started discussing tracking error with Alan Yates in the comments of the YouTube video shown in Figure 1. I had forgotten about that.

I made the graph in Figure 4 in LibreOffice Calc, and it ate my axis and series labels in revenge. Here’s a legend/rough analysis:

- The X axis is the real Z position of the glowing ball relative to the camera, in millimeters. I took measurements between 0.3m and 2.4m (one foot and eight feet actually).
- The orange and blue curves are the real and reconstructed Z position, respectively, also in millimeters, using the scale on the left. Note the excellent alignment and linearity, minus the strange outliers around 2.2m.
- The yellow and green curves are tracking error standard deviations in X and Y, respectively, in millimeters, using the scale on the right. (X, Y) tracking error grows linearly with distance, but is so small that it does not show up.
- The maroon curve is tracking error standard deviation in Z, in millimeters, using the scale on the right. Z tracking error grows quadratically with distance, as predicted by the mathematical formulation, and starts becoming very noisy itself at larger distances because the error is dominated by the ball’s projection quantization on the camera’s pixel grid.

After placing the PS Move controller in any of the test positions, I collected 5 seconds of video frames, i.e., 150 position measurements (the camera ran at 30Hz), averaged them for the position result (blue curve), and calculated their standard deviations for the error results (yellow, green, and maroon curves).

In the experiment that led to Figure 4, I placed the ball directly in front of the camera, and then moved it to different Z positions. I performed a second experiment where I placed the ball at a fixed distance and moved it laterally (in X), but the results were the same as the first experiment: X error standard deviation constant and on the order of fractions of a millimeter, and Z error standard deviation constant but randomly affected by X placement, probably due to different alignments with the camera’s pixel grid, and in general much larger than X error standard deviation.

]]>While I was writing up my article on the PlayStation VR headset’s optical properties yesterday, and specifically when I made the example images for the sub-section about sub-pixel layout comparing RGB Stripe and PenTile RGBG displays, it occurred to me that I could use those images to create a rough and simple simulator to visually evaluate the differences between VR headsets that have different resolutions and sub-pixel layouts.

The basic idea is straightforward: Take a test image that has some pixel count, for example WxH=640×360 as the initial low-resolution full-RGB picture seen in Figure 1. If you then blow up that image to fill a monitor that has the same aspect ratio (16:9) and some diagonal size D, the resolution of that image in terms of pixels per degree depends both on D and the viewer’s distance from the monitor Y: the larger the ratio Y/D, the higher is the image’s resolution as seen by the viewer. In detail, the formula for distance Y to achieve a desired resolution R in pixels/° is:

`Y = (D / sqrt(W*W + H*H))/(2*tan(1/(2*R)))`

where W and H are in pixel units, R is in pixels/°, and D is in some arbitrary length unit (inch, meter, parsec,…). Y will end up being in the same unit as D.

If a viewer then positions one of their eyes at a distance of Y from the center of the monitor and closes the other one, the resolution of the image on the monitor will be R. In other words, if R is the known resolution of some VR headset, the image on the monitor will look the same resolution as that VR headset.

There is one caveat: when looking at a flat monitor, the resolution of the displayed image *increases* away from the monitor’s center, while in a VR headset, the resolution generally *decreases* away from the center direction (see this article for reference). Meaning, for a correct evaluation, the viewer has to focus on the area in the center of the monitor. Unfortunately there is no easy way to simulate resolution drop-off using a flat monitor, at least not while also simulating sub-pixel layout.

Here’s what you need to do:

**Step 1.** Find out whether the VR headset you want to simulate has RGB sub-pixel layout (e.g., PlayStation VR and Valve Index), or PenTile RGBG layout (e.g., Oculus Rift CV1, HTC Vive, and HTC Vive Pro).

**Step 2.** Find the headset’s green-channel center resolution in pixels/°. That’s unfortunately a bit of a tall order, as those numbers are generally not publicly advertised. I have myself measured the following values: HTC Vive: 11.43 pixels/°, Oculus Rift CV1: 13.85 pixels/°, HTC Vive Pro: 15.70 pixels/°, PSVR: 10.50 pixels/°. I have not yet measured Valve Index, but I estimate its resolution to be close to 15 pixels/°.

**Step 3.** Measure the diagonal size of the monitor you want to use, in whatever measurement unit is convenient. Important: the monitor must have a 16:9 aspect ratio, like basically any HD TV.

**Step 4.** Enter the VR headset’s resolution and your monitor’s diagonal size into the following calculator, and press the “Viewing distance in same unit” button.

The resolution of the VR headset’s green channel in pixels/°: | |

The diagonal size of your monitor in some unit: | |

**Step 5.** Download the appropriate test image. If the headset has RGB layout, use the image from Figure 2. If the headset has PenTile RGBG layout, use the image from Figure 3. Open the image in a viewer, and show it in full-screen mode.

**Step 6.** Position yourself the calculated distance away from the center of your monitor, and close one eye.

**Step 7.** Enjoy!