View on GitHub

Conduit

Efficient Video Compression for Live VR Streaming

Aakash Patel & Gregory Rose

Download this project as a .zip file Download this project as a tar.gz file

[Proposal]   [Checkpoint]
[Final Presentation PDF]

Problem

The experience of live events — concerts, sporting events, parades, and even being at Times Square on New Years — is powerful. Yet, these same events, viewed live at home on a TV, feel far more distant and less immersive. We believe that virtual reality (VR) can provide a far more immersive live experience than TV, by adding presence, the feeling that “you’re really there”. However, one of VR’s key advantages, the fact that you have the freedom to look anywhere in 360°, requires using a fully panoramic video. Panoramic videos are large, often 4K resolution (4096 pixels wide, 2048 pixels tall, depending on the standard). In addition, VR headsets typically work in 3D, meaning you need a panoramic video for each eye, so you’re streaming effectively a 4096x4096 video.

This requires a very high amount of bandwidth — at least 25 Mbps (for 2D 4K at half the size), according to Netflix. However, according to Akamai, the average American internet speed is about 11 Mbps. In addition, while 4G internet can get almost to the 4K streaming point, at 20 Mbps down, on average it's a lot lower (typically more like ~11 Mbps). In addition, you'd want a higher bandwidth than you need to provide better stability and account for overhead. This effectively prohibits live streaming on mobile VR on all but some of the fastest internet connections in America.

Solution

Conduit is a project to reduce these bandwidth requirements for streaming live 4K panoramic video to a head-mounted display (HMD) like the Oculus Rift. We do this by using view-optimization, optimizing the video stream by compressing it, because we know what your view (where you’re looking) is, since the direction you’re looking at is reported by the HMD.

Before streaming a frame over, we crop it to just the part you can see, instantly reducing total size by over 50%. Next, we use foveated rendering, a technique which takes advantage of the fact that the human eye does not have uniform resolution: our eyes have significantly higher resolution in the center region, called the fovea. Therefore, we downsample the outer regions, which comprise most of the image, saving even more space. However, because of this, you could move your head fast enough that it got outside the “view” region of the last streamed frame, and so we introduce multiple techniques to mitigate this by adaptively changing the compression parameters, adding buffer room, and taking advantage of the GPU to reduce latency.

Scope

Fully investigating view optimization would require many weeks of full-time work. In order to scope this to be doable as a class project, we decided to focus on getting the view-optimization to work so that we could see how it looks, and optimizing the graphics as much as we could. The client and server are integrated into one program, and so we simply do decode_frame(optimize_frame(frame)). The Oculus DK2 does not have eye tracking capabilities, so we assume your eyes focus straight ahead at all times relative to the head.

Technologies

We used OpenCV to read frames from the H.264-encoded MP4 video file. The library handles video reading and decoding using FFMpeg.

For graphics, we used OpenGL, GLEW, and GLUT. OpenGL is used for texture mapping and rendering the cylinders to a screen. GLUT is able to render cylinders for us, which we can map textures onto. The textures are the frames from the video that we want to show to the user. The window manager is SDL2, which we use to look at the images generated by the graphics libraries.

We also used an Oculus Rift DK2 and its SDK. We sent our view-optimized images to the Oculus display where the video could be viewed.

Optimization

Initial Version

Our first approach used a render loop like the following:

OptimizedFrame optimize_frame(video_frame, viewer_data) {
  // crop_frame crops the frame to your FOV,
  // handling edge cases like if the viewer is looking
  // at the part of the panorama where the ends are stitched together.
  cropped_frame = crop_frame(video_frame, viewer_data.angle, 90);

  // extract the center 20 degrees
  left, center, right = horizontal_split(cropped_frame, 20);
  top, middle, bottom = vertical_split(center, 20);

  // shrink by a pre-defined constant factor of 5
  left = shrink(left);
  right = shrink(right);
  top = shrink(top);
  bottom = shrink(bottom);

  return OptimizedFrame(left, right, top, middle, bottom);
}

Frame decode_frame(optimized) {
  left = expand(optimized.left);
  right = expand(optimized.right);
  top = expand(optimized.top);
  bottom = expand(optimized.bottom);
  middle = optimized.middle;

  center = vertical_concat(top, middle, bottom);
  return horizontal_concat(left, center, right);
}

OpenGLTexture texture; // reference a texture on the GPU

while (true) {
  video_frame = get_video_frame();

  // We actually do each of these steps once for each eye,
  // having split video_frame into a "left eye" and "right eye"
  optimized_frame = optimize_frame(video_frame, viewer_data);
  decoded_frame = decode_frame(optimized_frame);

  copy_frame_to_texture(decoded_frame, texture);
  viewer_data = get_viewer_data_from_hmd();

  render_frame(decoded_frame, viewer_data);
}

Optimization 1: Async Video Loading

Loading the video every single frame was taking a long time. Our first realization was that the work of loading the video frame is completely independent from the rest of the code! So while we’re optimizing, decoding, rendering, we can load more video frames in parallel. We created a second thread to “buffer” the video, loading video frames continuously in the background, and putting them onto a queue, where the main thread could take them off.

We limited the queue to 10 decoded frames, since decoded frames can be quite large (4096x4096 pixels * 24 bits/pixel for RGB 8-bit is 50.33 MB).

We also found the video decoding ran much faster than the main loop, so the queue would quickly fill up and stay full. It may be worth revisiting (decode multiple frames in parallel) once everything else is optimized.

As a secondary optimization, we found that our dequeue function waited for a frame to be available. This meant that the main loop couldn’t run faster than video frames were decoded, and even if video frames were decoded slightly faster, if the video reader ever hung temporarily, it’d cause a hiccup in the main loop. In VR, it’s particularly important to have a fast main loop, since that’s where the frame is updated to reflect head-tracking, and high-latency and low framerates on head-tracking cause motion sickness. We then changed dequeue to just return NULL immediately if no frame was available, and in that case simply skipped updating the frame in the main loop.

Optimization 2: Simplifying the View-optimizer

We implemented view optimization and measured results from that.

First Attempt: First, we cropped the panoramic video frame that was read to 180° in order to exclude regions that would be too far for the user to turn their head to see. Then, taking a yaw and a pitch, we compressed the image. To do this, we divided the image up horizontally: left, middle (the inner 20°), and right. We scaled down the left and right sides by a factor of 5. The middle image was split into top, center and bottom sections. The top and bottom sections were similarly scaled down by a factor of 5. The middle section was left at full resolution.

This way, we had a section in the middle of the image that the user could see as being completely in focus, and the rest of the image was blurry, in order to have a smaller size in memory — this optimized image would, in real world circumstances, be sent over a network. We measured the time that each of the steps in the view optimization took.

Problem: On analysis, we saw that the slowest steps, by far, were the resizing steps of the optimizer. Each resizing section took over 100 times longer than any other step.

Solution: Our initial attempt tried to save space by splitting the image into exclusive regions, and operating on them individually. To restore, we would then scale each piece back up and concatenate them, first reassembling the vertical column, then reassembling the horizontal pieces. This was far too complicated, and error-prone, resulting in numerous bugs and edge cases. In addition, the multiple concatenations each copied every single pixel, meaning many pixels were copied multiple times, which is inefficient.

We fixed this by using a simpler approach, whereby we simply extract the center square at full resolution, and then take the entire image and scale it down. Then, to reconstruct it, we re-expand the cropped image to the original size, and paint the focused sub-image back in the center. While this does mean we now have redundant information for the center piece, the simpler approach overall reduced view-optimization time from ~26ms to ~18ms.

Optimization 3: Pixel Buffer Objects

We next address the issue of copy_frame_to_texture, which copies our decoded frame to the OpenGL texture, which is stored on the GPU. At first, we thought the primary issue was being bandwidth bound, but we profiled with the NVIDIA X Server Info tool, and found that GPU bandwidth wasn’t even near being fully used.

We found the problem to be using glTexImage2D to copy the texture to the GPU every single time we got a new video frame. It turns out that glTexImage2D is blocking, i.e. it waits for the GPU to be available, and in particular done using the texture, before copying. In other words, it wastes a lot of time waiting around.

We found a better solution to be using pixel-buffer objects (PBO). These allow you to copy your image data into a “pixel buffer”, and then tell the GPU to copy it into the texture, but it runs asynchronously. This is similar to asynchronous message passing from class. Using pixel-buffer objects reduced the time to copy textures from ~20ms to ~2ms, and since we do it once for each eye, it’s about a ~40ms win on total frame time

Optimization 4: Pipelining Frame View-Optimization

We found the next bottleneck to be frame optimization, taking over 30ms. Taking a hint from optimization 1, we made this asynchronous as well. This didn't affect the total time a frame spent in the pipeline, but reduced motion-to-photon latency, making the experience more comfortable.

A slight issue is that view-optimization requires sensor data, but the Oculus SDK sensor-reading functions are not thread-safe and must be executed on the main thread. Therefore, we copy them over periodically in the main thread, along with a timestamp of the time the sensor data was generated. We use a simple mutex for synchronization since these are very small constant-sized pieces of data that aren't accessed often. If we had more time, we would also optimize placement of updating the viewer data on the main thread, and possibly due it multiple times per loop.

// T1
Queue frameQueue;
Frame frame = frameQueue.dequeue();
ViewerData viewerData;

while (true) {
  if (frameQueue.hasNewFrame())
    frame = frameQueue.dequeue();
    viewer_data = get_viewer_data_from_hmd();
    T2.mutex.lock();
    T2.viewer_data = viewer_data;
    T2.mutex.unlock();

    render_frame(optimized_frame, viewer_data);
}

// T2
ViewerData viewer_data;
Mutex mutex;
while (videoFrameAvailable()) {
    mutex.lock();
    local_viewer_data = viewer_data;
    mutex.unlock();

    frame = readFrame();
    frame = decode_frame(optimize_frame(frame, local_viewer_data));
    frameQueue.enqueue(frame);
}

Optimization 5: Pose Prediction

Our optimizer works by taking the HMD’s orientation, and then turning it into an OptimizedFrame. This frame then has to go through the Optimizer->Renderer queue, and then arrive at the main rendering call. But by the time this has happened, about 70ms have passed (this is motion-to-update time, see results), and the data the optimized frame was based on is out of date.

So, we automatically measure motion-to-update time (M2U) using a rolling average. We then feed this measurement to the Oculus SDK, which provides a prediction for where your head will be when you see the frame. Note that Oculus clamps predictions to 100ms in the future, so we have trouble with spikes, but fortunately we’re about 70ms M2U on average. It's harder to measure prediction, although we could compare the position you arrived at with the position the view-optimization was for, but it made a noticeable difference. Keep in mind that it's much more ok to have a higher M2U than motion-to-photons, since the focus window is padded to make it larger than it needs to be.

Other miscellaneous optimizations

We performed several other optimizations such as optimizing for the API by removing unnecessary calls, using display lists, and etc, too small but numerous to list entirely here.

Results

Final Algorithm Pseudocode

After all the optimizations, here is pseudocode for the final end-to-end algorithm.
// parameters of our algorithm, chosen based on qualitative testing

// angles in degrees
const int CROP_ANGLE = 180;
const int FOCUS_ANGLE = 30;

const int BLUR_FACTOR = 3;

OptimizedFrame optimize_frame(video_frame, orientation) {
  // note: video_frame is a fully-decoded video frame, i.e. just a simple
  // 2D array of RGB values

  // crop_frame crops the frame to your FOV, i.e. just the CROP_ANGLE degrees
  // you're looking at handling edge cases like if the viewer is looking
  // at the part of the panorama where the ends are stitched together.
  // h_angle = horizontal angle = how far left/right you're looking
  cropped_frame = crop(video_frame, orientation.h_angle, CROP_ANGLE);

  // reduces size by 1/BLUR_FACTOR in each dimension
  cropped_frame_blurred = scale(video_frame, 1/BLUR_FACTOR)

  // extract the part of the frame around the point you're looking at
  // of size FOCUS_ANGLE x FOCUS_ANGLE degrees
  focus_region = extract_region(
    video_frame,
    orientation.h_angle,
    orientation.v_angle,
    FOCUS_ANGLE
  );

  return OptimizedFrame(cropped_frame_blurred, focus_region, orientation);
}

Frame decode_frame(OptimizedFrame optimized) {
  cropped_frame = scale(optimized.cropped_frame_blurred, BLUR_FACTOR);
  orientation = optimized.orientation

  // uncrop the frame, essentially copying the cropped portion into
  // its original position on a new 360 degree frame (initially all black).
  frame = uncrop(cropped_frame, optimized.orientation.h_angle);

  // copy the focus region directly back onto the new frame in-place
  copy(frame, optimized.focus_region, orientation.h_angle, orientation.v_angle);

  return frame;
}

Video source_video;

Queue source_video_queue;
Queue optimized_queue;

Queue frame_queue; // uses an internal mutex
ViewerData viewer_data;
Mutex viewer_data_mutex;
Profiler profiler;

void video_decoder() {
  while (source_video.has_next_frame()) {
    frame = source_video.get_next_frame();
    source_videoframe_queue.enqueue(frame);
  }
}

void optimizer() {
  while (true) {
    // blocks until a frame is available
    frame = source_video_queue.dequeue();

    viewer_data_mutex.lock();
    local_viewer_data = viewer_data; // get a local copy
    viewer_data_mutex.unlock();

    // simulate server-client process

    // server would do this
    optimized_frame = optimize_frame(frame, local_viewer_data.orientation);

    // client would do this
    decoded_frame = decode_frame();

    // record the time the data used by a frame was received from the HMD
    decoded_frame.time = local_viewer_data.time;
    optimized_queue.enqueue(decoded_frame);
  }
}


int main() {
  // initialization of OpenGL, video, etc. omitted

  spawn_thread(video_decoder);
  spawn_thread(optimizer);

  OpenGLTexture texture; // reference a texture on the GPU

  while (true) {
    if (optimized_queue.size() > 0 || !texture.loaded) {
      // the below blocks until a frame is available, but except for the first
      // time, the if condition ensures there will be no wait.
      frame = optimized_queue.dequeue();
      profiler.add_sample(time() - frame.time); // how old the frame's data is

      // copies frame into a pixel-buffer object which is sent
      // non-blocking to the GPU
      copy_frame_to_texture(frame, texture);
    }

    viewer_data_mutex.lock();

    // estimate how long before the frame based on this data is displayed
    motion_to_update_time = profiler.average();

    // predicts the viewer data (orientation of HMD, etc.) at time given
    // using the Oculus SDK's prediction
    viewer_data = predict_viewer_data_from_hmd(time() + motion_to_update_time);
    viewer_data_mutex.unlock();

    // note: render actually draws the scene twice -- once for each eye --
    // the top half of the video frame (now in texture) is the left eye, and
    // the bottom half is the right eye
    render(texture, viewer_data);
  }
}

Experimental Setup

We ran our demo on a Samsung Chronos 7 laptop with a NVIDIA GeForce GT 630M running Ubuntu 14.10. We ran it for 2 minutes while wearing it and looking around. We used a this video from Photocreations. It outputs averages of all the statistics we gave here. Our code is available at avp/conduit . To run our demo, we used:

./conduit oculus2 fountain-3D-4k.mp4

We used an Oculus Rift DK2 to run Conduit and took qualitative measurements, as well as key time metrics. We measured motion to update (M2U) latency, which is the time from when you move your head to when you see the view-optimized image for your new angle.

Qualitative Results

We found that blur was not particularly noticeable at blur factor (the factor by which we rescale each dimension of the image) 3, although you could "notice" it by toggling blur on and off, you didn't really care, and also, not having everything be in focus is more realistic. It became more noticeable at 4 and 5. Although, this is just qualitative experience for us, other people may be different. To test tracking, we turned blur factor up to 40 to make it really obvious where the focus window is, and tracking was quite responsive. Oculus's prediction helped quite a lot with responsiveness, and also, since the focus and crop regions are larger than necessary, and head-tracking updates are decoupled, we found motion-to-update could safely exceed the recommended 20ms for motion-to-photons. Making the focus window larger helped a lot both when moving and stationary. Since it turns out the focus window is only a few percentage points of the entire video, this didn't even cost much.

Quantitative Results

The bandwidth requirement for streaming 3D 4K video was a minimum of 18 Mbps for our specific H.264 encoded video. Our view optimization compressed the image down in terms of raw pixels to 10% the size. Cropping removes 50% of the frame. We shrink the image by a factor of 3 in each dimension, and we only keep a 50° by 50° window in the middle of the frame (only 3% of the overall frame). Assuming video compression is linear, we cut down bandwidth to 1.8 Mbps, making it much more accessible.

We got a motion-to-update latency of 75ms. This was comprised of about 50ms for view-optimization, and 25ms spent waiting in the buffer and miscellaneous overhead. Also, our optimizations brought the time per frame down from 150ms/frame (6 FPS) to about 19ms/frame (~50 FPS). 12ms were spent on display, and 7ms were spent loading the texture. Video reading and optimization take almost no time, because we hide latency by pipelining those operations.

Bottlenecks and further areas of improvement

The main bottleneck is the optimization pipeline. Blurring and resizing the images is the overwhelming majority of time spent in the view optimizer. In order to improve this, we could use the GPU for image resizing to try and make the resize operation faster. We wanted to accomplish this via OpenCV's CUDA mode, but unfortunately couldn't get it to build in time. We also wanted to use NVIDIA's CUDA Decoder, to get GPU accelerated decoding. Also, by doing both of these steps on the GPU, we could keep intermediary data on the GPU. To improve updating the orientation angle of the user’s head, we also added prediction of the future head position. This greatly improved the lag when the user’s head moved quickly, because the optimization pipeline would optimize for the future.

Latency also arises from texture mapping the frames onto the cylinder using OpenGL. This takes 7ms of the 19ms motion to update latency, and the project could benefit from making it go faster. We could also try reducing the data transfer into the texture, since most of it is black. Although performing all view-optimization compression and decompression on the GPU would obviate the need for this.

Although we've tried to incorporate as much modern graphics programming as we had time to learn and implement, we're still using OpenGL immediate mode to render the cylinder. We also added display lists, but they didn't help much. Moving to an even more modern shader/buffer model would probably improve the 12ms display time.

Choice of target machine

Our choice of target machine was limited, so if we had used a better computer, than say, a 2 year old laptop, we could've had faster render times. But, we were limited to the computers we owned, since we need physical access to plug in the HMD, and both of us only had laptops. Also, we found out that apparently our test video at 4098x4098 just barely exceeded the maximum supported GPU accelerated video decoding size of 4032x4032. We tried to rectify this by editing the video, but it crashed every computer we had access too. Apparently the video was also bigger than the H.264 codec technically allowed.

Next steps

This is a list of some of the things we'd do to continue the project if we had more time:

References

Context

Conduit was created as a final project in 15-418: Parallel Computer Architecture and Programming, taught by Kayvon Fatahalian at Carnegie Mellon University.

We thank Professor Kayvon and the TAs for their advice along the way!

Thanks to John Carmack for proposing the idea for the project, to do view-optimized video streaming for VR!