Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks

In this assignment, you will use distributed computing to accelerate video encoding.

You are supposed to:

 

Codec63

Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It is not compliant with any standards by itself, so the precode contains an example of an encoder and a decoder (which converts an encoded file back to YUV). C63's inter-frame prediction works by encoding for every macroblock independently, whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame.

Macroblocks are encoded according to the JPEG standard [1] if no motion vector is used and stored in the output file. If a motion vector is used, the residual is stored similarly. A comparative overview of the steps involved during JPEG encoding can be found in Wikipedia [2]. If a motion vector is used, this is stored right before storing the encoded residual.

It is your task to optimize the c63 encoder using two machines.

The c63 is very basic and shows behavior that you wouldn't allow a standard encoder to have. This concerns, in particular, the Huffman tables and the unrestricted use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).

The video scenario is live streaming. You should not have an encoder pipeline of more than three frames. Also, you should not use parallelization techniques that severely degrade the video quality.

You should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms provide considerable speedup potential. Still, they distract from the primary goal of this home exam, which is to identify and implement parallelization options. You should also not focus on improving your GPU implementation from Home Exam 2.

Two test sequences in YUV format are available in the /opt/Media directory on the lab machines:

These should be used as input to the provided c63 encoder and can test your implementations.

 

Precode

The precode consists of the reference c63 code, including:

The precode is written in C. You should not touch the decoder or c63pred.

The precode can be downloaded from a Git repository here; there is a Dolphin branch of the precode that can be used as a start:

git clone https://bitbucket.org/mpg_code/in5050-codec63.git

You must log in to the lab machines connected with PCI Express for this assignment. Information about accessing the computers can be found in the Dolphin FAQ.

You can entirely adapt, modify, or rewrite the provided encoder to take full advantage of the target architecture. You are, however, not allowed to change out the algorithms for Motion Estimation, Motion Compensation, or DCT/iDCT. You are not allowed to paste any other pre-written code into your implementation. You are also not allowed to post any code from the home exam on the Internet.

Some usage examples:

To decode a sequence:

$ ./c63dec /tmp/test.c63 /tmp/test.yuv

To decode the prediction buffer in a sequence:

$ ./c63pred -w 352 -h 288 -o /tmp/test.c63 foreman_pred.yuv

To playback a raw yuv file:

$ mplayer /tmp/test.yuv -demuxer rawvideo -rawvideo w=352:h=288

Evaluation

Write a short report where you discuss your results. The exam will be graded on how well you can take advantage of the distributed architecture to solve the task at hand.

In the evaluation, we will consider (in order):

  1. A program that works (on a Tegra and an x86 machine). (**)
  2. PCI Express is used to transport data between computers. (*)
  3. Effective use of PCI Express:
    • Efficient use of SISCI for transporting frame data.
    • Efficient synchronization with SISCI between the two machines uses either remote interrupts or PIO.
    • Moving data efficiently from the I/O machine to the GPU in the processing machine.
  4. Use of the potential of a distributed 3-frame pipeline. 
  5. Good documentation:
    • Readable, well-commented code.
    • Optimization steps and performance results
    • Comparison of / reflection on alternative approaches
    • Complete and well-presented document
  6. Output video has quality with a similar or better PSNR and file size as the reference encoders.
  7. Bonus points for other non-obvious optimizations.

(*) Automatic fail if this is not fulfilled. (**) We do not debug code before testing; correctness and effectiveness are not evaluated if this is not met.

 

Report

You must write the results as a technical report of no more than four pages in ACM format. The report should serve as a guide to the code modifications you have made and the resulting performance changes.  

 

Machine Setup

The PCI Express cluster is situated at IFI. Machine names and access to them can be found in the Dolphin FAQ

Contact in5050@ifi.uio.no or use the Mattermost channel if you have problems logging in.

For issues with PCIe hardware and SISCI APIs, Dolphin has a dedicated support e-mail: simula-support@dolphinics.no 

 

Formal Information

The deadline for handing in your assignment:

Deliver your code (as ZIP or TAR.GZ) and report (as PDF) at https://devilry.ifi.uio.no/.

Submit the design review and poster (as PDF) to in5050@ifi.uio.no.

The groups should also prepare a poster (to show on-screen during the lecture) and a 5-minute talk for the class on May 24th.

For questions and course-related chats, we have created a Mattermost channel.

There will be a prize for the best poster/presentation (awarded by an independent panel and independent of the grade).

Please check the Dolphin FAQ page for updates and the FAQ

For questions, please contact:

in5050@ifi.uio.no

 

[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example