IN5050 - Programming heterogeneous multi-core architectures

Home Exam 2: Video Encoding on Tegra Xavier using the CUDA framework

In this assignment, you will use the computing power available on a graphics processor to accelerate video encoding.

You are supposed to:

Profile and analyze the encoder, and write a short Design Review (max 1 x A4 page) that the group will present.
Optimize the c63 encoder using CUDA and the GPUs on the Nvidia Jetson AGX Xavier boards.
Write a short report where you describe which optimizations you have implemented and discuss your results. You should not describe other thinkable or planned optimizations you did not test.
Create a poster (to show on screen) and participate in the poster session on April 12th.

Codec63

Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It is not compliant with any standards, so the precode contains an example of an encoder and a decoder (which converts an encoded file back to YUV). C63's inter-frame prediction works by encoding for every macroblock independently, whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame.

Macroblocks are encoded according to the JPEG standard [1] if no motion vector is used and stored in the output file. If a motion vector is used, the residual is stored similarly. An illustrative overview of the steps involved during JPEG encoding can be found in Wikipedia [2]. If a motion vector is used, this is stored right before storing the encoded residual.

It is your task to optimize the c63 encoder using the CUDA framework.

The c63 is very basic and shows behavior that you wouldn't allow a standard encoder to have. This concerns, in particular, the Huffman tables and the unconditional use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).

The video scenario is live streaming. You should not have an encoder pipeline of more than three frames. Also, you should not use parallelization techniques that severely degrade the video quality.

You should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms provide considerable speedup potential. Still, they distract from this home exam's primary goal: identifying and implementing parallelization options.

Two test sequences in YUV format are available in the /mnt/sdcard directory on the lab machines:

foreman (352x288) CIF
tractor (1920x1080) 1080p

These should be used as input to the provided c63 encoder and can test your implementations.

Precode

The precode consists of the reference c63 code, including:

an encoder
a decoder
the command c63pred (which extracts the prediction buffer for debugging purposes)

The precode is written in C. You are not required to touch the decoder or c63pred.

The precode can be downloaded from a Git repository here (use the CUDA-branch):

git clone https://github.com/griwodz/in5050-codec63.git

You must log in to the Jetson AGX Xavier devkit assigned to your group for this assignment. You should have received an email from the course administrators about which kits to use. Information about how to access the machines can be found in the GPU FAQ.

You can adapt, modify, or rewrite the provided encoder to take full advantage of the target architecture. You are, however, not allowed to change out the algorithms for Motion Estimation, Motion Compensation, or DCT/iDCT. You are not allowed to paste any other pre-written code into your implementation. You can also not post any code from the home exam online.

Start by profiling the encoder to see which parts of the encoder are the bottlenecks. Remember, more profiling might be needed to find new bottlenecks after optimizing one code section.

Some usage examples:

To encode the foreman test sequence.

$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv

To decode a sequence.

$ ./c63dec /tmp/test.c63 /tmp/test.yuv

To dump the prediction buffer (used to test motion estimation):

$ ./c63pred /tmp/test.c63 /tmp/test.yuv

To playback a raw yuv file

$ mplayer /tmp/test.yuv -demuxer rawvideo -rawvideo w=352:h=288

Evaluation

Write a short report where you discuss your results. The exam will be graded on how well you can use the GPU architecture to solve the task.

In the evaluation, we will consider (in order) the following:

Motion Estimation & DCT/iDCT algorithmic functions in the source code have been offloaded to the GPU.
- Document the bottleneck and the effect of your optimization.
A program that works (on the Jetson AGX Xavier provided)
- Runs to completion. (*)
- Encodes foreman and tractor correctly.
- Output video has quality similar quality and file size as the reference encoders.
- Readable, well-commented code
Effect of the GPU offload
- Understanding the SoC architecture and the Volta GPU.
  - Minimizing overhead with moving data between the CPU and GPU.
  - Investigate if there are any advantages to using mixed precision (FP16) on the GPU
  - The correctness of memory use on the GPU (memory types, bank conflicts) and GPU code optimization regarding branching.
- Bonus points can be given for non-obvious optimizations, such as offloading parts of VLC.
The quality of the report that accompanies the code
- Clear and structured description of the performance changes caused by your modifications to the precode
- References to the relevant parts of the accompanying code (to aid the reviewer of the submitted assignment)
- Graphical presentation of the optimization steps and performance results (plots of performance changes)
- Comparison of / reflection about the alternative approaches tried out by your group.

^{(*) We do not debug code before testing; correctness and effectiveness are not evaluated if this is not fulfilled.}

Report

You must write the results as a technical report of no more than four pages in ACM format. The report should serve as a guide to the code modifications you have made and the resulting performance changes.

Machine Setup

The Jetson AGX Xavier devkits are at IFI. Machine names and access to them can be found in the GPU FAQ. If you have reported your group to the course administration, you should have been assigned to a devkit and provided with a username and a password.

Contact in5050@ifi.uio.no if you have problems logging in.

Formal Information

The deadline for handing in your assignment:

Design: Thursday, March 21st at 12:00
Code: Friday, April 12th at 23:59
Report: Tuesday, April 16th at 14:00

Deliver your code (as ZIP. TGZ, etc.) and report (as PDF) to https://devilry.ifi.uio.no/.

Submit the design review and poster (as PDF) to in5050@ifi.uio.no.

The groups should also prepare a poster (to show on-screen during the lecture) and a 5-minute talk for the class on April 12th. There will be a prize for the best poster/presentation (awarded by an independent panel and independent of the grade).

For questions and course-related chatter, we have created a Mattermost space: https://mattermost.uio.no/ifi-undervisning/channels/in5050

Please check the GPU FAQ page for updates and the FAQ

For questions, please contact:

in5050@ifi.uio.no

[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example