Performance tests of Intel Media SDK (4.0.024-HSW) – Decode

It’s a test report of current/latest version of kdvcodec_msdkdec which based on the latest version of Intel Media SDK(4.0.024-HSW).

There is a known issue, NV12 to YV12 (or YUV420P) conversion, which is very low in efficiency.

Here are the details.

I. Intel MSDK version

Beta version: 4.0.024-HSW

II. Hardware

Core ivy bridge i7-3770, and here is the details (you can view it by bash command: cat /proc/cpuinfo)

vendor_id : GenuineIntel

cpu family : 6

model : 58

model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

stepping : 9

microcode : 0x12

cpu MHz : 1600.000

cache size : 8192 KB

physical id : 0

siblings : 8

core id : 3

cpu cores : 4

apicid : 7

initial apicid : 7

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms

bogomips : 6799.77

clflush size : 64

cache_alignment : 64

address sizes : 36 bits physical, 48 bits virtual

III. OS

Ubuntu 12.04 LTS, kernel version 3.2.0-23

jacky@ubuntu-msdk:~$ uname -a

Linux ubuntu-msdk 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

IV. Test video sequences

1080p.264: total 300 frames
1080P-10M-600.264: total 600 frames

V. Test program

1. Sample program provided by Intel
Major processes & features

Read 264 file
Decode in asynchronized mode
Save to a YUV file

This test programs performance is limited due to the processes really unnecessary like Read 264 ES stream file, and save the decoded YUV buffer to file. The Intel Media SDK’s output format is NV12.

So I made a few modifications for it, to run a specific test more appropriate for our product’s run-time scenario, here are the modifications:
1. Skip Read 264 file by reading the total file to memory and parse the total file to a framed array(there are bugs here)
2. Implemented two mode of Decode (async mode and block mode)
3. Decoded buffer process
A. Copy the decoded NV12 to a buffer.
B. Copy the decoded NV12 to a buffer, and memset the NV12′s UV data to 0
C. Copy and convert the NV12 to YUV420P (Color convertion without optimized)
D. Copy and convert the NV12 to YUV420P, and memset the UV data to 0(Color convertion without optimized)
E. A + save to file.
F. B + save to file
G. C + save to file
H. D + save to file

VI. Test results

1MB bitstream per input

Test sequence 1: 1080P, 300 frames

	A	B	C	D	E	F	G	H
FPS	969758	997308	3330680	3342184	5682973	4660920	5672404	5336507
	309.36	300.81	90.07	89.76	52.79	64.36	52.89	56.22
CPU	Low	Low	High	High	Low	Low	slight	slight

Test sequence 2: 1080P-10M-600.264: total 600 frames

	A	B	C	D	E	F	G	H
FPS	2233026	2287319	7224942	7281495	13804849	13727656	14136208	14121156
	268.69	262.32	83.05	82.4	43.46	43.71	43.71	42.49
CPU	Low	Low	High	High	Low	Low	slight	slight

Single frame/slice per input

Test sequence 1: 1080P, 300 frames

	A	B	C	D	E	F	G	H
FPS	1610299	1637120	4406070	4549480	6122828	6198919	6438414	6705004
	186.3	183.25	68.09	65.94	49	48.4	46.6	44.74
CPU	Low	Low	High	High	Low	Low	slight	slight

Test sequence 2: 1080P-10M-600.264: total 600 frames

	A	B	C	D	E	F	G	H
FPS	2511838	2597502	7577040	7679905	14260726	14330324	14857964	14912612
	238.87	230.99	79.19	78.13	42.07	41.87	40.38	40.23
CPU	Low	Low	High	High	Low	Low	slight	slight

*This test result can be different by :
a. The test video sequence file which is not encoded by the same settings with my test sequence ;
b. Chances, like OS loading or other processes interferings, especially the E/F/G/H saving to file scenarios, results Italicized could also be a prove of it.

*About CPU usage:
“Not obvious” means total OS level CPU usage less than 3%, and not sure whether is the test programs costing it.

Some suspecious phenomenon.

a. CPU usage

b. GPU performance

Periodic conclusion for Intel MSDK decode research

Most costy procedure in the current test program is Colorspace conversion(from NV12 to YUV420P), I’ll try to improve it in the coming days (BTW: Intel VPP module does not provide this conversion support).

However, comparing to the software decoding, even down to 150 fps for 1080P decoding, MSDK sounds still a promising technique for us.

RG4.NET