Lab 8 – Description | Edwin's HomePage @ the UvA

Objectives

The objectives for this experiment is to research the performance enhancement achieved by using SIMD instructions.

Getting started

This lab will run only on Linux-based systems. For each experiment the non-SIMD version is provided and your task is to implement the SIMD version. The non-SIMD source code is available here:

Source - Lab 8

More in-depth article can be found here

http://sites.utexas.edu/jdm4372/2016/11/05/intel-discloses-vectorsimd-instructions-for-future-processors/

You will use the so called intrinsic, these are functions that map one-to-one on an assembly instruction.. You can find the overview of all the intrinsics on the following page

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Inside each part there is a Makefile. To compile and test your programs run make run. For each program it will run multiple times and will print the number of cycles used by the non-SIMD program and the SIMD versionof the program. If the results from both programs (SIMD and non-SIMD) are equal it will print CORRECT.

To load int valuess use the following function call, change <array> and <multiplication factor> to your values:

 _mm_load_si128((__m128i const*)(<array> + <multiplication factor> * i));

SSE
SSE (Streaming SIMD Extensions) is a set of instructions added to the x86 instruction set that makes SIMD possible. The instructions work on 128bit registers. One 128bit register can contain: 2 double precision floats, 4 single precision, 16 8bit ints, 8 16bit ints, 4 32bit ints or (<– de r) 2 64bit ints.

Intrinsics syntax
The intrinsics have a specific syntax. For example:

 _mm_add_ps

The _mm states that it works on 128bit registers and is the prefix for all SSE functions. The _add is the operation and the _ps stands for packed single.

That means it works on multiple single precision floats. _pd stands for packed doubles. _epi{8,16,32,64} stands for it works on int’s where the number is the amount of bits.

Questions

Part one

1. Explain for both experiments (1 and 2) the performance difference between the first and succeeding runs.

exp 1. Use the provided code to load and store the data from and to the memory. Your task is to add 5 to every number inside the a array.

Implement the void sseFunction(float* a, float* b, int size) 
Function inside add.c.

a is the memory where the input resides. b is the memory where the result must be stored.

exp 2. Calculate the following polynomial x^2+3x+2.

Implement the void sseFunction(float* a, float* b, int size) Function inside poly.c.

Part two
2. This experiment the Pythagorean theorem is used to calculate the length of the hypotenuse.

Implement sseFunction(float* a,float* b,float* c, int size) inside functions.c

Part three
3. This next assignment you will normalize 4d vectors.

Implement sseFunction(float* x,float* y,float* z,float* w,int size) inside functions.c

Part four
4. Round numbers to the nearest int.

Implement void sseFunction(float* a, float* b,int size) inside functions.c

Part five
5. This experiment sums two audio signals.

Implement void sseFunction(int8_t* a, int8_t* b, int8_t* c, int size) inside functions.c

Part six
6. This experiment counts the frequency of one letter inside a text.

Implement int sseFunction(char* text, char find, int size) inside functions.c