#### Objectives

The objectives for this experiment is to research the performance enhancement achieved by using SIMD instructions.

##### Getting started

This lab will run only on Linux-based systems. For each experiment the non-SIMD version is provided and your task is to implement the SIMD version. The non-SIMD source code is available here:

Source - Lab 8

More in-depth article can be found here

http://sites.utexas.edu/jdm4372/2016/11/05/intel-discloses-vectorsimd-instructions-for-future-processors/

You will use the so called intrinsic, these are functions that map one-to-one on an assembly instruction.. You can find the overview of all the intrinsics on the following page

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Inside each part there is a Makefile. To compile and test your programs run make run. For each program it will run multiple times and will print the number of cycles used by the non-SIMD program and the SIMD versionof the program. If the results from both programs (SIMD and non-SIMD) are equal it will print CORRECT.

To load int valuess use the following function call, change <array> and <multiplication factor> to your values:

_mm_load_si128((__m128i const*)(<array> + <multiplication factor> * i));

**SSE**

SSE (Streaming SIMD Extensions) is a set of instructions added to the x86 instruction set that makes SIMD possible. The instructions work on 128bit registers. One 128bit register can contain: 2 double precision floats, 4 single precision, 16 8bit ints, 8 16bit ints, 4 32bit ints or (<– de r) 2 64bit ints.

**Intrinsics syntax**

The intrinsics have a specific syntax. For example:

_mm_add_ps

The _mm states that it works on 128bit registers and is the prefix for all SSE functions. The _add is the operation and the _ps stands for packed single.

That means it works on multiple single precision floats. _pd stands for packed doubles. _epi{8,16,32,64} stands for it works on int’s where the number is the amount of bits.

##### Questions

Part one

1. Explain for both experiments (1 and 2) the performance difference between the first and succeeding runs.

**exp 1. **Use the provided code to load and store the data from and to the memory. Your task is to add 5 to every number inside the a array.

```
Implement the void sseFunction(float* a, float* b, int size)
Function inside add.c.
```

a is the memory where the input resides. b is the memory where the result must be stored.

**exp 2. ** Calculate the following polynomial x^2+3x+2.

`Implement the void sseFunction(float* a, float* b, int size) Function inside poly.c.`

**Part two**

2. This experiment the Pythagorean theorem is used to calculate the length of the hypotenuse.

`Implement sseFunction(float* a,float* b,float* c, int size) inside functions.c`

**Part three**

3. This next assignment you will normalize 4d vectors.

`Implement sseFunction(float* x,float* y,float* z,float* w,int size) inside functions.c`

**Part four**

4. Round numbers to the nearest int.

`Implement void sseFunction(float* a, float* b,int size) inside functions.c`

**Part five**

5. This experiment sums two audio signals.

`Implement void sseFunction(int8_t* a, int8_t* b, int8_t* c, int size) inside functions.c`

**Part six**

6. This experiment counts the frequency of one letter inside a text.

`Implement int sseFunction(char* text, char find, int size) inside functions.c`