LabExercise Numpy Exercises =========================== In machine learning we are dealing with massive amounts of data. Data most often organised in tables. When all data elements in a table are of the same datatype (like an integer or a floating point number) the table can be represented with a homogeneous array. Languages that are optimally suited for programming with data are therefore equipped with array data types that are integral part of the language. Although arrays look a lot like python lists they are not as shown in the following code. .. ipython:: python import numpy as np a = np.array([1,2,3]) print(type(a)) print(a) b = [1,2,3] print(b) print(a+a) print(b+b) The nice thing about Numpy arrays is that it allows you to manipulate the data in arrays without writing explicit loops. For instance look at the addition of all elements in an array: .. ipython:: python import numpy as np a = np.random.rand(65536) def loopsum(a): sum = 0 for i in range(len(a)): sum += a[i] return sum %timeit loopsum(a) %timeit np.sum(a) So the explicit loop sum function in python takes 10 ms versus 30 us for the numpy version. That is about 350 times slower for the explicit loop version. So be aware in this course to use build-in Numpy tools to manipulate and calculate with arrays. There are many python/numpy tutorials available like this one http://cs231n.github.io/python-numpy-tutorial/. Array Calculations and Array Indexing ------------------------------------- **In all exercises below you are not allowed to use a loop in python.** #. Given two arrays A and B each of the same size calculate their sum (elementwise) and their product (elementwise) #. Calculate the mean of all elements in an array A without using the np.mean or np.average functions. #. Calculate the standard deviation of all elements in an array A without using np.var or np.std. #. Given an array A with shape (128,) calculate the sum of the first, third, fifth, etc elements (A[0]+A[2]+...). #. Given an array A with shape (1024,) make an array B containing only the first 512 elements of A. I.e. B-shape should be (512,). #. Given an array A with shape (1024,) make an array B containing only the elements A[22],A[23],...,A[42]. #. Given an array A with shape (1024,) and an integer array I make an array B whose elements are A[I[0]], A[I[1]], ..., A[I[-1]] #. Given an array A with shape (N,) make an array with all elements of A in reverse order. #. Given an array A = np.random.rand(128) make an array B containing all elements in A that are less then or equal to 0.5 Array indexing is probably one of the most difficult subjects of programming with numpy in an efficient way. Let A be a numpy ndarray (n-dimensional array) then A[obj] is an indexing operation on array A. It depends on the value and type of obj what type of indexing is used. There are really three types of indexing... Views on Arrays --------------- Consider the following code snippet: .. ipython:: python import numpy as np A = np.random.randint(0, 5, size=(3,5)) print(A) B = A[::2,::2] print(B) B[:,:] = 99 print(B) print(A) Yes indeed B shares the same data (the element values) as A. B is not an entire new data item, it is a *view* on (a subset of) the array A. This is something that a programmer should be aware of, else it will lead to errors that are hard to spot since it is hidden in the semantics of array indexing operators. Two dimensional data arrays --------------------------- In machine learning a classical task is classification. Given $n$ features measured for an object, say we measure the mass, the circumference and the surface area of either banana's or apples, we would like to classify an piece of fruit as either banana or apple based only on the three numerical values. In machine learning such a problem is tackled by collecting a lot of examples of these types of fruit. Say we have $m$ examples. For each example with index $i$ we know whether it is an apple or banana, this will be encoded with the *target vector* $Y$ such that $Y[i]=0$ for banana's and $Y[i]=1$ for apples. The mass, circumference and area of example $i$ form the $i$ -th row in the $data array$ $X$. So $X[i,0]$ is the mass, $X[i,1]$ is the circumference and $X[i,2]$ is the area. The $i$ -th row is called the *feature vector* for the $i$ -th example. The task then for a machine learning algorithm is to come up with a rule that takes a feature vector as input and returns the corresponding class. This rule should be learned from the data matrix $X$ and the target vector $Y$. #. Select the $i$ -th feature vector from the data matrix $X$. I.e. select the $i$ -th row from $X$. #. Select the $j$ -th column from the data matrix $X$. #. Given a data matrix $X$ with shape $(m,n)$ calculate the vector $M$ of shape $(n,)$ where $M[i]$ is the mean of the $i$ -th column of $X$, i.e. the mean of the $i$ -th feature. For instance in our example of the apples and banana's $M[1]$ is the mean value of all circumferences of all pieces of fruits in the data matrix. #. Now subtract the mean vector you just calculated from all feature vectors (the row vectors) in the data set leading to the data matrix $X_0$. Yes this can be done without a loop! Hint: look at array broadcasting. #. Before calculating the mean of the features we would like to select the apples from the data set. Note that apples and oranges are randomly distributed over the rows of $X$. So calculate $M_a$ as the matrix such that $M_a[i]$ is the mean of the $i$ -th feature of only the apples in the data set. #. Select the feature vector $F$ for the piece of fruit with the largest mass of all in the data set. Hint: look at the function np.argmax for this. #. In a lot of algorithms we start with a data matrix $X$ and then we would like to make a matrix $X'$ that is matrix $X$ but with a column prepended containing only the values 1. You can do that in a one-liner! Tricks with Arrays ------------------ #. Given an array (vector) A of shape (N,) make it into an array B of shape (N,1). #. Given an array (vector) A of shape (N,) make it into an array B of shape (1,N). #. Given an array A of shape (M,N) what is A[i]? What are the valid values for i? #. Let A35 be an array of shape (3,5) and let v5 be an array of shape (5,). Subtract v5 from each row in A35. #. Let A35 be an array of shape (3,5) and let v3 be an array of shape (3,). Subtract v3 from each column in A35. Linear Algebra -------------- This is easier in Python 3 then in Python 2. In python 3 the @ operator is introduced for matrix multiplication. Let A be an array of shape (m,n) and let B be an array of shape (m,k) then in python 3 we can write A @ B for the matrix multiplication of A and B. In python 2 we have to write A.dot(B) for matrix multiplication. Note that there is conceptual difference between a 1 dimensional array V of size (N,) and a vector V as we know it from linear algebra. In linear algebra a vector with N elements has dimensions $N\times 1$. A 'vector' V as a numpy array has shape (N,). #. Calculate the inner product of two vector v and w both of shape (N,). #. Calculate the product of a matrix A of shape (M,N) with a vector v of shape (N,). #. Let v be an array of shape (N,). What is the shape of v.T (or np.transpose(v)) #. Let A be an array of shape (3,5) and let v3 be an array of shape (3,). We define v31 = v3.reshape(3,1). What is v3 @ A, v31 @ A and v31.T @ A? If you really want to eat your heart out with Numpy and matrix (tensor) multiplications look at the documentation of numpy.tensordot. Concluding Remarks ------------------ If you want to program a loop over the elements in the array and you think that is quite logical thing to do, then chances are high that is indeed so logical that someone else has done it before and that it is part of the numpy library. In general you should really read the manuals and tutorials on the web to make sure you learn how to use modern array manipulation languages (libraries) to unleash their power. Indeed in some cases ease of programming and speed is at the cost of (a lot) more memory that is needed. But for proof of concept implementations nothing beats these languages/libraries. General principle: **if you can write it down in a few lines of math, you should be able to program it in a few lines of Python/Numpy code...**