{ "metadata": { "name": "", "signature": "sha256:43cfdb03de61c090247b8dcd97c85e4965f815d94f2778671a3d96719ee0934b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Chapter 3, example 3\n", "====================\n", "\n", "In this example, we will download and analyze some data about a large number of cities around the world and their population. This data has been created by MaxMind and is available for free at http://www.maxmind.com.\n", "\n", "We first download the Zip file and uncompress it in a folder. The Zip file is about 40MB so that downloading it may take a while." ] }, { "cell_type": "code", "collapsed": true, "input": [ "import urllib2, zipfile" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "url = 'http://ipython.rossant.net/'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "filename = 'cities.zip'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "downloaded = urllib2.urlopen(url + filename)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "folder = 'data'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "mkdir data" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "with open(filename, 'wb') as f:\n", " f.write(downloaded.read())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "with zipfile.ZipFile(filename) as zip:\n", " zip.extractall(folder)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we're going to load the CSV file that has been extracted with Pandas. The `read_csv` function of Pandas can open any CSV file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "filename = 'data/worldcitiespop.txt'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.read_csv(filename)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stderr", "text": [ "/Users/admin/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.py:1159: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n", " data = self._reader.read(nrows)\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's explore the newly created data object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "type(data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "pandas.core.frame.DataFrame" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data object is a DataFrame, a Pandas type consisting of a two-dimensional labeled data structure with columns of potentially different types (like a Excel spreadsheet). Like a NumPy array, the shape attribute returns the shape of the table. But unlike NumPy, the DataFrame object has a richer structure, and in particular the keys methods returns the names of the different columns." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.shape, data.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "((3173958, 7),\n", " Index([u'Country', u'City', u'AccentCity', u'Region', u'Population', u'Latitude', u'Longitude'], dtype='object'))" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that data has more than 3 million lines, and seven columns including the country, city, population and GPS coordinates of each city. The head and tail methods allow to take a quick look to the beginning and the end of the table, respectively." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.tail()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryCityAccentCityRegionPopulationLatitudeLongitude
3173953 zw zimre park Zimre Park 4 NaN-17.866111 31.213611
3173954 zw ziyakamanas Ziyakamanas 0 NaN-18.216667 27.950000
3173955 zw zizalisari Zizalisari 4 NaN-17.758889 31.010556
3173956 zw zuzumba Zuzumba 6 NaN-20.033333 27.933333
3173957 zw zvishavane Zvishavane 7 79876-20.333333 30.033333
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ " Country City AccentCity Region Population Latitude \\\n", "3173953 zw zimre park Zimre Park 4 NaN -17.866111 \n", "3173954 zw ziyakamanas Ziyakamanas 0 NaN -18.216667 \n", "3173955 zw zizalisari Zizalisari 4 NaN -17.758889 \n", "3173956 zw zuzumba Zuzumba 6 NaN -20.033333 \n", "3173957 zw zvishavane Zvishavane 7 79876 -20.333333 \n", "\n", " Longitude \n", "3173953 31.213611 \n", "3173954 27.950000 \n", "3173955 31.010556 \n", "3173956 27.933333 \n", "3173957 30.033333 " ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that these cities have NaN values as populations. The reason is that the population is not available for all cities in the data set, and Pandas handles those missing values transparently.\n", "\n", "We'll see in the next sections what we can actually do with these data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each column of the DataFrame object can be accessed with its name. In IPython, tab completion proposes notably the different columns as attributes of the object. Here we get the series with the names of all cities (AccentCity is the full name of the city, with uppercase characters and accents)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.AccentCity" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "0 Aix\ufffds\n", "1 Aixirivali\n", "2 Aixirivall\n", "3 Aixirvall\n", "4 Aixovall\n", "5 Andorra\n", "6 Andorra la Vella\n", "7 Andorra-Vieille\n", "8 Andorre\n", "9 Andorre-la-Vieille\n", "10 Andorre-Vieille\n", "11 Ansalonga\n", "12 Any\ufffds\n", "13 Arans\n", "14 Arinsal\n", "...\n", "3173943 Zandi\n", "3173944 Zanyika\n", "3173945 Zemalapala\n", "3173946 Zemandana\n", "3173947 Zemanda\n", "3173948 Zibalonkwe\n", "3173949 Zibunkululu\n", "3173950 Ziga\n", "3173951 Zikamanas Village\n", "3173952 Zimbabwe\n", "3173953 Zimre Park\n", "3173954 Ziyakamanas\n", "3173955 Zizalisari\n", "3173956 Zuzumba\n", "3173957 Zvishavane\n", "Name: AccentCity, Length: 3173958, dtype: object" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This column is an instance of the Series class. We can access to certain rows using indexing. In the following example, we get the name 30000th city (knowing that indexing is 0-based):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.AccentCity[30000]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "'Howasiyan'" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can access to an element knowing its index. But how can we obtain a city from its name? For example, we'd like to obtain the population and GPS coordinates of New York. A possibility might be to loop through all cities and check their names, but it would be extremely slow because Python loops on millions on elements are not optimized at all. Pandas and NumPy offer a much more elegant and efficient way called boolean indexing. There are two steps that typically occur on the same line of code. First, we create an array with boolean values indicating, for each element, whether it satisfies a condition or not (if, whether the city name is New York). Then, we pass this array of booleans as an index to our original array: the result is then a subpart of the full array with only the elements corresponding to True. For example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data[data.AccentCity=='New York']," ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "( Country City AccentCity Region Population Latitude Longitude\n", " 998166 gb new york New York H7 NaN 53.083333 -0.150000\n", " 1087431 hn new york New York 16 NaN 14.800000 -88.366667\n", " 1525856 jm new york New York 9 NaN 18.250000 -77.183333\n", " 1525857 jm new york New York 10 NaN 18.116667 -77.133333\n", " 1893972 mx new york New York 5 NaN 16.266667 -93.233333\n", " 2929399 us new york New York FL NaN 30.838333 -87.200833\n", " 2946036 us new york New York IA NaN 40.851667 -93.259722\n", " 2951120 us new york New York KY NaN 36.988889 -88.952500\n", " 2977571 us new york New York MO NaN 39.685278 -93.926667\n", " 2986561 us new york New York NM NaN 35.058611 -107.526667\n", " 2990572 us new york New York NY 8107916 40.714167 -74.006389\n", " 3029084 us new york New York TX NaN 32.167778 -95.668889,)" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same syntax works in NumPy and Pandas. Here, we find a dozen of cities named New York, but only one happens to be in the New York state. To access a single element with Pandas, we can use the .ix attribute (for index):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ny = 2990572\n", "data.ix[ny]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "Country us\n", "City new york\n", "AccentCity New York\n", "Region NY\n", "Population 8107916\n", "Latitude 40.71417\n", "Longitude -74.00639\n", "Name: 2990572, dtype: object" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's turn this Series object into a pure NumPy array. We go from the Pandas world to NumPy (keeping in mind that Pandas is built on top of NumPy). We'll mostly work with the population count of all cities." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np \n", "population = np.array(data.Population)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "population.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "(3173958,)" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The population array is a one-dimensional vector with the populations of all cities (or NaN if the population is not available). The population of New York can be accessed in NumPy with basic indexing:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "population[ny]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "8107916.0" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find out how many cities do have an actual population count. To do this, we'll select all elements in the population array that have a value different to NaN. We can use the NumPy function isnan:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "np.isnan(population)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "array([ True, True, True, ..., True, True, False], dtype=bool)" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "x = population[~_]\n", "len(x), len(x) / float(len(population))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 26, "text": [ "(47980, 0.015116772181610469)" ] } ], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are about 1.5% of all cities in this data set that have a population count." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore now some statistics on the cities population." ] }, { "cell_type": "code", "collapsed": false, "input": [ "x.mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "47719.57063359733" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "x.sum() / 1e9" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ "2.2895849990000001" ] } ], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "len(x)/float(len(population))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 29, "text": [ "0.015116772181610469" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total population of those cities is about 2.3 billion people, about a third of the current world population. Hence, according to this data set, roughly 30% of the population lives in less than 1.5% of the cities in the world!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.Population.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ "count 47980.000000\n", "mean 47719.570634\n", "std 302888.715626\n", "min 7.000000\n", "25% 3732.000000\n", "50% 10779.000000\n", "75% 27990.500000\n", "max 31480498.000000\n", "Name: Population, dtype: float64" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's locate some geographical coordinates." ] }, { "cell_type": "code", "collapsed": false, "input": [ "locations = data[['Latitude','Longitude']].as_matrix()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "def locate(x, y):\n", " d = locations - np.array([x, y])\n", " distances = d[:,0] ** 2 + d[:,1] ** 2\n", " closest = distances.argmin()\n", " return data.AccentCity[closest]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "print(locate(48.861, 2.3358))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Paris\n" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }