{
 "metadata": {
  "name": "",
  "signature": "sha256:43cfdb03de61c090247b8dcd97c85e4965f815d94f2778671a3d96719ee0934b"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Chapter 3, example 3\n",
      "====================\n",
      "\n",
      "In this example, we will download and analyze some data about a large number of cities around the world and their population. This data has been created by MaxMind and is available for free at http://www.maxmind.com.\n",
      "\n",
      "We first download the Zip file and uncompress it in a folder. The Zip file is about 40MB so that downloading it may take a while."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "import urllib2, zipfile"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "url = 'http://ipython.rossant.net/'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filename = 'cities.zip'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "downloaded = urllib2.urlopen(url + filename)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "folder = 'data'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mkdir data"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with open(filename, 'wb') as f:\n",
      "    f.write(downloaded.read())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with zipfile.ZipFile(filename) as zip:\n",
      "    zip.extractall(folder)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we're going to load the CSV file that has been extracted with Pandas. The `read_csv` function of Pandas can open any CSV file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filename = 'data/worldcitiespop.txt'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data = pd.read_csv(filename)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/Users/admin/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/parsers.py:1159: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n",
        "  data = self._reader.read(nrows)\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's explore the newly created data object."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "type(data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "pandas.core.frame.DataFrame"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The data object is a DataFrame, a Pandas type consisting of a two-dimensional labeled data structure with columns of potentially different types (like a Excel spreadsheet). Like a NumPy array, the shape attribute returns the shape of the table. But unlike NumPy, the DataFrame object has a richer structure, and in particular the keys methods returns the names of the different columns."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.shape, data.keys()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "((3173958, 7),\n",
        " Index([u'Country', u'City', u'AccentCity', u'Region', u'Population', u'Latitude', u'Longitude'], dtype='object'))"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see that data has more than 3 million lines, and seven columns including the country, city, population and GPS coordinates of each city. The head and tail methods allow to take a quick look to the beginning and the end of the table, respectively."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.tail()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>Country</th>\n",
        "      <th>City</th>\n",
        "      <th>AccentCity</th>\n",
        "      <th>Region</th>\n",
        "      <th>Population</th>\n",
        "      <th>Latitude</th>\n",
        "      <th>Longitude</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>3173953</th>\n",
        "      <td> zw</td>\n",
        "      <td>  zimre park</td>\n",
        "      <td>  Zimre Park</td>\n",
        "      <td> 4</td>\n",
        "      <td>   NaN</td>\n",
        "      <td>-17.866111</td>\n",
        "      <td> 31.213611</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3173954</th>\n",
        "      <td> zw</td>\n",
        "      <td> ziyakamanas</td>\n",
        "      <td> Ziyakamanas</td>\n",
        "      <td> 0</td>\n",
        "      <td>   NaN</td>\n",
        "      <td>-18.216667</td>\n",
        "      <td> 27.950000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3173955</th>\n",
        "      <td> zw</td>\n",
        "      <td>  zizalisari</td>\n",
        "      <td>  Zizalisari</td>\n",
        "      <td> 4</td>\n",
        "      <td>   NaN</td>\n",
        "      <td>-17.758889</td>\n",
        "      <td> 31.010556</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3173956</th>\n",
        "      <td> zw</td>\n",
        "      <td>     zuzumba</td>\n",
        "      <td>     Zuzumba</td>\n",
        "      <td> 6</td>\n",
        "      <td>   NaN</td>\n",
        "      <td>-20.033333</td>\n",
        "      <td> 27.933333</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3173957</th>\n",
        "      <td> zw</td>\n",
        "      <td>  zvishavane</td>\n",
        "      <td>  Zvishavane</td>\n",
        "      <td> 7</td>\n",
        "      <td> 79876</td>\n",
        "      <td>-20.333333</td>\n",
        "      <td> 30.033333</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "        Country         City   AccentCity Region  Population   Latitude  \\\n",
        "3173953      zw   zimre park   Zimre Park      4         NaN -17.866111   \n",
        "3173954      zw  ziyakamanas  Ziyakamanas      0         NaN -18.216667   \n",
        "3173955      zw   zizalisari   Zizalisari      4         NaN -17.758889   \n",
        "3173956      zw      zuzumba      Zuzumba      6         NaN -20.033333   \n",
        "3173957      zw   zvishavane   Zvishavane      7       79876 -20.333333   \n",
        "\n",
        "         Longitude  \n",
        "3173953  31.213611  \n",
        "3173954  27.950000  \n",
        "3173955  31.010556  \n",
        "3173956  27.933333  \n",
        "3173957  30.033333  "
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see that these cities have NaN values as populations. The reason is that the population is not available for all cities in the data set, and Pandas handles those missing values transparently.\n",
      "\n",
      "We'll see in the next sections what we can actually do with these data."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Each column of the DataFrame object can be accessed with its name. In IPython, tab completion proposes notably the different columns as attributes of the object. Here we get the series with the names of all cities (AccentCity is the full name of the city, with uppercase characters and accents)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.AccentCity"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "0                  Aix\ufffds\n",
        "1             Aixirivali\n",
        "2             Aixirivall\n",
        "3              Aixirvall\n",
        "4               Aixovall\n",
        "5                Andorra\n",
        "6       Andorra la Vella\n",
        "7        Andorra-Vieille\n",
        "8                Andorre\n",
        "9     Andorre-la-Vieille\n",
        "10       Andorre-Vieille\n",
        "11             Ansalonga\n",
        "12                 Any\ufffds\n",
        "13                 Arans\n",
        "14               Arinsal\n",
        "...\n",
        "3173943                Zandi\n",
        "3173944              Zanyika\n",
        "3173945           Zemalapala\n",
        "3173946            Zemandana\n",
        "3173947              Zemanda\n",
        "3173948           Zibalonkwe\n",
        "3173949          Zibunkululu\n",
        "3173950                 Ziga\n",
        "3173951    Zikamanas Village\n",
        "3173952             Zimbabwe\n",
        "3173953           Zimre Park\n",
        "3173954          Ziyakamanas\n",
        "3173955           Zizalisari\n",
        "3173956              Zuzumba\n",
        "3173957           Zvishavane\n",
        "Name: AccentCity, Length: 3173958, dtype: object"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This column is an instance of the Series class. We can access to certain rows using indexing. In the following example, we get the name 30000th city (knowing that indexing is 0-based):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.AccentCity[30000]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "'Howasiyan'"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we can access to an element knowing its index. But how can we obtain a city from its name? For example, we'd like to obtain the population and GPS coordinates of New York. A possibility might be to loop through all cities and check their names, but it would be extremely slow because Python loops on millions on elements are not optimized at all. Pandas and NumPy offer a much more elegant and efficient way called boolean indexing. There are two steps that typically occur on the same line of code. First, we create an array with boolean values indicating, for each element, whether it satisfies a condition or not (if, whether the city name is New York). Then, we pass this array of booleans as an index to our original array: the result is then a subpart of the full array with only the elements corresponding to True. For example:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data[data.AccentCity=='New York'],"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "(        Country      City AccentCity Region  Population   Latitude   Longitude\n",
        " 998166       gb  new york   New York     H7         NaN  53.083333   -0.150000\n",
        " 1087431      hn  new york   New York     16         NaN  14.800000  -88.366667\n",
        " 1525856      jm  new york   New York      9         NaN  18.250000  -77.183333\n",
        " 1525857      jm  new york   New York     10         NaN  18.116667  -77.133333\n",
        " 1893972      mx  new york   New York      5         NaN  16.266667  -93.233333\n",
        " 2929399      us  new york   New York     FL         NaN  30.838333  -87.200833\n",
        " 2946036      us  new york   New York     IA         NaN  40.851667  -93.259722\n",
        " 2951120      us  new york   New York     KY         NaN  36.988889  -88.952500\n",
        " 2977571      us  new york   New York     MO         NaN  39.685278  -93.926667\n",
        " 2986561      us  new york   New York     NM         NaN  35.058611 -107.526667\n",
        " 2990572      us  new york   New York     NY     8107916  40.714167  -74.006389\n",
        " 3029084      us  new york   New York     TX         NaN  32.167778  -95.668889,)"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The same syntax works in NumPy and Pandas. Here, we find a dozen of cities named New York, but only one happens to be in the New York state. To access a single element with Pandas, we can use the .ix attribute (for index):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ny = 2990572\n",
      "data.ix[ny]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "Country             us\n",
        "City          new york\n",
        "AccentCity    New York\n",
        "Region              NY\n",
        "Population     8107916\n",
        "Latitude      40.71417\n",
        "Longitude    -74.00639\n",
        "Name: 2990572, dtype: object"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's turn this Series object into a pure NumPy array. We go from the Pandas world to NumPy (keeping in mind that Pandas is built on top of NumPy). We'll mostly work with the population count of all cities."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np \n",
      "population = np.array(data.Population)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "population.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "(3173958,)"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The population array is a one-dimensional vector with the populations of all cities (or NaN if the population is not available). The population of New York can be accessed in NumPy with basic indexing:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "population[ny]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "8107916.0"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's find out how many cities do have an actual population count. To do this, we'll select all elements in the population array that have a value different to NaN. We can use the NumPy function isnan:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.isnan(population)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 25,
       "text": [
        "array([ True,  True,  True, ...,  True,  True, False], dtype=bool)"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = population[~_]\n",
      "len(x), len(x) / float(len(population))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 26,
       "text": [
        "(47980, 0.015116772181610469)"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There are about 1.5% of all cities in this data set that have a population count."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's explore now some statistics on the cities population."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x.mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 27,
       "text": [
        "47719.57063359733"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x.sum() / 1e9"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "2.2895849990000001"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(x)/float(len(population))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 29,
       "text": [
        "0.015116772181610469"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The total population of those cities is about 2.3 billion people, about a third of the current world population. Hence, according to this data set, roughly 30% of the population lives in less than 1.5% of the cities in the world!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.Population.describe()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 30,
       "text": [
        "count       47980.000000\n",
        "mean        47719.570634\n",
        "std        302888.715626\n",
        "min             7.000000\n",
        "25%          3732.000000\n",
        "50%         10779.000000\n",
        "75%         27990.500000\n",
        "max      31480498.000000\n",
        "Name: Population, dtype: float64"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's locate some geographical coordinates."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "locations = data[['Latitude','Longitude']].as_matrix()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def locate(x, y):\n",
      "    d = locations - np.array([x, y])\n",
      "    distances = d[:,0] ** 2 + d[:,1] ** 2\n",
      "    closest = distances.argmin()\n",
      "    return data.AccentCity[closest]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 35
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(locate(48.861, 2.3358))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Paris\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}