{
 "nbformat": 3, 
 "nbformat_minor": 0, 
 "worksheets": [
  {
   "cells": [
    {
     "source": [
      "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n"
     ], 
     "cell_type": "markdown", 
     "metadata": []
    }, 
    {
     "source": [
      "# 4.11. Manipulating large heterogeneous tables with HDF5 and PyTables"
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "import numpy as np\n", 
      "import tables as tb"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "We create a new HDF5 file."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "f = tb.open_file('myfile.h5', 'w')"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "We will create a HDF5 table with two columns: the name of a city (a string with 64 characters at most), and its population (a 32 bit integer)."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "dtype = np.dtype([('city', 'S64'), ('population', 'i4')])"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "Now, we create the table in '/table1'."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "table = f.create_table('/', 'table1', dtype)"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "Let's add a few rows."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "table.append([('Brussels', 1138854),\n", 
      "              ('London',   8308369),\n", 
      "              ('Paris',    2243833)])"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "After adding rows, we need to flush the table to commit the changes on disk."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "table.flush()"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "Data can be obtained from the table with a lot of different ways in PyTables. The easiest but less efficient way is to load the entire table in memory, which returns a NumPy array."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "table[:]"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "It is also possible to load a particular column (and all rows)."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "table.col('city')"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "When dealing with a large number of rows, we can make a SQL-like query in the table to load all rows that satisfy particular conditions."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "[row['city'] for row in table.where('population>2e6')]"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "Finally, we can access particular rows knowing their indices."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "table[1]"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "Clean-up."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }, 
    {
     "cell_type": "code", 
     "language": "python", 
     "outputs": [], 
     "collapsed": false, 
     "input": [
      "f.close()\n", 
      "import os\n", 
      "os.remove('myfile.h5')"
     ], 
     "metadata": {}
    }, 
    {
     "source": [
      "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n\n> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)."
     ], 
     "cell_type": "markdown", 
     "metadata": {}
    }
   ], 
   "metadata": {}
  }
 ], 
 "metadata": {
  "name": "", 
  "signature": "sha256:3825624d6bf4f8e38eb125e50bfd8c7a90334b8b4a72b373a07c7ac2c38b8421"
 }
}