{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Working-with-big-files\" data-toc-modified-id=\"Working-with-big-files-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Working with big files</a></span><ul class=\"toc-item\"><li><span><a href=\"#Key-idea-1:-read-the-file-streaming,-and-unpack-on-the-fly\" data-toc-modified-id=\"Key-idea-1:-read-the-file-streaming,-and-unpack-on-the-fly-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Key idea 1: read the file streaming, and unpack on the fly</a></span><ul class=\"toc-item\"><li><span><a href=\"#Why?\" data-toc-modified-id=\"Why?-1.1.1\"><span class=\"toc-item-num\">1.1.1&nbsp;&nbsp;</span>Why?</a></span></li></ul></li><li><span><a href=\"#Key-idea-2:-clean-memory-when-you-are-done\" data-toc-modified-id=\"Key-idea-2:-clean-memory-when-you-are-done-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Key idea 2: clean memory when you are done</a></span></li><li><span><a href=\"#Key-idea-3:-divide-and-conquer\" data-toc-modified-id=\"Key-idea-3:-divide-and-conquer-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Key idea 3: divide and conquer</a></span></li></ul></li><li><span><a href=\"#Three-examples\" data-toc-modified-id=\"Three-examples-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Three examples</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Working with big files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    " \n",
    "\n",
    "## Key idea 1: read the file streaming, and unpack on the fly\n",
    "\n",
    "### Why?\n",
    "> This scales: you do NOT want to have a big file in memory if you only need it bit by bit. <br/>\n",
    "And why waste time and HD space with unpacking a file completely?\n",
    "\n",
    "1. You do not have to unzip a zip, gzipped or bzip2 file before you can read it.\n",
    "2. You can read it streaming.\n",
    "    * Even on the command line:\n",
    "        * `zcat` for gzipped files\n",
    "        * or `gunzip -c <file>|more` \n",
    "        * See <https://en.wikibooks.org/wiki/Guide_to_Unix/Commands/File_Compression>\n",
    "3. In Python:\n",
    "\n",
    "```\n",
    "import gzip\n",
    "\n",
    "with gzip.open('input.gz','r') as fin:\n",
    "    for line in fin:\n",
    "        print('got line', line)\n",
    "```\n",
    "4. For bz2 files there is `BZ2File` with similar interface."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Key idea 2: clean memory when you are done\n",
    "1. This especially holds when working with XML files and `lxml`\n",
    "2. Even when you read an XML file \"streamingly\" and remove the context,\n",
    "    * `lxml` stores the internal tree structure\n",
    "    * so your memory consumption starts to go up,\n",
    "    * your machine starts to swap like hell\n",
    "    * and basically stalls\n",
    "    \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Key idea 3: divide and conquer\n",
    "1. If you are OK with RAM memory, but your input file(s) are still so big that processing takes ages\n",
    "2. you can **divide** the work over several machines or cores\n",
    "3. and afterwards **combine** the results.\n",
    "4. Sometimes you have to divide yourself, sometimes you get the input data already in several files.\n",
    "    * E.g., you can downoad the complete wikipedia dump in 1 file or in 4 files. \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Three examples\n",
    "\n",
    "1. [Reading a big text file](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ReadingFilesFromTheWeb.ipynb#Reading-gzipped-file-line-by-line)  \n",
    "    * We have done this before several times.\n",
    "1. [Reading a big XML file](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ParseWikipediaDump.ipynb)\n",
    "1. [Reading a big spreadsheet](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ParseBigSpreadsheet.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}