{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Working with big files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "\n", "## Key idea 1: read the file streaming, and unpack on the fly\n", "\n", "### Why?\n", "> This scales: you do NOT want to have a big file in memory if you only need it bit by bit.
\n", "And why waste time and HD space with unpacking a file completely?\n", "\n", "1. You do not have to unzip a zip, gzipped or bzip2 file before you can read it.\n", "2. You can read it streaming.\n", " * Even on the command line:\n", " * `zcat` for gzipped files\n", " * or `gunzip -c |more` \n", " * See \n", "3. In Python:\n", "\n", "```\n", "import gzip\n", "\n", "with gzip.open('input.gz','r') as fin:\n", " for line in fin:\n", " print('got line', line)\n", "```\n", "4. For bz2 files there is `BZ2File` with similar interface." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Key idea 2: clean memory when you are done\n", "1. This especially holds when working with XML files and `lxml`\n", "2. Even when you read an XML file \"streamingly\" and remove the context,\n", " * `lxml` stores the internal tree structure\n", " * so your memory consumption starts to go up,\n", " * your machine starts to swap like hell\n", " * and basically stalls\n", " \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Key idea 3: divide and conquer\n", "1. If you are OK with RAM memory, but your input file(s) are still so big that processing takes ages\n", "2. you can **divide** the work over several machines or cores\n", "3. and afterwards **combine** the results.\n", "4. Sometimes you have to divide yourself, sometimes you get the input data already in several files.\n", " * E.g., you can downoad the complete wikipedia dump in 1 file or in 4 files. \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Three examples\n", "\n", "1. [Reading a big text file](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ReadingFilesFromTheWeb.ipynb#Reading-gzipped-file-line-by-line) \n", " * We have done this before several times.\n", "1. [Reading a big XML file](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ParseWikipediaDump.ipynb)\n", "1. [Reading a big spreadsheet](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ParseBigSpreadsheet.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }