" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Working with big files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "\n", "## Key idea 1: read the file streaming, and unpack on the fly\n", "\n", "### Why?\n", "> This scales: you do NOT want to have a big file in memory if you only need it bit by bit.
import gzip

with gzip.open('input.gz','r') as fin:
 for line in fin:
 print('got line', line)
```
4. For bz2 files there is `BZ2File` with similar interface." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Key idea 2: clean memory when you are done\n", "1. This especially holds when working with XML files and `lxml`\n", "2. Even when you read an XML file \"streamingly\" and remove the context,\n", " * `lxml` stores the internal tree structure\n", " * so your memory consumption starts to go up,\n", " * your machine starts to swap like hell\n", " * and basically stalls\n", " \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Key idea 3: divide and conquer\n", "1. If you are OK with RAM memory, but your input file(s) are still so big that processing takes ages\n", "2. you can **divide** the work over several machines or cores\n", "3. and afterwards **combine** the results.\n", "4. Sometimes you have to divide yourself, sometimes you get the input data already in several files.\n", " * E.g., you can downoad the complete wikipedia dump in 1 file or in 4 files. \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Three examples\n", "\n", "1. [Reading a big text file](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ReadingFilesFromTheWeb.ipynb#Reading-gzipped-file-line-by-line) \n", " * We have done this before several times.\n", "1. [Reading a big XML file](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ParseWikipediaDump.ipynb)\n", "1. [Reading a big spreadsheet](http://nbviewer.jupyter.org/format/slides/url/staff.fnwi.uva.nl/m.j.marx/teaching/DataScience/NoteBooks/ParseBigSpreadsheet.ipynb)