{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#     Text classification with Naive Bayes : feature selection with mutual information\n",
    "\n",
    "## Notebook made by  \n",
    "\n",
    "|** Name** | **Student id** | **email**|\n",
    "|:- |:-|:-|\n",
    "|. | | |\n",
    "|  | |. |\n",
    "\n",
    "### Pledge (taken from [Coursera's Honor Code](https://www.coursera.org/about/terms/honorcode) )\n",
    "\n",
    "\n",
    "\n",
    "Put here a selfie with your photo where you hold a signed paper with the following text: (if this is team work, put two selfies here). The link must be to some place on the web, not to a local file. \n",
    "\n",
    "> My answers to homework, quizzes and exams will be my own work (except for assignments that explicitly permit collaboration).\n",
    "\n",
    ">I will not make solutions to homework, quizzes or exams available to anyone else. This includes both solutions written by me, as well as any official solutions provided by the course staff.\n",
    "\n",
    ">I will not engage in any other activities that will dishonestly improve my results or dishonestly improve/hurt the results of others.\n",
    "\n",
    "<img src='link to your selfie'/>\n",
    "\n",
    "### Note\n",
    "* **Assignments without the selfies or completely filled in information will not be graded and receive 0 points.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Text classification with Naive Bayes  \n",
    "        \n",
    "        \n",
    "        \n",
    "<h3>Abstract</h3>\n",
    "<p>We will do **text classification** on a collection of Dutch parliamentary questions.\n",
    "       </p>\n",
    "<p>In dit notebook beperken we ons tot het bepalen van de **mutual information** scores, zoals beschereven in <http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html> </p>\n",
    "       \n",
    "       \n",
    "#### Data\n",
    "* 40K kamervragen: <http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz>\n",
    "* 1K kamervragen <http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR1000.csv.gz>\n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nltk \n",
    "from pattern.nl import lemma  # works very bad for Dutch\n",
    "import re\n",
    "from collections import  Counter\n",
    "import itertools\n",
    "import sklearn\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from numpy import log2\n",
    "from nltk.corpus import stopwords\n",
    "dutchstop = set(stopwords.words('dutch'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(40516, 6)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>jaar</th>\n",
       "      <th>partij</th>\n",
       "      <th>titel</th>\n",
       "      <th>vraag</th>\n",
       "      <th>antwoord</th>\n",
       "      <th>ministerie</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>KVR1000.xml</th>\n",
       "      <td>1994</td>\n",
       "      <td>PvdA</td>\n",
       "      <td>De vragen betreffen de betrouwbaarheid van de...</td>\n",
       "      <td>Hebt u kennisgenomen van het televisieprogram...</td>\n",
       "      <td>Ja. Het bedoelde geluidmeetpunt is eigendom v...</td>\n",
       "      <td>Verkeer en Waterstaat</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>KVR10000.xml</th>\n",
       "      <td>1999</td>\n",
       "      <td>PvdA</td>\n",
       "      <td>Vragen naar aanleiding van berichten (uitzend...</td>\n",
       "      <td>Kent u de berichten over de situatie in de Me...</td>\n",
       "      <td></td>\n",
       "      <td>Justitie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>KVR10001.xml</th>\n",
       "      <td>1999</td>\n",
       "      <td>SP</td>\n",
       "      <td>Vragen naar aanleiding van de berichten \"Nede...</td>\n",
       "      <td>Kent u de berichten «Nederland steunt de Soeh...</td>\n",
       "      <td></td>\n",
       "      <td>Financien</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>KVR10002.xml</th>\n",
       "      <td>1999</td>\n",
       "      <td>PvdA</td>\n",
       "      <td>Vragen over de gebrekkige opvang van verpleeg...</td>\n",
       "      <td>Kent u het bericht over onderzoek van Nu91 me...</td>\n",
       "      <td>Ja. Het onderzoek van NU’91 wijst uit dat het...</td>\n",
       "      <td>Volksgezondheid, Welzijn en Sport</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>KVR10003.xml</th>\n",
       "      <td>1999</td>\n",
       "      <td>PvdA</td>\n",
       "      <td>Vragen over onbetrouwbaarheid van filemeldingen.</td>\n",
       "      <td>Hebt u kennisgenomen van de berichten over de...</td>\n",
       "      <td>Ja. Nee. Door de waarnemers van het Algemeen ...</td>\n",
       "      <td>Verkeer en Waterstaat</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                jaar partij  \\\n",
       "KVR1000.xml     1994   PvdA   \n",
       " KVR10000.xml   1999   PvdA   \n",
       " KVR10001.xml   1999     SP   \n",
       " KVR10002.xml   1999   PvdA   \n",
       " KVR10003.xml   1999   PvdA   \n",
       "\n",
       "                                                           titel  \\\n",
       "KVR1000.xml     De vragen betreffen de betrouwbaarheid van de...   \n",
       " KVR10000.xml   Vragen naar aanleiding van berichten (uitzend...   \n",
       " KVR10001.xml   Vragen naar aanleiding van de berichten \"Nede...   \n",
       " KVR10002.xml   Vragen over de gebrekkige opvang van verpleeg...   \n",
       " KVR10003.xml   Vragen over onbetrouwbaarheid van filemeldingen.   \n",
       "\n",
       "                                                           vraag  \\\n",
       "KVR1000.xml     Hebt u kennisgenomen van het televisieprogram...   \n",
       " KVR10000.xml   Kent u de berichten over de situatie in de Me...   \n",
       " KVR10001.xml   Kent u de berichten «Nederland steunt de Soeh...   \n",
       " KVR10002.xml   Kent u het bericht over onderzoek van Nu91 me...   \n",
       " KVR10003.xml   Hebt u kennisgenomen van de berichten over de...   \n",
       "\n",
       "                                                        antwoord  \\\n",
       "KVR1000.xml     Ja. Het bedoelde geluidmeetpunt is eigendom v...   \n",
       " KVR10000.xml                                                      \n",
       " KVR10001.xml                                                      \n",
       " KVR10002.xml   Ja. Het onderzoek van NU’91 wijst uit dat het...   \n",
       " KVR10003.xml   Ja. Nee. Door de waarnemers van het Algemeen ...   \n",
       "\n",
       "                                       ministerie  \n",
       "KVR1000.xml                 Verkeer en Waterstaat  \n",
       " KVR10000.xml                            Justitie  \n",
       " KVR10001.xml                           Financien  \n",
       " KVR10002.xml   Volksgezondheid, Welzijn en Sport  \n",
       " KVR10003.xml               Verkeer en Waterstaat  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz', \n",
    "                         compression='gzip', \n",
    "                         sep='\\t', \n",
    "                         index_col=0, \n",
    "                       #  encoding='utf-8',\n",
    "                         names=['jaar', 'partij','titel','vraag','antwoord','ministerie'])\n",
    "print kvrdf.shape\n",
    "kvrdf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Exercise</h2>\n",
    "\n",
    "<p>We will use the fields in the _ministerie_ column as our classes. \n",
    "    These are the ministeries to whom the question is addressed.\n",
    "<br/>     \n",
    "    Note that these labels are <strong>not normalized</strong>, see e.g. these counts   below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       " Justitie (JUS)                                                    3219\n",
       " Volksgezondheid, Welzijn en Sport (VWS)                           2630\n",
       " Buitenlandse Zaken (BUZA)                                         1796\n",
       " Verkeer en Waterstaat (VW)                                        1441\n",
       " Justitie                                                          1333\n",
       " Sociale Zaken en Werkgelegenheid (SZW)                            1231\n",
       " Onderwijs, Cultuur en Wetenschappen (OCW)                         1187\n",
       " Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer (VROM)      984\n",
       " Financiën (FIN)                                                    960\n",
       " Volksgezondheid, Welzijn en Sport                                  951\n",
       " Economische Zaken (EZ)                                             946\n",
       " Buitenlandse Zaken                                                 753\n",
       " Binnenlandse Zaken en Koninkrijksrelaties (BZK)                    725\n",
       " Verkeer en Waterstaat                                              724\n",
       " Defensie (DEF)                                                     646\n",
       " Sociale Zaken en Werkgelegenheid                                   607\n",
       " Landbouw, Natuurbeheer en Visserij (LNV)                           586\n",
       " Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer             554\n",
       " Onderwijs, Cultuur en Wetenschappen                                532\n",
       " Vreemdelingenzaken en Integratie (VI)                              466\n",
       " Landbouw, Natuurbeheer en Visserij                                 440\n",
       " Landbouw, Natuur en Voedselkwaliteit (LNV)                         422\n",
       " Financiën                                                          409\n",
       " Binnenlandse Zaken                                                 389\n",
       " Economische Zaken                                                  337\n",
       " Defensie                                                           305\n",
       " Ontwikkelingssamenwerking (OS)                                     198\n",
       " Onderwijs, Cultuur en Wetenschap (OCW)                             195\n",
       " Algemene Zaken (AZ)                                                169\n",
       " Binnenlandse Zaken en Koninkrijksrelaties (BZK) Justitie (JUS)     168\n",
       "Name: ministerie, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kvrdf.ministerie.value_counts().head(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    " \n",
    "* Normalize the values for \"ministerie\" and choose 10 ministeries to work with. Put these in a new column called `NormalizedMinisterie`.\n",
    "* Let ook goed op spaties aan het begin en eind.\n",
    "      \n",
    "      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "24824\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Justitie                                                  4640\n",
       "Volksgezondheid, Welzijn en Sport                         3597\n",
       "Buitenlandse Zaken                                        2697\n",
       "Verkeer en Waterstaat                                     2178\n",
       "Sociale Zaken en Werkgelegenheid                          1861\n",
       "Onderwijs, Cultuur en Wetenschappen                       1730\n",
       "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer    1560\n",
       "Financiën                                                 1403\n",
       "Economische Zaken                                         1309\n",
       "Binnenlandse Zaken en Koninkrijksrelaties                 1241\n",
       "Landbouw, Natuurbeheer en Visserij                        1031\n",
       "Defensie                                                   963\n",
       "Vreemdelingenzaken en Integratie                           614\n",
       "Name: NormalizedMinisterie, dtype: int64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Ik heb er 13 gekozen want die laatste drie leken me juist zo leuk.\n",
    "\n",
    "topNChosenMinisteries=13\n",
    "\n",
    "kvrdf['NormalizedMinisterie']= kvrdf.ministerie.str.replace(r' *\\(.*','')\n",
    "kvrdf['NormalizedMinisterie']=kvrdf.NormalizedMinisterie.str.strip() # remove leading and ending spaces\n",
    "print kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries).sum()\n",
    "kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "40516\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(24824, 9)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# restrict to the top 13\n",
    "\n",
    "print len(kvrdf)\n",
    "kvrdf= kvrdf[kvrdf.NormalizedMinisterie.isin(kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries).index)]\n",
    "kvrdf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenize\n",
    "\n",
    "1. Voeg een kolom `vocabulair` toe met een lijst met alle unieke woorden die voorkomen in de tekst kolommen van een kamervraag.\n",
    "    * lowercase de woorden\n",
    "    * gooi getallen en punctuatie-tokens weg\n",
    "    * neem geen woorden die in de NL stopwoorden lijst `dutchstop` zitten op\n",
    "    * je kunt ook nog lemmatiseren\n",
    "2. **Hint** \n",
    "    * Plak eerst alle tekst aan elkaar vast (let op!)\n",
    "    * maak een functie die tokenize doet en al die dingen hierboven beschreven\n",
    "    * pas die functie met `.apply` toe op de kolom met alle tekst\n",
    "3. Dit kan wel even duren. Test dus op kleine aantallen, of op de kleine file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'basisscholen'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemma('basisschool')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kvrdf['fulltext']= kvrdf.titel.astype(str)+' '+kvrdf.vraag.astype(str)+' '+kvrdf.antwoord.astype(str)\n",
    "\n",
    "\n",
    " \n",
    "    \n",
    "\n",
    "def tokenize(text, lemmatize=False):\n",
    "    tokens=  [s.lower()  for s in set(nltk.word_tokenize(str(text).decode('utf-8')))\n",
    "     if   re.match(r'^\\w+$',s)  and \n",
    "            len(s) > 2\n",
    "            and not re.match(r'^\\d+$',s ) and \n",
    "         not s.lower() in dutchstop\n",
    "            ]\n",
    "    if not lemmatize:\n",
    "        return list(set(tokens))\n",
    "    if lemmatize:\n",
    "         return list(lemma(w) for w in set(tokens))\n",
    "        \n",
    "# test\n",
    "\n",
    "print kvrdf.fulltext.head(5).apply(tokenize)\n",
    "\n",
    "kvrdf.fulltext.head(5).apply(lambda w:tokenize(w,lemmatize=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%time kvrdf['vocabulair']= kvrdf.fulltext.apply(tokenize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4min 40s, sys: 3.42 s, total: 4min 43s\n",
      "Wall time: 4min 44s\n"
     ]
    }
   ],
   "source": [
    "%time kvrdf['lemmatized_vocabulair']= kvrdf.fulltext.apply(lambda w:tokenize(w,lemmatize=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 44.2 s, sys: 433 ms, total: 44.6 s\n",
      "Wall time: 44.9 s\n"
     ]
    }
   ],
   "source": [
    "# Fast alternative \n",
    "def filter_voc(L):\n",
    "    tokens=[s for s in L if len(s) > 2\n",
    "            and not re.match(r'^\\d+$',s ) and \n",
    "         not s in dutchstop\n",
    "           ]\n",
    "    return list(set(tokens))\n",
    "%time tokens= (kvrdf.fulltext.str.lower().str.findall(r'\\b\\w+\\b')).apply(filter_voc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 12.9 s, sys: 208 ms, total: 13.1 s\n",
      "Wall time: 13.5 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/admin/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "KVR1000.xml      [betrouwbaarheid, inderdaad, meter, bereiken, ...\n",
       " KVR10000.xml    [twijfels, bent, kennis, situatie, moeten, wel...\n",
       " KVR10002.xml    [gestaan, erom, volkskrant, wijze, protocollen...\n",
       " KVR10003.xml    [gezet, betrouwbaarheid, desbetreffende, waarn...\n",
       " KVR10004.xml    [rol, volkskrant, termijn, gepresenteerd, spoo...\n",
       "Name: fulltext, dtype: object"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Even faster \n",
    "# lower case, length restrictie, remove pure digits, remove doubles,  remove dutchstop \n",
    "\n",
    "%time tokens= (kvrdf.fulltext.str.lower().str.replace(r'\\b\\d+\\b','').str.replace(r'\\b\\w\\w\\b','').str.findall(r'\\b\\w+\\b')).apply(lambda l:([w for w in  set(l) if not w in dutchstop]))\n",
    "\n",
    "kvrdf['vocabulair']=tokens\n",
    "tokens.head()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pattern lemmatize is very dodgy for NL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#kvrdf.lemmatized_vocabulair.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tel woorden voor elk Minsterie\n",
    "\n",
    "* Gebruik `Counter` om voor elk Ministerie, **voor elk woord te tellen in hoeveel documenten dat woord voorkomt.**\n",
    "* **Hints**\n",
    "    * Loop over alle ministeries\n",
    "    * voor elk ministerie pak de kolom vocabulair op, en concateneer alle lijsten tot 1 lijst\n",
    "    * tel dan met `Counter`\n",
    "    * Neem alleen woorden die in minimaal 10 documenten voorkomen\n",
    "    * Lever uiteindelijk een dict op met Ministeries als sleutels en hun counters als waarden\n",
    "    * Maak hier een dataframe van met `pd.DataFrame.from_dict`\n",
    "        * de woorden zijn de rijen, en de Ministeries de kolommen\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.2 s, sys: 99.5 ms, total: 4.3 s\n",
      "Wall time: 4.3 s\n",
      "Volksgezondheid, Welzijn en Sport 7671\n",
      "Defensie 2903\n",
      "Sociale Zaken en Werkgelegenheid 4978\n",
      "Justitie 8126\n",
      "Binnenlandse Zaken en Koninkrijksrelaties 3677\n",
      "Landbouw, Natuurbeheer en Visserij 3184\n",
      "Financiën 4167\n",
      "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 4802\n",
      "Buitenlandse Zaken 6030\n",
      "Onderwijs, Cultuur en Wetenschappen 4581\n",
      "Vreemdelingenzaken en Integratie 2101\n",
      "Verkeer en Waterstaat 5634\n",
      "Economische Zaken 4225\n"
     ]
    }
   ],
   "source": [
    "def makeCounters(Vocabulair,drempelwaarde=10):\n",
    "    II= {}\n",
    "    for mi in  set(kvrdf.NormalizedMinisterie.values):\n",
    "        A= kvrdf[kvrdf.NormalizedMinisterie==mi][Vocabulair].values\n",
    "        M= Counter([w  for l in A for w in l])\n",
    "        Minimaal10={v:M[v] for v in M if M[v]>=drempelwaarde}\n",
    "        II[mi]= Minimaal10\n",
    "    return II\n",
    "\n",
    "%time II= makeCounters('vocabulair')\n",
    "\n",
    "for mi in II:\n",
    "    print mi, len(II[mi])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 109 ms, sys: 3.05 ms, total: 113 ms\n",
      "Wall time: 113 ms\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/admin/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile\n",
      "  RuntimeWarning)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Binnenlandse Zaken en Koninkrijksrelaties</th>\n",
       "      <th>Buitenlandse Zaken</th>\n",
       "      <th>Defensie</th>\n",
       "      <th>Economische Zaken</th>\n",
       "      <th>Financiën</th>\n",
       "      <th>Justitie</th>\n",
       "      <th>Landbouw, Natuurbeheer en Visserij</th>\n",
       "      <th>Onderwijs, Cultuur en Wetenschappen</th>\n",
       "      <th>Sociale Zaken en Werkgelegenheid</th>\n",
       "      <th>Verkeer en Waterstaat</th>\n",
       "      <th>Volksgezondheid, Welzijn en Sport</th>\n",
       "      <th>Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer</th>\n",
       "      <th>Vreemdelingenzaken en Integratie</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3677.000000</td>\n",
       "      <td>6030.000000</td>\n",
       "      <td>2903.000000</td>\n",
       "      <td>4225.000000</td>\n",
       "      <td>4167.000000</td>\n",
       "      <td>8126.000000</td>\n",
       "      <td>3184.000000</td>\n",
       "      <td>4581.000000</td>\n",
       "      <td>4978.000000</td>\n",
       "      <td>5634.000000</td>\n",
       "      <td>7671.000000</td>\n",
       "      <td>4802.000000</td>\n",
       "      <td>2101.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>50.447919</td>\n",
       "      <td>70.073632</td>\n",
       "      <td>42.026180</td>\n",
       "      <td>53.996923</td>\n",
       "      <td>54.662827</td>\n",
       "      <td>90.829190</td>\n",
       "      <td>47.155779</td>\n",
       "      <td>59.431347</td>\n",
       "      <td>65.586983</td>\n",
       "      <td>66.653355</td>\n",
       "      <td>87.797810</td>\n",
       "      <td>59.704082</td>\n",
       "      <td>38.621133</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>74.609997</td>\n",
       "      <td>132.822490</td>\n",
       "      <td>57.124467</td>\n",
       "      <td>81.939572</td>\n",
       "      <td>82.836844</td>\n",
       "      <td>197.824251</td>\n",
       "      <td>63.995373</td>\n",
       "      <td>98.232193</td>\n",
       "      <td>109.091074</td>\n",
       "      <td>118.804579</td>\n",
       "      <td>181.457338</td>\n",
       "      <td>95.210070</td>\n",
       "      <td>45.112760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1106.000000</td>\n",
       "      <td>2184.000000</td>\n",
       "      <td>766.000000</td>\n",
       "      <td>1057.000000</td>\n",
       "      <td>1089.000000</td>\n",
       "      <td>3808.000000</td>\n",
       "      <td>762.000000</td>\n",
       "      <td>1357.000000</td>\n",
       "      <td>1501.000000</td>\n",
       "      <td>1703.000000</td>\n",
       "      <td>2965.000000</td>\n",
       "      <td>1236.000000</td>\n",
       "      <td>518.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Binnenlandse Zaken en Koninkrijksrelaties  Buitenlandse Zaken  \\\n",
       "count                                3677.000000         6030.000000   \n",
       "mean                                   50.447919           70.073632   \n",
       "std                                    74.609997          132.822490   \n",
       "min                                    10.000000           10.000000   \n",
       "25%                                          NaN                 NaN   \n",
       "50%                                          NaN                 NaN   \n",
       "75%                                          NaN                 NaN   \n",
       "max                                  1106.000000         2184.000000   \n",
       "\n",
       "          Defensie  Economische Zaken    Financiën     Justitie  \\\n",
       "count  2903.000000        4225.000000  4167.000000  8126.000000   \n",
       "mean     42.026180          53.996923    54.662827    90.829190   \n",
       "std      57.124467          81.939572    82.836844   197.824251   \n",
       "min      10.000000          10.000000    10.000000    10.000000   \n",
       "25%            NaN                NaN          NaN          NaN   \n",
       "50%            NaN                NaN          NaN          NaN   \n",
       "75%            NaN                NaN          NaN          NaN   \n",
       "max     766.000000        1057.000000  1089.000000  3808.000000   \n",
       "\n",
       "       Landbouw, Natuurbeheer en Visserij  \\\n",
       "count                         3184.000000   \n",
       "mean                            47.155779   \n",
       "std                             63.995373   \n",
       "min                             10.000000   \n",
       "25%                                   NaN   \n",
       "50%                                   NaN   \n",
       "75%                                   NaN   \n",
       "max                            762.000000   \n",
       "\n",
       "       Onderwijs, Cultuur en Wetenschappen  Sociale Zaken en Werkgelegenheid  \\\n",
       "count                          4581.000000                       4978.000000   \n",
       "mean                             59.431347                         65.586983   \n",
       "std                              98.232193                        109.091074   \n",
       "min                              10.000000                         10.000000   \n",
       "25%                                    NaN                               NaN   \n",
       "50%                                    NaN                               NaN   \n",
       "75%                                    NaN                               NaN   \n",
       "max                            1357.000000                       1501.000000   \n",
       "\n",
       "       Verkeer en Waterstaat  Volksgezondheid, Welzijn en Sport  \\\n",
       "count            5634.000000                        7671.000000   \n",
       "mean               66.653355                          87.797810   \n",
       "std               118.804579                         181.457338   \n",
       "min                10.000000                          10.000000   \n",
       "25%                      NaN                                NaN   \n",
       "50%                      NaN                                NaN   \n",
       "75%                      NaN                                NaN   \n",
       "max              1703.000000                        2965.000000   \n",
       "\n",
       "       Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer  \\\n",
       "count                                        4802.000000        \n",
       "mean                                           59.704082        \n",
       "std                                            95.210070        \n",
       "min                                            10.000000        \n",
       "25%                                                  NaN        \n",
       "50%                                                  NaN        \n",
       "75%                                                  NaN        \n",
       "max                                          1236.000000        \n",
       "\n",
       "       Vreemdelingenzaken en Integratie  \n",
       "count                       2101.000000  \n",
       "mean                          38.621133  \n",
       "std                           45.112760  \n",
       "min                           10.000000  \n",
       "25%                                 NaN  \n",
       "50%                                 NaN  \n",
       "75%                                 NaN  \n",
       "max                          518.000000  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Turn it into a dataframe \n",
    "%time IIdf=pd.DataFrame.from_dict(II)#.fillna(0)\n",
    "IIdf.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Binnenlandse Zaken en Koninkrijksrelaties</th>\n",
       "      <th>Buitenlandse Zaken</th>\n",
       "      <th>Defensie</th>\n",
       "      <th>Economische Zaken</th>\n",
       "      <th>Financiën</th>\n",
       "      <th>Justitie</th>\n",
       "      <th>Landbouw, Natuurbeheer en Visserij</th>\n",
       "      <th>Onderwijs, Cultuur en Wetenschappen</th>\n",
       "      <th>Sociale Zaken en Werkgelegenheid</th>\n",
       "      <th>Verkeer en Waterstaat</th>\n",
       "      <th>Volksgezondheid, Welzijn en Sport</th>\n",
       "      <th>Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer</th>\n",
       "      <th>Vreemdelingenzaken en Integratie</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>zwitserland</th>\n",
       "      <td>NaN</td>\n",
       "      <td>22.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zwitserse</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zwolle</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>60.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>32.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zwolse</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zzp</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Binnenlandse Zaken en Koninkrijksrelaties  Buitenlandse Zaken  \\\n",
       "zwitserland                                        NaN                22.0   \n",
       "zwitserse                                          NaN                 NaN   \n",
       "zwolle                                             NaN                 NaN   \n",
       "zwolse                                             NaN                 NaN   \n",
       "zzp                                                NaN                 NaN   \n",
       "\n",
       "             Defensie  Economische Zaken  Financiën  Justitie  \\\n",
       "zwitserland       NaN               11.0        NaN      20.0   \n",
       "zwitserse         NaN                NaN        NaN      11.0   \n",
       "zwolle            NaN                NaN        NaN      60.0   \n",
       "zwolse            NaN                NaN        NaN      11.0   \n",
       "zzp               NaN                NaN        NaN       NaN   \n",
       "\n",
       "             Landbouw, Natuurbeheer en Visserij  \\\n",
       "zwitserland                                 NaN   \n",
       "zwitserse                                   NaN   \n",
       "zwolle                                      NaN   \n",
       "zwolse                                      NaN   \n",
       "zzp                                         NaN   \n",
       "\n",
       "             Onderwijs, Cultuur en Wetenschappen  \\\n",
       "zwitserland                                  NaN   \n",
       "zwitserse                                    NaN   \n",
       "zwolle                                      14.0   \n",
       "zwolse                                       NaN   \n",
       "zzp                                          NaN   \n",
       "\n",
       "             Sociale Zaken en Werkgelegenheid  Verkeer en Waterstaat  \\\n",
       "zwitserland                               NaN                   14.0   \n",
       "zwitserse                                 NaN                    NaN   \n",
       "zwolle                                    NaN                   32.0   \n",
       "zwolse                                    NaN                   12.0   \n",
       "zzp                                      20.0                    NaN   \n",
       "\n",
       "             Volksgezondheid, Welzijn en Sport  \\\n",
       "zwitserland                               18.0   \n",
       "zwitserse                                  NaN   \n",
       "zwolle                                    33.0   \n",
       "zwolse                                    10.0   \n",
       "zzp                                        NaN   \n",
       "\n",
       "             Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer  \\\n",
       "zwitserland                                                NaN        \n",
       "zwitserse                                                  NaN        \n",
       "zwolle                                                     NaN        \n",
       "zwolse                                                     NaN        \n",
       "zzp                                                        NaN        \n",
       "\n",
       "             Vreemdelingenzaken en Integratie  \n",
       "zwitserland                               NaN  \n",
       "zwitserse                                 NaN  \n",
       "zwolle                                    NaN  \n",
       "zwolse                                    NaN  \n",
       "zzp                                       NaN  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "IIdf.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hoeveel woorden hebben we eigenlijk? \n",
    "\n",
    "Bijna 4 miljoen tokens, en 15K unieke woorden verdeeld  over bijna 25.000 documenten, verdeeld over 13 ministeries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4089793.0\n",
      "(14873, 13)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Binnenlandse Zaken en Koninkrijksrelaties                 185497.0\n",
       "Buitenlandse Zaken                                        422544.0\n",
       "Defensie                                                  122002.0\n",
       "Economische Zaken                                         228137.0\n",
       "Financiën                                                 227780.0\n",
       "Justitie                                                  738078.0\n",
       "Landbouw, Natuurbeheer en Visserij                        150144.0\n",
       "Onderwijs, Cultuur en Wetenschappen                       272255.0\n",
       "Sociale Zaken en Werkgelegenheid                          326492.0\n",
       "Verkeer en Waterstaat                                     375525.0\n",
       "Volksgezondheid, Welzijn en Sport                         673497.0\n",
       "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer    286699.0\n",
       "Vreemdelingenzaken en Integratie                           81143.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "\n",
    "print IIdf.sum().sum()\n",
    "print IIdf.shape\n",
    "IIdf.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Welke woorden komen het meest voor?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "vragen        20052.0\n",
       "bent          15988.0\n",
       "welke         15361.0\n",
       "aanleiding    13285.0\n",
       "zoals         12865.0\n",
       "bereid        11655.0\n",
       "mening        11650.0\n",
       "ten           11298.0\n",
       "wel           10897.0\n",
       "mogelijk      10559.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "IIdf.sum(axis=1).sort_values(ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mutual information\n",
    "\n",
    "Definieer de functie $I(U,C)$ zoals in vergelijking 13.16 in het IR book. Je kunt natuurlijk veel makkelijker 13.17 implementeren. \n",
    "* Definieer in de functie alle onderdelen $N, N_{11}, N_{1.}$, etc\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7.5381481637955215e-05, 0)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Dit kan veel mooier met lineaire algebra vanuit een klein dataframepje\n",
    "\n",
    "def h(cel,sumV,sumH,N):\n",
    "     return (cel/N)*log2((N*cel)/(sumV*sumH))\n",
    "    \n",
    "\n",
    "\n",
    "def I(U,C,df):\n",
    "    try:\n",
    "        N= IIdf.sum().sum()\n",
    "        N11= df.loc[U][C]\n",
    "        N1p= df.loc[U].sum()\n",
    "        Np1= df[C].sum()\n",
    "        N10= N1p-N11\n",
    "        N01= Np1-N11\n",
    "        N0p= N-N1p\n",
    "        Np0= N-Np1\n",
    "        N00= N0p-N01\n",
    "        return    sum([ h(N11,N1p,Np1,N),\n",
    "                      h(N10,N1p,Np0,N),\n",
    "                      h(N01,N0p,Np1,N),\n",
    "                      h(N00,N0p,Np0,N)\n",
    "                       ]\n",
    "                     )\n",
    "    except:\n",
    "        return 0\n",
    "\n",
    "#test     \n",
    "I('misdrijf','Justitie',IIdf), I('misdrij','Justitie',IIdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 12.1 s, sys: 191 ms, total: 12.3 s\n",
      "Wall time: 12.5 s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Mutual Information</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>justitie</th>\n",
       "      <td>0.000401</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>politie</th>\n",
       "      <td>0.000245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>openbaar</th>\n",
       "      <td>0.000213</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>verdachte</th>\n",
       "      <td>0.000201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>procureurs</th>\n",
       "      <td>0.000183</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafbare</th>\n",
       "      <td>0.000182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>vervolging</th>\n",
       "      <td>0.000179</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>wetboek</th>\n",
       "      <td>0.000163</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafrechtelijk</th>\n",
       "      <td>0.000160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafvordering</th>\n",
       "      <td>0.000154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>officier</th>\n",
       "      <td>0.000144</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>verdachten</th>\n",
       "      <td>0.000139</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafrecht</th>\n",
       "      <td>0.000134</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feiten</th>\n",
       "      <td>0.000131</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>opsporing</th>\n",
       "      <td>0.000128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rechter</th>\n",
       "      <td>0.000128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>justiti</th>\n",
       "      <td>0.000125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafzaak</th>\n",
       "      <td>0.000120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafrechtelijke</th>\n",
       "      <td>0.000119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>strafbaar</th>\n",
       "      <td>0.000118</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  Mutual Information\n",
       "justitie                    0.000401\n",
       "politie                     0.000245\n",
       "openbaar                    0.000213\n",
       "verdachte                   0.000201\n",
       "procureurs                  0.000183\n",
       "strafbare                   0.000182\n",
       "vervolging                  0.000179\n",
       "wetboek                     0.000163\n",
       "strafrechtelijk             0.000160\n",
       "strafvordering              0.000154\n",
       "officier                    0.000144\n",
       "verdachten                  0.000139\n",
       "strafrecht                  0.000134\n",
       "feiten                      0.000131\n",
       "opsporing                   0.000128\n",
       "rechter                     0.000128\n",
       "justiti                     0.000125\n",
       "strafzaak                   0.000120\n",
       "strafrechtelijke            0.000119\n",
       "strafbaar                   0.000118"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def TopNMutualInformationWords(C,n):\n",
    "    '''Give the top N words with the highest mutual information score for class C'''\n",
    "    df= pd.DataFrame.from_dict({w:I(w,C,IIdf) for w in IIdf[C].dropna().index}, orient='index')\n",
    "    df.sort_values(0,ascending=False, inplace=True)\n",
    "    df.columns= ['Mutual Information']\n",
    "    return df.head(n)\n",
    " \n",
    "%time TopNMutualInformationWords('Justitie',20)\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Defensie \n",
      "             Mutual Information\n",
      "defensie               0.000516\n",
      "militairen             0.000293\n",
      "militaire              0.000220\n",
      "koninklijke            0.000212\n",
      "krijgsmacht            0.000211\n",
      "militair               0.000203\n",
      "commandant             0.000145\n",
      "luchtmacht             0.000117\n",
      "eenheden               0.000102\n",
      "marine                 0.000096 \n",
      "\n",
      "Sociale Zaken en Werkgelegenheid \n",
      "            Mutual Information\n",
      "sociale               0.000314\n",
      "werknemers            0.000304\n",
      "arbeid                0.000258\n",
      "werkgevers            0.000256\n",
      "szw                   0.000245\n",
      "wao                   0.000227\n",
      "uitkering             0.000216\n",
      "werkgever             0.000191\n",
      "werknemer             0.000156\n",
      "uwv                   0.000150 \n",
      "\n",
      "Justitie \n",
      "                 Mutual Information\n",
      "justitie                   0.000401\n",
      "politie                    0.000245\n",
      "openbaar                   0.000213\n",
      "verdachte                  0.000201\n",
      "procureurs                 0.000183\n",
      "strafbare                  0.000182\n",
      "vervolging                 0.000179\n",
      "wetboek                    0.000163\n",
      "strafrechtelijk            0.000160\n",
      "strafvordering             0.000154 \n",
      "\n",
      "Economische Zaken \n",
      "                  Mutual Information\n",
      "economische                 0.000145\n",
      "bedrijven                   0.000105\n",
      "elektriciteit               0.000100\n",
      "energiebedrijven            0.000092\n",
      "energie                     0.000088\n",
      "nma                         0.000086\n",
      "markt                       0.000071\n",
      "gas                         0.000068\n",
      "afnemers                    0.000068\n",
      "mededingingswet             0.000065 \n",
      "\n",
      "Binnenlandse Zaken en Koninkrijksrelaties \n",
      "                     Mutual Information\n",
      "politie                        0.000147\n",
      "korpsen                        0.000127\n",
      "koninkrijksrelaties            0.000106\n",
      "bzk                            0.000097\n",
      "burgemeester                   0.000091\n",
      "korps                          0.000084\n",
      "korpsbeheerder                 0.000083\n",
      "politiekorpsen                 0.000081\n",
      "binnenlandse                   0.000078\n",
      "agenten                        0.000075 \n",
      "\n",
      "Landbouw, Natuurbeheer en Visserij \n",
      "              Mutual Information\n",
      "dieren                  0.000235\n",
      "landbouw                0.000178\n",
      "lnv                     0.000163\n",
      "agrarisch               0.000155\n",
      "visserij                0.000149\n",
      "natuurbeheer            0.000148\n",
      "aid                     0.000116\n",
      "vlees                   0.000113\n",
      "vee                     0.000088\n",
      "boeren                  0.000084 \n",
      "\n",
      "Financiën \n",
      "                     Mutual Information\n",
      "belastingdienst                0.000241\n",
      "fiscale                        0.000176\n",
      "belastingplichtigen            0.000139\n",
      "inkomstenbelasting             0.000138\n",
      "bank                           0.000117\n",
      "belastingplichtige             0.000110\n",
      "fiscaal                        0.000102\n",
      "heffing                        0.000087\n",
      "belastingheffing               0.000084\n",
      "financi                        0.000082 \n",
      "\n",
      "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer \n",
      "                  Mutual Information\n",
      "vrom                        0.000295\n",
      "afval                       0.000167\n",
      "woningen                    0.000158\n",
      "milieubeheer                0.000149\n",
      "milieu                      0.000147\n",
      "ruimtelijke                 0.000139\n",
      "volkshuisvesting            0.000127\n",
      "ordening                    0.000122\n",
      "huurders                    0.000121\n",
      "afvalstoffen                0.000114 \n",
      "\n",
      "Buitenlandse Zaken \n",
      "                       Mutual Information\n",
      "regering                         0.000602\n",
      "autoriteiten                     0.000337\n",
      "mensenrechten                    0.000306\n",
      "bilateraal                       0.000273\n",
      "ambassadeur                      0.000245\n",
      "politieke                        0.000244\n",
      "dialoog                          0.000231\n",
      "buitenlandse                     0.000223\n",
      "president                        0.000205\n",
      "mensenrechtensituatie            0.000201 \n",
      "\n",
      "Onderwijs, Cultuur en Wetenschappen \n",
      "                Mutual Information\n",
      "onderwijs                 0.000641\n",
      "scholen                   0.000460\n",
      "leerlingen                0.000421\n",
      "school                    0.000396\n",
      "ocw                       0.000191\n",
      "studenten                 0.000165\n",
      "schooljaar                0.000160\n",
      "voortgezet                0.000152\n",
      "cultuur                   0.000146\n",
      "basisonderwijs            0.000140 \n",
      "\n",
      "Vreemdelingenzaken en Integratie \n",
      "                     Mutual Information\n",
      "ind                            0.000202\n",
      "asielzoekers                   0.000181\n",
      "verblijfsvergunning            0.000148\n",
      "vreemdeling                    0.000141\n",
      "uitzetting                     0.000138\n",
      "vreemdelingen                  0.000134\n",
      "terugkeer                      0.000104\n",
      "verblijf                       0.000090\n",
      "uitgeprocedeerde               0.000089\n",
      "asielzoeker                    0.000076 \n",
      "\n",
      "Verkeer en Waterstaat \n",
      "                    Mutual Information\n",
      "verkeer                       0.000343\n",
      "waterstaat                    0.000317\n",
      "vervoer                       0.000215\n",
      "rijkswaterstaat               0.000196\n",
      "reizigers                     0.000190\n",
      "spoor                         0.000169\n",
      "verkeersveiligheid            0.000135\n",
      "trein                         0.000133\n",
      "aanleg                        0.000132\n",
      "rijden                        0.000120 \n",
      "\n",
      "Volksgezondheid, Welzijn en Sport \n",
      "                 Mutual Information\n",
      "pati                       0.000555\n",
      "nten                       0.000511\n",
      "gezondheidszorg            0.000411\n",
      "zorg                       0.000335\n",
      "ziekenhuizen               0.000319\n",
      "vws                        0.000290\n",
      "medisch                    0.000262\n",
      "ziekenhuis                 0.000235\n",
      "awbz                       0.000217\n",
      "huisartsen                 0.000185 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for mi in list(set(kvrdf.NormalizedMinisterie.values)):\n",
    "    print mi,'\\n', TopNMutualInformationWords(mi,10),'\\n'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Try out:  Colocations of bigrammen\n",
    "\n",
    "Zie <http://www.nltk.org/howto/collocations.html>\n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mi='Justitie'\n",
    "A= kvrdf[kvrdf.NormalizedMinisterie==mi].fulltext \n",
    "A_all= ' '.join(A.values).decode('utf-8')\n",
    "tokens= nltk.tokenize.wordpunct_tokenize(A_all)\n",
    "B = nltk.Text(tokens)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "van het; Openbaar Ministerie; naar aanleiding; waarom niet; openbaar\n",
      "ministerie; ten aanzien; Aanhangsel Handelingen; mening dat;\n",
      "betrekking tot; van een; Tweede Kamer; van van; Vragen naar; kan\n",
      "worden; aanleiding van; met betrekking; strafbare feiten; het\n",
      "Openbaar; feit dat; dan wel\n"
     ]
    }
   ],
   "source": [
    "# werkt niet echt goed, want filtert er geen NL stopwoorden uit\n",
    "B.collocations()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.collocations import * \n",
    "bigram_measures = nltk.collocations.BigramAssocMeasures()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    ">>> word_fd = nltk.FreqDist(tokens)\n",
    ">>> bigram_fd = nltk.FreqDist(nltk.bigrams(tokens))\n",
    ">>> finder = BigramCollocationFinder(word_fd, bigram_fd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "finder.apply_freq_filter(100)\n",
    "finder.apply_word_filter(lambda w: w in dutchstop or len(w)<3 or re.search(r'^\\W+$',w))\n",
    "scored = finder.score_ngrams(bigram_measures.pmi)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[((u'Sri', u'Lanka'), 14.334247343063975),\n",
       " ((u'Verenigd', u'Koninkrijk'), 13.770043402509613),\n",
       " ((u'Ter', u'Apel'), 13.120122537711127),\n",
       " ((u'persoonlijke', u'levenssfeer'), 13.014957486638092),\n",
       " ((u'Koninklijke', u'Marechaussee'), 12.6041932359635),\n",
       " ((u'Dienst', u'Justiti'), 12.55619998538745),\n",
       " ((u'Holland', u'Casino'), 12.27312889617636),\n",
       " ((u'centrale', u'autoriteit'), 12.21993631683266),\n",
       " ((u'NRC', u'Handelsblad'), 12.20586418610441),\n",
       " ((u'Verenigde', u'Staten'), 12.171037334864451),\n",
       " ((u'Nationale', u'Recherche'), 12.13859977696005),\n",
       " ((u'alleenstaande', u'minderjarige'), 12.029608731798323),\n",
       " ((u'voorlopige', u'hechtenis'), 11.992737081233336),\n",
       " ((u'seksueel', u'misbruik'), 11.992411369522582),\n",
       " ((u'Den', u'Haag'), 11.8970289893151),\n",
       " ((u'Den', u'Bosch'), 11.887777649381626),\n",
       " ((u'kort', u'geding'), 11.755647446233333),\n",
       " ((u'rechterlijke', u'macht'), 11.752291188474594),\n",
       " ((u'Indiener', u'vraagt'), 11.635066411357236),\n",
       " ((u'Algemeen', u'Overleg'), 11.607257418406245),\n",
       " ((u'inzage', u'gelegd'), 11.465888519708379),\n",
       " ((u'burgerlijke', u'stand'), 11.424562969771415),\n",
       " ((u'Buitenlandse', u'Zaken'), 11.341850018123964),\n",
       " ((u'penitentiaire', u'inrichtingen'), 11.302398477024632),\n",
       " ((u'Binnenlandse', u'Zaken'), 11.278904271582267),\n",
       " ((u'Algemeen', u'Dagblad'), 11.23272804915242),\n",
       " ((u'Burgerlijk', u'Wetboek'), 11.223330534825525),\n",
       " ((u'hoger', u'beroep'), 11.09942481724375),\n",
       " ((u'Europese', u'Unie'), 11.017523721181668),\n",
       " ((u'Economische', u'Zaken'), 10.880456382057698)]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "scored[:30]\n",
    "#sorted(bigram for bigram, score in scored )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}