{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text classification with Naive Bayes : feature selection with mutual information\n", "\n", "## Notebook made by \n", "\n", "|** Name** | **Student id** | **email**|\n", "|:- |:-|:-|\n", "|. | | |\n", "| | |. |\n", "\n", "### Pledge (taken from [Coursera's Honor Code](https://www.coursera.org/about/terms/honorcode) )\n", "\n", "\n", "\n", "Put here a selfie with your photo where you hold a signed paper with the following text: (if this is team work, put two selfies here). The link must be to some place on the web, not to a local file. \n", "\n", "> My answers to homework, quizzes and exams will be my own work (except for assignments that explicitly permit collaboration).\n", "\n", ">I will not make solutions to homework, quizzes or exams available to anyone else. This includes both solutions written by me, as well as any official solutions provided by the course staff.\n", "\n", ">I will not engage in any other activities that will dishonestly improve my results or dishonestly improve/hurt the results of others.\n", "\n", "\n", "\n", "### Note\n", "* **Assignments without the selfies or completely filled in information will not be graded and receive 0 points.**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text classification with Naive Bayes \n", " \n", " \n", " \n", "

Abstract

\n", "

We will do **text classification** on a collection of Dutch parliamentary questions.\n", "

\n", "

In dit notebook beperken we ons tot het bepalen van de **mutual information** scores, zoals beschereven in

\n", " \n", " \n", "#### Data\n", "* 40K kamervragen: \n", "* 1K kamervragen \n", "\n", " " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk \n", "from pattern.nl import lemma # works very bad for Dutch\n", "import re\n", "from collections import Counter\n", "import itertools\n", "import sklearn\n", "import pandas as pd\n", "import numpy as np\n", "from numpy import log2\n", "from nltk.corpus import stopwords\n", "dutchstop = set(stopwords.words('dutch'))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(40516, 6)\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
jaarpartijtitelvraagantwoordministerie
KVR1000.xml1994PvdADe vragen betreffen de betrouwbaarheid van de...Hebt u kennisgenomen van het televisieprogram...Ja. Het bedoelde geluidmeetpunt is eigendom v...Verkeer en Waterstaat
KVR10000.xml1999PvdAVragen naar aanleiding van berichten (uitzend...Kent u de berichten over de situatie in de Me...Justitie
KVR10001.xml1999SPVragen naar aanleiding van de berichten \"Nede...Kent u de berichten «Nederland steunt de Soeh...Financien
KVR10002.xml1999PvdAVragen over de gebrekkige opvang van verpleeg...Kent u het bericht over onderzoek van Nu91 me...Ja. Het onderzoek van NU’91 wijst uit dat het...Volksgezondheid, Welzijn en Sport
KVR10003.xml1999PvdAVragen over onbetrouwbaarheid van filemeldingen.Hebt u kennisgenomen van de berichten over de...Ja. Nee. Door de waarnemers van het Algemeen ...Verkeer en Waterstaat
\n", "
" ], "text/plain": [ " jaar partij \\\n", "KVR1000.xml 1994 PvdA \n", " KVR10000.xml 1999 PvdA \n", " KVR10001.xml 1999 SP \n", " KVR10002.xml 1999 PvdA \n", " KVR10003.xml 1999 PvdA \n", "\n", " titel \\\n", "KVR1000.xml De vragen betreffen de betrouwbaarheid van de... \n", " KVR10000.xml Vragen naar aanleiding van berichten (uitzend... \n", " KVR10001.xml Vragen naar aanleiding van de berichten \"Nede... \n", " KVR10002.xml Vragen over de gebrekkige opvang van verpleeg... \n", " KVR10003.xml Vragen over onbetrouwbaarheid van filemeldingen. \n", "\n", " vraag \\\n", "KVR1000.xml Hebt u kennisgenomen van het televisieprogram... \n", " KVR10000.xml Kent u de berichten over de situatie in de Me... \n", " KVR10001.xml Kent u de berichten «Nederland steunt de Soeh... \n", " KVR10002.xml Kent u het bericht over onderzoek van Nu91 me... \n", " KVR10003.xml Hebt u kennisgenomen van de berichten over de... \n", "\n", " antwoord \\\n", "KVR1000.xml Ja. Het bedoelde geluidmeetpunt is eigendom v... \n", " KVR10000.xml \n", " KVR10001.xml \n", " KVR10002.xml Ja. Het onderzoek van NU’91 wijst uit dat het... \n", " KVR10003.xml Ja. Nee. Door de waarnemers van het Algemeen ... \n", "\n", " ministerie \n", "KVR1000.xml Verkeer en Waterstaat \n", " KVR10000.xml Justitie \n", " KVR10001.xml Financien \n", " KVR10002.xml Volksgezondheid, Welzijn en Sport \n", " KVR10003.xml Verkeer en Waterstaat " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "kvrdf= pd.read_csv('http://maartenmarx.nl/teaching/zoekmachines/LectureNotes/MySQL/KVR.csv.gz', \n", " compression='gzip', \n", " sep='\\t', \n", " index_col=0, \n", " # encoding='utf-8',\n", " names=['jaar', 'partij','titel','vraag','antwoord','ministerie'])\n", "print kvrdf.shape\n", "kvrdf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Exercise

\n", "\n", "

We will use the fields in the _ministerie_ column as our classes. \n", " These are the ministeries to whom the question is addressed.\n", "
\n", " Note that these labels are not normalized, see e.g. these counts below" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " Justitie (JUS) 3219\n", " Volksgezondheid, Welzijn en Sport (VWS) 2630\n", " Buitenlandse Zaken (BUZA) 1796\n", " Verkeer en Waterstaat (VW) 1441\n", " Justitie 1333\n", " Sociale Zaken en Werkgelegenheid (SZW) 1231\n", " Onderwijs, Cultuur en Wetenschappen (OCW) 1187\n", " Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer (VROM) 984\n", " Financiën (FIN) 960\n", " Volksgezondheid, Welzijn en Sport 951\n", " Economische Zaken (EZ) 946\n", " Buitenlandse Zaken 753\n", " Binnenlandse Zaken en Koninkrijksrelaties (BZK) 725\n", " Verkeer en Waterstaat 724\n", " Defensie (DEF) 646\n", " Sociale Zaken en Werkgelegenheid 607\n", " Landbouw, Natuurbeheer en Visserij (LNV) 586\n", " Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 554\n", " Onderwijs, Cultuur en Wetenschappen 532\n", " Vreemdelingenzaken en Integratie (VI) 466\n", " Landbouw, Natuurbeheer en Visserij 440\n", " Landbouw, Natuur en Voedselkwaliteit (LNV) 422\n", " Financiën 409\n", " Binnenlandse Zaken 389\n", " Economische Zaken 337\n", " Defensie 305\n", " Ontwikkelingssamenwerking (OS) 198\n", " Onderwijs, Cultuur en Wetenschap (OCW) 195\n", " Algemene Zaken (AZ) 169\n", " Binnenlandse Zaken en Koninkrijksrelaties (BZK) Justitie (JUS) 168\n", "Name: ministerie, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kvrdf.ministerie.value_counts().head(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercises\n", " \n", "* Normalize the values for \"ministerie\" and choose 10 ministeries to work with. Put these in a new column called `NormalizedMinisterie`.\n", "* Let ook goed op spaties aan het begin en eind.\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "24824\n" ] }, { "data": { "text/plain": [ "Justitie 4640\n", "Volksgezondheid, Welzijn en Sport 3597\n", "Buitenlandse Zaken 2697\n", "Verkeer en Waterstaat 2178\n", "Sociale Zaken en Werkgelegenheid 1861\n", "Onderwijs, Cultuur en Wetenschappen 1730\n", "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 1560\n", "Financiën 1403\n", "Economische Zaken 1309\n", "Binnenlandse Zaken en Koninkrijksrelaties 1241\n", "Landbouw, Natuurbeheer en Visserij 1031\n", "Defensie 963\n", "Vreemdelingenzaken en Integratie 614\n", "Name: NormalizedMinisterie, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ik heb er 13 gekozen want die laatste drie leken me juist zo leuk.\n", "\n", "topNChosenMinisteries=13\n", "\n", "kvrdf['NormalizedMinisterie']= kvrdf.ministerie.str.replace(r' *\\(.*','')\n", "kvrdf['NormalizedMinisterie']=kvrdf.NormalizedMinisterie.str.strip() # remove leading and ending spaces\n", "print kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries).sum()\n", "kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "40516\n" ] }, { "data": { "text/plain": [ "(24824, 9)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# restrict to the top 13\n", "\n", "print len(kvrdf)\n", "kvrdf= kvrdf[kvrdf.NormalizedMinisterie.isin(kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries).index)]\n", "kvrdf.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tokenize\n", "\n", "1. Voeg een kolom `vocabulair` toe met een lijst met alle unieke woorden die voorkomen in de tekst kolommen van een kamervraag.\n", " * lowercase de woorden\n", " * gooi getallen en punctuatie-tokens weg\n", " * neem geen woorden die in de NL stopwoorden lijst `dutchstop` zitten op\n", " * je kunt ook nog lemmatiseren\n", "2. **Hint** \n", " * Plak eerst alle tekst aan elkaar vast (let op!)\n", " * maak een functie die tokenize doet en al die dingen hierboven beschreven\n", " * pas die functie met `.apply` toe op de kolom met alle tekst\n", "3. Dit kan wel even duren. Test dus op kleine aantallen, of op de kleine file." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'basisscholen'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lemma('basisschool')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kvrdf['fulltext']= kvrdf.titel.astype(str)+' '+kvrdf.vraag.astype(str)+' '+kvrdf.antwoord.astype(str)\n", "\n", "\n", " \n", " \n", "\n", "def tokenize(text, lemmatize=False):\n", " tokens= [s.lower() for s in set(nltk.word_tokenize(str(text).decode('utf-8')))\n", " if re.match(r'^\\w+$',s) and \n", " len(s) > 2\n", " and not re.match(r'^\\d+$',s ) and \n", " not s.lower() in dutchstop\n", " ]\n", " if not lemmatize:\n", " return list(set(tokens))\n", " if lemmatize:\n", " return list(lemma(w) for w in set(tokens))\n", " \n", "# test\n", "\n", "print kvrdf.fulltext.head(5).apply(tokenize)\n", "\n", "kvrdf.fulltext.head(5).apply(lambda w:tokenize(w,lemmatize=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%time kvrdf['vocabulair']= kvrdf.fulltext.apply(tokenize)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4min 40s, sys: 3.42 s, total: 4min 43s\n", "Wall time: 4min 44s\n" ] } ], "source": [ "%time kvrdf['lemmatized_vocabulair']= kvrdf.fulltext.apply(lambda w:tokenize(w,lemmatize=True))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 44.2 s, sys: 433 ms, total: 44.6 s\n", "Wall time: 44.9 s\n" ] } ], "source": [ "# Fast alternative \n", "def filter_voc(L):\n", " tokens=[s for s in L if len(s) > 2\n", " and not re.match(r'^\\d+$',s ) and \n", " not s in dutchstop\n", " ]\n", " return list(set(tokens))\n", "%time tokens= (kvrdf.fulltext.str.lower().str.findall(r'\\b\\w+\\b')).apply(filter_voc)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.9 s, sys: 208 ms, total: 13.1 s\n", "Wall time: 13.5 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/admin/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" ] }, { "data": { "text/plain": [ "KVR1000.xml [betrouwbaarheid, inderdaad, meter, bereiken, ...\n", " KVR10000.xml [twijfels, bent, kennis, situatie, moeten, wel...\n", " KVR10002.xml [gestaan, erom, volkskrant, wijze, protocollen...\n", " KVR10003.xml [gezet, betrouwbaarheid, desbetreffende, waarn...\n", " KVR10004.xml [rol, volkskrant, termijn, gepresenteerd, spoo...\n", "Name: fulltext, dtype: object" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Even faster \n", "# lower case, length restrictie, remove pure digits, remove doubles, remove dutchstop \n", "\n", "%time tokens= (kvrdf.fulltext.str.lower().str.replace(r'\\b\\d+\\b','').str.replace(r'\\b\\w\\w\\b','').str.findall(r'\\b\\w+\\b')).apply(lambda l:([w for w in set(l) if not w in dutchstop]))\n", "\n", "kvrdf['vocabulair']=tokens\n", "tokens.head()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pattern lemmatize is very dodgy for NL" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#kvrdf.lemmatized_vocabulair.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tel woorden voor elk Minsterie\n", "\n", "* Gebruik `Counter` om voor elk Ministerie, **voor elk woord te tellen in hoeveel documenten dat woord voorkomt.**\n", "* **Hints**\n", " * Loop over alle ministeries\n", " * voor elk ministerie pak de kolom vocabulair op, en concateneer alle lijsten tot 1 lijst\n", " * tel dan met `Counter`\n", " * Neem alleen woorden die in minimaal 10 documenten voorkomen\n", " * Lever uiteindelijk een dict op met Ministeries als sleutels en hun counters als waarden\n", " * Maak hier een dataframe van met `pd.DataFrame.from_dict`\n", " * de woorden zijn de rijen, en de Ministeries de kolommen\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.2 s, sys: 99.5 ms, total: 4.3 s\n", "Wall time: 4.3 s\n", "Volksgezondheid, Welzijn en Sport 7671\n", "Defensie 2903\n", "Sociale Zaken en Werkgelegenheid 4978\n", "Justitie 8126\n", "Binnenlandse Zaken en Koninkrijksrelaties 3677\n", "Landbouw, Natuurbeheer en Visserij 3184\n", "Financiën 4167\n", "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 4802\n", "Buitenlandse Zaken 6030\n", "Onderwijs, Cultuur en Wetenschappen 4581\n", "Vreemdelingenzaken en Integratie 2101\n", "Verkeer en Waterstaat 5634\n", "Economische Zaken 4225\n" ] } ], "source": [ "def makeCounters(Vocabulair,drempelwaarde=10):\n", " II= {}\n", " for mi in set(kvrdf.NormalizedMinisterie.values):\n", " A= kvrdf[kvrdf.NormalizedMinisterie==mi][Vocabulair].values\n", " M= Counter([w for l in A for w in l])\n", " Minimaal10={v:M[v] for v in M if M[v]>=drempelwaarde}\n", " II[mi]= Minimaal10\n", " return II\n", "\n", "%time II= makeCounters('vocabulair')\n", "\n", "for mi in II:\n", " print mi, len(II[mi])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 109 ms, sys: 3.05 ms, total: 113 ms\n", "Wall time: 113 ms\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/admin/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile\n", " RuntimeWarning)\n" ] }, { "data": { "text/html": [ "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Binnenlandse Zaken en KoninkrijksrelatiesBuitenlandse ZakenDefensieEconomische ZakenFinanciënJustitieLandbouw, Natuurbeheer en VisserijOnderwijs, Cultuur en WetenschappenSociale Zaken en WerkgelegenheidVerkeer en WaterstaatVolksgezondheid, Welzijn en SportVolkshuisvesting, Ruimtelijke Ordening en MilieubeheerVreemdelingenzaken en Integratie
count3677.0000006030.0000002903.0000004225.0000004167.0000008126.0000003184.0000004581.0000004978.0000005634.0000007671.0000004802.0000002101.000000
mean50.44791970.07363242.02618053.99692354.66282790.82919047.15577959.43134765.58698366.65335587.79781059.70408238.621133
std74.609997132.82249057.12446781.93957282.836844197.82425163.99537398.232193109.091074118.804579181.45733895.21007045.112760
min10.00000010.00000010.00000010.00000010.00000010.00000010.00000010.00000010.00000010.00000010.00000010.00000010.000000
25%NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
50%NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
75%NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
max1106.0000002184.000000766.0000001057.0000001089.0000003808.000000762.0000001357.0000001501.0000001703.0000002965.0000001236.000000518.000000
\n", "
" ], "text/plain": [ " Binnenlandse Zaken en Koninkrijksrelaties Buitenlandse Zaken \\\n", "count 3677.000000 6030.000000 \n", "mean 50.447919 70.073632 \n", "std 74.609997 132.822490 \n", "min 10.000000 10.000000 \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max 1106.000000 2184.000000 \n", "\n", " Defensie Economische Zaken Financiën Justitie \\\n", "count 2903.000000 4225.000000 4167.000000 8126.000000 \n", "mean 42.026180 53.996923 54.662827 90.829190 \n", "std 57.124467 81.939572 82.836844 197.824251 \n", "min 10.000000 10.000000 10.000000 10.000000 \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max 766.000000 1057.000000 1089.000000 3808.000000 \n", "\n", " Landbouw, Natuurbeheer en Visserij \\\n", "count 3184.000000 \n", "mean 47.155779 \n", "std 63.995373 \n", "min 10.000000 \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max 762.000000 \n", "\n", " Onderwijs, Cultuur en Wetenschappen Sociale Zaken en Werkgelegenheid \\\n", "count 4581.000000 4978.000000 \n", "mean 59.431347 65.586983 \n", "std 98.232193 109.091074 \n", "min 10.000000 10.000000 \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max 1357.000000 1501.000000 \n", "\n", " Verkeer en Waterstaat Volksgezondheid, Welzijn en Sport \\\n", "count 5634.000000 7671.000000 \n", "mean 66.653355 87.797810 \n", "std 118.804579 181.457338 \n", "min 10.000000 10.000000 \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max 1703.000000 2965.000000 \n", "\n", " Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer \\\n", "count 4802.000000 \n", "mean 59.704082 \n", "std 95.210070 \n", "min 10.000000 \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max 1236.000000 \n", "\n", " Vreemdelingenzaken en Integratie \n", "count 2101.000000 \n", "mean 38.621133 \n", "std 45.112760 \n", "min 10.000000 \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max 518.000000 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Turn it into a dataframe \n", "%time IIdf=pd.DataFrame.from_dict(II)#.fillna(0)\n", "IIdf.describe()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Binnenlandse Zaken en KoninkrijksrelatiesBuitenlandse ZakenDefensieEconomische ZakenFinanciënJustitieLandbouw, Natuurbeheer en VisserijOnderwijs, Cultuur en WetenschappenSociale Zaken en WerkgelegenheidVerkeer en WaterstaatVolksgezondheid, Welzijn en SportVolkshuisvesting, Ruimtelijke Ordening en MilieubeheerVreemdelingenzaken en Integratie
zwitserlandNaN22.0NaN11.0NaN20.0NaNNaNNaN14.018.0NaNNaN
zwitserseNaNNaNNaNNaNNaN11.0NaNNaNNaNNaNNaNNaNNaN
zwolleNaNNaNNaNNaNNaN60.0NaN14.0NaN32.033.0NaNNaN
zwolseNaNNaNNaNNaNNaN11.0NaNNaNNaN12.010.0NaNNaN
zzpNaNNaNNaNNaNNaNNaNNaNNaN20.0NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " Binnenlandse Zaken en Koninkrijksrelaties Buitenlandse Zaken \\\n", "zwitserland NaN 22.0 \n", "zwitserse NaN NaN \n", "zwolle NaN NaN \n", "zwolse NaN NaN \n", "zzp NaN NaN \n", "\n", " Defensie Economische Zaken Financiën Justitie \\\n", "zwitserland NaN 11.0 NaN 20.0 \n", "zwitserse NaN NaN NaN 11.0 \n", "zwolle NaN NaN NaN 60.0 \n", "zwolse NaN NaN NaN 11.0 \n", "zzp NaN NaN NaN NaN \n", "\n", " Landbouw, Natuurbeheer en Visserij \\\n", "zwitserland NaN \n", "zwitserse NaN \n", "zwolle NaN \n", "zwolse NaN \n", "zzp NaN \n", "\n", " Onderwijs, Cultuur en Wetenschappen \\\n", "zwitserland NaN \n", "zwitserse NaN \n", "zwolle 14.0 \n", "zwolse NaN \n", "zzp NaN \n", "\n", " Sociale Zaken en Werkgelegenheid Verkeer en Waterstaat \\\n", "zwitserland NaN 14.0 \n", "zwitserse NaN NaN \n", "zwolle NaN 32.0 \n", "zwolse NaN 12.0 \n", "zzp 20.0 NaN \n", "\n", " Volksgezondheid, Welzijn en Sport \\\n", "zwitserland 18.0 \n", "zwitserse NaN \n", "zwolle 33.0 \n", "zwolse 10.0 \n", "zzp NaN \n", "\n", " Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer \\\n", "zwitserland NaN \n", "zwitserse NaN \n", "zwolle NaN \n", "zwolse NaN \n", "zzp NaN \n", "\n", " Vreemdelingenzaken en Integratie \n", "zwitserland NaN \n", "zwitserse NaN \n", "zwolle NaN \n", "zwolse NaN \n", "zzp NaN " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "IIdf.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hoeveel woorden hebben we eigenlijk? \n", "\n", "Bijna 4 miljoen tokens, en 15K unieke woorden verdeeld over bijna 25.000 documenten, verdeeld over 13 ministeries" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4089793.0\n", "(14873, 13)\n" ] }, { "data": { "text/plain": [ "Binnenlandse Zaken en Koninkrijksrelaties 185497.0\n", "Buitenlandse Zaken 422544.0\n", "Defensie 122002.0\n", "Economische Zaken 228137.0\n", "Financiën 227780.0\n", "Justitie 738078.0\n", "Landbouw, Natuurbeheer en Visserij 150144.0\n", "Onderwijs, Cultuur en Wetenschappen 272255.0\n", "Sociale Zaken en Werkgelegenheid 326492.0\n", "Verkeer en Waterstaat 375525.0\n", "Volksgezondheid, Welzijn en Sport 673497.0\n", "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 286699.0\n", "Vreemdelingenzaken en Integratie 81143.0\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "print IIdf.sum().sum()\n", "print IIdf.shape\n", "IIdf.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Welke woorden komen het meest voor?" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "vragen 20052.0\n", "bent 15988.0\n", "welke 15361.0\n", "aanleiding 13285.0\n", "zoals 12865.0\n", "bereid 11655.0\n", "mening 11650.0\n", "ten 11298.0\n", "wel 10897.0\n", "mogelijk 10559.0\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "IIdf.sum(axis=1).sort_values(ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mutual information\n", "\n", "Definieer de functie $I(U,C)$ zoals in vergelijking 13.16 in het IR book. Je kunt natuurlijk veel makkelijker 13.17 implementeren. \n", "* Definieer in de functie alle onderdelen $N, N_{11}, N_{1.}$, etc\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(7.5381481637955215e-05, 0)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dit kan veel mooier met lineaire algebra vanuit een klein dataframepje\n", "\n", "def h(cel,sumV,sumH,N):\n", " return (cel/N)*log2((N*cel)/(sumV*sumH))\n", " \n", "\n", "\n", "def I(U,C,df):\n", " try:\n", " N= IIdf.sum().sum()\n", " N11= df.loc[U][C]\n", " N1p= df.loc[U].sum()\n", " Np1= df[C].sum()\n", " N10= N1p-N11\n", " N01= Np1-N11\n", " N0p= N-N1p\n", " Np0= N-Np1\n", " N00= N0p-N01\n", " return sum([ h(N11,N1p,Np1,N),\n", " h(N10,N1p,Np0,N),\n", " h(N01,N0p,Np1,N),\n", " h(N00,N0p,Np0,N)\n", " ]\n", " )\n", " except:\n", " return 0\n", "\n", "#test \n", "I('misdrijf','Justitie',IIdf), I('misdrij','Justitie',IIdf)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.1 s, sys: 191 ms, total: 12.3 s\n", "Wall time: 12.5 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mutual Information
justitie0.000401
politie0.000245
openbaar0.000213
verdachte0.000201
procureurs0.000183
strafbare0.000182
vervolging0.000179
wetboek0.000163
strafrechtelijk0.000160
strafvordering0.000154
officier0.000144
verdachten0.000139
strafrecht0.000134
feiten0.000131
opsporing0.000128
rechter0.000128
justiti0.000125
strafzaak0.000120
strafrechtelijke0.000119
strafbaar0.000118
\n", "
" ], "text/plain": [ " Mutual Information\n", "justitie 0.000401\n", "politie 0.000245\n", "openbaar 0.000213\n", "verdachte 0.000201\n", "procureurs 0.000183\n", "strafbare 0.000182\n", "vervolging 0.000179\n", "wetboek 0.000163\n", "strafrechtelijk 0.000160\n", "strafvordering 0.000154\n", "officier 0.000144\n", "verdachten 0.000139\n", "strafrecht 0.000134\n", "feiten 0.000131\n", "opsporing 0.000128\n", "rechter 0.000128\n", "justiti 0.000125\n", "strafzaak 0.000120\n", "strafrechtelijke 0.000119\n", "strafbaar 0.000118" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def TopNMutualInformationWords(C,n):\n", " '''Give the top N words with the highest mutual information score for class C'''\n", " df= pd.DataFrame.from_dict({w:I(w,C,IIdf) for w in IIdf[C].dropna().index}, orient='index')\n", " df.sort_values(0,ascending=False, inplace=True)\n", " df.columns= ['Mutual Information']\n", " return df.head(n)\n", " \n", "%time TopNMutualInformationWords('Justitie',20)\n", " " ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defensie \n", " Mutual Information\n", "defensie 0.000516\n", "militairen 0.000293\n", "militaire 0.000220\n", "koninklijke 0.000212\n", "krijgsmacht 0.000211\n", "militair 0.000203\n", "commandant 0.000145\n", "luchtmacht 0.000117\n", "eenheden 0.000102\n", "marine 0.000096 \n", "\n", "Sociale Zaken en Werkgelegenheid \n", " Mutual Information\n", "sociale 0.000314\n", "werknemers 0.000304\n", "arbeid 0.000258\n", "werkgevers 0.000256\n", "szw 0.000245\n", "wao 0.000227\n", "uitkering 0.000216\n", "werkgever 0.000191\n", "werknemer 0.000156\n", "uwv 0.000150 \n", "\n", "Justitie \n", " Mutual Information\n", "justitie 0.000401\n", "politie 0.000245\n", "openbaar 0.000213\n", "verdachte 0.000201\n", "procureurs 0.000183\n", "strafbare 0.000182\n", "vervolging 0.000179\n", "wetboek 0.000163\n", "strafrechtelijk 0.000160\n", "strafvordering 0.000154 \n", "\n", "Economische Zaken \n", " Mutual Information\n", "economische 0.000145\n", "bedrijven 0.000105\n", "elektriciteit 0.000100\n", "energiebedrijven 0.000092\n", "energie 0.000088\n", "nma 0.000086\n", "markt 0.000071\n", "gas 0.000068\n", "afnemers 0.000068\n", "mededingingswet 0.000065 \n", "\n", "Binnenlandse Zaken en Koninkrijksrelaties \n", " Mutual Information\n", "politie 0.000147\n", "korpsen 0.000127\n", "koninkrijksrelaties 0.000106\n", "bzk 0.000097\n", "burgemeester 0.000091\n", "korps 0.000084\n", "korpsbeheerder 0.000083\n", "politiekorpsen 0.000081\n", "binnenlandse 0.000078\n", "agenten 0.000075 \n", "\n", "Landbouw, Natuurbeheer en Visserij \n", " Mutual Information\n", "dieren 0.000235\n", "landbouw 0.000178\n", "lnv 0.000163\n", "agrarisch 0.000155\n", "visserij 0.000149\n", "natuurbeheer 0.000148\n", "aid 0.000116\n", "vlees 0.000113\n", "vee 0.000088\n", "boeren 0.000084 \n", "\n", "Financiën \n", " Mutual Information\n", "belastingdienst 0.000241\n", "fiscale 0.000176\n", "belastingplichtigen 0.000139\n", "inkomstenbelasting 0.000138\n", "bank 0.000117\n", "belastingplichtige 0.000110\n", "fiscaal 0.000102\n", "heffing 0.000087\n", "belastingheffing 0.000084\n", "financi 0.000082 \n", "\n", "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer \n", " Mutual Information\n", "vrom 0.000295\n", "afval 0.000167\n", "woningen 0.000158\n", "milieubeheer 0.000149\n", "milieu 0.000147\n", "ruimtelijke 0.000139\n", "volkshuisvesting 0.000127\n", "ordening 0.000122\n", "huurders 0.000121\n", "afvalstoffen 0.000114 \n", "\n", "Buitenlandse Zaken \n", " Mutual Information\n", "regering 0.000602\n", "autoriteiten 0.000337\n", "mensenrechten 0.000306\n", "bilateraal 0.000273\n", "ambassadeur 0.000245\n", "politieke 0.000244\n", "dialoog 0.000231\n", "buitenlandse 0.000223\n", "president 0.000205\n", "mensenrechtensituatie 0.000201 \n", "\n", "Onderwijs, Cultuur en Wetenschappen \n", " Mutual Information\n", "onderwijs 0.000641\n", "scholen 0.000460\n", "leerlingen 0.000421\n", "school 0.000396\n", "ocw 0.000191\n", "studenten 0.000165\n", "schooljaar 0.000160\n", "voortgezet 0.000152\n", "cultuur 0.000146\n", "basisonderwijs 0.000140 \n", "\n", "Vreemdelingenzaken en Integratie \n", " Mutual Information\n", "ind 0.000202\n", "asielzoekers 0.000181\n", "verblijfsvergunning 0.000148\n", "vreemdeling 0.000141\n", "uitzetting 0.000138\n", "vreemdelingen 0.000134\n", "terugkeer 0.000104\n", "verblijf 0.000090\n", "uitgeprocedeerde 0.000089\n", "asielzoeker 0.000076 \n", "\n", "Verkeer en Waterstaat \n", " Mutual Information\n", "verkeer 0.000343\n", "waterstaat 0.000317\n", "vervoer 0.000215\n", "rijkswaterstaat 0.000196\n", "reizigers 0.000190\n", "spoor 0.000169\n", "verkeersveiligheid 0.000135\n", "trein 0.000133\n", "aanleg 0.000132\n", "rijden 0.000120 \n", "\n", "Volksgezondheid, Welzijn en Sport \n", " Mutual Information\n", "pati 0.000555\n", "nten 0.000511\n", "gezondheidszorg 0.000411\n", "zorg 0.000335\n", "ziekenhuizen 0.000319\n", "vws 0.000290\n", "medisch 0.000262\n", "ziekenhuis 0.000235\n", "awbz 0.000217\n", "huisartsen 0.000185 \n", "\n" ] } ], "source": [ "for mi in list(set(kvrdf.NormalizedMinisterie.values)):\n", " print mi,'\\n', TopNMutualInformationWords(mi,10),'\\n'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Try out: Colocations of bigrammen\n", "\n", "Zie \n", "\n", " " ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mi='Justitie'\n", "A= kvrdf[kvrdf.NormalizedMinisterie==mi].fulltext \n", "A_all= ' '.join(A.values).decode('utf-8')\n", "tokens= nltk.tokenize.wordpunct_tokenize(A_all)\n", "B = nltk.Text(tokens)\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "van het; Openbaar Ministerie; naar aanleiding; waarom niet; openbaar\n", "ministerie; ten aanzien; Aanhangsel Handelingen; mening dat;\n", "betrekking tot; van een; Tweede Kamer; van van; Vragen naar; kan\n", "worden; aanleiding van; met betrekking; strafbare feiten; het\n", "Openbaar; feit dat; dan wel\n" ] } ], "source": [ "# werkt niet echt goed, want filtert er geen NL stopwoorden uit\n", "B.collocations()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from nltk.collocations import * \n", "bigram_measures = nltk.collocations.BigramAssocMeasures()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ ">>> word_fd = nltk.FreqDist(tokens)\n", ">>> bigram_fd = nltk.FreqDist(nltk.bigrams(tokens))\n", ">>> finder = BigramCollocationFinder(word_fd, bigram_fd)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [], "source": [ "finder.apply_freq_filter(100)\n", "finder.apply_word_filter(lambda w: w in dutchstop or len(w)<3 or re.search(r'^\\W+$',w))\n", "scored = finder.score_ngrams(bigram_measures.pmi)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[((u'Sri', u'Lanka'), 14.334247343063975),\n", " ((u'Verenigd', u'Koninkrijk'), 13.770043402509613),\n", " ((u'Ter', u'Apel'), 13.120122537711127),\n", " ((u'persoonlijke', u'levenssfeer'), 13.014957486638092),\n", " ((u'Koninklijke', u'Marechaussee'), 12.6041932359635),\n", " ((u'Dienst', u'Justiti'), 12.55619998538745),\n", " ((u'Holland', u'Casino'), 12.27312889617636),\n", " ((u'centrale', u'autoriteit'), 12.21993631683266),\n", " ((u'NRC', u'Handelsblad'), 12.20586418610441),\n", " ((u'Verenigde', u'Staten'), 12.171037334864451),\n", " ((u'Nationale', u'Recherche'), 12.13859977696005),\n", " ((u'alleenstaande', u'minderjarige'), 12.029608731798323),\n", " ((u'voorlopige', u'hechtenis'), 11.992737081233336),\n", " ((u'seksueel', u'misbruik'), 11.992411369522582),\n", " ((u'Den', u'Haag'), 11.8970289893151),\n", " ((u'Den', u'Bosch'), 11.887777649381626),\n", " ((u'kort', u'geding'), 11.755647446233333),\n", " ((u'rechterlijke', u'macht'), 11.752291188474594),\n", " ((u'Indiener', u'vraagt'), 11.635066411357236),\n", " ((u'Algemeen', u'Overleg'), 11.607257418406245),\n", " ((u'inzage', u'gelegd'), 11.465888519708379),\n", " ((u'burgerlijke', u'stand'), 11.424562969771415),\n", " ((u'Buitenlandse', u'Zaken'), 11.341850018123964),\n", " ((u'penitentiaire', u'inrichtingen'), 11.302398477024632),\n", " ((u'Binnenlandse', u'Zaken'), 11.278904271582267),\n", " ((u'Algemeen', u'Dagblad'), 11.23272804915242),\n", " ((u'Burgerlijk', u'Wetboek'), 11.223330534825525),\n", " ((u'hoger', u'beroep'), 11.09942481724375),\n", " ((u'Europese', u'Unie'), 11.017523721181668),\n", " ((u'Economische', u'Zaken'), 10.880456382057698)]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "scored[:30]\n", "#sorted(bigram for bigram, score in scored )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }