{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text classification with Naive Bayes : feature selection with mutual information\n", "\n", "## Notebook made by \n", "\n", "|** Name** | **Student id** | **email**|\n", "|:- |:-|:-|\n", "|. | | |\n", "| | |. |\n", "\n", "### Pledge (taken from [Coursera's Honor Code](https://www.coursera.org/about/terms/honorcode) )\n", "\n", "\n", "\n", "Put here a selfie with your photo where you hold a signed paper with the following text: (if this is team work, put two selfies here). The link must be to some place on the web, not to a local file. \n", "\n", "> My answers to homework, quizzes and exams will be my own work (except for assignments that explicitly permit collaboration).\n", "\n", ">I will not make solutions to homework, quizzes or exams available to anyone else. This includes both solutions written by me, as well as any official solutions provided by the course staff.\n", "\n", ">I will not engage in any other activities that will dishonestly improve my results or dishonestly improve/hurt the results of others.\n", "\n", "\n", "\n", "### Note\n", "* **Assignments without the selfies or completely filled in information will not be graded and receive 0 points.**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text classification with Naive Bayes \n", " \n", " \n", " \n", "
We will do **text classification** on a collection of Dutch parliamentary questions.\n", "
\n", "In dit notebook beperken we ons tot het bepalen van de **mutual information** scores, zoals beschereven in
\n", " | jaar | \n", "partij | \n", "titel | \n", "vraag | \n", "antwoord | \n", "ministerie | \n", "
---|---|---|---|---|---|---|
KVR1000.xml | \n", "1994 | \n", "PvdA | \n", "De vragen betreffen de betrouwbaarheid van de... | \n", "Hebt u kennisgenomen van het televisieprogram... | \n", "Ja. Het bedoelde geluidmeetpunt is eigendom v... | \n", "Verkeer en Waterstaat | \n", "
KVR10000.xml | \n", "1999 | \n", "PvdA | \n", "Vragen naar aanleiding van berichten (uitzend... | \n", "Kent u de berichten over de situatie in de Me... | \n", "\n", " | Justitie | \n", "
KVR10001.xml | \n", "1999 | \n", "SP | \n", "Vragen naar aanleiding van de berichten \"Nede... | \n", "Kent u de berichten «Nederland steunt de Soeh... | \n", "\n", " | Financien | \n", "
KVR10002.xml | \n", "1999 | \n", "PvdA | \n", "Vragen over de gebrekkige opvang van verpleeg... | \n", "Kent u het bericht over onderzoek van Nu91 me... | \n", "Ja. Het onderzoek van NU’91 wijst uit dat het... | \n", "Volksgezondheid, Welzijn en Sport | \n", "
KVR10003.xml | \n", "1999 | \n", "PvdA | \n", "Vragen over onbetrouwbaarheid van filemeldingen. | \n", "Hebt u kennisgenomen van de berichten over de... | \n", "Ja. Nee. Door de waarnemers van het Algemeen ... | \n", "Verkeer en Waterstaat | \n", "
We will use the fields in the _ministerie_ column as our classes. \n",
" These are the ministeries to whom the question is addressed.\n",
"
\n",
" Note that these labels are not normalized, see e.g. these counts below"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
" Justitie (JUS) 3219\n",
" Volksgezondheid, Welzijn en Sport (VWS) 2630\n",
" Buitenlandse Zaken (BUZA) 1796\n",
" Verkeer en Waterstaat (VW) 1441\n",
" Justitie 1333\n",
" Sociale Zaken en Werkgelegenheid (SZW) 1231\n",
" Onderwijs, Cultuur en Wetenschappen (OCW) 1187\n",
" Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer (VROM) 984\n",
" Financiën (FIN) 960\n",
" Volksgezondheid, Welzijn en Sport 951\n",
" Economische Zaken (EZ) 946\n",
" Buitenlandse Zaken 753\n",
" Binnenlandse Zaken en Koninkrijksrelaties (BZK) 725\n",
" Verkeer en Waterstaat 724\n",
" Defensie (DEF) 646\n",
" Sociale Zaken en Werkgelegenheid 607\n",
" Landbouw, Natuurbeheer en Visserij (LNV) 586\n",
" Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 554\n",
" Onderwijs, Cultuur en Wetenschappen 532\n",
" Vreemdelingenzaken en Integratie (VI) 466\n",
" Landbouw, Natuurbeheer en Visserij 440\n",
" Landbouw, Natuur en Voedselkwaliteit (LNV) 422\n",
" Financiën 409\n",
" Binnenlandse Zaken 389\n",
" Economische Zaken 337\n",
" Defensie 305\n",
" Ontwikkelingssamenwerking (OS) 198\n",
" Onderwijs, Cultuur en Wetenschap (OCW) 195\n",
" Algemene Zaken (AZ) 169\n",
" Binnenlandse Zaken en Koninkrijksrelaties (BZK) Justitie (JUS) 168\n",
"Name: ministerie, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kvrdf.ministerie.value_counts().head(30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises\n",
" \n",
"* Normalize the values for \"ministerie\" and choose 10 ministeries to work with. Put these in a new column called `NormalizedMinisterie`.\n",
"* Let ook goed op spaties aan het begin en eind.\n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"24824\n"
]
},
{
"data": {
"text/plain": [
"Justitie 4640\n",
"Volksgezondheid, Welzijn en Sport 3597\n",
"Buitenlandse Zaken 2697\n",
"Verkeer en Waterstaat 2178\n",
"Sociale Zaken en Werkgelegenheid 1861\n",
"Onderwijs, Cultuur en Wetenschappen 1730\n",
"Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 1560\n",
"Financiën 1403\n",
"Economische Zaken 1309\n",
"Binnenlandse Zaken en Koninkrijksrelaties 1241\n",
"Landbouw, Natuurbeheer en Visserij 1031\n",
"Defensie 963\n",
"Vreemdelingenzaken en Integratie 614\n",
"Name: NormalizedMinisterie, dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Ik heb er 13 gekozen want die laatste drie leken me juist zo leuk.\n",
"\n",
"topNChosenMinisteries=13\n",
"\n",
"kvrdf['NormalizedMinisterie']= kvrdf.ministerie.str.replace(r' *\\(.*','')\n",
"kvrdf['NormalizedMinisterie']=kvrdf.NormalizedMinisterie.str.strip() # remove leading and ending spaces\n",
"print kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries).sum()\n",
"kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"40516\n"
]
},
{
"data": {
"text/plain": [
"(24824, 9)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# restrict to the top 13\n",
"\n",
"print len(kvrdf)\n",
"kvrdf= kvrdf[kvrdf.NormalizedMinisterie.isin(kvrdf['NormalizedMinisterie'].value_counts().head(topNChosenMinisteries).index)]\n",
"kvrdf.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tokenize\n",
"\n",
"1. Voeg een kolom `vocabulair` toe met een lijst met alle unieke woorden die voorkomen in de tekst kolommen van een kamervraag.\n",
" * lowercase de woorden\n",
" * gooi getallen en punctuatie-tokens weg\n",
" * neem geen woorden die in de NL stopwoorden lijst `dutchstop` zitten op\n",
" * je kunt ook nog lemmatiseren\n",
"2. **Hint** \n",
" * Plak eerst alle tekst aan elkaar vast (let op!)\n",
" * maak een functie die tokenize doet en al die dingen hierboven beschreven\n",
" * pas die functie met `.apply` toe op de kolom met alle tekst\n",
"3. Dit kan wel even duren. Test dus op kleine aantallen, of op de kleine file."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'basisscholen'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lemma('basisschool')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"kvrdf['fulltext']= kvrdf.titel.astype(str)+' '+kvrdf.vraag.astype(str)+' '+kvrdf.antwoord.astype(str)\n",
"\n",
"\n",
" \n",
" \n",
"\n",
"def tokenize(text, lemmatize=False):\n",
" tokens= [s.lower() for s in set(nltk.word_tokenize(str(text).decode('utf-8')))\n",
" if re.match(r'^\\w+$',s) and \n",
" len(s) > 2\n",
" and not re.match(r'^\\d+$',s ) and \n",
" not s.lower() in dutchstop\n",
" ]\n",
" if not lemmatize:\n",
" return list(set(tokens))\n",
" if lemmatize:\n",
" return list(lemma(w) for w in set(tokens))\n",
" \n",
"# test\n",
"\n",
"print kvrdf.fulltext.head(5).apply(tokenize)\n",
"\n",
"kvrdf.fulltext.head(5).apply(lambda w:tokenize(w,lemmatize=True))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%time kvrdf['vocabulair']= kvrdf.fulltext.apply(tokenize)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4min 40s, sys: 3.42 s, total: 4min 43s\n",
"Wall time: 4min 44s\n"
]
}
],
"source": [
"%time kvrdf['lemmatized_vocabulair']= kvrdf.fulltext.apply(lambda w:tokenize(w,lemmatize=True))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 44.2 s, sys: 433 ms, total: 44.6 s\n",
"Wall time: 44.9 s\n"
]
}
],
"source": [
"# Fast alternative \n",
"def filter_voc(L):\n",
" tokens=[s for s in L if len(s) > 2\n",
" and not re.match(r'^\\d+$',s ) and \n",
" not s in dutchstop\n",
" ]\n",
" return list(set(tokens))\n",
"%time tokens= (kvrdf.fulltext.str.lower().str.findall(r'\\b\\w+\\b')).apply(filter_voc)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 12.9 s, sys: 208 ms, total: 13.1 s\n",
"Wall time: 13.5 s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/admin/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
]
},
{
"data": {
"text/plain": [
"KVR1000.xml [betrouwbaarheid, inderdaad, meter, bereiken, ...\n",
" KVR10000.xml [twijfels, bent, kennis, situatie, moeten, wel...\n",
" KVR10002.xml [gestaan, erom, volkskrant, wijze, protocollen...\n",
" KVR10003.xml [gezet, betrouwbaarheid, desbetreffende, waarn...\n",
" KVR10004.xml [rol, volkskrant, termijn, gepresenteerd, spoo...\n",
"Name: fulltext, dtype: object"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Even faster \n",
"# lower case, length restrictie, remove pure digits, remove doubles, remove dutchstop \n",
"\n",
"%time tokens= (kvrdf.fulltext.str.lower().str.replace(r'\\b\\d+\\b','').str.replace(r'\\b\\w\\w\\b','').str.findall(r'\\b\\w+\\b')).apply(lambda l:([w for w in set(l) if not w in dutchstop]))\n",
"\n",
"kvrdf['vocabulair']=tokens\n",
"tokens.head()\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pattern lemmatize is very dodgy for NL"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#kvrdf.lemmatized_vocabulair.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tel woorden voor elk Minsterie\n",
"\n",
"* Gebruik `Counter` om voor elk Ministerie, **voor elk woord te tellen in hoeveel documenten dat woord voorkomt.**\n",
"* **Hints**\n",
" * Loop over alle ministeries\n",
" * voor elk ministerie pak de kolom vocabulair op, en concateneer alle lijsten tot 1 lijst\n",
" * tel dan met `Counter`\n",
" * Neem alleen woorden die in minimaal 10 documenten voorkomen\n",
" * Lever uiteindelijk een dict op met Ministeries als sleutels en hun counters als waarden\n",
" * Maak hier een dataframe van met `pd.DataFrame.from_dict`\n",
" * de woorden zijn de rijen, en de Ministeries de kolommen\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4.2 s, sys: 99.5 ms, total: 4.3 s\n",
"Wall time: 4.3 s\n",
"Volksgezondheid, Welzijn en Sport 7671\n",
"Defensie 2903\n",
"Sociale Zaken en Werkgelegenheid 4978\n",
"Justitie 8126\n",
"Binnenlandse Zaken en Koninkrijksrelaties 3677\n",
"Landbouw, Natuurbeheer en Visserij 3184\n",
"Financiën 4167\n",
"Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer 4802\n",
"Buitenlandse Zaken 6030\n",
"Onderwijs, Cultuur en Wetenschappen 4581\n",
"Vreemdelingenzaken en Integratie 2101\n",
"Verkeer en Waterstaat 5634\n",
"Economische Zaken 4225\n"
]
}
],
"source": [
"def makeCounters(Vocabulair,drempelwaarde=10):\n",
" II= {}\n",
" for mi in set(kvrdf.NormalizedMinisterie.values):\n",
" A= kvrdf[kvrdf.NormalizedMinisterie==mi][Vocabulair].values\n",
" M= Counter([w for l in A for w in l])\n",
" Minimaal10={v:M[v] for v in M if M[v]>=drempelwaarde}\n",
" II[mi]= Minimaal10\n",
" return II\n",
"\n",
"%time II= makeCounters('vocabulair')\n",
"\n",
"for mi in II:\n",
" print mi, len(II[mi])"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 109 ms, sys: 3.05 ms, total: 113 ms\n",
"Wall time: 113 ms\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/admin/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile\n",
" RuntimeWarning)\n"
]
},
{
"data": {
"text/html": [
"
\n", " | Binnenlandse Zaken en Koninkrijksrelaties | \n", "Buitenlandse Zaken | \n", "Defensie | \n", "Economische Zaken | \n", "Financiën | \n", "Justitie | \n", "Landbouw, Natuurbeheer en Visserij | \n", "Onderwijs, Cultuur en Wetenschappen | \n", "Sociale Zaken en Werkgelegenheid | \n", "Verkeer en Waterstaat | \n", "Volksgezondheid, Welzijn en Sport | \n", "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer | \n", "Vreemdelingenzaken en Integratie | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "3677.000000 | \n", "6030.000000 | \n", "2903.000000 | \n", "4225.000000 | \n", "4167.000000 | \n", "8126.000000 | \n", "3184.000000 | \n", "4581.000000 | \n", "4978.000000 | \n", "5634.000000 | \n", "7671.000000 | \n", "4802.000000 | \n", "2101.000000 | \n", "
mean | \n", "50.447919 | \n", "70.073632 | \n", "42.026180 | \n", "53.996923 | \n", "54.662827 | \n", "90.829190 | \n", "47.155779 | \n", "59.431347 | \n", "65.586983 | \n", "66.653355 | \n", "87.797810 | \n", "59.704082 | \n", "38.621133 | \n", "
std | \n", "74.609997 | \n", "132.822490 | \n", "57.124467 | \n", "81.939572 | \n", "82.836844 | \n", "197.824251 | \n", "63.995373 | \n", "98.232193 | \n", "109.091074 | \n", "118.804579 | \n", "181.457338 | \n", "95.210070 | \n", "45.112760 | \n", "
min | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "10.000000 | \n", "
25% | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
50% | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
75% | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
max | \n", "1106.000000 | \n", "2184.000000 | \n", "766.000000 | \n", "1057.000000 | \n", "1089.000000 | \n", "3808.000000 | \n", "762.000000 | \n", "1357.000000 | \n", "1501.000000 | \n", "1703.000000 | \n", "2965.000000 | \n", "1236.000000 | \n", "518.000000 | \n", "
\n", " | Binnenlandse Zaken en Koninkrijksrelaties | \n", "Buitenlandse Zaken | \n", "Defensie | \n", "Economische Zaken | \n", "Financiën | \n", "Justitie | \n", "Landbouw, Natuurbeheer en Visserij | \n", "Onderwijs, Cultuur en Wetenschappen | \n", "Sociale Zaken en Werkgelegenheid | \n", "Verkeer en Waterstaat | \n", "Volksgezondheid, Welzijn en Sport | \n", "Volkshuisvesting, Ruimtelijke Ordening en Milieubeheer | \n", "Vreemdelingenzaken en Integratie | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zwitserland | \n", "NaN | \n", "22.0 | \n", "NaN | \n", "11.0 | \n", "NaN | \n", "20.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "14.0 | \n", "18.0 | \n", "NaN | \n", "NaN | \n", "
zwitserse | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "11.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
zwolle | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "60.0 | \n", "NaN | \n", "14.0 | \n", "NaN | \n", "32.0 | \n", "33.0 | \n", "NaN | \n", "NaN | \n", "
zwolse | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "11.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "12.0 | \n", "10.0 | \n", "NaN | \n", "NaN | \n", "
zzp | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "20.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
\n", " | Mutual Information | \n", "
---|---|
justitie | \n", "0.000401 | \n", "
politie | \n", "0.000245 | \n", "
openbaar | \n", "0.000213 | \n", "
verdachte | \n", "0.000201 | \n", "
procureurs | \n", "0.000183 | \n", "
strafbare | \n", "0.000182 | \n", "
vervolging | \n", "0.000179 | \n", "
wetboek | \n", "0.000163 | \n", "
strafrechtelijk | \n", "0.000160 | \n", "
strafvordering | \n", "0.000154 | \n", "
officier | \n", "0.000144 | \n", "
verdachten | \n", "0.000139 | \n", "
strafrecht | \n", "0.000134 | \n", "
feiten | \n", "0.000131 | \n", "
opsporing | \n", "0.000128 | \n", "
rechter | \n", "0.000128 | \n", "
justiti | \n", "0.000125 | \n", "
strafzaak | \n", "0.000120 | \n", "
strafrechtelijke | \n", "0.000119 | \n", "
strafbaar | \n", "0.000118 | \n", "