{
 "metadata": {
  "name": "",
  "signature": "sha256:bb9895f2a74d665b537d6bfcc11ff5e35d16ac2a5904bfeb9e8548109e4c6108"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": [],
     "source": [
      "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# 8.4. Learning from text: Naive Bayes for Natural Language Processing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this recipe, we show how to handle text data with scikit-learn. Working with text requires careful preprocessing and feature extraction. It is also quite common to deal with highly sparse matrices.\n",
      "\n",
      "We will learn to recognize whether a comment posted during a public discussion is considered insulting to one of the participants. We will use a labeled dataset from [Impermium](https://impermium.com), released during a [Kaggle competition](https://www.kaggle.com/c/detecting-insults-in-social-commentary)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You need to download the *troll* dataset on the book's website. (https://ipython-books.github.io)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "1. Let's import our libraries."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import pandas as pd\n",
      "import sklearn\n",
      "import sklearn.cross_validation as cv\n",
      "import sklearn.grid_search as gs\n",
      "import sklearn.feature_extraction.text as text\n",
      "import sklearn.naive_bayes as nb\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "2. Let's open the csv file with Pandas."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.read_csv(\"data/troll.csv\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "3. Each row is a comment. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df[['Insult', 'Comment']].tail()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "4. Now, we are going to define the feature matrix $\\mathbf{X}$ and the labels $\\mathbf{y}$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y = df['Insult']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Obtaining the feature matrix from the text is not trivial. Scikit-learn can only work with numerical matrices. How to convert text into a matrix of numbers? A classical solution is to first extract a **vocabulary**: a list of words used throughout the corpus. Then, we can count, for each sample, the frequency of each word. We end up with a **sparse matrix**: a huge matrix containiny mostly zeros. Here, we do this in two lines. We will give more explanations in *How it works...*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tf = text.TfidfVectorizer()\n",
      "X = tf.fit_transform(df['Comment'])\n",
      "print(X.shape)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "5. There are 3947 comments and 16469 different words. Let's estimate the sparsity of this feature matrix."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(\"Each sample has ~{0:.2f}% non-zero features.\".format(\n",
      "          100 * X.nnz / float(X.shape[0] * X.shape[1])))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "6. Now, we are going to train a classifier as usual. We first split the data into a train and test set."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "(X_train, X_test,\n",
      " y_train, y_test) = cv.train_test_split(X, y,\n",
      "                                        test_size=.2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "7. We use a **Bernoulli Naive Bayes classifier** with a grid search on the parameter $\\alpha$."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bnb = gs.GridSearchCV(nb.BernoulliNB(), param_grid={'alpha':np.logspace(-2., 2., 50)})\n",
      "bnb.fit(X_train, y_train);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "8. What is the performance of this classifier on the test dataset?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bnb.score(X_test, y_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "9. Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We first get the words corresponding to each feature.\n",
      "names = np.asarray(tf.get_feature_names())\n",
      "# Next, we display the 50 words with the largest\n",
      "# coefficients.\n",
      "print(','.join(names[np.argsort(\n",
      "    bnb.best_estimator_.coef_[0,:])[::-1][:50]]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "10. Finally, let's test our estimator on a few test sentences."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(bnb.predict(tf.transform([\n",
      "    \"I totally agree with you.\",\n",
      "    \"You are so stupid.\",\n",
      "    \"I love you.\"\n",
      "    ])))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n",
      "\n",
      "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}