{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical Data Science in Python\n", "\n", "#### Small additions by Maarten Marx, UvA 2016-02" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "This notebook accompanies my talk on \"Data Science with Python\" at the [University of Economics](https://www.vse.cz/english/) in Prague, December 2014. Questions & comments welcome [@RadimRehurek](https://twitter.com/radimrehurek).\n", "\n", "The goal of this talk is to demonstrate some high level, introductory concepts behind (text) machine learning. The concepts are demonstrated by concrete code examples in this notebook, which you can run yourself (after installing IPython, see below), on your own computer.\n", "\n", "The talk audience is expected to have some basic programming knowledge (though not necessarily Python) and some basic introductory data mining background. This is *not* an \"advanced talk\" for machine learning experts.\n", "\n", "The code examples build a working, executable prototype: an app to classify phone SMS messages in English (well, the \"SMS kind\" of English...) as either \"spam\" or \"ham\" (=not spam)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![](http://radimrehurek.com/data_science_python/python.png)](http://xkcd.com/353/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The language used throughout will be [Python](https://www.python.org/), a general purpose language helpful in all parts of the pipeline: I/O, data wrangling and preprocessing, model training and evaluation. While Python is by no means the only choice, it offers a unique combination of flexibility, ease of development and performance, thanks to its mature scientific computing ecosystem. Its vast, open source ecosystem also avoids the lock-in (and associated bitrot) of any single specific framework or library.\n", "\n", "Python (and of most its libraries) is also platform independent, so you can run this notebook on Windows, Linux or OS X without a change.\n", "\n", "One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " | \n", " | message | \n", "
---|---|---|
label | \n", "\n", " | \n", " |
ham | \n", "count | \n", "4827 | \n", "
unique | \n", "4518 | \n", "|
top | \n", "Sorry, I'll call later | \n", "|
freq | \n", "30 | \n", "|
spam | \n", "count | \n", "747 | \n", "
unique | \n", "653 | \n", "|
top | \n", "Please call our customer service representativ... | \n", "|
freq | \n", "4 | \n", "