[Supplementary PNAS appendix](http://www.pnas.org/content/suppl/2014/12/12/1410931111.DCSupplemental/pnas.1410931111.sapp.pdf) \n", " * Lots of details\n", " * Free format\n", " * Extra references\n", "1. [Website](http://language.media.mit.edu/), called _Global Language Network_\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Papers website \n", "\n", "## Contains 6 clear tabs\n", "1. Content related\n", " 1. Visualizations\n", " 1. Rankings \n", " 1. Data\n", "1. Paper related\n", " 1. Paper\n", " 1. About\n", " 1. Press " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Non content related information\n", "\n", "## Paper\n", "* Bibliographic information\n", "* Link to paper\n", "* Abstract\n", "\n", "## Press\n", "* A list of Venues, Dates and Links\n", "* Video\n", "\n", "## About\n", "* Affiliations, persons involved, grant acknowledgments" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import HTML\n", "HTML('')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Content related sections\n", "\n", "## Further interactive exploration of data\n", "* [Network Visualizations](http://language.media.mit.edu/visualizations/books)\n", " * Explore the language **network** for all three datasets\n", "* [Rankings](http://language.media.mit.edu/rankings/books)\n", " * Explore characteristics of each language with an **interactive spreadsheet**\n", " \n", "## Access to (almost all) data\n", "* \n", "* For each data set:\n", " * metadata\n", " * raw data\n", " * final network (nodes and edges)\n", "* Core (and expensive) measures precomputed (betweenness and eigenvector centrality)\n", "* Additional datasets used in handy cvs format\n", "* Even the cytoscape file with the interactive networks" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "curl: (6) Could not resolve host: language.media.mit.edu\r\n" ] } ], "source": [ "!curl 'http://language.media.mit.edu/data'" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2015-01-22 15:26:23-- http://language.media.mit.edu/data/public/twitter_userlang_iso639-3a.tsv.zip\n", "Resolving language.media.mit.edu... failed: nodename nor servname provided, or not known.\n", "wget: unable to resolve host address `language.media.mit.edu'\n" ] } ], "source": [ "!wget \"http://language.media.mit.edu/data/public/twitter_userlang_iso639-3a.tsv.zip\"" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 admin staff 188M May 21 2013 twitter_userlang_iso639-3a.tsv\n", " 10000000 22685108 197378908 twitter_userlang_iso639-3a.tsv\n", "0000saraa\tnld,2\teng,2\n", "00012amal\tnld,1\n", "0001_xml\tnld,11\teng,88\n", "000annika000\tnld,1\n", "000debbie000\tnld,2\teng,1\n", "000hicham000\tnld,2\n", "000jesse000\tnld,3\n", "000marianne\tnld,1\n", "000remco000\tnld,36\teng,2\n", "000shirley000\tnld,2\n" ] } ], "source": [ "#!unzip twitter_userlang_iso639-3a.tsv.zip\n", "! ls -lh twitter_userlang_iso639-3a.tsv\n", "!wc twitter_userlang_iso639-3a.tsv\n", " \n", "!grep 'nld,' twitter_userlang_iso639-3a.tsv |head" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Twitter data\n", "\n", "* Parsed from 1G tweets.\n", "* Each line contains \n", " * userid\n", " * how many tweets per language that user has tweeted\n", "* 10M users per file, 4 files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Wrap up _form of paper_\n", "\n", "1. Done very well\n", "1. Benefits at several layers:\n", " * Scientific\n", " * easy to continue the work\n", " * reproducible\n", " * stepwise zooming in on the work\n", " * Societal\n", " * easy to _play_ with the results\n", " * press releases\n", " * Education\n", " * easy to let students redo analysis\n", " * Grant providers\n", " * More concrete output than a bibitem\n", "1. All these benefits turn back to the authors, their lab, faculty, institute " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Possible improvements\n", "\n", "1. Presentation(s) of the results (slides, videos)\n", "1. Code\n", " * Notebooks for data processing\n", " * Notebooks for statistical and network analysis \n", "\n", "## Follow up (keep the paper/site alive)\n", "1. Inlinks to the paper (i.e. citations)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Content of the paper\n", "\n", "## Research question: How can we measure the global influence of a language?\n", "\n", "## Motivation\n", "* Only very poor measures used: \n", " * number of speakers\n", " * economic power of countries with language as state language\n", "* Poor because influence of a language does not come from the number of people who speak it,\n", "* Rather from _who speaks it_\n", "* Think of **latin**, the universal language for over 1000 years\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Research method\n", "\n", "1. Collect data about _\"interactions of languages\"_\n", " * translations (here _books_)\n", " * people who perform a task in several languages\n", " * Tweet \n", " * edit an article in several languages (Wikipedia)\n", "1. Lift data to languages themselves: create **networks of languages**\n", "1. **Sanity checks:** \n", " * Repeat analysis on diferent datasets\n", " * Analyse their correlation\n", "1. Define _influence of L_ as the _centrality of L_ in the language network" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Evaluation\n", "1. Show correlation between \n", " * proposed measure of influence of L\n", " * number of famous people with L as mothertongue\n", "1. Show that proposed measure explains _number of famous people_ better than previous measures\n", "1. Number of famous people is measured in two ways." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Results\n", "\n", "1. Global influence of a language measured as network centrality is a sensible and robust measure.\n", "1. _English_ is the central hub in the network\n", " * _Intermediate hubs:_ German, French, Spanish\n", "1. Widely spoken languages (Chinese, Hindi, Arabic) are peripheral in the network" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# To finish: hands on" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# From data to network (example Twitter)\n", "\n", "1. Collect Tweets\n", "1. Identify language of Tweets\n", "1. Identify users who tweet in more than one language.\n", " * For _sanity_ do\n", " * remove tweeters who have too few tweets (here 6)\n", " * remove tweeters with tweet in too many languages (here more than 5)\n", " * only consider langauges in which user tweeted twice or more\n", "\n", "1. weight(L1,L2) = len([ t for t in tweeters if t tweets in L1 and L2])\n", "1. edge(L1,L2) iff weight(L1,L2) at least 6 AND correlation between L1 and L2 is stat. significant\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Take home message\n", "\n", "1. A paper can be much more than a paper.\n", " * **Form matters**\n", "1. Tapping the collective intelligence hidden in \"our kind of data\" can lead to exciting science.\n", " * **Cross disciplinary research**\n", "1. # Take home message

1. A paper can be much more than a paper.
 * **Form matters**
1. Tapping the collective intelligence hidden in "our kind of data" can lead to exciting science.
 * **Cross disciplinary research**
1. Evaluation without a gold standard
 * **Creativity pays** (when paired with solid statistics)