{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GenderAnalysisCSPublications\n", "\n", "* We want to redo the analysis in the Nature article \n", "* Data from DBLP\n", "* Gender assignment using American names as we did with Wikipedia\n", "\n", "### Todo\n", "* get rid of diacritics\n", "* maybe more normalization (Irina, Irini, etc -> Irine)\n", "* Check if authors occur with different spellings (e.g Maarten Marx and M. Marx)\n", " * Check if that is solvable\n", " * Otherwise crwal DBLP site\n", " * On the autor pages all seems normalized \n", " \n", "### Recall with only american names:\n", "U 0.686486\n", "M 0.252337\n", "F 0.061177\n", "\n", "* I put the threshold on 16 times more often one gender.\n", "* The nature paper had 10 times. This would help of course.\n", "* Probably we can imporve recall a lot when we set a threshold on minimum number of publications. \n", " * E.g. 50% has at most 2 publications\n", " * 11 % has more than 10 publications. \n", "\n", "#### Diacritics should help at least 5% \n", "\n", "## Other things\n", "\n", "* Create the coauthorship network, look for homophily.\n", "* track the places of the males/females in the (non alphabetical) author lists\n", "* Possibly diferentiate between high quality and other outlets\n", "* Focus on one subfiedl (e.g., database)\n", "* Collect affiliations (DBLP has them)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "from __future__ import division\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import re\n", "import time\n", "import random" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2016-02-04 17:38:58-- http://dblp.uni-trier.de/xml/dblp.xml.gz\n", "Resolving dblp.uni-trier.de... 136.199.55.186\n", "Connecting to dblp.uni-trier.de|136.199.55.186|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 328840008 (314M) [application/x-gzip]\n", "Saving to: `dblp.xml.gz'\n", "\n", "100%[======================================>] 328,840,008 6.96M/s in 40s \n", "\n", "2016-02-04 17:39:39 (7.79 MB/s) - `dblp.xml.gz' saved [328840008/328840008]\n", "\n" ] } ], "source": [ "!wget http://dblp.uni-trier.de/xml/dblp.xml.gz" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gunzip: error writing to output: Broken pipe\r\n", "gunzip: dblp.xml.gz: uncompress failed\r\n" ] } ], "source": [ "!gunzip -c dblp.xml |head -1000 > testje" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.68 s, sys: 1.15 s, total: 2.83 s\n", "Wall time: 1min 14s\n" ] } ], "source": [ "%time authors = !gunzip -c dblp.xml|grep '^'|sed 's/<[^>]*>//g'" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "effe = !cat testje|grep --before-context=1 --after-context=1 '^'" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['
',\n", " 'Sanjeev Saxena',\n", " 'Parallel Integer Sorting and Simulation Amongst CRCW Models.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Hans-Ulrich Simon',\n", " 'Pattern Matching in Trees and Nets.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Nathan Goodman',\n", " 'Oded Shmueli',\n", " 'NP-complete Problems Simplified on Tree Schemas.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Norbert Blum',\n", " 'On the Power of Chain Rules in Context Free Grammars.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Arnold Schönhage',\n", " 'Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Juha Honkala',\n", " 'A characterization of rational D0L power series.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Chua-Huang Huang',\n", " 'Christian Lengauer',\n", " 'The Derivation of Systolic Implementations of Programs.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Alain Finkel',\n", " 'Annie Choquet',\n", " 'Fifo Nets Without Order Deadlock.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Joachim Biskup',\n", " 'On the Complementation Rule for Multivalued Dependencies in Database Relations.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Symeon Bozapalidis',\n", " 'Zoltán Fülöp 0001',\n", " 'George Rahonis',\n", " 'Equational weighted tree transformations.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Victor Khomenko',\n", " 'Alex Kondratyev',\n", " 'Maciej Koutny',\n", " 'Walter Vogler',\n", " 'Merged processes: a new condensed representation of Petri net behaviour.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Wim H. Hesselink',\n", " 'Verifying a simplification of mutual exclusion by Lycklama-Hadzilacos.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Christian Ronse',\n", " 'A Three-Stage Construction for Multiconnection Networks.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Carol Critchlow',\n", " 'Prakash Panangaden',\n", " 'The Expressive Power of Delay Operators in SCCS.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Robin Milner',\n", " 'Calculi for Interaction.',\n", " '--',\n", " '--',\n", " '
',\n", " 'John Darlington',\n", " 'A Synthesis of Several Sorting Algorithms.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Maria Calzarossa',\n", " 'M. Italiani',\n", " 'Giuseppe Serazzi',\n", " 'A Workload Model Representative of Static and Dynamic Characteristics.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Vincent Vajnovszki',\n", " 'Gray visiting Motzkins.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Walter Vogler',\n", " 'Christian Stahl',\n", " 'Richard Müller 0001',\n", " 'Trace- and failure-based semantics for responsiveness.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Luc Devroye',\n", " 'Branching Processes in the Analysis of the Heights of Trees.',\n", " '--',\n", " '--',\n", " '
',\n", " 'T. C. Hu',\n", " 'K. C. Tan',\n", " 'Least Upper Bound on the Cost of Optimum Binary Search Trees.',\n", " '--',\n", " '--',\n", " '
',\n", " 'William R. Franta',\n", " 'The Mathematical Analysis of the Computer System Modeled as a Two Stage Cyclic Queue.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Ekkart Kindler',\n", " 'Invariants, Composition, and Substitution.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Demetres D. Kouvatsos',\n", " 'Maximum Entropy and the <i> G/G/1/N </i> Queue.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Sergei Gorlatch',\n", " 'Christian Lengauer',\n", " 'Abstraction and Performance in the Design of Parallel Programs: An Overview of the SAT Approach.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Roland Meyer',\n", " 'A theory of structural stationarity in the <i>pi</i> -Calculus.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Stefan Reisch',\n", " 'Hex ist PSPACE-vollständig.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Erzsébet Csuhaj-Varjú',\n", " 'Victor Mitrana',\n", " 'Evolutionary Systems: A Language Generating Device Inspired by Evolving Communities of Cells.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Ryszard Janicki',\n", " 'Relational structures model of concurrency.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Bruce Russell',\n", " 'On an Equivalence between Continuation and Stack Semantics.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Andrzej Ehrenfeucht',\n", " 'Grzegorz Rozenberg',\n", " 'Nonterminals Versus Homomorphisms in Defining Languages for Some Classes of Rewriting Systems.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Rainer Kemp',\n", " 'A Note on the Density of Inherently Ambiguous Context-free Languages.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Yijie Han',\n", " 'Yoshihide Igarashi',\n", " 'Time Lower Bounds for Parallel Sorting on a Mesh-Conected Processor Array.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Ian F. Akyildiz',\n", " 'Horst von Brand',\n", " 'Computational Algorithms for Networks of Queues with Rejection Blocking.',\n", " '--',\n", " '--',\n", " '
',\n", " 'X. J. Chen',\n", " 'Carlo Montangero',\n", " 'Compositional Refinements in Multiple Blackboard Systems.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Rob J. van Glabbeek',\n", " 'Ursula Goltz',\n", " 'Ernst-Rüdiger Olderog',\n", " 'Special issue on \"Combining Compositionality and Concurrency\": part 1.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Antonella Santone',\n", " 'Automatic verification of concurrent systems using a formula-based compositional approach.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Eric C. R. Hehner',\n", " 'On Removing the Machine from the Language.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Moon-Jung Chung',\n", " 'Michael Evangelist',\n", " 'Ivan Hal Sudborough',\n", " 'Complete Problems for Space Bounded Subclasses of NP.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Peter E. Bulychev',\n", " 'Alexandre David',\n", " 'Kim G. Larsen',\n", " 'Guangyuan Li',\n", " 'Efficient controller synthesis for a fragment of MTL<sub>0,∞</sub>.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Timothy A. Budd',\n", " 'Dana Angluin',\n", " 'Two Notions of Correctness and Their Relation to Testing.',\n", " '--',\n", " '--',\n", " '
',\n", " 'George Markowsky',\n", " 'Best Huffman Trees.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Manfred P. Stadel',\n", " 'Behandlung verschiedener INTEGER-Darstellungen durch optimierende Compiler.',\n", " '--',\n", " '--',\n", " '
',\n", " 'David A. Watt',\n", " 'The Parsing Problem for Affix Grammars.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Paul S. Amerins',\n", " 'Ricardo A. Baeza-Yates',\n", " 'Derick Wood',\n", " 'On Efficient Entreeings.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Ernst W. Mayr',\n", " 'Persistence of Vector Replacement Systems is Decidable.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Karel Culik II',\n", " 'Simant Dube',\n", " 'Implementing Daubechies Wavelet Transform with Weighted Finite Automata.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Armin B. Cremers',\n", " 'Thomas N. Hibbard',\n", " 'Functional Behavior in Data Spaces.',\n", " '--',\n", " '--',\n", " '
',\n", " 'William P. R. Mitchell',\n", " 'Inductive Completion with Retracts.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Patrick Cousot',\n", " 'Radhia Cousot',\n", " 'Sometime = Always + Recursion = Always on the Equivalence of the Intermittent and Invariant Assertions Methods for Proving Inevitability Properties of Programs.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Isi Mitrani',\n", " 'J. H. Hine',\n", " 'Complete Parameterized Families of Job Scheduling Strategies.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Joseph M. Morris',\n", " 'Malcolm Tyrrell',\n", " 'Modelling higher-order dual nondeterminacy.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Jürgen Nehmer',\n", " 'Dispatcher Primitives for the Construction of Operating System Kernels.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Walter J. Savitch',\n", " 'A Note on Multihead Automata and Context-Sensitive Languages',\n", " '--',\n", " '--',\n", " '
',\n", " 'Karl Meinke',\n", " 'A Recursive Second Order Initial Algebra Specification of Primitive Recursion.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Joost Engelfriet',\n", " 'Linda Heyker',\n", " 'George Leih',\n", " 'Context-Free Graph Languages of Bounded Degree are Generated by Apex Graph Grammars.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Shyamal K. Chowdhury',\n", " 'Pradip K. Srimani',\n", " 'Worst Case Performance of Weighted Buddy Systems.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Ugo Montanari',\n", " 'Francesca Rossi',\n", " 'Contextual Nets.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Peter E. Lauer',\n", " 'Piero R. Torrigiani',\n", " 'M. W. Shields',\n", " 'COSY - A System Specification Language Based on Paths and Processes.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Rudolf Berghammer',\n", " 'Applying relation algebra and Rel View to solve problems on orders and lattices.',\n", " '--',\n", " '--',\n", " '
',\n", " 'David Pager',\n", " 'Eliminating Unit Productions from LR Parsers.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Peter Lipps',\n", " 'Ulrich Möncke',\n", " 'Matthias Olk',\n", " 'Reinhard Wilhelm',\n", " 'Attribute (Re)evaluation in OPTRAN.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Sanguthevar Rajasekaran',\n", " 'Sandeep Sen',\n", " 'On Parallel Integer Sorting.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Richard Hull',\n", " 'Jianwen Su',\n", " 'Domain Independence and the Relational Calculus.',\n", " '--',\n", " '--',\n", " '
',\n", " 'John K. Lee',\n", " 'Alan Fekete',\n", " 'Multi-Granularity Locking for Nested Transactions: A Proof Using a Possibilities Mapping.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Teuvo Laurinolli',\n", " 'Bounded Quantification and Relations Recognizable by Finite Automata.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Gary T. Leavens',\n", " 'Don Pigozzi',\n", " 'A Complete Algebraic Characterization of Behavioral Subtyping.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Y. Daniel Liang',\n", " 'Maw-Shang Chang',\n", " 'Minimum Feedback Vertex Sets in Cocomparability Graphs and Convex Bipartite Graphs.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Gilles Bernot',\n", " 'Michel Bidoit',\n", " 'Teodor Knapik',\n", " 'Behavioural Approaches to Algebraic Specifications: A Comparative Study.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Jan A. Bergstra',\n", " 'C. A. Middelburg',\n", " 'Instruction sequence processing operators.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Wlodzimierz Drabent',\n", " 'What is Failure? An Approach to Constructive Negation.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Victor Khomenko',\n", " 'Mark Schäfer',\n", " 'Walter Vogler',\n", " 'Ralf Wollowski',\n", " 'STG decomposition strategies in combination with unfolding.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Raphael A. Finkel',\n", " 'Jon Louis Bentley',\n", " 'Quad Trees: A Data Structure for Retrieval on Composite Keys.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Alfred Schmitt',\n", " 'On the Number of Relational Operators Necessary to Compute Certain Functions of Real Variables.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Jacques Labetoulle',\n", " 'Guy Pujolle',\n", " 'A Study of Queueing Networks with Deterministic Service and Application to Computer Networks.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Paul Pritchard',\n", " 'Explaining the Wheel Sieve.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Robert T. Moenck',\n", " 'Another Polynomial Homomorphism.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Philip Heidelberger',\n", " 'Variance Reduction Techniques for the Simulation of Markov Process.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Siegfried Bublitz',\n", " 'Decomposition of Graphs and Monotone Formula Size of Homogeneous Functions.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Benedek Nagy',\n", " 'Friedrich Otto',\n", " 'Deterministic pushdown-CD-systems of stateless deterministic R(1)-automata.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Gheorghe Paun',\n", " 'On the Synchronization in Parallel Communicating Grammar Systems.',\n", " '--',\n", " '--',\n", " '
',\n", " 'P. F. Schuler',\n", " 'Weakly Context-Sensitive Languages as Model for Programming Languages.',\n", " '--',\n", " '--',\n", " '
',\n", " 'George W. Ernst',\n", " 'Rules of Inference for Procedure Calls.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Günther E. Pfaff',\n", " 'The Construction of Operator Interfaces Based on Logical Input Devices.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Joost Engelfriet',\n", " 'Heiko Vogler',\n", " 'High Level Tree Transducers and Iterated Pushdown Tree Transducers.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Stefan Reisch',\n", " 'Gobang ist PSPACE-vollständig.',\n", " '--',\n", " '--',\n", " '
',\n", " 'Chen-Ming Fan',\n", " 'Cheng-Chih Huang',\n", " 'Huei-Jan Shyr',\n", " 'Kuo-Hsiang Chen',\n", " 'A note on autodense related languages.']" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "effe" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0, 0, [])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "author_counts= Counter(authors)\n", "len(authors), len(set(authors)), author_counts.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Percentage of authors with at least n papers" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 0.001),\n", " (2, 0.505),\n", " (3, 0.662),\n", " (4, 0.74),\n", " (5, 0.788),\n", " (6, 0.821),\n", " (7, 0.846),\n", " (8, 0.865),\n", " (9, 0.88),\n", " (10, 0.892),\n", " (100, 0.995),\n", " (1000, 1.0)]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Hapax= [c for c in author_counts if author_counts[c]<=2]\n", "[ (n,round(len([c for c in author_counts if author_counts[c]<=n])/len(author_counts),3)) for n in range(1,11)+[100,1000]]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(10816262,\n", " 183286,\n", " [('David', 105539),\n", " ('Michael', 98534),\n", " ('John', 69160),\n", " ('Peter', 64243),\n", " ('M.', 59838),\n", " ('Thomas', 59189),\n", " ('Robert', 59184),\n", " ('Daniel', 52697),\n", " ('J.', 50339),\n", " ('A.', 48892)])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "firstnames= [n.split()[0] for n in authors]\n", "firstnames_counts=Counter(firstnames)\n", "len(firstnames), len(set(firstnames)), firstnames_counts.most_common(10)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(10816262,\n", " 183286,\n", " [('David', 105539),\n", " ('Michael', 98534),\n", " ('John', 69160),\n", " ('Peter', 64243),\n", " ('M.', 59838),\n", " ('Thomas', 59189),\n", " ('Robert', 59184),\n", " ('Daniel', 52697),\n", " ('J.', 50339),\n", " ('A.', 48892)])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get rid of HTML entities\n", "import HTMLParser\n", "pars = HTMLParser.HTMLParser()\n", "firstnames=[pars.unescape(name) for name in firstnames]\n", "firstnames_counts=Counter(firstnames)\n", "len(firstnames), len(set(firstnames)), firstnames_counts.most_common(10)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Sanjeev', 'Hans-Ulrich', 'Nathan', 'Oded', 'Norbert', 'Arnold', 'Juha', 'Chua-Huang', 'Christian', 'Alain', 'Annie', 'Joachim', 'Symeon', u'Zolt\\xe1n', 'George', 'Victor', 'Alex', 'Maciej', 'Walter', 'Wim', 'Christian', 'Carol', 'Prakash', 'Robin', 'John', 'Maria', 'M.', 'Giuseppe', 'Vincent', 'Walter', 'Christian', 'Richard', 'Luc', 'T.', 'K.', 'William', 'Ekkart', 'Demetres', 'Sergei', 'Christian', 'Roland', 'Stefan', u'Erzs\\xe9bet', 'Victor', 'Ryszard', 'Bruce', 'Andrzej', 'Grzegorz', 'Rainer', 'Yijie', 'Yoshihide', 'Ian', 'Horst', 'X.', 'Carlo', 'Rob', 'Ursula', u'Ernst-R\\xfcdiger', 'Antonella', 'Eric', 'Moon-Jung', 'Michael', 'Ivan', 'Peter', 'Alexandre', 'Kim', 'Guangyuan', 'Timothy', 'Dana', 'George', 'Manfred', 'David', 'Paul', 'Ricardo', 'Derick', 'Ernst', 'Karel', 'Simant', 'Armin', 'Thomas', 'William', 'Patrick', 'Radhia', 'Isi', 'J.', 'Joseph', 'Malcolm', u'J\\xfcrgen', 'Walter', 'Karl', 'Joost', 'Linda', 'George', 'Shyamal', 'Pradip', 'Ugo', 'Francesca', 'Peter', 'Piero', 'M.']\n" ] } ], "source": [ "print firstnames[:100]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameFMAllratiologratiogender
0Helene32797032797inf44.898629F
1Helena30892030892inf44.812299F
2Aileen30580030580inf44.797654F
3Leanne28127028127inf44.677021F
4Julianne27390027390inf44.638714F
\n", "
" ], "text/plain": [ " name F M All ratio logratio gender\n", "0 Helene 32797 0 32797 inf 44.898629 F\n", "1 Helena 30892 0 30892 inf 44.812299 F\n", "2 Aileen 30580 0 30580 inf 44.797654 F\n", "3 Leanne 28127 0 28127 inf 44.677021 F\n", "4 Julianne 27390 0 27390 inf 44.638714 F" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "!cp ../GenderWikipedia/NonAmbiguousNames.csv .\n", "\n", "Names= pd.read_csv('NonAmbiguousNames.csv')\n", "Names.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "F 1184\n", "M 791\n", "Name: gender, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Names.gender.value_counts()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def DetermineGender(name):\n", " if name in list(Names.name):\n", " return list(Names[(Names.name==name)].gender)[0]\n", " else:\n", " return 'U'\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
0Tzu-Herng
1Feiji
2Dage
3Erdun
4ChunJuan
\n", "
" ], "text/plain": [ " name\n", "0 Tzu-Herng\n", "1 Feiji\n", "2 Dage\n", "3 Erdun\n", "4 ChunJuan" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Uniquenames= pd.DataFrame(list(set(firstnames)) )\n", "Uniquenames.columns=['name']\n", "Uniquenames.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 21.6 s, sys: 143 ms, total: 21.8 s\n", "Wall time: 22.2 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegender
0Tzu-HerngU
1FeijiU
2DageU
3ErdunU
4ChunJuanU
\n", "
" ], "text/plain": [ " name gender\n", "0 Tzu-Herng U\n", "1 Feiji U\n", "2 Dage U\n", "3 Erdun U\n", "4 ChunJuan U" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time Uniquenames['gender']= Uniquenames.name.apply(DetermineGender)\n", "Uniquenames.head()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "U 181411\n", "F 1099\n", "M 776\n", "Name: gender, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Uniquenames.gender.value_counts()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 55min 25s, sys: 12.3 s, total: 55min 38s\n", "Wall time: 55min 45s\n" ] } ], "source": [ "# This is a stupid way of doing this because we very often do the same work (e.g with John)\n", "Allnames= pd.DataFrame(firstnames)\n", "Allnames.columns=['name']\n", "%time Allnames['gender']= Allnames.name.apply(DetermineGender)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "U 0.686486\n", "M 0.252337\n", "F 0.061177\n", "Name: gender, dtype: float64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Allnames.gender.value_counts()/len(Allnames)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.07 s, sys: 1.13 s, total: 8.2 s\n", "Wall time: 8.5 s\n" ] } ], "source": [ "% time ef= pd.merge(Allnames,Uniquenames, on='name')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index([u'name', u'gender_x', u'gender_y'], dtype='object')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ef.columns" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "U 7425215\n", "M 2729342\n", "F 661705\n", "Name: gender_x, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ef.gender_x.value_counts()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "U 7425215\n", "M 2729342\n", "F 661705\n", "Name: gender_y, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ef.gender_y.value_counts()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# ACM\n", "\n", "* Super great database\n", "* They blocked me quickly, and I sent a letter." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 20621 0 20621 0 0 29882 0 --:--:-- --:--:-- --:--:-- 29972\n" ] } ], "source": [ "!curl \"http://dl.acm.org/exportformats_search.cfm?query=%2A&filtered=persons%2Eauthors%2EprofileID%3D81340490949&within=owners%2Eowner%3DGUIDE&dte=&srt=publicationDate&expformat=csv\" > marx.csv" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeauthoreditoradvisornotetitlepagesarticle_nonum_pageskeywords...monthyearissnbooktitleacronymeditionisbnconf_locpublisherpublisher_loc
id
2767799articleHosein Azarbonyad and Mostafa Dehghani and M...NaNNaNNaNTime-Aware Authorship Attribution for Short Te...727--730NaN4authorship attribution, short text analysis, t......NaN2015NaNProceedings of the 38th International ACM SIGI...SIGIR '15NaN978-1-4503-3621-5Santiago, ChileACMNew York, NY, USA
2634353articleAlex Olieman and Hosein Azarbonyad and Mosta...NaNNaNNaNEntity Linking by Focusing DBpedia Candidate E...13--24NaN12dbpedia spotlight, entity linking, erd challenge...NaN2014NaNProceedings of the First International Worksho...ERD '14NaN978-1-4503-3023-7Gold Coast, Queensland, AustraliaACMNew York, NY, USA
\n", "

2 rows × 26 columns

\n", "
" ], "text/plain": [ " type author editor \\\n", "id \n", "2767799 article Hosein Azarbonyad and Mostafa Dehghani and M... NaN \n", "2634353 article Alex Olieman and Hosein Azarbonyad and Mosta... NaN \n", "\n", " advisor note title \\\n", "id \n", "2767799 NaN NaN Time-Aware Authorship Attribution for Short Te... \n", "2634353 NaN NaN Entity Linking by Focusing DBpedia Candidate E... \n", "\n", " pages article_no num_pages \\\n", "id \n", "2767799 727--730 NaN 4 \n", "2634353 13--24 NaN 12 \n", "\n", " keywords ... \\\n", "id ... \n", "2767799 authorship attribution, short text analysis, t... ... \n", "2634353 dbpedia spotlight, entity linking, erd challenge ... \n", "\n", " month year issn booktitle \\\n", "id \n", "2767799 NaN 2015 NaN Proceedings of the 38th International ACM SIGI... \n", "2634353 NaN 2014 NaN Proceedings of the First International Worksho... \n", "\n", " acronym edition isbn \\\n", "id \n", "2767799 SIGIR '15 NaN 978-1-4503-3621-5 \n", "2634353 ERD '14 NaN 978-1-4503-3023-7 \n", "\n", " conf_loc publisher publisher_loc \n", "id \n", "2767799 Santiago, Chile ACM New York, NY, USA \n", "2634353 Gold Coast, Queensland, Australia ACM New York, NY, USA \n", "\n", "[2 rows x 26 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "marx=pd.read_csv('marx.csv', index_col=\"id\")\n", "marx.head(2)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [], "source": [ "marxurl='http://dl.acm.org/author_page.cfm?id=81340490949'\n", "\n", "m=requests.get(marxurl)\n", "\n", "soup=BeautifulSoup(m.text)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'81100034443',\n", " '81100036950',\n", " '81100096464',\n", " '81100208294',\n", " '81100232445',\n", " '81100514566',\n", " '81309511039',\n", " '81314494980',\n", " '81335489572',\n", " '81340490949',\n", " '81340492697',\n", " '81363597928',\n", " '81363606346',\n", " '81440623906',\n", " '81447602504',\n", " '81447603443',\n", " '81464671054',\n", " '81470654009',\n", " '81485655851',\n", " '81485658257',\n", " '81486656973',\n", " '81490687418',\n", " '81548006163',\n", " '81554276156',\n", " '99658636269',\n", " '99658637259'}" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_author_ids(author_id):\n", " '''Given an ACM author_id downlaod the page of that author and extract the author_ids of all its co-authors.\n", " Return as a set.'''\n", " url='http://dl.acm.org/author_page.cfm?id='+str(author_id)\n", " m=requests.get(url)\n", " soup=BeautifulSoup(m.text)\n", " authors= [re.sub(r'[^0-9]','',a.attrs['href'].split('&')[0]) \n", " for a in soup.findAll('a') \n", " if 'href' in a.attrs and a.attrs['href'].startswith('author_page.cfm')\n", " ]\n", " return set(authors).union({author_id})\n", "\n", "# test \n", "get_author_ids(81340490949)" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "26\n", "53\n", "58\n", "58\n", "58\n", "59\n", "59\n", "59\n", "71\n", "74\n", "80\n", "119\n", "119\n", "152\n", "173\n", "173\n", "177\n", "191\n", "198\n", "198\n", "211\n", "214\n", "214\n", "244\n", "253\n", "253\n", "260\n" ] } ], "source": [ "# try: go 2 steps marx or de rijke \n", "\n", "seed= 81340490949 # marx\n", "#seed =81335489572 # de rijke\n", "authors=get_author_ids(seed)\n", "for a in authors:\n", " print len(authors)\n", " time.sleep(random.randint(1,4))\n", " authors=authors.union(get_author_ids(a))\n", "print len(authors)\n", "#de_rijke_coauthors=authors\n", "marx_coauthors=authors" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Download all csv files given a set of user id's\n", "\n", "\n", "def download_csvs(set_of_ids):\n", " template=\"http://dl.acm.org/exportformats_search.cfm?query=%2A&filtered=persons%2Eauthors%2EprofileID%3D81340490949&within=owners%2Eowner%3DGUIDE&dte=&srt=publicationDate&expformat=csv\"\n", " marx= '81340490949'\n", " dataframes=[]\n", " cum=0\n", " for author in set_of_ids:\n", " time.sleep(random.randint(2,6))\n", " new_url= re.sub(marx,author,template)\n", " #print author, new_url\n", " new=pd.read_csv(new_url, index_col=\"id\")\n", " cum+=len(new)\n", " print author, len(new), cum\n", " dataframes.append(new)\n", " return pd.concat(dataframes) \n", "\n", "# test\n", "#rijke5 = download_csvs(list(de_rijke_coauthors)[100:103])\n", " " ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "81100041550 54 54\n", "81100082955 72 126\n", "81472646931 4 130\n", "81100339991 20 150\n", "81384620228 5 155\n", "81384601729 11 166\n", "99658738174 1 167\n", "81452617021 81 248\n", "81502725238 11 259\n", "99658720570 1 260\n", "81477644242 1 261\n", "81100307839 151 412\n", "81488663482 6 418\n", "81339515235 78 496\n", "81548041665 7 503\n", "99658746090 1 504\n", "99658736687 1 505\n", "81363606346 16 521\n", "81100432369 32 553\n", "81436598567 7 560\n", "99658748233 1 561\n", "81100499185 16 577\n", "81384602631 7 584\n", "81384620254 4 588\n", "81335493029 30 618\n", "81549092656 4 622\n", "99658738843 1 623\n", "99658711904 1 624\n", "81314494980 36 660\n", "81464651429 2 662\n", "81502696743 13 675\n", "81452610079 129 804\n", "99658734480 1 805\n", "81316487451 111 916\n", "99658618675 3 919\n", "81490692105 11 930\n", "81435598553 5 935\n" ] }, { "ename": "HTTPError", "evalue": "HTTP Error 403: Forbidden", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mHTTPError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmarx_diepte2\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdownload_csvs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmarx_coauthors\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mdownload_csvs\u001b[0;34m(set_of_ids)\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0mnew_url\u001b[0m\u001b[0;34m=\u001b[0m \u001b[0mre\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msub\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmarx\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mauthor\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mtemplate\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;31m#print author, new_url\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \u001b[0mnew\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnew_url\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex_col\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"id\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 14\u001b[0m \u001b[0mcum\u001b[0m\u001b[0;34m+=\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnew\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0mauthor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnew\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcum\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)\u001b[0m\n\u001b[1;32m 496\u001b[0m skip_blank_lines=skip_blank_lines)\n\u001b[1;32m 497\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 498\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 499\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 500\u001b[0m \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 260\u001b[0m filepath_or_buffer, _, compression = get_filepath_or_buffer(filepath_or_buffer,\n\u001b[1;32m 261\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 262\u001b[0;31m compression=kwds.get('compression', None))\n\u001b[0m\u001b[1;32m 263\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'compression'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minferred_compression\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcompression\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'infer'\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mcompression\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 264\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/site-packages/pandas/io/common.pyc\u001b[0m in \u001b[0;36mget_filepath_or_buffer\u001b[0;34m(filepath_or_buffer, encoding, compression)\u001b[0m\n\u001b[1;32m 256\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 257\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0m_is_url\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 258\u001b[0;31m \u001b[0mreq\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_urlopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 259\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcompression\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'infer'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 260\u001b[0m \u001b[0mcontent_encoding\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mreq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mheaders\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Content-Encoding'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(url, data, timeout, cafile, capath, cadefault, context)\u001b[0m\n\u001b[1;32m 152\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 153\u001b[0m \u001b[0mopener\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_opener\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 154\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mopener\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 155\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 156\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0minstall_opener\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mopener\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mopen\u001b[0;34m(self, fullurl, data, timeout)\u001b[0m\n\u001b[1;32m 435\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mprocessor\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprocess_response\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 436\u001b[0m \u001b[0mmeth\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprocessor\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmeth_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 437\u001b[0;31m \u001b[0mresponse\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmeth\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mreq\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mresponse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 438\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 439\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresponse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mhttp_response\u001b[0;34m(self, request, response)\u001b[0m\n\u001b[1;32m 548\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;36m200\u001b[0m \u001b[0;34m<=\u001b[0m \u001b[0mcode\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m300\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 549\u001b[0m response = self.parent.error(\n\u001b[0;32m--> 550\u001b[0;31m 'http', request, response, code, msg, hdrs)\n\u001b[0m\u001b[1;32m 551\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 552\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresponse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36merror\u001b[0;34m(self, proto, *args)\u001b[0m\n\u001b[1;32m 473\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mhttp_err\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 474\u001b[0m \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'default'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'http_error_default'\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0morig_args\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 475\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_chain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 476\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 477\u001b[0m \u001b[0;31m# XXX probably also want an abstract factory that knows when it makes\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36m_call_chain\u001b[0;34m(self, chain, kind, meth_name, *args)\u001b[0m\n\u001b[1;32m 407\u001b[0m \u001b[0mfunc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhandler\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmeth_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 408\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 409\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 410\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 411\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/admin/anaconda/lib/python2.7/urllib2.pyc\u001b[0m in \u001b[0;36mhttp_error_default\u001b[0;34m(self, req, fp, code, msg, hdrs)\u001b[0m\n\u001b[1;32m 556\u001b[0m \u001b[0;32mclass\u001b[0m \u001b[0mHTTPDefaultErrorHandler\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mBaseHandler\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 557\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mhttp_error_default\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreq\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhdrs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 558\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mHTTPError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mreq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_full_url\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhdrs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 559\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 560\u001b[0m \u001b[0;32mclass\u001b[0m \u001b[0mHTTPRedirectHandler\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mBaseHandler\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mHTTPError\u001b[0m: HTTP Error 403: Forbidden" ] } ], "source": [ "marx_diepte2=download_csvs(marx_coauthors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Letter sent to ACM\n", "\n", "* 2016-02-08" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gender Disparities in Computing\n", "\n", "\n", "Dear people at the ACM digital library,\n", "\n", "Inspired by Vardi's CACM editorial on gender diversity in computing [1], I want to investigate gender disparities in the scientific field of computing along the lines of [2], who did a bibliometric analysis using data from Thomson Reuters Web of Science. \n", "\n", "I am an ACM member working in the field of databases and logic, and in my impression the field of computing has less gender bias than described for all sciences in the Nature article [2]. This article is restricted to an analysis of journal articles, which is less appropriate for the computing field.\n", "\n", "I tried to download the csv files for ACM authors, starting with myself as a seed, but quickly got banned. \n", "\n", "So, my question is whether I can get some sort of access to the DL database for doing this bibliometric research. \n", "\n", "I am interested in the following data:\n", "* list of publications of each author (eg as in the csv files on author's pages)\n", "* affiliations of authors\n", "* citations\n", "* If you have this: gender of authors ;-)\n", "\n", "Of course, I am happy to sign a contract about the use of this data.\n", "\n", "My intention is to publish this in the Communications of the ACM.\n", "\n", "Looking forward to your reaction,\n", "\n", "with best regards\n", "\n", "Maarten Marx (my page is )\n", "\n", " \n", "\n", "[1] What Can Be Done About Gender Diversity in Computing?: A Lot!, Moshe Y. Vardi \n", "Communications of the ACM, Vol. 58 No. 10, 2015. Page 5 \n", "\n", "[2] Bibliometrics: Global gender disparities in science\n", "Vincent Larivière, Chaoqun Ni, Yves Gingras, Blaise Cronin& Cassidy R. Sugimoto. Nature, Vol 504, No 7479, 2013. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }