{
"metadata": {
"name": "All About TEDx"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"All About TEDx"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"TEAM : Chan Kim, JT Huang"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Our Goal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, and Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.\n",
"\n",
"Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.\n",
"\n",
"The questions we are trying to answer:\n",
"\n",
"* How are the TEDx videos distributed among different languages and places?\n",
"* How is the spreading status of TEDx videos comparing to that of TED videos?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Outline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"1. Our Goal
\n",
"2. Dataset
\n",
"3. Data collection and cleaning
\n",
"4. Basic Statistics: by language
\n",
"5. Basic Statistics: by country
\n",
"6. Trends of TEDx over past 5 years by language
\n",
"7. Comparison between TED and TEDx by language
\n",
"8. What's next?\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"About Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since TEDx did not provide any API for people to retrieve video data, we write our own scrapper to crawl various attributes of the TEDx videos.\n",
"And since all TEDx videos are on YouTube, we also use YouTube API to retrieve more interesting information about the videos.\n",
"\n",
"* **TEDx Website**\n",
" * Language\n",
" * Event\n",
" * Country\n",
" * Topic\n",
"\n",
"* **YouTube API**\n",
" * Uploaded Timestamp\n",
" * Title\n",
" * Tags\n",
" * Thumbnail\n",
" * Duration\n",
" * Like Count\n",
" * Rating\n",
" * Rating Count\n",
" * View Count\n",
" * Favorite Count\n",
" * Comment Count\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Go to Top"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"TEDx Web Scrapper"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Stage 1. Get Type Portal Links from TEDx Home URL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we try to get all the type portal links from [TEDx home URL](http://tedxtalks.ted.com).\n",
"\n",
"We try to find out all the links begin with the following strings in the [TEDx home URL](http://tedxtalks.ted.com) with [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/).\n",
"\n",
"* **Language** Pages: '/browse/talks-by-language/', EX:\n",
" * Korean: '/browse/talks-by-language/korean'\n",
" * Chinese: '/browse/talks-by-language/chinese'\n",
"* **Event** Pages: '/browse/talks-by-event/', EX:\n",
" * TEDxBerkeley: '/browse/talks-by-event/tedxberkeley'\n",
" * TEDxStanford: '/browse/talks-by-event/tedxstanford'\n",
"* **Country** Pages: '/browse/talks-by-country/'\n",
" * South Korea: '/browse/talks-by-country/korea'\n",
" * Taiwan: '/browse/talks-by-country/taiwan'\n",
"* **Topic** Pages: '/browse/talks-by-topic/\n",
" * Technology: '/browse/talks-by-topic/technology'\n",
" * Design: '/browse/talks-by-topic/design'"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Sample Code and Results"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"TEDX_HOME_URL = \"http://tedxtalks.ted.com\"\n",
"LANG_URL = \"/browse/talks-by-language/\"\n",
"\n",
"s = requests.get(TEDX_HOME_URL)\n",
"soup = BeautifulSoup(s.content)\n",
"\n",
"total = 0\n",
"link_tags = soup.find_all('a', href=True)\n",
"for link_tag in link_tags:\n",
" link = link_tag['href']\n",
" lang = link_tag.next_element.next_element.next_element\n",
" if link.startswith(LANG_URL):\n",
" print(\"Language %s: %s\" % (lang, link))\n",
" total += 1\n",
"\n",
"print(\"Total: %d\" % (total))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Language American Sign Language: /browse/talks-by-language/asl\n",
"Language Azerbaijani: /browse/talks-by-language/azerbaijani\n",
"Language Galician: /browse/talks-by-language/galician\n",
"Language Arabic: /browse/talks-by-language/arabic\n",
"Language Bulgarian: /browse/talks-by-language/bulgarian\n",
"Language Catalan: /browse/talks-by-language/catalan\n",
"Language Chinese: /browse/talks-by-language/chinese\n",
"Language Croatian: /browse/talks-by-language/croatian\n",
"Language Czech: /browse/talks-by-language/czech\n",
"Language Dutch: /browse/talks-by-language/dutch\n",
"Language English: /browse/talks-by-language/english\n",
"Language Estonian: /browse/talks-by-language/estonian\n",
"Language Finnish: /browse/talks-by-language/finnish\n",
"Language French: /browse/talks-by-language/french\n",
"Language German: /browse/talks-by-language/german\n",
"Language Greek: /browse/talks-by-language/greek\n",
"Language Hebrew: /browse/talks-by-language/hebrew\n",
"Language Hindi: /browse/talks-by-language/hindi\n",
"Language Hungarian: /browse/talks-by-language/hungarian\n",
"Language Icelandic: /browse/talks-by-language/icelandic\n",
"Language Indonesian: /browse/talks-by-language/indonesian\n",
"Language Italian: /browse/talks-by-language/italian\n",
"Language Japanese: /browse/talks-by-language/japanese\n",
"Language Korean: /browse/talks-by-language/korean\n",
"Language Lithuanian: /browse/talks-by-language/lithuanian\n",
"Language Malay: /browse/talks-by-language/malay\n",
"Language Polish: /browse/talks-by-language/polish\n",
"Language Portuguese: /browse/talks-by-language/portuguese\n",
"Language Rajasthani: /browse/talks-by-language/rajasthani\n",
"Language Romanian: /browse/talks-by-language/romanian\n",
"Language Russian: /browse/talks-by-language/russian\n",
"Language Slovak: /browse/talks-by-language/slovak\n",
"Language Slovene: /browse/talks-by-language/slovene\n",
"Language Spanish: /browse/talks-by-language/spanish\n",
"Language Swedish: /browse/talks-by-language/swedish\n",
"Language Tamil: /browse/talks-by-language/tamil\n",
"Language Thai: /browse/talks-by-language/thai\n",
"Language Turkish: /browse/talks-by-language/turkish\n",
"Language Ukrainian: /browse/talks-by-language/ukrainian\n",
"Language Urdu: /browse/talks-by-language/urdu\n",
"Total: 40\n"
]
}
],
"prompt_number": 68
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Stage 2: Get Video Links in Type Portal Links"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we go to each type portal link to get video links, and the type will become the attribute of the video.\n",
"\n",
"We will go through page 1, page 2, until there is no other pages in that type attribute. \n",
"For example, we will go to 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1', then 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=2' and stop at ''http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=3' to get all the 28 videos in Icelandic.\n",
"\n",
"In 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1' we can find the video link '/list/search%3Atag%3A%22icelandic%22/video/TEDxReykjavik-Eythor-Edvardsson' by using Beautiful Soup and the regular expression. For example, the second video in 'http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1'\n",
"\n",
"``"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Sample Code and Results"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import re\n",
"\n",
"VIDEO_LINK_PREFIX = \"mvp_grid_panel_img_\"\n",
"MSG_CLASS = \"mvp_padded_message\"\n",
"EMPTY_PAGE_MSG = \"This page is empty.\"\n",
"\n",
"portal_url = \"http://tedxtalks.ted.com/browse/talks-by-language/icelandic\"\n",
"page = 1\n",
"while(True):\n",
" # EX: http://tedxtalks.ted.com/browse/talks-by-language/icelandic?page=1\n",
" url = portal_url + \"?page=\" + str(page)\n",
" print(\"Reading URL: \" + url)\n",
" s = requests.get(url)\n",
" soup = BeautifulSoup(s.content)\n",
" \n",
" # if there is no Next page\n",
" #