After Tuesday’s talk on personal data mining, I gave another talk to non-experts on Thursday. This time the topic was “The Autonomous Search Engine”. The backbone of the story is the move from supervised to weakly supervised technology development of one of the core components of search engines: rankers. Weak supervision in this context means that we’re not using explicit, manually created labels provided by experts but that we’re making inferences about our ranker technology from naturally occurring interactions with the search engine.
Weakly supervised ways of evaluating rankers and weakly supervised ways of learning to combined the outputs of multiple rankers are being studied a lot. And solutions are being used outside the lab. The next holy grail here is to create new rankers in a weakly supervised way, again based on the implicit signals that we get from user interactions. The audience, consisting of information professionals for whom expert judgments are the norm, was obviously critical but interested, with a number of great questions and a range of suggestions. Thank you!
Now playing: The Durutti Column — Stuki
I gave a talk aimed at the general public on personal data mining last night, in Maastricht. The talk is about explaining what type of information can be mined from the content of open sources (news, social media, etc) using state of the art search and text mining technology. And the focus is on extracting personal information, whether this is location or music listening behavior or health or personality traits. It’s not the first time I’ve given this talk and follow-ups are planned for later this Spring. The message is at the end of the talk is very simple: the content of open sources can be incredibly revealing and therefore incredibly sensitive.
Because of the ongoing NSA revelations, I decided to add some material on what can be mined from metadata. The message is the same: like the content of open sources, the metadata can be incredibly revealing and therefore sensitive too. There were good and interesting questions, both on the reach of search and text mining technology and also on the balance between sharing digital trails as many people stand to benefit on the one hand and ownership of such trails on the other hand. Thanks to everyone who attended!
Now playing: Dakota Suite & Quentin Sirjacq — The Side of Her Inexhaustible Heart (Part III)
A few weeks ago I visited a local high school as part of a series of efforts to get more high school kids to maintain an interest in computer science and possibly study the subject in university. I gave a sneak preview of a new version of a demo that we’ve been working on with Manos Tsagkias and Wouter Weerkamp, Streamwatchr.
Streamwatchr offers a new way to discover and enjoy music through an innovative interface. We show, in real-time, what music people around the world are listening to. Each time Streamwatchr identifies a tweet in which someone reports about the song that he or she is listening to, it shows a tile with a photo of the artist and a play button (on mouse over) that does, indeed, play the song (from Youtube). Streamwatchr collects about 500,000 music tweets per day, which is about 6 tweets per second.
I visited a class of 12 and 13 year olds at a local high school here in Amsterdam. As my laptop and the beamer refused to talk to each other I walked around the class room with my laptop to demo Streamwatchr, with Streamwatchr running in full screen. While walking around I talked about some of the technology behind it (entity linking, data integration, open data, etc). Occasionally, I put the laptop down on a table so that the students could interact with Streamwatchr.
I was amazed to see how addictive the interface was … a screen full of tiles, 6 of which flip and change at random every second, kept every kid glued to the laptop. Later (private) demos of Streamwatchr to friends and colleagues led to similar scenes. As I observed the high school kids interact with Streamwatchr, some interesting questions came up. What is the appeal of random changes to visual elements at a pace that seems to be slightly higher than one can actively track? Is it that there is always something on the screen that you have not seen yet? But that you think you don’t want to miss? At which speed should those random changes occur to be optimally captivating? Should the changes really be random or should they provide maximal coverage of the screen (in the obvious spatial sense) to be optimally captivating? There’s a great set of experiments to be run there — an unexpected side product of an outreach activity.