ECIR 2014 proceedings online

With 10 days to go to ECIR 2014, the ECIR 2014 proceedings are online now. You can find information about it at the Springer web site or access the online version directly at this page.

HPC grant

My colleague Lars Buitinck and I received a grant from the HPC fund to support the development of a more scalable version of xTAS, our extensible text analysis service. It’s the pipeline that we (and others) use for our text mining work. This is great news as it allows us to port modules to the new architecture that Lars has been developing for the past six months.

The autonomous search engine

After Tuesday’s talk on personal data mining, I gave another talk to non-experts on Thursday. This time the topic was “The Autonomous Search Engine”. The backbone of the story is the move from supervised to weakly supervised technology development of one of the core components of search engines: rankers. Weak supervision in this context means that we’re not using explicit, manually created labels provided by experts but that we’re making inferences about our ranker technology from naturally occurring interactions with the search engine.

Weakly supervised ways of evaluating rankers and weakly supervised ways of learning to combined the outputs of multiple rankers are being studied a lot. And solutions are being used outside the lab. The next holy grail here is to create new rankers in a weakly supervised way, again based on the implicit signals that we get from user interactions. The audience, consisting of information professionals for whom expert judgments are the norm, was obviously critical but interested, with a number of great questions and a range of suggestions. Thank you!

Life mining talk

I gave a talk aimed at the general public on personal data mining last night, in Maastricht. The talk is about explaining what type of information can be mined from the content of open sources (news, social media, etc) using state of the art search and text mining technology. And the focus is on extracting personal information, whether this is location or music listening behavior or health or personality traits. It’s not the first time I’ve given this talk and follow-ups are planned for later this Spring. The message is at the end of the talk is very simple: the content of open sources can be incredibly revealing and therefore incredibly sensitive.

Because of the ongoing NSA revelations, I decided to add some material on what can be mined from metadata. The message is the same: like the content of open sources, the metadata can be incredibly revealing and therefore sensitive too. There were good and interesting questions, both on the reach of search and text mining technology and also on the balance between sharing digital trails as many people stand to benefit on the one hand and ownership of such trails on the other hand. Thanks to everyone who attended!

