Sina Samangooei's thoughts. Musings of an engineer trapped in an academic's body. login


I have something to add...

ICWSM 2012 - A Sina's eye view

Over the last week I was in Dublin attending ICWSM 2012. I’ll be writing about that, but first, context!

What i’ve been up to recently

(Less Sina, more ICWSM? CLICK HERE)

My more recent academic endeavours have been centred around machine learning and semantic web technologies as applied to real time social media data. The specific goal is somewhere between “make it easier for folks to glean some knowledge from social media” and “use social media to automatically predict X s.t. X is a real valued time series”. I am actually finding it rather fun and challenging. My background is in computer vision and multimedia so this is the first time I’ve had to deal with text in any form beyond regex hackery. The semantic web aspect of this project is also fun and new for me. Though i know the general idea form my PhD, I stopped using ontologies and RDF databases relatively early on in the PhD and I look forward to seeing what a proper semantic knowledge base can do. Also there is the big data side. I have played with “big data” before, in fact we’ve been subverting hadoop to do fun multimedia retrieval things for a while. This project will let me go beyond that, looking at data which things like hadoop were originally made to deal with, namely, text! Also my machine learning skills have been a little “random walky” up till now, applying whatever machine learning techniques I thought were required for the multimedia/computer vision job at hand. This project gave me a chance to learn about/implement some high tech/interesting machine learning in a project more concentrated around the machine learning itself.

TL;DR my interests in social media are: text analysis, semantic knowledge store stuff, big data and machine learning applied to in this big-data-text-source setting. So, the computer science side of social I suppose, or at least a part of it.

ICWSM had a BIG focus on the social science, non-technical side of social. This was a surprise. A pleasant surprise, but one that did leave me feeling a little out of my depth at the conference. Mixed with the social media community being one I am not so familiar with, ICWSMers were using words I was not comfortable with, using techniques I was not 100% comfortable with and generally inspiring me to read a LOAD of stuff. I’ve come out of ICWSM having met a lot of cool people, learnt some cool things, and feel inspired to learn much more :-)

Day 1 – Workshops

The first day at the conference was a day of workshops, one of which was the Real-Time Analysis and Mining of Social Stream (RAMSS) workshop. This was ostensibly my reason for being at ICWSM in the first place. Daniel (from sheffield) and I had worked on porting a bunch of text preprocessing components into an open source tool which can be used in both a command line or hadoop setting to preprocess text, with a specific aim towards preprocessing twitter data. The challenge is the relatively short text snippets with crazy grammar in things like twitter status updates. Our tool can tokenise, language detect and stem such horrible text. The code is on my github, the paper can be found on ePrints. The rest of RAMSS looked very interesting, including a keynote by the wonderful Jimmy Lin. However, I couldn’t go to the morning session of RAMSS as (in the place of Jon Hare who was off in Hong Kong presenting our ImageTerrier paper) I had to attend the SocMedNews workshop and take his place on the panel there in.


SocMedNews was all about the interaction of news and social media. Are tweeters journalists? Are journalists relevant in a world of real time news on social streams? What tools are there to help journalists use social media, things like this. Knowing absolutely NOTHING about journalism, let alone its interaction with social media, I thought it best to attend the morning papers to root in reality the insightful discussion in which I was to take part in the afternoon. So yeah, we started with a keynote by Katrin Weller. She framed the rest of the discussion well, asking questions like: “Is twitter actually a social network”. Short answer was no, it is sort of not, because very few people (22%) actually follow back their followers, this means twitter is actually primarily for information consumption. Also other little nuances included the issues of trust in social media and the propogation of lies, something like the XKCD wikipedia citation loop happens between journalists and news.

Next we saw a cool tool from Brigit Gray, socios which boiled down to a cool query lead tool with a backend developed by IBM to help journalists keep on top of the social media activity around a specific topic. Another interesting take away from the workshop was Bahareh Heravi’s Towards Social Semantic Journalism which showed the begginings of some interesting work which was looking at combining existing ontologies to model networks and content from social media, again towards making it more easily explorable by journalists. Another interesting talk was given by Marco Toldeo who wanted to automatically identify “hard-news” and “soft-news” from some latent features of how things were discussed on twitter. He noticed distinct RT patterns between these kinds of news, many RTs between a tight nit group for hard news and RTs amongst large disperate news for soft news.

The SocMedNews panel basically echoed some of the issues highlighted through the rest of the talks, is news on social networks reliable, what is the role of journalists in a world of social media and how can technology assist them. My contribution was that we need better ways to more honestly use the massive amounts of data being generated beyond simplistic “does it contain the terms i’ve defined that define my topic” etc. A primary take away by Jochen Spangenberg was that in a time of social media there are more citizen journalists, but some key roles of traditional journalists remain. For example, rechecking and confirmation are not things every member of twitter will necissarily care about, but confirmation and fact checking are integral to the role of journalists.

Overall it was interesting to hear about what is being done to help journalists use social media. Further, it was cool to see what KIND of news twitter is generating, whether social media denizens are in fact journalists and beyond this, what this entails for traditional news and journalists. I was however, very much a observer in this field and I only hope someone found my slight technical input of “let’s analyse this stuff on larger scales, maybe with some multimedia in there for kicks!” useful :)

RAMSS part 2

SocMedNews ended in the afternoon and RAMSS continued until the end of the day so I went to RAMSS. Some cool things were presented, some stuff deeply related to my area of interest in the sense that it was actually thinking about the practicallity of the problem. Your topic model is nice and all, but DOES IT SCALE.

A very practical presentation was given by Lance Vick of Tawlk. Tawlk have developed and opensourced some very cool tools for extracting structured data from many different social networks, including excitingly reddit! They do this with a python tool, but also a browser based javascript tool which means that the job of extracting data can be pushed to clients which in turn means one can get around API limits, very exciting for building realtime user facing apps! Next we had a really great look at Dominik Riemer’s work on Modeling Semantic querys in real time social media monitoring. This was pretty cool because it talked directly about implementing a system which was initiated on a semantic query and monitored a stream for of RDF for results matching the query. Little was said about modelling inferences across these streams, but having a chat with Dominik afterwards seemed to suggest it possible, so cool, exciting times!

A paper whose authors didn’t make it to RAMSS was about Unsupervised realtime company name disambiguation. Their technique was data driven (using a bunch of rules applied to a company’s website) and their results had very good recall and performed comparatively when applied to state of the art techniques. This is definitely applicable to trendminer and i’ll be investigating this soon.

Overall RAMSS was very interesting and got me talking with other people interested in making all this fun social media stuff work in real time. Great stuff!

The rest of the conference

I’m summarising the rest of the conference into one section. Though there was a lot of interesting stuff, it has to be said that for me most of it was interesting but inapplicable (at least directly) to the technical problems I’m trying to solve in my current project. The keynote by Robin Dunbar exemplifies this issue well. His talk was engaging, full of really awesome insights on some fundamental limits of human social circles. Can I use this to extract trends from content in social media? Not easily. Will it colour some intuitions and some of my thinking? Maybe!

On to the interesting papers and posters. Let us start with something directly applicable to my problem area presented by Milad Kharratzadeh. He had a look at blogs that talked about companies. He built a distance metric of companies based around their co-occurence and from this was able to use GANC clustering to construct some relatively convincing groupings of companies. He then used this clustering as well as correlation of historical prices to specifically construct stock portfolios around companies which were not similar. The idea here being a portfolio of dissimilar companies reduces risk. His method showed that the blog derived metric when mixed with the correlation of historical prices metrics provides a better reduction in risk than historical price alone, certainly promising! Also not twitter so woo :-)

Song Feng presented some nice work about the distribution of ratings on review sites being indicative of whether the reviews were by spammers or shills. Fair points but I must question whether the reviews could have such a skew for different reasons, e.g. one time reviewers who had a very bad or very good experiance, and also happened to be of different social backgrounds so what comprised as “good” differs radically. Interesting either way, it didn’t use twitter so rare!

Adam Sadilek showed some really interesting results about how to model the spread of disease based on social networks. First he learns what comprises a tweet about someone being ill. Because getting loads of labeled data is hard he uses a bootstrapping process to train an SVM. Firstly M-Truk was used to tag 5000 tweets as being about sickess or other. He trained two binary SVMs, one which was heavily penalised for false positives and another penalised for false negatives. He then uses these classifiers on a set of 1.6 million tweets that were “likely to be about being sick” but contained some noise. Using the first two binary classifiers he extracts tweets that were very likely to be about sickness, and tweets very likely to not. On top of this from a set of 200 million tweets the “penalised for false negatives” classifier was used to select a load more tweets which were definitely NOT about sickness. This yielded 700 thousand sickness tweets and 3 million “other” from which they trained their final classifier. The features they train their SVMs on are unigrams, bigrams and trigrams. There are a few more details, especially how they go on to use this data to model actual disease spreading, but this paper is very interesting for two reasons. One: It gives very specific details of their process, easily reproducible, excellent read. Two: an interesting and data centric approach to learning classifiers for a specific subject matter.

There was alot of stuff about sentiment analysis at ICWSM. The social folks really love this stuff, hell everyone does right? It gives you a handy number to encapsulate complex human outputs, who doesn’t want that! I won’t mention all the sentiment work, if you are interested go to the conference accepted papers and ctrl+f (cmd+f) for sentiment. A couple noteworthy examples include Abdelhalim Rafrafi who used a regularizer to deal with frequency bias when learning a sentiment classifier, Yulan He constructed a dynamic topic model for sentiment and topic and used it with firefox plugin reviews. Going beyond dual polarity metrics of sentiment, Munmun De Choudhury did some really brilliant work on moods. They identify over 200 moods and produce a lexicon for mood indicative words. Their lexicon is available and we should be able to use it to produce a tool which can attribute POMs like moods to tweets. Finally!

Also, on a bit of a tangent, Akshay Patil presented some really interesting work where they modelled the network dynamics behind groups of players in world of warcraft. Specifically they looked at what happened when important individuals leave gaming groups, causing the groups to fall apart. This has absolutely nothing to do with content analysis, indeed it was a Dunbar style “yeah that is pretty cool, I can’t use it at all, but it is good food for thought”, of which there were many at ICWSM. But it deserves special note because, yes, playing in guilds is interesting and hard :-)

...Thanks for all tweets

That was ICWSM, or a part of it anyway. Check out the hashtag and my twitter timeline for more real time stuff, but the above is a reasonable summary. It was very different from conferences I have been to in the past. The interests were varied, and the focus of the folks there at times were so alien from my interests in social media that at times it was hard to follow what exactly was going on. But it was still interesting, work presented and the offline chats have definitely taught me some things and given me some real practical directions worth following up.

I’d also like to thank the organisers. Though the workshop day was a little over subscribed, the rest of the conference’s organisation was absolutely great. The single track nature was really brilliant, no hard decisions about whether to see this or that talk. Also the lightning talks gave the poster presentations a really excellent way to grab my attention and tell me about their work. Other things: The wifi was perfect all day, the food and drinks were lovely and the final banquet at the Guinness store house was a beautiful way to see dublin.

So yes, great city, great conference, great people and some very cool work. NOW. To put it all to good use in my work. LET THE RESEARCH COMMENCE!

Stolen, torn apart, slapped together and otherwise created by Sina Samangooei. Licensed under WTFPL login