thoughts from mlss 2015 tübingen

I recently attended the Machine Learning Summer School at the MPI in Tübingen. This wasn't my first time at an event like this - I attended the Gaussian Process 'Summer' School in September 2014 - but the MLSS is a lot bigger/diverse. I'm fairly sure I didn't even speak to all the other participants (unfortunately).

The basic format is lectures all day (9am til roughly 5pm) and various academic or social activities in the evenings. I foolishly though it would be possible to get lots of work done in the free time, which was wrong on two counts: free time was limited, and the hostel had essentially no WiFi. By that I mean it was impressively bad: pings of the order 5s, packet loss above 50%. Free café WiFi is also harder to find in Tübingen than in NYC (somehow!), so by the end of the two weeks, MLSS participants could be found sitting near eduroam hotspots across the town.

Luckily there are better things to do at a summer school than struggle with laggy ssh tunnels. The lectures provided good exposure various topics within machine learning, although a 3-hour course is necessarily limited in depth. Most of the lectures were recorded and will probably be here eventually. Those from 2013 are still available here. There are also more from other MLSS venues here.

My favourites were Tamara Broderick's Bayesian Nonparametrics and Zoubin Ghahramani's Bayesian Inference (note my bias). Michael Hirsch's Computational Imaging and Michael Black's Learning Human Body Shape were also enjoyable, largely due to demonstrations. The former briefly covered MIT's visual microphone which prompted a similar level of disquiet as it did on infosec twitter, although more fascination. The unearthly sounds of the reconstructions do little to ease the creep level.

My favourite session overall was the practical from Frank Wood and Brooks Paige on Probabilistic Programming (bitbucket repo), possibly because I am a nascent Clojure fan, or maybe I just love sampling. I'm also quite enthusiastic about abstracting away implementation details and focusing on models, which Anglican facilitates. How much use I'll make of it in my own research has yet to be determined.

Something which cannot be replicated via video lectures or git repos (yet) is interaction with other participants. As I mentioned, there were a lot of us (about 100), and the poster sessions were probably the best opportunity to talk science. I'm not sure how participants were selected but I was impressed by the diversity of research represented. It turns out not everyone is throwing convnets at everything (but maybe they should be?). There was also a lot more theory than I was expecting, which is what happens when you assume your biased sample (of largely-applied colleagues) is representative of the whole. Lesson learned. I didn't take any notes at the poster sessions (nor did I read all of the posters), so I'll just mention a few that stand out in my memory (and have something concrete to link to).

Muhammad Bilal Zafar had a poster about Fairness Constraints: A Mechanism for Fair Classification. Quoting from the paper,

"Fairness prevents a classifier from outputting predictions correlated with certain sensitive attributes in the data."

I was really excited to see a poster about fairness, especially having just read "What Does it Mean for an Algorithm to be Fair?". The danger exists for people to believe that the recommendations from a machine learning algorithm are 'fair' (for some nebulous definition of fair, likely including 'not racist' and 'not sexist'), which could be used to avoid addressing systemic social injustices. It's important for machine learning researchers/users to stress that the output of learning algorithm is a function of its training data (madness, I know), and as long as our historical data contains biases, models trained on it will have them too. That is, unless we do something about it. I'm sure there are more subtle factors at play that I'm not aware of, but I'm glad that these issues are being considered by the research community.

Klaus Greff presented a poster about an experiment-management tool he created called Sacred. (The name is a reference to Monty Python's Every Sperm is Sacred). This obviously isn't research, but it seems extremely useful. It records things like config options, a snapshot of the source code(!), runtime trace(s), and saves them in a (mongoDB) database. I already have a semi-elaborate setup for running reproducible experiments (the details of which are too gory and shameful to provide), but this seems more pleasant and sane.
Tom Rainforth had a poster about Canonical Correlation Forests which I didn't actually get to look at (I was in the same poster session), but the gist I got from a chat in the pub is that they're better than random forests (my brain has a very aggressive compression algorithm, clearly). I'll need to read the paper. I have a picture of him explaining the poster to someone on the bus after the poster session, demonstrating that science never rests.
My friend Jean Maillard had a poster on Learning Adjective Meanings with a Tensor-Based Skip-Gram Model. This was by far the most similar to mine (my poster was also on distributional semantics), although this paper focuses moreso on language-modelling, by representing adjectives as matrices. I'm amused that Jean and I started off doing something entirely different (at the time, Part III in Mathematics at Cambridge, mostly flipping tables over quantum algorithms) and then converged (if only temporarily, for me) on something that is (at least by MLSS standards) somewhat obscure. Maybe there was something in the water at St. John's.

There were also lots of opportunities to talk non-science. On the last afternoon, a straw poll was conducted on the viability of human-level AI during our lifetime. The majority present (n ~ 10) felt it wasn't going to happen, which seems to go against popular opinion on the matter (at least judging by recent articles about the threat of such AIs). Maybe grad students are too pessimistic (optimistic?) a group, or we succumbed to our small sample size. The poll wasn't even conducted in secret.

Another outcome of the MLSS is that I reaffirmed my desire to write a Gaussian Processes for Biologists tutorial. Biologist here really means 'anyone lacking a strong mathematical background' (hopefully in the future it will be offensive for me to use biologist as a proxy for that). I'd originally planned to do this after the GPSS last year, partially out of GP evangelism (all the people doing simple linear regression could be doing GP regression!) and partially to deepen my own understanding (one learns much through teaching), but progress stalled due to lack of interest. Interest is briefly re-ignited, so maybe I'll actually do it this time[citation needed].