ICML 2016 not by the day

The International Conference on Machine Learning (ICML) was in NYC this year! Unfortunately(?) for me, I moved from NYC to Zürich two months ago. Fortunately for me, I was able to return to attend the conference. Instead of doing a day-by-day breakdown (as I did for NIPS and AAAI), this post will be arranged thematically. Let's see how I deal with the hard group assignment problem... Skip to the bit you care about.


Volunteering at ICML

I was a student volunteer for ICML, which consisted of working two ~five-hour shifts at the conference. For me these were both Registration Desk. I had 07.30-12.30 on the first and last days, which was possible purely by my being in European time for much of the trip. I woke up at 4am on the first day. Here are some observations:

Overall it was surprisingly fun. Apparently I rather enjoy that kind of work, so if this whole research thing doesn't work out I have a bright future as a vending machine.

Tutorial on Deep Reinforcement Learning

I was only able to attend one tutorial due to volunteering, and it was Deep RL. It was so popular there were two overflow rooms. Intense community interest in deep RL continues. Here's an abbreviated version:

The deep part comes into play when you use a deep neural network to approximate your value function, policy, environment etc.

Interesting Papers/Talks

These are the papers I flagged in the conference app. Did I attend all of these talks? No. Did I attend all of the posters? Also no. In hopefully-meaningful categories:

Neural Networks

Reinforcement Learning / Bandits

Representation Learning

Other / ???

Geometry in Machine Learning Workshop

Is the title of this workshop an intentional Lord of the Rings reference? I sure hope so.

I spent the whole day at this workshop, since I was presenting a poster and also yay differential geometry.

So why care about geometry for machine learning? Firstly, by geometry we're talking about differential geometry, which is focused on differentiable manifolds (manifolds which locally look flat). Data usually lies on a manifold. We often assume this manifold is Euclidean space (nice and flat), but it often isn't. A simple example is data which lies on a circle, which if you've encountered if you've ever dealt with angular measurements. Gregory S. Chirikjian gave a really nice illustrating example in his talk "Learning and Lie Groups": if you consider the range of motions available to a simple noisy robot, after a certain number of steps its possible location will be given by some probability distribution (this is called the 'banana distribution'). This distribution is not Gaussian in x and y (the coordinates of the Euclidean manifold a.k.a. the plane the robot was moving on), but if you recall that its motions were constrained to come from a Lie group (specifically the planar special Euclidean group, SE(2), consisting of translations and rotations in the plane), you can define a Gaussian distribution relative to coordinates in that group space (since Lie groups are manifolds), and this distribution describes its location. For more details, see the paper: The Banana Distribution is Gaussian: A Localization Study in Exponential Coordinates.

Reasons to be careful when your data lies on a manifold seem to be:

This second point is actually highly relevant to the work I was presenting at the workshop, which will become entirely clear once I put the paper on the arXiv.

I think machine learning as a field already cares about manifolds a lot, particularly when it comes to finding low-dimensional subspaces within a dataset. This workshop was however primarily concerned with cases where the (sub-)manifold is already known.

And now, the content: (also, you can get the slides for these talks on the workshop page)

Nicolas Boumal spoke about Optimisation on Manifolds. Here is his PhD thesis on the topic. The take-homes were:

Laura Balzano spoke about Subspace Learning by Incremental Gradient Descent on the Grassmannian.

Gregory S. Chirikjian spoke about Learning and Lie Groups as I mentioned above:

Tom Fletcher spoke about Probabilistic Geodesic Models. The motivation is shape analysis (with a medical application in brains), particularly for dimensionality reduction and regression.

Katherine St. John spoke about Dimensionality Reduction on Treespaces, specifically evolutionary trees. Hey, biology! Phylogenetics! The core issue is: you see a set of organisms (their genomes, rather) and want to find the optimal evolutionary tree, out of a very very large set of trees. What to do? Metrics on trees usually look at things like rearrangements ("remember balancing red-black trees?"), distances which are NP-hard to compute. I apparently didn't take many notes during this talk, so have some likely-relevant references:

Mikhail Belkin spoke about Eigenvectors of Orthogonally Decomposable Functions: Theory and Applications. This was partially lost on me, but what I got was: - we have a well-defined notion of eigenvectors and eigenvalues for matrices, but what of tensors (multilinear forms)? There's no spectral theorem here, the idea of rank is different, 'things are just sort of unpleasant' - focusing on orthogonally-decomposable tensors makes things easier (sort of an analogue of eigen-decomposition) - then the trick is to recover the 'basis' the tensor is orthogonally-decomposable on - he said this was primarily about work with Rademacher and Voss, so this paper is likely the reference: Basis Learning as an Algorithimic Primitive, Belkin, Rademacher, Voss

Finally, Stephen Marsland spoke about Principal Autoparallel Analysis: Data Analysis in Weitzenbock Space. This talk got into discussion of connections (maps between elements of tangent spaces), and their curvature, and torsion. It had the same effect that looking at my copy of Spivak's 'A Comprehensive Introduction to Differential Geometry' has: excitement to (re)learn these things but the vague guilt of indulgence in intellectually stimulating but maybe not so directly applicable mathematics. But so cool. Also the sense of having come so close to getting fibre bundles. One of these days.

And that's where my notes end.

The poster session was really good in that I got to speak about my work a lot, but really bad in that it ended before I got to see anyone else's work, or talk much about my work at all. I had so many more things to say! Good thing I have a blog. I'm also working on a manuscript which is very almost ready to go on the arXiv, honestly.

Computational Frameworks for Personalisation Workshop

Mistakes were made. I spent the first quarter of this workshop working the registration desk, and the second quarter standing outside the workshop. The afternoon I spent at Machine Learning in Social Good Applications, which was not a mistake (although I arrived too late to get a t-shirt in my size), as I think I had already seen the work from David Blei's talk present at the New York Academy of Sciences Machine Learning Symposium.

The name of the workshop got truncated to 'Computational Frameworks' on the sign outside, so I got to feel vaguely useful providing disambiguation services while trying to glimpse content.

The content I was most interested in (and managed to catch part of) was Joelle Pineau speaking about Contextual Bandits for Effective Discovery of Personalized Adaptive Treatment Strategies. The focus here is on adaptive protocols, such as adaptive clinical trials or adaptive treatment strategies. In each case, earlier outcomes influence subsequent decisions: it's, you know, adaptive. The computational framework they use is the multi-armed bandit: you have a set of K actions with probabilistic outcomes. You don't know the outcomes or the probabilities, but you have to select actions to maximise some expected utility. This poses the classic exploration-exploitation trade-off so integral to sequential decision making. Once you discover an 'ok' action, do you choose it repeatedly (exploiting it), or do you attempt to find yet better actions, risking stumbling upon inferior outcomes (exploration)? This also raises questions about whether it's possible to explore 'safely', which was the subject of Andreas Krause's keynote at AAAI this year.

Back to exploration-exploitation: In adaptive Bayesian trials, they use Thompson Sampling. This requires having a posterior over models, sampling one and selecting the action with highest expected utility relative to that model. So you act greedily given your belief (exploiting), but your belief is random (exploring). Another approach is to define an upper confidence bound (Auer 2002), where you estimate the confidence of the estimate of the expected utility of an action using how many times the action has been tried, and select arms maximising the estimate + the confidence bound. In this way, you select actions which are either very good, decent and uncertain, or very uncertain. The third example in her slides is BESA: Best Empirical Sampled Average (Baranski, Maillard, Mannor, 2014), which seems to involve subsampling the arm which has more data, then selecting the one with highest expected reward.

The specific application was cancer, specifically trying to minimise tumour volume in mice. They did a pure exploration phase, where mice with induced tumours had random treatments of combinations of two drugs (fluorouracil and imiquimod). They then considered the adaptive problem of selecting treatments given the current tumour size. This makes it a contextual bandit problem. They used Gaussian Processes to model the reward function over the space of continuous contexts (tumour sizes) and arms (discrete treatments). Then, given a specific context, you can select the arm maximising the expected reward, using these earlier-described methods. At this point there's a reference to Durand & Pineau 2015 for the GP extension of BESA but I somehow cannot find it. The idea seems to be to re-estimate the GP using a sub-sample of the data, then using that GP to estimate the maximum expected reward. Preliminary results using the adaptive approach look promising, and they're interested in doing sequential reinforcement learning (rather than bandits) in the future.

Machine Learning In Social Good Applications

I approximately made it to the Disease section of this workshop, which is unfortunate because I would have liked to see Quantifying and Reducing Stereotypes in Word Embeddings, Bolukbasi et al. I'd consider this under the umbrella task of removing unwanted patterns from data, or perhaps more accurately, training a model such that it doesn't pick up on these patterns. See also 'racist algorithms' and this ProPublica piece on Machine Bias. Will there be a conference summary where I don't mention Fairness, Accountability and Transparency in Machine Learning? Probably not.

Anyway, I have an especially strong memory of Barbara Han's talk on Predicting Novel Tick Vectors of Zoonotic Diseases, possibly because it contained many horrifying images of ticks. This work is part a project to use machine learning to predict zoonotic diseases, and also featured a (iirc) undergraduate researcher! The problem is basically: ticks act as disease vectors, but not all of them carry zoonoses. They mined entomological literature (and maybe other sources) to come up with feature sets for ticks, trained a supervised classifier (if I recall they used boosted regression trees), and predicted novel vectors. They also did some feature analysis to understand what differentiates these classes of tick. It turns out that a strong predictor is the number of hosts the tick feeds on. It seems like this could be confounded with the need to feed on a specific host (since that host has to be reservoir of the zoonosis), I asked and they hadn't done a breakdown looking at the specific species. Anyway, a straight-forward machine learning task but an important problem in ecology and epidemiology.

A Rant about the Venue

Times Square is the worst. Times Square is why people hate NYC. Tunnels should be built under Times Square so we never have to look at it. I acknowledge its utility to tourists and I reserve through gritted teeth some respect for their bloody-minded dedication to milling at junctions, drifting absent-mindedly across sidewalks, and stopping suddenly. I just don't enjoy being the person trying to weave between them on my way to lunch, especially when it's summer in NYC and I'm an inappropriately-attired Irishwoman. (We don't do 'direct sunlight' very well.)

I thought of some reasons to locate a conference on Times Square:

The venue itself was interesting because the conference was distributed across multiple floors. This meant lots of using the futuristic elevator system. I was involved in more than one 'what algorithm does this elevator system use' conversation. And hey, here's the chapter of the Sutton Reinforcement Learning book about Elevator Dispatching. I wonder how many interesting methods have been developed to solve simple problems arising in the work environment of engineer/scientist types. I certainly used to think about the optimal road-crossing strategy when I lived in NYC (the problem is slightly interesting because east/west and north/south aren't symmetric due to differing block lengths and crossing times, so always going with the go sign isn't an optimal policy[citation required]).

The negative side-effect of this layout was (to me) a lack of general 'focal point' for the conference, especially since there were various other things going on in the hotel. (Excitingly, on the final day there was an Edward Tufte seminar on the same floor as us.)

TL;DR limit registrations to a number your venue can comfortably accommodate. Turning people away is sad (especially if they are, like me, students who only knew they were going once their workshop submission was accepted), but overcrowding is detrimental to good conferencing.

In Conclusion

Despite missing about half the conference between volunteering, working and being sick, I saw a lot of good work and had some great discussions with people. I'm a bit disappointed there was no proper closing ceremony with summary statistics like at NIPS (unless it was at the party on the Wednesday, which I spent coughing in my hotel room). The multi-track format makes it a little hard to get an overview of the broader field, ad there was a strange lack of closure on the last day. I'd say I'm looking forward to next year, but I think* it's going to be in Sydney, so we'll see about that.

*I don't know why I think this and I can't find any evidence supporting it. I did however learn that ICML also stands for:

The more you know.