We return for another installment of Stephanie Summarises a Conference. My previous work in this area is NIPS 2015, AAAI 2016, and ICML 2016. I was pleasantly surprised at NIPS to be asked if I was going to write one of these again. Apparently someone somehow found my blog. Ignorance of this is one of the downsides (??) of not having creepy tracking analytics.

This time we get a table of contents so I can be guiltlessly verbose (I fear how long my PhD thesis is going to be):

Women in Machine Learning Workshop

"What are women and how can machine learning stop them?"

I didn't register for WiML in time last year, so this was my first time attending. I also managed to miss all the Sunday events by arriving to Barcelona at midnight that night. There was a workshop on Effective Communication where I could perhaps have learned how to write shorter blog posts.

My feelings about having 'woman-only/woman-centric' events are complex, poorly-understood and otherwise beyond the scope of this particular post, but the reality is that women are wildly underrepresented in computer science and machine learning is no exception (about 15% of the 6000-odd NIPS attendees were women, and I don't know what fraction of those were recruiters). I'm so used to being surrounded by men that I barely notice it (except for the occasional realisation that I'm the only woman in a room), so having a large conference hall full of women for this workshop was a bit surreal.

Interesting talks/posters:

(talk) Maithra Raghu, On the expressive power of deep neural networks. They study the expressive power (ability to accurately represent different functions) of neural networks and show that this depends on a quantity they call 'trajectory length'. There's also a companion paper, Exponential expressivity in deep neural networks through transient chaos.
(poster) Niranjani Prasad, Barbara Engelhardt, Li-Fang Cheng, Corey Chivers, Michael Draugelis and Kai Li. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in ICU: relevant to my ICU-interests, but this poster was unfortunately on the other side of the board to mine, so I only got to look at it briefly. They're using MIMIC-III, looking at pneuomnia patients and the question of intubation. A challenge was engineering the reward function, which required consultation with clinicians.
(poster) Luisa M Zintgraf, Taco S Cohen, Tameem Adel and Max Welling. Visualizing Deep Neural Network Decisions. They propose a 'prediction difference analysis' method to visualise regions of an image which either support or oppose a particular prediction. This is based on assigning 'relevance' to parts of the input, based on the 'weight of evidence' a particular input gives to a certain class. This is a pre-existing idea, and a cursory glance at the paper doesn't highlight what's novel about their approach - possibly applying it to deep networks? Extending it to analysing the influence of multiple features at a time, possibly?

I accidentally presented my poster for most of the poster session and therefore missed out on going around to others. This is a compelling argument for having co-authors who can share the load. For the record, the work I was presenting was Learning Unitary Operators with Help from u(n), which I did with my advisor Gunnar Rätsch, and which will be appearing in AAAI-17. I also presented it at the Geometry in ML workshop at ICML, see my post here.

Roundtables

What I found especially valuable and unique about WiML were the roundtables - one for research advice, one for career guidance. In each one there were subtables for specific topics, with 'experts' to extract wisdom from.

I shamelessly hogged space at the healthcare research roundtable in the first session to listen to Jennifer Healey. She's a researcher at Intel Labs working on using sensor data for human health. That is, if you have continuous audio recording (as one can get from a phone microphone), you can identify a person coughing, measure qualities of it, its frequency, onset and so on. This information is incredibly valuable for making diagnoses and treatment decisions, and it's the kind of data that one could reasonably imagine everyone collecting in the future. One thing I really enjoyed about the discussion was that she was quite aware of the HORRIFYING PRIVACY IMPLICATIONS of this kind of data, and the need to avoid storing (and calculating on) this data on The Cloud. I'm really excited about this avenue of healthcare (as I say every time it comes up) and I'm really glad to hear a senior researcher from a big company talking about the importance of the privacy considerations. As was mentioned in the ML and Law symposium, all personal data you collect is a privacy vulnerability. But collecting this data could have such massive positive healthcare implications that 'solving' the privacy problem is really important. Especially if the data is going to end up getting collected anyway...

The second roundtable I went to (about careers/advice), I spoke to some people at Deepmind about working there (me and everyone else at NIPS, it feels like...), and some other people about how to decide between industry (that is, industrial research) and academia. Both experts at the industry/academia table were in industry, so I'm not sure I got an unbiased perspective on it. The context for all of this is that I'm a 'late-stage' PhD student (the idea of that is rather scary to me - there's still so much to learn!), so I'm looking for internships (got any spare internships? contact me) and thinking about post-PhD land. The most concrete difference I learned about was that in companies, you may need to send your paper to the legal team before submitting it to a conference, in case they want to patent something first. I'd imagine this also applies to preprints and code and so on. Otherwise, the level of intellectual freedom one enjoys seems to vary, but everyone I spoke to (from a biased sample) seemed largely unconstrained by their industrial ties.

I'd imagine there's a gulf of misery between brand-new startups that have yet to become overly concerned with Product, and established tech companies with the luxury of blue-skies research labs, where you don't get to do cool things and instead must live in a box desperately trying to demonstrate the commercial viability of your research. I'd also imagine that said box-dwellers don't attend roundtables (how do you fit a round table in a square box?).

The final notable thing that happened at WiML was me apparently winning a raffle, but being shamefully absent. I was upstairs charging my laptop and catching up with a friend from MLSS, blissfully ignorant of the prize I would never receive.

The Main Conference

Invited Talks

The main conference opened with a talk (the Posner Lecture) from Yann LeCun. LeCun is famous enough in machine learning that people were excitedly acquiring and then sharing selfies taken with him (a practice I find puzzling), so the things he said will likely echo around the community and I need not repeat them in detail here. In gist he was talking about unsupervised learning (although focusing on a subtle variant he called 'predictive learning'). He used a cake analogy which spawned parodies and further cake references throughout the conference/social media. The analogy is that reward signals (as in reinforcement learning) are the cherry, labels for supervised learning is the icing, and the rest of the cake is essentially unlabelled data which requires unsupervised learning. The growing importance of unsupervised learning is not new, I can say from my intimidating one year of previous NIPS conferences.

Marc Raibert from Boston Dynamics gave an entertaining talk about dynamic legged robots. This featured many YouTube videos I'd already seen, but was happy to gormlessly rewatch. One amusing thing is the fact that they can't use hydraulics in domestic robots, because they leak. That's a great example of a real-world problem. It might be common knowledge amongst roboticists, but 'you can't use hydraulics because nobody wants oil and stuff on their carpet' would not have occurred to me if I for some reason needed to design a robot. Now, maybe I would not need to design a robot directly, but it's not entirely unlikely that I could design an algorithm making assumptions about the kinds of movements, or the cost of those movements, that a robot could make. And this is why 'domain experts' will always be needed. Probably.

At the end of the talk, someone asked if Boston Dynamcis uses machine learning. They do not. Maybe they should?

Saket Navlakha spoke about 'Engineering Principles from Stable and Developing Brains'. Part of this talk was based on this PLoS CB paper where they compare neural network development in the brain to that of engineered networks. In brains, connections are created rapidly and excessively, and then pruned back over time dependent on use (they demonstrate this in mouse models). This is to be contrasted with engineered networks, where adding and removing edges in this way would be seen as wasteful. They demonstrate however that the hyper-creation and then aggressive pruning results in improved network function. They're particularly interested in routing networks, so the applicability to artificial neural networks is not immediately apparent.

Susan Holmes gave the Brieman Lecture, which exists to bridge the gap between the statistics and machine learning communities. This was the single talk of the conference where I took notes, because the relevance of the topic to me and others in my lab overwhelmed the need to preserve precious limited laptop battery. The title of the talk was "Reproducible Research: the case of the Human Microbiome", and so was mostly a story about how to do reproducible research, in the context of microbiome analysis. One really cool thing she mentioned was a web application called shiny-phyloseq, which seems to be an interactive web interface to their phyloseq package. However, it also (I think) records what you do with the data as you explore, which you can then export as a markdown file to include with your paper. I try to emulate this by pipelining my analysis in bash scripts (or within python), but having something to passively record as you interactively explore data seems additionally very beneficial. The garden of forking paths is a risk during any data exploration. Also, the garden of forgetting exactly what preprocessing steps you did.

There was a touching memorial to Sir David MacKay during one of the sessions. It's easy, as an early-stage scientist, to get swept up in the negative aspects of academic culture (looking at you, Publish or Perish) and lose sight of the reasons for doing any of this. Hearing about scientists like MacKay, who both think and care deeply, is genuinely inspirational. The only book on my Christmas wishlist this year is "Information Theory, Inference, and Learning Algorithms".

Interesting Papers/Posters

Necessarily, a subset of the interesting work.

Misc

Learning Transferrable Representations for Unsupervised Domain Adaptation - Ozan Sener · Hyun Oh Song · Ashutosh Saxena · Silvio Savarese - jointly learn representation, cross-domain transformation as well as labels to do better domain adaptation.
Examples are not enough, learn to criticize! Criticism for Interpretability - Been Kim · Oluwasanmi Koyejo · Rajiv Khanna - this was a great poster and spotlight talk. The idea is this: to help make sense of massive datasets, we ideally identify some 'representative samples' ('prototypes') which we can manually assess and use to generalise about the rest of the data. The danger is that there will be non-stereotypical data points, which are nonetheless represented in the data and should be considered. They call these examples 'criticisms', and describe an approach to generate both prototypes and criticisms from large datasets.
Disease Trajectory Maps - Peter Schulam, Raman Arora - the objective here is to find latent representations of patient trajectories, and then characterise them (i.e. through clustering). They use a fairly complicated probabilistic model to do this, so the more interesting details are in the paper. They also associate the representations with clinical outcomes to prove that they're 'clinically meaningful', comparing with some other methods of representing time series.

Reinforcement Learning

Cooperative Inverse Reinforcement Learning - Dylan Hadfield-Menell · Stuart J Russell · Pieter Abbeel · Anca Dragan - in traditional inverse reinforcement learning (IRL), the agent tries to learn the expert's reward function. However, to have benevolent robots, we would like them to maximise rewards for humans, not themselves. Additionally, in IRL the agent observes assumed-optimal expert trajectories, which may nonetheless be sub-optimal for learning - one would rather generate teaching, or demonstration trajectories. They formulate a solution to these concerns as a two-player game with learning and acting (deployment) phases.
Showing versus doing: Teaching by demonstration - Mark K Ho · Michael Littman · James MacGlashan · Fiery Cushman · Joe Austerweil · Joseph L Austerweil - this work focuses on the second issue raised in the previous one - how does a teaching trajectory differ from a doing trajectory? They formulate it as 'Pedagogical Inverse Reinforcement Learning'd. What's really neat about this work is that they actually did experiments with humans to validate their model's predictions about how people would behave while trying to teach versus simply doing.
Safe and Efficient Off-Policy Reinforcement Learning - Remi Munos · Tom Stepleton · Anna Harutyunyan · Marc Bellemare - 'safety' in this work refers to the capacity of the algorithm to deal with arbitrary 'off-policyness' (that is, the policy to evaluate and the behaviour policy observed need not be close), and 'efficiency' refers to using data ... efficiently. The work seems to combine previous approaches which are either safe or efficient into an algorithm enjoying the benefits of both, with various theoretical results.
Safe Exploration in Finite Markov Decision Processes with Gaussian Processes - Matteo Turchetta · Felix Berkenkamp · Andreas Krause - 'safe' here roughly has its common meaning. They address the issue where an agent, looking to maximise long-term (discounted, perhaps) reward, is willing to tolerate temporary very negative rewards. This is unacceptable for safety-critical agents - they used the example of a Mars rover getting stuck in a crater - so they develop an algorithm (SafeMDP) to safely explore, avoiding unsafe states/actions using noisy observations from nearby states. They also ensure the agent can't get stuck in states without safe escape routes.

Recurrent Neural Networks

Sequential Neural Models with Stochastic Layers - Marco Fraccaro · Søren Kaae Sønderby · Ulrich Paquet · Ole Winther - they combine state-space models (uncertainty about states) with recurrent neural networks (sequential, long time dependencies), and describe a variational inference procedure for the model.
Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences - Daniel Neil · Michael Pfeiffer · Shih-Chii Liu - they add a time gate to the LSTM unit, which has a parametrized oscillation frequency, controlling when individual parts of the memory cell can be updated. This allows for irregularly sampled sensor data to be integrated and they demonstrate improved performance on long memory tasks. They also have really nice figures.
Full-Capacity Unitary Recurrent Neural Networks - Scott Wisdom · Thomas Powers · John Hershey · Jonathan Le Roux · Les Atlas - this is pretty relevant for/similar to my recent work, so I'm going to read this paper in detail later. My initial thought upon seeing the poster is that they have some really unnecessary mathematics in there, which also appears in the manuscript - the entirety of section three in their paper is self-evident. I'm a bit concerned that reviewers might think well-known mathematical facts restated as 'theorems' may constitute novel results. Anyway cattiness aside, their model is interestingly different to my approach - they optimise on the Stiefel manifold of unitary matrices directly (I optimise in the Lie algebra), although if you define the Riemannian gradient using inner products on the tangent space, this probably becomes equivalent in some sense. It requires further analysis. Their results seem quite impressive, although they don't do a comprehensive comparison on the same experiments as Arjovsky & Shah, which are the ones I'm familiar with. I had a nice conversation with one of the authors at the poster, which is really what conferences are about.
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism - Edward Choi · Mohammad Taha Bahadori · Joshua Kulas · Jimeng Sun · Andy Schuetz · Walter Stewart - their focus here is to have an interpretable model, so the evidence used to make a decision is easily identified. They achieve this using an attention mechanism where the recurrence is on the attention mechanism, not on the hidden state. I'm not sure why RNNs should be seen as intrinsically uninterpretable (you can get gradients of cost with respect to any input, for example), so I'm going to think about this more. Interpretability is crucial for any medical applications.

Machine Learning and the Law Symposium

I was and remain to be confused by the choice of symposia. The options were: Deep Learning, Recurrent Neural Networks, and ML and the Law. RNNs aren't deep? What was the DL symposium covering? Deep but Not Recurrent Learning? Weight-Sharing Is OK but Not Over Time, Never Over Time? As evidenced by the title of this section, I didn't attend either of them, and I also didn't attend enough of the Counterfactual Reasoning workshop on Saturday to say what would have happened if I had gone to them, but there seems to be a naming/scope issue here. Whatever it was, the RNN Symposium was Hot Shit and had to switch rooms with us ML+Law people during the lunch break. As soon as the room change was announced, people started appearing at the fringes of the Law symposium and may have been inadvertently exposed to some meta-ethics. I'm not sure how this planning error occurred - it is natural to assume that most of the growth in NIPS attendance is coming from DEEP LEARNING, which should (??) include RNNs, so that symposium was likely to be popular. Maybe they thought enough people would go to the other DL symposium.

The real question is - did non-DL non-justice machine learners feel cheated of a symposium? Am I wrong to try to place the RNN symposium inside the DL one?

Having just published a paper (arguably) about RNNs, I should have gone to the RNN symposium, but I can't resist thinking about the broader social impact of machine learning. I've also found myself thinking about morality and justice (and therefore law) more than usual lately, so I had to attend this. Discussions of normative ethics at a machine learning conference? Yes.

I'd consider this symposium a law-oriented follow-on to the 'Algorithms Among Us: the Societal Impacst of Machine Learning' symposium at NIPS 2015 (see my summary here. Having a focus is good. The impacts of machine learning on society are widespread, so trying to cover too many all forces a shallower treatment. High level talk is well and good, but getting stuff done requires being specific. This is actually a point that was raised during one of the panel discussions: how do we balance the need in computational science to formulate very specific, quantified definitions of things (like discrimination) with the requirement of margin of interpretation in law? I was surprised, as a non-lawyer, to hear that such ambiguity could be tolerated, much less desired. The example given for this was in discussions where compromise may only be attained through baking some ambiguity into an agreement, which would then (I suppose) later be argued over as necessary. This leads to another point which was made - law is not a monolith, laws are not absolute immutable statements - law is a process, an argumentative tradition (at least in the US), evolving and iterating and requiring justification at all times (get it - justice pun!). How to integrate algorithms into this process is not as simple as treating them as Truth Functions (shout out to my main man Wittgenstein) on Evidence ... or is it? I get ahead of myself.

Legal Perspectives

Ian Kerr, Learned justice: prediction machines and big picture privacy. The 'learned' in the title is partially a reference to the US judge Learned Hand (what a name). A quote from him, "If we are to keep our democracy, there must be one commandment: Thou shalt not ration justice". As an example of a 'learned AI' he mentioned 'The World's First Robot Lawyer', which helps people generate appeal letters. It's actually a pretty standard chat bot, but it's helped to overturn over 160,000 parking tickets in London and New York, which is a massive impact ('helping to protecte vulnerable people from state coercion'). What could we do with more powerful algorithms? He then spoke about prediction, highlighting the links between prediction, preemption, and presumption. This brought us to the prediction theory of law, an idea coming from the legal scholar Oliver Wendell Holmes. This is the idea that 'the law' is simply about predicting what the courts will do, and nothing else. So the study of law is the study of prediction, not morality or anything else. He went on to talk about the 'reasonable expectation of privacy' which is required to understand the scope of the 4th Amendment of the US Constitution. The difficult part is not defining 'reasonable', but rather 'expectation'. What does this word mean? There are two interpretations: it could be normative, or predictive. The US courts have taken the latter stance, and one's 'expectation' of privacy therefore depends on what is possible with generally-available technology. This is terrifying - if I know my phone microphone is always on, and my phone is at risk of being hacked, do I lose the expectation of privacy whenever my phone is on me?
Mireille Hildebrandt, No Free Lunch. One particularly pertinent thing she spoke about was 'Data & Pattern Obesitas'. That is, there is a general desire to collect as much data as possible, to look for as many patterns as possible, simply because. This is dangerous for several reasons, the most obvious of which being that any personal data that is stored is a security risk (looking at you, Big Healthcare Databases). And so she highlighted the importance of salience of purpose, citing the security adage of 'select before you collect'. I think this idea likely goes against the inclinations of many researchers in machine learning/data science, who would rather grab everything, and do some sort of automated relevance detection later. This may be fine in certain domains, but when the data you're operating on is sensitive in some way, it can be fatal.
Deirdre Mulligan - Governance and Machine Learning: there was not so much machine learning in this talk, but she spoke about various ways technology and governance interact. Voting machines are one obvious place (and topical!). She spoke about how electronic voting systems failed to reproduce the traditional voting system. In pen-and-paper voting, the ballot is a physical artefact of the vote, but in these systems, apparently it was rendered on the fly and not saved. There was no storage of the ballot image, it simply incremented a counter somewhere in the backend. This is obviously a terrible system, but these machines were closed-source (!?!?!), so I guess nobody realised they were working like this until they reverse-engineered them? The mind boggles. Other examples are automobiles - you can hack them (like everything on the IoT), they were avoiding regulation (Volkswagen), product safety was compromised by software updates. The last case highlights the need for certification and verification of post-purchase software updates. If you want to run Windows XP on your computer that's your own business, but unsafe cars (from either software or hardware) are public safety risks.

Technical Perspectives

Aaron Roth: Quantitative tradeoffs between fairness and accuracy in machine learning - Rawls provides a definition of fairness, which is "fair equality of opportunity", which he formalised using a 'discrimination index' - the probability of victimisation (not being selected despite being the most qualified, I think) conditional on being present at a bad round (a round in which a sub-optimal applicant is selected). This was all formulated in a contextual bandit setting, and he described an algorithm called 'fairUCB' (from UCB - upper confidence bound, a standard bandit algorithm) and gave its regret bound.
Krishna P. Gummadi: Measures of fairness, and mechanisms to mitigate unfairness - the focus here was on discrimination, which is a specific kind of unfairness. So what is discrimination? A definition is "wrongfully imposing relative disadvantage based on membership in socially salient groups". One could ask what most of these terms mean exactly (and indeed, we must, if we want to computationally model anything), but he focused on the phrase "based on". Some attributes are sensitive, and some are not. Can you simply ignore them? The problem is that, people in different sensitive attribute groups may have different non-sensitive feature distributions, which risks disparate mistreatment and disparate impact. One can test disparity of impact through, for example, proportionality tests, e.g. "an 80% rule" - if 50% of men are accepted, then 40% of women should too. And a shout-out to Fairness, Accountability and Transparency in ML.

There were more talks, but I was drifting into the semi-delirious pre-fever stages of the Conference Flu at this point.

Panel Discussions

The discussion spotlight was 'Regulation by Machine' from Benjamin Alarie. A question - how to use AI to make better laws? My notes are sparse but a recurring theme (also in MLHC) is that we should use machine learning to help and augment humans, not to replace them. So he was speaking about using ML to - for example - help to predict if it's 'worth' taking a case to court. Apparently many cases go to court which are 'overdetermined given the facts', and it's somewhat easy (citation needed) for an algorithm to identify which these are.

My notes on the actual panel are sketchy at best. It may have been the time or how sick I was but, it felt like people were saying a lot of interesting things without obvious argumentative structure or direction, so it's hard to summarise any salient points. Here are some decontextualised, paraphrased snippets:

Deirde Mulligan: the judicial system is not always about applying the same law the same way. You must know the facts, the context... The law wants you to come in and argue about what it means. You can go to court to change the law (she asked how many people had been to court - a couple raised their hands - I've only been to court as a juror). Any algorithm for the law must be both performative and output-focused.
Neil Lawrence: how do judges come to opinions? Also, "I don't want to talk too much as I'm not on the panel."
??? (unknown panel member) - we're assuming the law will furnish us with specific definitions, but actually, policies breed on, thrive on, require a lack of specificity and precision - ambiguity is not an accident!
Ian Kerr: Paul the Octopus was highly accurate, but does that mean we should trust it?
Deirde: shout-out to Nolo press, making the law easier to understand. Especially important in areas where the cost of fighting something isn't worth it...

And a final shoutout to Chief Justice John Roberts is a Robot - Ian Kerr and Carissima Mathen.

Machine Learning for Healthcare Workshop

With the caveat that these are workshop contributions, here are some interesting papers/posters (with accompanying arXiv papers, so I have a chance to remember anything about them):

Demographical Priors for Health Conditions Diagnosis Using Medicare Data - Fahad Alhasoun, May Alhazzani, Marta C. González - they look at insurance claims data from Brazil over a 15 month period - about 6.6 million visits. They represent ICD-10 codes by their distribution over ages (a 100-dimensional normalised vector) and do clustering on this representation.
Stratification of patient trajectories using covariate latent variable models - Kieran R. Campbell, Christopher Yau - they describe a kind of linear latent variable model taking patient covariates into account, and use it on a TCGA RNAseq dataset.
Learning Cost-Effective and Interpretable Regimes for Treatment Recommendation - Himabindu Lakkaraju, Cynthia Rudin - related (possibly extended version) paper here: Learning Cost-Effective Treatment Regimes using Markov Decision Processes. The 'interpretability' comes in here because their state space (of the MDP) consists of the effects on their patient population of decision lists - ordered lists of rules, each consisting of tuples of predicates (like, properties a patient must fulfill) and actions.
Modeling trajectories of mental health: challenges and opportunities - Lauren Erdman, Ekansh Sharma, Eva Unternahrer, Shantala Hari Dass, Kieran ODonnell, Sara Mostafavi, Rachel Edgar, Michael Kobor, Helene Gaudreau, Michael Meaney, Anna Goldenberg - they're interested identifying subtypes of mental illness using time series, and predicting future phenotypic values. They use a Dirichlet Process-Gaussian Process and compare with latent class mixed models, finding that the LCMMs are actually as good as the DP-GP, although neither model is yet good enough for clinical use.
Transfer Learning Across Patient Variations with Hidden Parameter Markov Decision Processes - Taylor Killian, George Konidaris, Finale Doshi-Velez - they're concerned with patient heterogeneity, and cast this as a multitask learning problem, where different tasks are different patients. They share information between tasks using a GP-LVM, removing the requirement to visit every state to learn the dynamics (which is, of course, infeasible in medicine). -Predictive Clinical Decision Support System with RNN Encoding and Tensor Decoding - Yinchong Yang, Peter A. Fasching, Markus Wallwiener, Tanja N. Fehm, Sara Y. Brucker Volker Tresp - they represent the patient's time series with a LSTM encoder and concatenate the static information into a representation.As a decoder, they use tensor factorisation. I'm not entirely clear on what is actually contained in this tensor, so the paper will need to be read more carefully.
Multi-task Learning for Predicting Health, Stress, and Happiness - Natasha Jaques, Sara Taylor, Ehimwenma Nosakhare, Akane Sano, Rosalind Picard - they have wearable sensors and smartphone logs from 30 days of monitoring. They looked at three multi-task approaches: multi-task multi-kernel learning, hierarchical bayes with Dirichlet process priors, neural networks (sharing hidden layers), and single-task versions of all of these.

Mandatory shout-out to my contribution to the workshop: - Neural Document Embeddings for Intensive Care Patient Mortality Prediction - Paulina Grnarova, Florian Schmidt, Stephanie L. Hyland, Carsten Eickhoff - we used document embeddings to predict patient mortality in MIMIC-III, purely using text notes. The embedding procedure uses two layers of CNNs - word vectors are combined into sentence vectors (with a CNN), and sentence vectors are combined into patient vectors (with a CNN), and we use target replication to improve predictive accuracy. This was fairly preliminary (there are many other factors to consider, as ever), but we beat previous work using topic modelling on the task, which is encouraging, and perhaps unsurprising given LDA's inability to deal with multi-word phrases.

This is only a snippet of the interesting work presented at the workshop. I unfortunately came down with Conference Flu about half way through NIPS, and was at my sickest during the MLHC workshop (ironically), so I didn't get to speak to as many poster presenters as I would have liked.

Miscellaneous Comments/Observations

Generative Adversarial Networks are super hot right now, and by saying this I am contributing to the hype.
Despite having around 6000 attendees, NIPS didn't feel overcrowded (contrast with ICML this year). I'm guessing this was a combination of having an appropriately-sized venue and good crowd-control from the venue staff (they were closing off the top floor when it got too full), or maybe everyone was just busy enjoying Barcelona.
Being a vegetarian in Spain sucks. Given my diet was largely eggs, potatoes and bread for the week, I feel sorry for the vegans in the NIPS community. I for one devolved into a patatas-bravas guzzling monster and don't want to even think about tapas for the foreseeable future.

Conclusion

I feel less obviously exuberant about NIPS than I did last year, which I attribute to a combination of having been (and continuing to be somewhat) ill, and being in the development stage of several new projects where I just want to be getting stuff done.

As I've mentioned before, I think about approaching research in an exploration-exploitation framework. At this NIPS I realised that even within the exploration mode, one can explore exploitatively. That is, you can distinguish between diversity-increasing exploration (seeing areas of the state space/field you've never been in before) and depth-increasing exploration (refining your knowledge of partially-explored states/topics). The latter is arguably a kind of exploitation, because it's exploration with the aim to increase knowledge of things you are intending to use later. You hope.

Bringing this strained analogy back to conferences, this makes the difference between going to talks on things you already sort of know and going to totally new topics. I tried a bit of the latter, because chances are I'm going to read papers relevant to me regardless, but I found spotlight talks suboptimal for learning new ideas without sufficient background knowledge. An alternative approach would be to be incredibly exploitative, pre-emptively read the relevant papers and then talk to the authors at the poster sessions. Perhaps next year I'll be organised enough to do that, because unless you go to the tutorials, 15-minute talks of questionable presentation quality on cutting edge research are not good ways to learn new topics.

What is a good way to learn a new topic (personally), is to write about it. I've been working on a pedagogical post about sparse Gaussian process classification, which will be up next, after a brief diversion into roller derby.