databases all the way down

I'm (arguably) a data scientist, so I need data to do science. A problem with data is that it's all over the place, a natural and unsurprising consequence of its many origins. Producing databases seems to be a hobby for bioinformaticians, which also includes databases of databases (Bolser et al., 2012), so I think it's fair to say some nontrivial effort is expended in trying to deal with this data problem. Large collaborations are pretty good for creating big datasets (e.g. ENCODE, TCGA), which can lessen the appearance of scattered data (or at least, heighten the attractiveness of centralised data), but these efforts are less about data curation and more about data generation. One notable effort towards more efficient (genomic) data discovery/sharing is the Global Alliance for Genomics and Health, which is working on defining standards, data formats, APIs, things like that. It's not the only project thinking about APIs and formats (of course). Skimming my notes from the Biological Data Science meeting at CSHL last year, I see: Open Science Data Framework, the NIH's bioCADDIE, Ensembl's REST API, and so on[1]. For the sake of efficiency I hope the community can come to some consensus on how best to store/index data and metadata, although a quote from Richard Durbin comes to mind[2],

"In science, always there are lots of people looking at the same thing in different ways. There are people trying out all sorts of crazy things. It's extremely successful to not have top-down control. It can look a little bit redundant when you have a person write yet another read mapper, but sometimes things will be influential. New ideas will come. Sometimes things can be relevant to individual projects. I think for sure things are done inefficiently. I accept that. It's a bit like evolution. Random mutation and testing is very powerful." The anatomy of successful computational biology software (Nature Biotech, 2013)

Which is to say I'm excessively justifying my decision to create yet another list of resources (but just a list, I have no intention of actually serving any data). In this case it's something I wanted to do for myself anyway, and more importantly I'm starting a new category of non-blog pages on my site (although technically the contact page was first). The idea is to separate pages which I feel are more time-insensitive (like tutorials), or which I intend to keep updated (like my contact details), and treat things on the blog as an unmodifiable record that will likely become outdated. So here is my dataset database. It covers 'things which are relevant to me', which you might find useful, if you're me. It might also be useful for people using convoluted methods to infer my research interests.


[1] It's reasonably likely these are not equivalent projects. I'm not really familiar with any of them, so YMMV.

[2] Duplication of effort is something I think about a lot in the context of bioinformatics/computational biology/computers. This quote makes me feel better about it.