Open data my a**

Reading data sharing policies in most major scientific journals, one would think “open data” is almost a given considering all of the standards and sharing platforms that are in place. Moreover, in the “reproducibility” crisis that the web and journal editorials are abuzz with, one would furthermore be inclined to think that there is a constant and fervent enforcing of this kind of sharing: if you publish something, you must share the data on which your conclusions are based. End of story.

For the most part, in the field of metagenomics at least, one would be wrong. Over the past couple of years, whenever I have tried to download a dataset (be it full shotgun sequencing or just 16S data), I have almost always run into unnecessary and uncomfortable obstruction. One prominent exception is the human microbiome project (HMP). Hats off to all involved for providing a great example of open data sharing for an entire community. All the shotgun and 16S data generated are a couple of clicks away. The subject of metadata...well, that is a horse of a different color.

Now to the “bad”. I am not going to start listing all the culprits that have shown how it should not be done, rather I prefer to present the latest ordeal of a kafkaesque nature when trying to request published data. The paper in question is as follows: “Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity”. Sorry for those who cannot access this paper, you will have to take my word for it that it is an interesting paper that has been built on a fascinating dataset (1135 gut metagenomes, shotgun sequencing as well as 16S data). I was especially interested in the raw reads, so when I scrolled to the end of the publication to find that “The raw sequence data for both MGS and 16S rRNA gene sequencing data sets, and age and gender information per sample are available from the European genome-phenome archive at accession number EGAS00001001704” I was excited to see an accession number (sometimes you don’t even have this). My excitement was short lived when I followed the link and realized the data was not available because the data is not actually “public”. It is under some sort of “embargo”, with the rights to it controlled by “Lifelines-DEEP” according to the EBI’s website.

This was immediately starting to reek of something foul. I had not even thought about trying to access the metadata and I am already required to jump through hoops just to get the raw reads. But fine, a couple of emails asking for data access, getting some sort of username and password and then getting the data is not the end of the world (but just to be clear this is not the way these things should work).

A couple of polite emails later (both to the authors and the journal editors), I am asked to fill in the “data request” form. Take a couple of minutes to scroll through that carefully… What does that read like to you? Cause to me, this is not a trivial data request form, especially not for data that I should be able to download without having to let the authors even know about it. Some questions come to mind: what binds the recipients of this proposal to confidentiality? (am I just letting the “competition” know exactly what I am up to?) and more importantly what if I get the data and then do something else with it? What is this proposal, a binding contract? Who will enforce its terms?

So, I decided to dig a bit deeper into Lifelines-DEEP and found their data access policy. This contains a lot of standard stuff, but also some obvious “gems”. On a positive note they do assure confidentiality. However they also state that “After approval of the proposal, the applicant receives a financial agreement and a Data/Material Transfer Agreement (DMTA). The DMTA specifies the conditions for the use of the LifeLines data and/or biomaterials, Intellectual Property (IP) and warranties. The access fee for data and biomaterials supports the handling and service costs of LifeLines and includes a small contribution to the ongoing data and sample maintenance. After signing the offer and DMTA the researcher is granted access (but not any ownership rights) to use the data and/or samples to conduct the approved research project for a particular period of time.“. Given that the EBI already provides data maintenance at no costs for the whole scientific community, it seems that the only “added value” that LifeLines is providing is this extra level of control. Effectively, the fee serves to finance the bureaucracy that charges the fee.

At this point, it was obvious to me that getting access to the raw data will not be trivial and once I had that data I could only make restricted use of it. How restricted? Well, according to the same document: “Prior to submission, researchers are asked to send their abstracts and manuscripts to the Lifelines Research Office for a general check on correct reference to LifeLines, to check whether the content of the manuscript fits the initial approved research proposal, and to identify possible privacy risks.”

In the meantime, I had been in contact with the editor and had received a final decision that they “established that the institutional conditions for access to these data are not unreasonable”. Additionally, a correction had been added to the manuscript: “A statement about the informed consent regulations for data access to the Lifelines population cohort was inadvertently omitted from the acknowledgments.”

Since I only wanted to “play around” with this data and try some things (including checking some specific aspects of the original analysis which I found troubling), I do not actually qualify for its access. And here we are, months later, with no data and the very real prospect of writing a “request of reproducing results” proposal that will most likely be rejected. Maybe I should just let it go and forget about this dataset. But in the past month, I found another one, again with data produced by some “third party” which is not willing to share. This is exactly why we should fight this, as setting a very troubling precedent. I am sympathetic and equally understand the privacy issues that are often not initially obvious, however this has nothing to do with such issues. There are thousands of human microbiome samples readily available and there should be no exceptions to making new samples public. Subject metadata gets a bit more complicated and maybe I will write about that at some later stage.

