My concerns about PLOS’s new open data policy

I have struggled with whether I should write this post. Generally, I am an open everything advocate: open access, open source, open science, and yes, open data. But I have some concerns about PLOS’s new open data policy.

As of March 1, 2014, PLOS will require authors of research papers to make “the underlying data…freely available for researchers to use, wherever this is legal and ethical”. It will no longer be sufficient for authors to say “data available upon request” and remain as the sole “gatekeepers”. Data must be included in the paper itself, in a supplemental, or stored in an online repository. In cases where data cannot be publicly released (for example, due to concerns about patient privacy), or if a third party owns the data, the authors must specify a committee to which requests for data should be submitted.

In principal, data sharing sounds great. But what are some of the practical implications of this policy? There are many, but I would like to focus for now on the potential repercussions for researchers in low- to middle-income countries and the diversity of PLOS authorship.

First, a little background: I work in Mexico, where the country spends less than half a percent of its GDP on research. (Compare that to nearly 3% of a much larger GDP spent on research  in the U.S.) Multimillion dollar (or even 1 million dollar) RO1′s do not exist here. Labs are run on very little funding. In 3-5 years, a single lab may only have enough money to produce one really solid data set. This means that data acquired are like gold, and it is absolutely crucial that researchers here get as many publications out of one data set as possible. The situation is the same or worse in many other countries throughout the world where research funding is scarce.

Now, I do not mean to diminish the funding problems currently facing scientists in many European countries, Canada, or the U.S. – the funding environment has gotten increasing difficult and many researchers are suffering. It is also important for these researchers to get the most out of their data. But I think it is fair to say that the funding situation is more dire in low- to middle-income countries, like Mexico.

What does all this have to do with PLOS? Their new open data policy means that researchers will have to make data available upon first publication. As PLOS writes, “data availability allows validation, replication, reanalysis, new analysis, reinterpretation, or inclusion into meta-analyses”. In short, they are admitting that one of the possibilities is that people could take the published data and do a new study. Could the public and science in general benefit from this? Yes. But what about the researchers who originally generated the data? Because of funding and consequently manpower constraints, labs in low- to middle-income countries may not be able to crank out additional papers from data as quickly as could larger labs in the U.S., for example. There is the possibility the original researchers could be ‘scooped’, losing out on publications from data that was meant to sustain their lab for potentially several years. (This could also be a problem for smaller labs within the U.S.)

Some have argued that the fear of getting scooped is unfounded. That could be true – I am not aware of any data on how often this actually happens. But the problem, I think, is not just whether the risk of being scooped is real, but whether researchers believe the risk exists.  If researchers believe the risk of being scooped is higher if they have to release their data, they will simply stop submitting to PLOS.

Why should PLOS care if they get a few less submissions ever year? Diversity. I think this policy has the potential to decrease the diversity of the authors submitting to PLOS. I can tell you that although PLOS has a generally good reputation in Mexico, many researchers here will now be thinking twice about submitting. In fact, I am particularly worried about a personal situation. I had just convinced my coauthors here that we should submit our next paper to PLOS ONE. (I considered this no small victory, since they often publish in closed access medical journals.) I wonder whether PLOS’s new policy will change their minds. I  hope not. But since the data do not belong to me, it is possible we will have to submit elsewhere, depending on what they decide.

I say all this not to unduly criticize PLOS – I think they have the best intentions in mind with this new policy. But ideology is different from practice, and I think there are many details that still need to be worked out with respect to data sharing. In sum, I am torn. I want to support their efforts to increase openness in research, but I am concerned about the possible repercussions.

What do you think? I am worrying unnecessarily?

About these ads

43 thoughts on “My concerns about PLOS’s new open data policy”

  1. Why did you struggle about whether to write this post? It seems like a reasonable critique of the current policy. Perhaps it will be addressed or it won’t turn out to be a huge issue. I don’t know. This is what I don’t get about some of the “change how science is done” advocates. Of course Plos has good intentions. Intentions are one thing, it is perfectly reasonable to ask for details about how this is going to work. We are scientists after all. Critique is what we do. So why rest on good intentions and assurances that it will probably work out?

    1. I guess I struggled most because I didn’t want it to come off as if I am against sharing data, or even against PLOS’s new policy per se. I just wanted to call attention to possible unintended consequences. But as you say, it is absolutely reasonable to question how this policy will work in practice. I think it is especially important to critique when it is an idea and an organization we support and want to see succeed.

  2. This is an interesting perspective, and one that I will admit I have not considered. I think I need to mull it over more, but here are a few quick (and perhaps not well thought out) thoughts.

    If we all are participating in Science as a world wide community, then we all need to (as a community) agree to certain sets of standards of scholarly communication and sharing (ideas, reagents, tools and data). To my knowledge, we do not review papers differently depending on the country of origin of the study (that is we expect the same level of scientific rigour). I guess I consider the raw data as much a part of the science as the analysis scripts and the summary and interpretation of the data in the paper.

    At least personally, I am most concerned with the long term archiving of the data. Making sure others have access to it immediately with the publication of the associated paper is also important, but I think a bit of patience could be asked for. I know that DRYAD did (perhaps still does) allow for up to a year access freeze on the data. Even though the data is archived immediately with the publication of a paper, it is not publicly available for up to a year. This should give the scientists who generated the data at least some time to get going on their next study. Whether this is sufficient (or too much) time is worth discussing.

    On the flipside of your argument, I have two thoughts.

    Currently most scientists may need t to email the authors to get the data, but virtually all journals would require that the authors who generated the data to share it because of the request. This is true independent of country of origin or funding status of the lab, right?

    Also, what if other scientists also in labs with little to no funding (whether in Mexico or elsewhere in the world) could really use that data to help them with a study they are working on? If it is from a study that has already been published, why should they not be allowed to use?

    1. Of course people should be allowed to have access to data in previously published papers, if requested. The question is, how is that access facilitated and what kind of role do the creators of the data have?

      Let’s be honest and recognize the fact that there is a huge difference between having data freely available on request, and having data readily downloadable without having any direct interaction with those that created the data.

      For those who primarily work with their own data, the way our system is designed there is scant to no reward, and lost opportunity costs, to making data readily downloadable. This cost is disproportionately negative for smaller labs.

      I understand the argument that if data are paid for by federal funding, that they should be publicly available. However, those that make this argument most vociferously, or wish to accuse others of impeding science by not making all of these data available promptly, are being disingenuous by not recognizing the realities of how research happens.

      Clearly, there is a clamor for change, at least in a very vocal minority which has a loud online presence. Note that it’s a clamor for *change* from how things are done. Which means that things are not that way, and people aren’t used to it. And people have designed their labs and careers around the old (and still current) system. Few people will be won over by just being told that they’re doing it wrong. I know the system, and I understand the arguments, but when people dismiss the overhead of effort it takes to do this, and the fact that it often takes me several years to work up a quality dataset, I stop listening.

      1. Could you elaborate on how you think those ( I gather including myself) making the argument for publically available data are being disingenuous? I really am at a loss here.

        All I can say is that we generate ~95% of the data we use in our lab. I like to think I have a pretty reasonable idea of how research happens and how the data is generated, at least in my own lab and my time as a student and post-doc. Sharing it has never seemed like a substantial burden, and I have yet to be scooped on anything because of it.

        I am sure there are cases where it might be a burden, and as Erin suggests could really disadvantage smaller labs that had to put all of their resources into generating it. So how might this be dealt with? Can a 1 ( or 2) year embargo help for such labs? Other approaches to deal with this issue?

        I have said this before, but in my experience it has been far more of a hassle trying to dig up old data, than archiving it immediately,

      2. I never wrote that people making the argument for publicly available data are being disingenuous.

        I wrote that those making the argument because of federal funding are being disingenuous because they’re not recognizing realities of how research happens. That’s clearly a wedge for those who want all data to be open. It’s an astute rhetorical angle. But really, the people making that argument mostly don’t care where the $ comes from, they want all data to be free and the federal funding is secondary to the central argument.

    2. You make some good points, Ian, many of which I still need to mull over. The idea of freezing access to data for some time is interesting. I generally don’t support embargoes, but it certainly could help small labs. I’m not sure yet how I feel about this one…

      I particularly like your last point. It is possible that open data could help small labs in that it provides them access to information they have difficulty obtaining on their own. As a personal example, open electrophysiological or epidemiological data would certainly help me to improve the models I use in my work. I can also think of examples in which a lab could extend or support their smaller primary data set with open data. It could be that in the long run these benefits outweigh the costs to a small lab of sharing their own data, but I think it would depend on the type of research being done.

  3. I’m sorry our revised data policy (announced in December: http://www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy/) is causing such confusion. We’ve tried to say that the type of data required is not changing, nor is the fact that we have always required authors to share data with other researchers when asked to do so. What is changing is that we are asking authors to make entirely transparent where the data can be found, and that place cannot be ‘on my private hard drive’. So, yes, there is a change, but I’m not sure that this change will alter the likelihood of someone else working with your data, assuming that you would previously have shared it on request, as required to do so by PLOS journal policy.

    1. Thank you for commenting. I think it is important to recognize that responding to personal requests is different than sharing data publicly. The biggest difference is the interaction between the original authors and the person requesting the data that is required by the former and not the latter. This personal interaction provides, among other things, the opportunity to discuss potential collaborations. This alone is why I think authors, even in low- to middle-income countries, are more open to sharing data via personal requests. So, while the new PLOS policy was meant to simply be an extension of the previous policy regarding sharing, it represents a significant change for many authors.

      1. I do take your point, which I think is an important one: there is a difference in making something openly available without intervention, as distinct from making it available if someone asks for it specifically. That difference seems to us very analogous to the difference between subscription-access articles that most people can get a copy of if they really need it, and Open Access articles that everyone can read, mine, store and pass on. Because we are already committed to the latter idea, it seems to us that the data should be similarly freely available – but I definitely take your point that because this removes the need to consult the original authors, it might make opportunities for collaboration fewer. There are, however, many people who make the opposite point: that being ‘open’ tends to encourage more people to contact them and interact with them – see for example my colleague Cameron Neylon’s post on an ‘open state of mind’ http://cameronneylon.net/blog/open-is-a-state-of-mind/

  4. Interesting argument. Two quick thoughts:

    1 – If it is problem to require scientists in low- to middle-income countries to make their data publicly available then the same presumably applies to personal requests for data. So the logical conclusion of the above argument is that any scientist working in a low- to middle-income country can ethically and justifiably deny anyone access to their data on the grounds that because of their disadvantage, the value of the data is higher to them and not suitable for sharing.

    2 – Couldn’t it be just as easily argued that mandating public data archiving will provide an advantage for scientists in low- to middle-income countries by giving them access to other scientists’ data that they can then use in their own research? In other words, doesn’t everyone stand to gain as much (if not more) than they lose?

    1. Thanks for raising these points, Chris. As to point 1, I think responding to personal requests for data is very different than sharing data publicly via a supplemental text or online repository. The former necessarily involves an interaction between the original authors and the person requesting the data, whereas the latter does not. This personal interaction provides the opportunity to discuss how the data will be used and possibilities for collaboration. This is just one reason why I think authors, even in low- to middle-income countries, might be more open to sharing data via personal requests.

      I did not mean to suggest that authors in these countries should never share data. Ian Dworkin (see comments above) also suggests an interesting option that would allow authors to place an embargo on their data for some time period. This could benefit small labs (wherever they are located) by giving them the time to draft additional publications, but the data would eventually become open. I am not sure if I favor this idea, but it is worth discussing ways like this in which we can help small labs while still encouraging sharing.

      As to point 2, please see my response above to the same point made by Ian Dworkin. Briefly, I agree, there could be cases in which labs in low- to middle-income countries (or small U.S. labs, for example) could benefit from open data. Whether these benefits outweigh the costs remains to be seen, and will likely depend on the lab and the focus of their work.

      1. Thanks Erin, these are valid points. You say that the necessity of personal interaction in data sharing (as per status quo) “provides the opportunity to discuss how the data will be used and possibilities for collaboration”. So if we imagine a scenario where a researcher in Mexico is asked for their raw data by a researcher from a big lab in the US, and with no offer of collaboration in the use of that data (apart from citing it and attributing it to the original group if they plan to publish based on it), do you feel that the Mexican researcher is justified in declining the personal request on the grounds of their economic disadvantage? In other words do you feel that the economic disadvantages experienced by scientists in low- to middle-income countries should afford them greater discretion in releasing their raw data on request?

        If the answer is yes then you are essentially arguing for two-tiered system and that the same standards of science don’t apply to everyone. If the answer is no then I don’t the difference between that scenario and releasing the data in a archive, with the exception that by requiring personal interaction with researchers in disadvantaged countries to access data we might, as a side effect, be promoting collaborative science with those countries. But that seems like a strange argument t me, where limiting data access across all science becomes a tool for ensuring socioeconomic parity. I can’t help but wonder that there must be better ways to achieve parity than by restricting data access.

  5. I am biased because as soon as I could I run away from an country which cannot afford funding science properly. Nonetheless I don’t agree with you. I think you miss one very important aspect of data sharing. It works both ways. Yes other researchers will be able to use your data and publish papers using it without your name on it (you will, however, get credit through citations). But most importantly you can do the same. You can take data acquired in one of the more lucky financially regions and reanalyze it and publish a paper. Data sharing is not a problem for developing countries. It is an opportunity. I suspect that in the near future we will se more labs specialising in analysing and combining existing data sets instead of acquiring new ones.

    Take for example human neuroscience. There are big privately or publicly founded, US based, projects that do prospective data sharing (Allen Brain Atlas, Human Brain Project, NKI Enhanced). They share rich datasets before they analyze it themselves. It’s a great opportunity for developing countries which just don’t have the resources to acquire similar data. Now researchers from those places can do cutting edge research and publish in glam journals.

    So look at the bright side. See how data sharing can help you instead of worrying how it can hurt you.

      1. I thought the time (and money) limiting step According to Erin is the data acquisition not the write up.

        I’m not merely saying that one should just not worry, but there are positive aspects of data sharing (reuse of data, discoveries using already acquired data) that outweigh the potential concerns.

      2. Terry,

        I apologize if I am missing something in your argument. If you have not written the paper yet, why would this be an issue (since the mandates are to archive only the data associated with studies accepted for publication)?

        If it is about the re-analysis of exactly that same data set, and you are concerned with time to write it up, I am not sure it is likely that are likely to get scooped on any re-analysis of your data. At least in my experience, most data re-use is for somewhat different purposes than the original reasons they were generated (testing different hypotheses). Also, would you not (as the author/generator of the data) have a huge leg up on any potential “competitor” as by the time the data and paper are published you would probably already have been re-analyzing the data for its additional purposes?

        While I can not speak exactly to your concerns (which I would appreciate if you could elaborate on), but would the “second study” be using the exact same variables as the first study, or would it in fact be only a partial overlap of the dataset? Since you are only required to deposit the portion of the dataset that is necessary to replicate the findings of the first study, you could just archive the relevant subset.

        I feel like I am missing something essential in your argument, so if you have a chance to elaborate, that would be really helpful.

    1. Thanks, Dominique! I think this is an excellent article that everyone with an interest in public data archiving must read. It contains many reasonable suggestions that could make researchers more comfortable with open data policies, like flexible embargoes and explicit journal requirements regarding citation of primary data. The following quote particularly caught my attention:

      Meaningful solutions require frank acknowledgment of the potential differences between the interests of individual researchers and those of the broader scientific community.

      I couldn’t agree more, and this is what I was trying to get at with this post. I did not mean to say that we shouldn’t share data – we absolutely should. But only by acknowledging the potential conflicting interests of individual researchers versus the scientific community (or the public) will we be able to construct open data policies that encourage participation, even from those who may initially be reluctant.

      1. As an added note, it would be great if we could reform our evaluation/reward systems in academia to minimize this mismatch between the interests of individual researchers and the interests of the scientific community or the public. If sharing data got researchers more “points” than say publications in high IF journals, we would have no problem with participation.

  6. Lots of important and valid points here. However one argument in all this that is just not workable is that something being publicly-funded should automatically equate to it being publicly-owned: the implication would be that everyone had a right to ownership of anything publicly-funded: medical equipment in hospitals, government policy documents, trains, NASA rockets etc., into absurdity.

    1. You’re making a category error. If we could create (for example) replicas of medical equipment for whoever wanted it, for free, then people would have a right to them if they were developed using public money. Data is not limited in the same way that material thing are limited.

  7. Totally agree with Chris and Chris – particularly in fields where generating data is very expensive or relies on equipment that is difficult to obtain, the benefit might well be stronger for low to middle income countries.

    And let’s be honest, many people insist so strongly on personal communication because this gives them the opportunity NOT to share the data upon request – it is more than common that people don’t react to emails, seem to have lost data, or provide it in a way that is effectively unusable. The fact that other people could do something useful with the data is probably only half of the story. I strongly suspect that at least some people are also afraid of others reanalyzing their study and finding different results. The same thing applies for code btw.

    Having the data publicly will be a huge benefit for sciences as a whole. Potential adverse effects for single groups (e.g. the data provides) should be somehow balanced, e.g. by more strongly acknowledging data creation in the assessment of merit, but I really see no argument against making data available as freely as possible.

  8. Dear Erin,

    thanks for that interesting post. Like with most things, Open Access has its dark side: But people rarely mention that. Just like you I am a true Open Access & Open Science advocate, nevertheless I am quite sure that Open Access & Open Science include hegemonic tendencies. With other words: I share your consideration

  9. I am also up to just submit a paper to Plos One. However, this new policy makes me concerned. The problem is that my data are part of a huge project generating multiple papers and involving many people, very expensive ones and unique, but very slow to gather and publish (7 years to gather the data). So I hardly I can make open entirely all datasets I use, as part of data is going to be re-used in few other years for different publications, even by broad team people who are not now coauthors of my study. Beside of that, also I will use part of the data for other papers. However, I can try best and make about two thirds of data as Supplementary without a problem now as I am otherwise open to this idea (just also I think the risk being scooped could be real in some situations if i put really everything on web – this I cannot). The definition in PlosOne is very conflicting, as they specify “minimal dataset” in the way, that you need actually all data to make all open in real, as otherwise it is not possible to replicate the study results in “their entirety”! There is no minimal dataset for classic ecology study I do, just original full matrices which you need to use to regenerate the same results! However, I noted that there are still papers in PlosOne published after March 2014 which say “all data are in the text and supplementaries” and then authors put to the study really nothing (just a doc table with mean values) and to the text they put a conflicting note “available on request”. So I am really still not sure how strictly and well this really works, and what changed..! But I also agree that it is likely that less interesting papers will be now published in the journal as people will be concerned…

    Idea: What about to put there a condition that anyone who is going to use the original authors data for analysis should also contact the authors to make sure there is no conflict with their ongoing publications and/or offer them co-authorship??? I think the authors should have priority to re-use their data first! (but openness should be also done – otherwise all time-spending results are just in charts and texts later, and original data will vanish with retirement of scientists).

  10. I am an ecologist terribly frustrated by the fact that millions of Euros are spent on projects where scientists are paid to create datasets which are of great global value and should be integrated with open-access on-line resources but instead (for decades) are given a gate-keeper who can arbitrarily demanding co-authorship for use. This is holding back the field and means that people are constantly reinventing the wheel, all funded by Joe Bloggs tax-payer. We are here to deliver impact to the public and to the field, data hoarding in the advancement of our career self-interests can be very detrimental.

    On another note I know of many individuals with Science and Natural co-authorships from sharing their data for meta-analysis, clearly demonstrating that the benefits can far out-weigh the costs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s