Yesterday, I asked a very simple question on Twitter:


Let’s assume for the moment that we are ignoring moral and ethical considerations, and focusing only on the question of legality. Simple, right? Except that it’s not. Immediately, I received several responses, the first from Mike Taylor (@MikeTaylor):


Again, let’s ignore for the moment the question of whether it’s wise (we’ll get to that later). Playing devil’s advocate, I responded that the advisor’s claim is that any data collected in their lab are their property, and that by extension they control the publication of any work based on those data. Mike then tweeted:


Ok, so now we get to the heart of it. Are data subject to copyright and, if so, who owns the copyright? Mike’s argument, and that of others I have spoken to, is that data collections are sets of facts and since facts are not subject to copyright, then neither are data. But is this how copyright law is written? (Let me note that I am currently focusing on U.S. copyright law, but I would love to do a comparison of the laws in different countries in future.) This is where the law, at least to someone like myself not educated in this area, gets tricky. Here is an excerpt I pulled from Bitlaw:

Although databases may be protected as compilations under U.S. copyright law, the underlying data is not automatically granted protection. The Copyright Act specifically states that the copyright in a compilation extends only to the compilation itself, and not to the underlying materials or data. 17 U.S.C. § 103(b). As a result, compilation copyrights cannot be used to extend copyright protection to ideas or facts that are otherwise unprotectable (it is a basic premise of copyright law that there is no copyright protection for ideas and basic facts…Thus, a database of unprotectable works (such as basic facts) is protected only as a compilation. Since the underlying data is not protected, U.S. copyright law does not prevent the extraction of unprotected data from an otherwise protectable database.

So, as Mike and others argued, the data themselves are considered facts and thus not subject to copyright. This law arises from the case of Feist v. Rural, which ruled that “information alone without a minimum of original creativity cannot be protected by copyright” (Wikipedia). Thus, you can only claim copyright on a data compilation, and only when you can show that you have organized the data or provided some infrastructure that is unique. Unless I am reading this wrong, this appears to me to be in conflict with statements by many universities regarding the ownership of data. For example, take this one from Columbia University:

Although graduate students, postdoctoral fellows, or even some faculty in academia performing research may believe that they own the data collected, they are wrong. As employees of a university, they are working for hire for the university, which, in most cases, owns the rights to the data. In federally sponsored research, the university owns the data but allows the principal investigator on the grant to be the steward of the data.  …With industry-funded or privately funded research, data can belong to the sponsor, although the right to publish the data may or may not be extended to the investigator.

I am far from an expert on copyright, but this reads to me as if universities and funding agencies are trying to claim ownership, not of an original database or collection, but of the underlying facts themselves; something specifically prohibited by copyright law. Have I misinterpreted something here? How do these institutions get by this?

Getting back to the original question, it gets even more complicated. Even if we assume that the advisor owns the rights to the data, which is questionable, the copyright to the written work in the form of the dissertation is sometimes owned by the student. (I say sometimes because I am not sure in what percentage of cases this is true. I, for example, as the author of my dissertation hold copyright, which is stated in the online repository record.) When the student then wants to publish parts of their dissertation, which copyright takes precedence? (1) the copyright (assuming there is one) on the data collection on which the publication is based, or (2) the copyright on the written work itself? Perhaps this is obvious to others, but it’s not to me and I’m guessing it’s not to a least a few other researchers out there. More to come on this soon, including additional discussion of whether it is wise for students to publish without their advisor’s consent. In the meantime, I’d really appreciate comments from anyone who can help me understand this mess of who owns research data and the rights to publish it.

Update 10/29/2012: I have removed the previous note saying this was a draft. I think this post stands as a good introduction to the questions that inspired what will be a series of posts on data ownership and copyright. Please see future posts for answers to some of the questions posed above. I still welcome input from anyone with either personal or professional experience in this area. I would also like to thank Ian Holmes (@ianholmes), @MnkyMnd, Casey Bergman (@caseybergman), and Dan Stowell (@mclduk) who, though not quoted here, also participated in the original discussion on Twitter.