Data Set Publication, Citation, Peer Review, and Sharing

Introduction

This annotated bibliography contains selected resources covering the topics of data publication, citation, peer review, and sharing. The range of issues covered in these resources includes motivations and incentives for sharing and publishing data sets, how to create a data set citation, comparison of and use of unique persistent identifiers, and questions surrounding peer review of data sets.

Resources include journal articles, reports, and websites. Resources were located through searches on databases, such as Library Literature and Information Science Full Text, Library and Information Science Abstracts, and Google Scholar. Search terms included “data set publish*,” “data peer review,” “data citation,” and “data sharing.” Starting from the initial search results, the pearl growing method was employed to locate additional resources.

Altman, M., & King, G. (2007). A proposed standard for the scholarly citation of quantitative data. D-lib Magazine, 13(3/4).

This article proposes a format for citation of data sets. Claiming that data sets are often only briefly described in the text, if at all, the authors believe data sets should receive the same level of citation as articles. To accomplish this, the authors suggest each citation should have a minimum of six parts: author, title, date, a unique global identifier, a universal numeric fingerprint, and a bridge service. The authors argue that these items are required for unambiguous identification of a data set or a subset of a data set.

The article is a well-written how-to guide on creating a data citation, giving examples throughout the text. Many of the authors’ suggestions are used in some of the data citation initiatives underway, such as the Oak Ridge National Laboratory Distributed Active Archive Center Data Product Citation Policy (http://daac.ornl.gov/citation_policy.html). However, I am unaware of any data citation formats that include the universal numeric fingerprint. This type of information seems superfluous within the citation. Perhaps it would be more appropriate for this information to be situated on the landing page where the unique global identifier resolves.

In my future career as a data curation librarian, I will be involved with developing policies on data set citation for data sets archived by my campus’ data repository. This article will be useful in that it provides sound guidance on which elements are important for a citation. This article will provide a starting place for discussions surrounding citation policy.

Brase, J. (2009). DataCite – A global registration agency for research data. Proceedings of the Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology, 257-261. doi: 10.1109/COINFO.2009.66

This article introduces the need for, brief history of, and structure of the organization and consortium called DataCite. In response to the problem of citing and publishing data sets, DataCite was formed to enable “organisations to register research data sets and assign persistent identifiers to them, so that research data sets can be handled as independent, citable, unique scientific objects” (Brase, 2009).

DataCite facilitates assignments of persistent identifiers to data sets using the Digital Object Identifier (DOI). The International DOI Foundation manages the DOI system, but DOIs are assigned by the registration entities, of which DataCite is one of ten as of this writing.

This article provides a useful overview of what DataCite is, how it is structured, and its goals. However, it does not contain guidance on how one should build a data set citation. It provides a good overall discussion of the DOI, but does not explain why the DOI was chosen as the persistent identifier over other persistent identifiers, such as the Archival Resource Key, or ARK.

DataCite also provides a metadata schema for describing generic data sets. As was the case at the Oak Ridge National Laboratory Distributed Active Archive Center (DAAC), sometimes the metadata originally provided by the data producer is limited and must be augmented by DAAC personnel before archiving the data set. The general guidance DataCite provides on creating descriptive metadata for data sets will be valuable to my future career as a data curation librarian, since I will be involved in developing metadata for data sets.

Chavan, V. S., & Ingwersen, P. (2009). Towards a data publishing framework for primary biodiversity data: challenges and potentials for the biodiversity informatics community. [Review]. BMC bioinformatics, 10 Suppl 14, S2. doi: 10.1186/1471-2105-10-S14-S2

This article lays the groundwork for the later Chavan and Penev (2011) article below. In this article, which precedes the Chavan and Penev article by two years, the authors establish a data publishing framework that would set the stage for incentivizing individuals and institutions to publish their research data. Later, in Chavan and Penev, the authors propose a method by which scientists could publish their research data, called the “data paper,” which fits within these frameworks. In this article, the authors assert that “[o]pen access to primary biodiversity data is essential both to enable effective decision-making and to empower those concerned with the conservation of biodiversity and the natural world.“

The authors lay out several challenges to publishing of data sets and propose the “Data Publishing Frameworks” to help alleviate some of these roadblocks. The roadblocks are 1. possible misuse of data, albeit inadvertently, 2. Lack of ownership agreements, 3. Academic tenure and promotion considerations, 4. Funding competition, 5. Usability issues, and 6. Lack of confidentiality. The Data Publishing Frameworks consists of five areas to address: technical and infrastructure, policy and political, socio-cultural, economic, and legal. This article proceeds to elaborate on the parts of the technical and infrastructure area: persistent identifiers, data usage index, and data citation. The data usage index is proposed as a method to demonstrate impact of data sets by providing statistics such as unique visits, loyal visits (more than one visit from one IP address), viewing of data sets, download of data sets, and volume distributions of data sets.

While the authors did a good job of explaining the need for the technical and infrastructural components, as well as what they are and how they work, they did not dive into the other four areas that proposed Data Publishing Frameworks contain. However, upon future searching, I discovered an article entitled, “Towards Mainstreaming of Biodiversity Data Publishing: Recommendations of the GBIF Data Publishing Framework Task Group,” which was published in 2011. This article dives deeper into the Data Publishing Frameworks and provides specific recommendations to address each of the five parts of the frameworks.

Chavan, V., & Penev, L. (2011). The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC bioinformatics, 12 Suppl 15, S2. doi: 10.1186/1471-2105-12-S15-S2

In this paper, Chavan and Penev build upon the work of Chavan and Ingwersen (2009) by proposing and describing a method by which scientists can publish their data set, namely, the “data paper.” They claim that the data paper, if executed as they describe, will assuage the scientists’ apprehensions about publishing data sets.

The authors explain the structure of the data paper and provide an extensive list of metadata from which the data paper is built. They also describe a service called the Integrated Publishing Toolkit (IPT) from the Global Biodiversity Information Facility (GBIF). This tool provides a way for scientists to input metadata into a standardized format, then output the metadata into a data paper automatically. Use of this tool will help alleviate some of the effort required to create a data paper and encourage scientists to create them.

The data paper consists of the following parts: 1. bibliographic information such as author, date, title, and subject; 2. a citation to the data paper; 3. taxonomic, spatial, and temporal coverage information; and 4. descriptions of the data and collection methods. These parts are automatically generated from the metadata entered into the IPT.

The description of the data paper in this article closely resembles the data set guide documents I created for data sets at Oak Ridge National Laboratory Distributed Active Archive Center. There, the guide docs contained all of the same information as Chavan’s and Penev’s proposed data papers, with the addition of some graphs and images to enhance the information. In my future career as a data curation librarian, I plan to implement a pilot study in which we explore seeding an institutional data repository with data sets from a specific research group on campus to develop methods and workflows. This paper will be a useful resource in preparing documentation for these data sets.

Costello, M. J. (2009). Motivating online publication of data. BioScience, 59(5), 418-427. doi: 10.1525/bio.2009.59.5.9

This article was included in this annotated bibliography because I feel it gives the best explanation of why data sets should be published, as well as possible responses to scientists’ objections to publishing data. As a proponent of online publishing of data sets, I will be an advocate for this practice at my institution and will have to provide proof that it is a good thing for scientists to do. Thus, this article provides a thorough overview and provides talking points for having these discussions later on.

Costello begins by making a bold claim: “Scientists advance knowledge gained from empirical and modeled data and observations. It follows that scientists who do not publish or release their data are compromising scientific development and, arguably, leaving their work unfinished.” [emphasis mine] This is a bold thing to claim, but the author does a good job of justifying that statement. There are several reasons why data publishing should occur.

First, by publishing data sets online along with the journal articles written from them increases the scientists’ visibility and reputation among his or her peers. This happens when data sets become citable scientific objects. Second, publishing data sets online will discourage scientific misconduct, since the data will be visible for all to see, use, and duplicate research results with. Third, it provides more data to analyze, which could potentially help to answer new questions that were not able to be answered before.

Costello also discusses several objections scientists have given over the years for why they can not publishing their data sets. One example is that scientists may claim the data is theirs, so they should not share it. Costello makes the point that scientists are simply the custodians of data owned by their employer or, if publishing funded, the public. Another concern scientists have is that their data will be used incorrectly. The response to this objection is that data should only be published with sufficient metadata and documentation as to thoroughly describe it. It is then up to the user of the data to be competent enough to understand the metadata.

Costello ends with some suggestions for how we can increase publication of data. Examples he gives include planning for data publication from the beginning of projects, citation of data sets should be standardized and encouraged or mandated, citation services must include data citations in their metrics, and employers must give equal weight to data sets as they do to published articles in tenure and promotion considerations.

Davis, H. M., & Vickery, J. N. (2007). Datasets, a Shift in the Currency of Scholarly Communication: Implications for Library Collections and Acquisitions. Serials Review, 33(1), 26-32. doi: http://dx.doi.org/10.1016/j.serrev.2006.11.004

This article discusses how the currency of scholarly communication is shifting away from the traditional peer-reviewed journal article to the publishing of other parts of the scholarly process, specifically data sets. The authors suggest ways libraries can prepare for this changing landscape. This thought-provoking article poses pertinent questions libraries must ask themselves going forward. I will refer to this article in my future career as my library enters the world of data set collection and dissemination.

The authors identify five trends that indicate data is becoming more important. First, data is increasingly becoming a commodity. Second, legislative trends regarding copyright and fair use are becoming more important. Third, the growth and organization (or lack thereof) of data poses an interesting challenge, namely locating, accessing, and reusing data sets. Fourth, publishing companies are implementing policies regarding data sets associated with articles they publish. Last, public/private partnerships, such as the Human Genome Project, are being created that add value to and repackage data sets for commercial use.

The authors go on to discuss current business models for accessing data sets. Among the models they identify is the Institutional Membership Model, in which institutions pay a one-time fee for access to data sets. Another model is the Serials Continuation Model in which the institution acquires the data sets on CD or other physical media on a periodic basis. Limitations of this model are that users may not be able to access these data sets online or away from the institution. A third model the authors discuss is the One-Time Payment Model, which is similar to the way most academic libraries currently buy monographs. Under this model, the institution would purchase data sets when identified by researchers as helpful to their research. The authors then go on to discuss what they call transitional models, i.e., non-traditional models that may indicate a transition to a more well-defined model. The two transitional models identified are the Ad-Hoc Model and the lack of a model. Both of these, the authors claim, present opportunities for the libraries to develop a sustainable business model.

Continuing beyond these models, the authors identify potential questions libraries must consider in key parts of their functioning regarding data sets: budgeting, selection and evaluation, licensing, and negotiation. Libraries need to ask how they will appropriate money towards purchase of data sets. Regarding selection and evaluation, libraries need to determine which data sets they will acquire and collect and how they will determine their level of quality. Additionally, should libraries license access to the data sets they collect? If so, how would that work? Likewise, how will they manage interlibrary loan of data sets and potential liability?

Finally, the authors suggest that libraries can exert significant influence on the data set market by being proactive and collecting data sets “at the source,” i.e., from researchers within their institutions. Because of the trend of data sets becoming more important, we are likely to see more data sets being published. In my future career as a data curation librarian, I will encourage data set publication at my institution. This article provides a nice overview of the reasons why it is important.

Duerr, R. E., Downs, R. R., Tilmes, C., Barkstrom, B., Lenhardt, W. C., Glassy, J., . . . Slaughter, P. (2011). On the utility of identification schemes for digital earth science data: an assessment and recommendations. Earth Science Informatics, 4(3), 139-160.

This article discusses and compares nine unique identifiers (UID) and evaluates them on how well they can identify and locate a data set. For each of the nine UIDs, the authors evaluate them on four use cases and compare each. First, they evaluate them on how well they function as unique identifiers, meaning how well the expression captures and uniquely identifies the data set. They also evaluate the UIDs on how well they can uniquely locate the data sets, meaning if someone knew the UID of a data set, they would be able to locate it and download it. Thirdly, they evaluate each UID on whether or not it is a citable identifier, which means the UID can be used within a citation. Last, they evaluate the UIDs on the basis of being able to identify a scientifically unique data set. This use case is distinct from the first use case (how well it uniquely identifies a data set). In the first use case, the unique identifier simply identifies one data set, but in the last use case, the unique identifier would be able to facilitate determination of whether or not two data sets were scientifically the same, even if they were different formats.

The results of the evaluation show that for the first use case (unique identifier) the Universally Unique Identifier served this need the best. For the second use case (unique locator), the authors found the Digital Object Identifier to be most suitable, as it is supported by DataCite (see Brase, 2009). Of the nine unique identifiers evaluated, the authors found that most of them work well for uniquely locating a data set (the third use case), but the Archival Resource Key (see Willett & Kunze, 2013) provides more metadata embedded within it and the Handle System has the largest user base, which may be an indication of longevity. The authors found that none of the unique identifiers evaluated were suitable in the fourth use case (scientifically unique identifier).

While this study focused the evaluation of the nine unique identifiers on their suitability for earth science data, one could use the detailed and comprehensive evaluation for other types of data. Much of it may be suitable as-is for other types of data, while others may need to be evaluated again. In my future career, I expect that I will be advising not only faculty researchers on the use of unique identifiers for their data sets, but also guiding policy within the library on assigning unique identifiers to data sets ingested to the repository. This guide provides a thorough, unbiased explanation of the strengths and weaknesses of each type of identifier.

Gray, J., Szalay, A. S., Thakar, A. R., Stoughton, C., & VandenBerg, J. (2002). Online scientific data curation, publication, and archiving. Proceedings of SPIE, 103-107. doi: 10.1117/12.461524

This article, though relatively old in the field of data set publication and peer review, contains some useful themes and terminology for this topic. The authors discuss publication of data sets and introduce the Sloan Digital Sky Survey and the Virtual Observatory as examples.

One important theme discussed in this article is the fact that ephemeral data must be preserved while stable data can be recreated using metadata, but metadata is itself ephemeral data. The authors introduce the term edition, which is the updated publications of data sets that may add new data and correct errors in old data. Additionally, they introduce the concept of data pyramids. Since old editions are reprocessed using new software at each subsequent edition, the size of the complete data set containing all editions grows exponentially. Lastly, they introduce the term data inflation, the “tendency for derived data products to proliferate” (Gray, et. al., 2002).

This article did not provide much substance beyond the definitions described above, perhaps because of the fact that this is a technical report and not a scholarly article. The authors make a claim that I disagree with. They say, “If you can afford to store some digital information for a year, you can afford to buy a digital cemetery plot that will store it forever.” I am aware of the decreasing cost of digital storage space, but maintaining accessibility and meaningfulness of data sets permanently is an expensive endeavor. It cannot be done automatically. There must be periodic effort by people to refresh, migrate, and interpret the data sets for them to remain accessible for the long term. It also brought to mind something someone else once said, “Storage is cheap, but infinite storage is infinitely expensive.” This gives me pause and puts the storage issue into perspective.

Green, T (2009), We need publishing standards for datasets and data tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430

This report, though not a peer-reviewed journal article, is an informative white paper on why we need publishing standards for data sets. Green makes the case by referencing how if one searches for “datasets” on Google, what he or she finds is a variable mish-mash of websites, many of which are unprofessional and have numerous broken links. As the Head of the Organization for Economic Co-operation and Development (OECD) Publishing, he also gives the example of how people cite OECD data: sometimes the print versions, sometimes the website, and sometimes just “OECD.” He says people generally follow the “post-it-and-Google-will-find-it” approach, which he clearly demonstrates does not work. All this leads to the fact that, if we expect data to be findable and citable, we must have some sort of standard of publishing them.

OECD publishes data sets often. Thus, in response to the conundrum with finding research data sets, the OECD has proposed a bibliographic standard for research data sets. Data sets published by the OECD are now identified with a DOI. Users will also be able to download citations in formats compatible with bibliographic management software, such as EndNote and RefWorks.

Possibly the most helpful and promising aspect of this proposal is the metadata schema proposed for describing data sets. Using this metadata, libraries can obtain MARC records to add to their catalogs to aid in finding these data sets.

As a data curation librarian, I will be cataloging data sets on my institution’s data repository (if we have one). This document will provide a useful resource for describing data sets with a standardized metadata schema.

Heidorn, P. B. (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2), 280-299. The Johns Hopkins University Press. Retrieved February 13, 2013, from Project MUSE database.

P. Bryan Heidorn describes what he calls the long tail of dark data, which is a reference to a graph showing the size of data sets on the Y-axis and the number of data sets of a particular size on the X-axis. The graph shows that the number of data sets of small size, what Heidorn calls “boutique data sets,” far outnumber the data sets of large size, but they are no less valuable. He calls them “dark data” because they “exist[s] only in the bottom left-hand desk drawer of scientists on some media that is quickly aging and soon will be unreadable by commonly available devices.”

This article was selected for this annotated bibliography because it is those dark data sets in the long tail that rarely get published, and, consequently, rarely get cited. There is a vast amount of useful data in these data sets that could be reused and synthesized for future research. Unfortunately, many times the data are unusable because of poor documentation. Another aspect of the value in these data sets lies in the fact they contain useful information on “failed” experiments, though no project is really a failure if it advances the field of knowledge, even if by saying this path leads nowhere. It has been proven that negative research results tend not to get published (Stern, J. M., & Simes, R. J. (1997). Publication bias: evidence of delayed publication in a cohort study of clinical research projects. BMJ, 315(7109), 640-645.). Similar to reaping rewards of publishing data sets, this is another culture that must be changed if negative research findings are to be allowed to see the light of day.

As a data curation librarian at a research university library, I hope to shed some light on the dark data sets at my institution by encouraging researchers to publish them. I will also help them to improve their data management of current research projects’ data so they can be published and not put away in the desk drawer.

Kim, J. (2013) Data sharing and its implications for academic libraries. New Library World. 114(11/12).

Annotation forthcoming

Klump, J., Bertelmann, R., Brase, J., Diepenbroek, M., Grobe, H., Hock, H., . . . Wachter, J. (2006). Data publication in the open access initiative. Data Science Journal, 5, 79-83.

This article discusses data publication, citation, and reuse, but it discusses an important concept that appears repeatedly in this articles in this bibliography, namely motivation and incentive. The authors discuss how we can increase incentive, thereby increasing motivation to publish and share data sets. Like other articles, this one claims, appropriately, that sharing data sets is a time consuming proposition, mainly because of the effort required to put data sets into a format and state that is useful to others. Therefore, the incentives must outweigh the additional efforts required to publish data sets. One main incentive for publishing data sets is the recognition the scientists will receive when that data set is published. His or her reputation will increase as data sets they publish are used and cited by others. Of course, this reputation and recognition will only come with the culture of academia is changed to place greater value on publishing data sets in tenure and promotion considerations.

Another topic this article discusses, which is also discussed further in Stodden, 2009, is the issue on intellectual property rights for data sets. There needs to be a licensing system in place for data sets that protects the intellectual property rights of the creators. The authors of this article suggest using Creative Commons licenses, such as the CC Attribution-NonCommercial-ShareAlike 3.0 license. However, Stodden claims the ShareAlike component discourages scientific innovation and discusses another system she calls the Reproducible Research Standard, described in more detail in the annotation for that article below.

I appreciated this article’s emphasis on motivation and incentive, because I believe that is the main roadblock to increasing data publication and, thus, data sharing. I also appreciate the effort the authors put forth on attempting to define a way to protect intellectual property rights. This article was published in 2006, and these efforts are still underway. In my career, I want to be an advocate for intellectual property rights and tenure and promotion value of data sets, and thereby, sharing of data.

Lawrence, B., Jones, C., Matthews, B., Pepler, S., & Callaghan, S. (2011). Citation and peer review of data: Moving towards formal data publication. International Journal of Digital Curation, 6(2), 4-37.

This article encouraged citation and peer review of data sets as a way to encourage data set publishing. The authors discuss several key aspects of data publishing and offer suggestions for their implementation.

The authors advocate for “Publishing” data sets, with a capital “P.” This is in contrast to simple data publishing on the internet with little to no regard to accuracy and metadata. Publishing (with the capital “P”) ensures the data is accurate and reliable. This must be accomplished by a data peer review system. This peer review system is described as “a procedure which allows the community to make assertions about the trustworthiness and fitness for purpose of the data.” This method of peer reviewing data sets is used by some data journals, such as Earth System Science Data.

The authors attempt to answer the question, “What exactly is peer review of data?” They argue that there are similar requirements for data peer review as with article peer review. The main things that must be checked are the metadata (for completeness), the data’s internal consistency, the merits of algorithms and equations used, and judgments of the data’s importance and potential impact. The authors then provide a data review checklist with items relating to metadata, data quality, and general characteristics.

The authors end with yet another recommendation for citation of data sets. The competing recommendations for data citation indicate a standard is sorely needed. The authors’ recommendation contains many of the same items suggested in other citation recommendations, but add “extent,” “availability,” and “language.”

Even though the authors recommend another standard for data citation, this article is thorough and useful. As I work to advocate for data peer review, publishing, and citation standards, this article will provide a good foundation to develop needed policies.

Parsons, M. A., Duerr, R., & Minster, J. B. (2010). Data citation and peer review. Eos, Transactions American Geophysical Union, 91(34), 297.

In this article, Parsons, Duerr, and Minster claim there is a disconnect between scientific paper publications and the data that support the claims made in those papers. This article asserts that since the data behind a scientific research paper is of such high importance – the authors suggest they are as important as the papers they support – they should be cited as any article referenced in the publication.

One reason the authors give for this disconnect is the lack of consistent, standardized data citation formats. Some data centers have suggestions on how to cite their data, but there is no consistency among data centers, and some data centers do not require citation of data at all. Common practice is to cite the published papers describing the data, but the authors argue that since data is often dynamic and often extend beyond the scope of one article, citing the paper is not a suitable substitute to citing the data themselves. This article introduces the International Polar Year Data Policy citation format, which contains some of the parts of the citation format described in Altman and King (2007), but contains other parts that are different.

Similarly, the authors give another reason for the lack of data citation and publication, namely data peer review. While some data centers have methods in place to check data sets archived there, there are no consistent methods for peer review of data sets. Questions arise such as if peer review of data means checking the accuracy of all data points or if it means verifying or “auditing” the data collection and documentation practices. Peer review of data sets is currently being debated within the scientific community. Projects, such as the “Peer Review for Publication & Accreditation of Research Data in the Earth Sciences” or PREPARDE project, seek to define what data peer review should consist of. These questions need to be answered to achieve a consistent methodology for peer reviewing data sets.

Part of my future career as a data curation librarian will be offering guidance to campus researchers as to how they can have their data sets peer reviewed, so following this project’s developments will inform those discussions. Concepts from the peer review debate will also be useful in implementing effective data management best practices.

Parsons, M. A., & Fox, P. A. (2013). Is Data Publication the Right Metaphor? Data Science Journal, 12, WDS32-WDS46.

This article was assigned reading in the Foundations of Data Curation course taught by Carole Palmer in Spring, 2012. In this article, Parsons and Fox question whether or not “publishing” data sets is the correct metaphor to use. They assess other metaphors and argue that no one metaphor encompasses enough to be the only one used. Other metaphors include ones from the industrial production world; the cartography world; and the artisanal, tailored services world.

One drawback, according to the authors, of the data publication metaphor is that it does not look at the “big picture,” i.e., it does not emphasize data interoperability among systems. In other words, the publisher publishes a data set and that is the end product without any regard to how it might work together with other data sets published by other publishers. While this might be true to an extent, interoperability between data sets comes from understanding their differences and how to let each compliment the other. Machine readability of data sets is another topic that helps to increase interoperability, but is outside the scope of this bibliography.

While this article proposed some interesting philosophical questions, Parsons and Fox do not convince me that other metaphors are better than the data publication metaphor. I see the publication metaphor as an important part of the argument that researchers should make their data publicly available since they already live in a publication world. Nevertheless, during our class discussion on this article last year I proposed the metaphor of data economy. I argued that it was broad enough and that the interoperability between parts is a key feature of an economy.

Paskin, N. (2005). Digital object identifiers for scientific data. Data Science Journal, 4, 12-20. doi: dx.doi.org/10.2481/dsj.4.12

The annotation of this article complements the annotation of the Archival Resource Key (ARK) below (Willett & Kunze, 2013). Willett and Kunze describe why and how the ARK is used, how to obtain one, how to manage it, and the structure of the ARK. This article does the same for the Digital Object Identifier (DOI). The article answers several questions about the DOI. First, the article answers the question, “What is the DOI?” Next, it answers the question, “What is the structure of a DOI?” It then describes how it resolves. Last, it answers the question, “How is the DOI implemented?”

The article begins by saying the DOI is a name, not a location. Though the DOI can and is used to locate a resource, the DOI is a specific identifier for a resource. The DOI Foundation provides a DOI resolution service so that if a person knows the DOI of a resource, he or she can find it easily.

The structure of the DOI is similar to that of the ARK. It begins with the “DOI prefix” which identifies the naming authority. After the prefix, there is the “DOI suffix” which is the actual identifier. The DOI accommodates any type of identifier, even an ISBN or ISSN.

Resolution of the DOI happens on the resolution server, which is based on the Handle System. Once the requester sends the requested DOI to the Handle System, the Handle System determines which type of information is requested, then sends it back to the requester in whatever form is appropriate, such as metadata, video, or a data set.

The remainder of the article discusses the DOI Data Model, which consists of a data dictionary and a frameworks for applying it, and discusses how the DOI is implemented. It goes on to describe two projects that have implemented the DOI system: the German National Library of Science and Technology’s project assigning DOIs to primary research data, and the Names for Life Project, which is a project that assigned DOIs to taxonomic data.

One disadvantages of the DOI is that it there is cost for an organization to assign one. In today’s world of decreasing library and data center budgets, this is a strong incentive to using an ARK over a DOI, due to the costs of assigning a DOI, but the DOI has the advantage of being a more widely used standard. As a data curation librarian, using DOIs for data sets will provide a unique identifier for data sets within my institution’s repository. I am certain that I will return to this resource in the future when explaining the benefits of the DOI at my workplace.

Penev, L., Erwin, T., Miller, J., Chavan, V., Moritz, T., & Griswold, C. (2009). Publication and dissemination of datasets in taxonomy: ZooKeys working example. ZooKeys, 11(0), 1-8. doi: 10.3897/zookeys.11.210

This article describes a novel approach to data publishing with semantic enhancements within the normal process of journal publishing. The approach is demonstrated through an example in the ZooKeys journal. The following enhancements were added in this example:

  1. All data are published as a dataset under a separate DOI within the paper
  2. The dataset is indexed in the Global Biodiversity Information Facility simultaneously with the publication
  3. The dataset is published as a KML (Keyhole Markup Language) file under a distinct DOI
  4. The interactive map contains data for all specimens in the article and links to collections of images for each species in Morphbank
  5. Data can be filtered to display or hide any family, genus, or species
  6. All new taxa are registered at ZooBank during the publication process
  7. All new taxa are provided to Encyclopedia of Life through XML mark up on the day of publication

These added semantic enhancements to the simple journal article provide rich information and visualization of the data behind the journal article, which is often hidden from the public view and difficult to obtain.

The approach laid out in this article is innovative and seems to address concerns about citation, use and reuse, and recognition of efforts. Of course, this approach takes much time to process all the additional information required beyond the journal article, but, hopefully, the additional effort will be outweighed by the additional recognition the scientists receive for doing it.

Stodden, V. (2009). The legal framework for reproducible scientific research: Licensing and copyright. Computing in Science & Engineering, 11(1), 35-40.

In this article, Victoria Stodden proposes a method by which scientists could release the copyright on material and attach licenses to all parts of scientific scholarship. In order to be able to reproduce scientific research, more than just the data needs to be available. Articles, code, data, the experiment, tables, figures, graphs, and any other auxiliary materials also need to be available.

Stodden calls this method the Reproducible Research Standard, and defines it at three parts: attaching Creative Commons BY license to any media components, attaching a modified Berkeley Software Distribution (BSD) license to any code resulting from the research, and attaching a Science Commons Database Protocol for any data related to the research.

Authors must actively chose to place their material under these licenses or accept the default copyright. Stodden claims this configuration of licenses will encourage scientific innovation, by removing any roadblocks to reusing data for new research.

I am interested in advocating this type of licensing standard on my university campus some day, but while this standard seems logical and straightforward, I am not sure if it has been accepted within the scientific community. I have contacted Dr. Stodden to inquire how much success she has had with advancing this standard and for any examples of where this standard has been employed.

Swan, A., & Brown, S. (2008). To share or not to share: Publication and quality assurance of research data outputs. Report commissioned by the Research Information Network (RIN).

This article was included in this annotated bibliography because it is a report of a study in the United Kingdom of whether or not researchers share their research data and what issues and challenges they encounter along the way. It is similar to the Data Curation Profiles (http://datacurationprofiles.org/) (DCP) in that the DCP is also studying researcher willingness to share research data, with whom they would share it, and when they would share it.

Swan and Brown present a clear picture of how researchers feel about sharing their data. The study results show that some researchers are motivated to share through altruism, hope for collaboration, or peer pressure. Unfortunately, however, the incentive to share data faces headwinds because of lack of recognition or reward for publishing data sets. Regarding peer review of data sets, the study found that while the researchers who created the data sets are best suited to ensuring the quality of those data, no clear standards for reviewing quality of data sets exists.

In this annotation, I focus on the motivations and incentives for researchers to share their data sets. The results of this study show that researchers are generally not as concerned with publishing and sharing their data sets when the pressure to publish articles is so high. Researchers are engaged in a cycle of pressure to get funding, pressure to publish from that funding, and then pressure to get more funding. Swan and Brown found that researchers would be willing to share their data sets if three questions were answered. The first concerns the benefits of sharing data, the second, a workable data citation mechanism, and the third, clear career rewards for sharing data.

This article will serve my future interests when I speak to researchers about sharing their data by first providing proof that researchers’ needs and desires are being heard. Second, it will inform policy development for data sharing at my institution. I fully understand the desire for clear rewards for sharing the data sets, and I am also fully aware of the effort required to prepare a data set for publishing. I hope to serve researchers at my institution by being an advocate for their needs, while also meeting the demands of their funders and their institution.

Willett, P. & Kunze, J. (January 5, 2013). ARK (Archival Resource Key) Identifiers. In UC Curation Center: Curation Wiki. Retrieved February 17, 2013, from https://confluence.ucop.edu/display/Curation/ARK.

This annotation is of the website describing the Archival Resource Key (ARK) from the California Digital Library’s wiki. I chose to review this entry because it provides a more detailed look at the ARK, including description of why and how it is used, how to obtain an ARK, how to manage it, and the structure of the ARK.

The article provides some explanation of why one might want to use an ARK, including it is free, it is self-sufficient, and it is portable. It also provides some benefits and advantages to using an ARK, such as its simplicity, its versatility, and its transparency.

The most useful part of the article is the section detailing the anatomy of an ARK. The article provides the following diagram to demonstrate the anatomy:

Figure 1 – Anatomy of an Archival Resource Key

   http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff
    \________________/ \__/ \___/ \______/ \____________/
      (replaceable)     |     |      |       Qualifier
           |       ARK Label  |      |    (NMA-supported)
           |                  |      |
 Name Mapping Authority       |    Name (NAA-assigned)
          (NMA)               |
                   Name Assigning Authority Number (NAAN)

The first part of the ARK, the Name Mapping Authority, simply provides a clickable link to which the user will be brought to the digital resource. This is replaceable with a new website domain if the resource moves to another location, but the ARK assigned to the resource remains the same and remains with that resource long after the organization that hosts it is gone.

Another interesting feature of the ARK is that it can not only link to the resource, but also link to metadata about the resource. If one appends a single question mark (?) to the ARK expression, the metadata for that object is returned. The metadata is a brief set that uses Electronic Resource Citation, which is a subset of Dublin Core.

One of the important advantages of the ARK is that it there is no cost for an organization to assign one. In today’s world of decreasing library and data center budgets, this is a strong incentive to using an ARK over a DOI. As a data curation librarian, using ARKs for data sets will provide a free, unique identifier for data sets within my institution’s repository. I am certain that I will return to this resource in the future when explaining the benefits of the ARK at my workplace.

Advertisements