Appraisal and Selection of Digital Data

Introduction

This annotated bibliography contains selected resources covering the topics of appraisal and selection of digital data. Since not all data sets from scientific studies can be archived, there needs to be a set of criteria to guide the selection and appraisal of the resources. The terms “selection” and “appraisal” are defined here for this annotated bibliography. Selection of digital material refers to the set of general criteria outlining which types of materials will be accepted within a library’s collection. Appraisal of digital resources refers to the process by which it is determined if a specific resource meets the selection criteria and will be obtained.

Resources include journal articles, reports, and websites. Resources were located through searches on databases, such as Library Literature and Information Science Full Text, Library and Information Science Abstracts, and Google Scholar. Search terms included “appraisal selection data” and “digital material appraisal.” Starting from the initial search results, the pearl growing method was employed to locate additional resources.

Anderson, W. L. (2004). Some challenges and issues in managing, and preserving access to, long-lived collections of digital scientific and technical data. Data Science Journal, 3, 191-201.

In this article, Anderson asserts that scientific and technical data are important to maintain for three main reasons. The three reasons they are important to maintain are 1. science is becoming more and more interdisciplinary, 2. requirements of data management are changing, and 3. the role and nature of data preservation and access is changing. However, says Anderson, there are four main groups of issues surrounding preserving access to digital materials: science, management, policy, and technical issues.

Science-based issues start with the question, “What is scientific data?” Also, metadata must be collected with the data to make the data understandable to anyone not associated with the project. Third, what constitutes useful scientific data varies from country to country. Fourth, nomenclature and taxonomy change over time, so these things need to be saved with the data. And last, there are barriers to preservation and access.

Management-based issues deal with issues such as which tasks are associated with effective data management? Who will fund these efforts? How will we teach the skills necessary to management data properly?

Policy-based issues deal with the differences in national perspectives of scientific data, issues with ownership of data, national security issues, and ethical issues. It also deals with fair use and open access of data. For example, what is fair use of scientific data? This type of question needs to be answered.

Last, the technical issues with preserving access to scientific data deal with things like the large variety of file formats, the fact that preserving scientific data and databases are different than preserving literature.

I selected this article for this annotated bibliography because it gives a nice overview of challenges we already face and will continue to face with preserving access to scientific data. These issues and challenges must be overcome, because preserving scientific data is important for the success of science.

Davis, H. M., & Vickery, J. N. (2007). Datasets, a Shift in the Currency of Scholarly Communication: Implications for Library Collections and Acquisitions. Serials Review, 33(1), 26-32. doi:http://dx.doi.org/10.1016/j.serrev.2006.11.004

This article discusses how the currency of scholarly communication is shifting away from the traditional peer-reviewed journal article to the publishing of other parts of the scholarly process, specifically data sets. The authors suggest ways libraries can prepare for this changing landscape. This thought-provoking article poses pertinent questions libraries must ask themselves going forward. This article was included to provide support that digital data sets are worth selecting and appraising.

The authors identify five trends that indicate data is becoming more important. First, data is increasingly becoming a commodity. Second, legislative trends regarding copyright and fair use are becoming more important. Third, the growth and organization (or lack thereof) of data poses an interesting challenge, namely locating, accessing, and reusing data sets. Fourth, publishing companies are implementing policies regarding data sets associated with articles they publish. Last, public/private partnerships, such as the Human Genome Project, are being created that add value to and repackage data sets for commercial use.

The authors go on to discuss current business models for accessing data sets. Among the models they identify is the Institutional Membership Model, in which institutions pay a one-time fee for access to data sets. Another model is the Serials Continuation Model in which the institution acquires the data sets on CD or other physical media on a periodic basis. Limitations of this model are that users may not be able to access these data sets online or away from the institution. A third model the authors discuss is the One-Time Payment Model, which is similar to the way most academic libraries currently buy monographs. Under this model, the institution would purchase data sets when identified by researchers as helpful to their research. The authors then go on to discuss what they call transitional models, i.e., non-traditional models that may indicate a transition to a more well-defined model. The two transitional models identified are the Ad-Hoc Model and the lack of a model. The authors claim that both of these present opportunities for the libraries to develop a sustainable business model.

Continuing beyond these models, the authors identify potential questions libraries must consider in key parts of their functioning regarding data sets: budgeting, selection and evaluation, licensing, and negotiation. Libraries need to ask how they will appropriate money towards purchase of data sets. Regarding selection and evaluation, libraries need to determine which data sets they will acquire and collect and how they will determine their level of quality. Additionally, should libraries license access to the data sets they collect? If so, how would that work? Likewise, how will they manage interlibrary loan of data sets and potential liability?

Finally, the authors suggest that libraries can exert significant influence on the data set market by being proactive and collecting data sets “at the source,” i.e., from researchers within their institutions. Because of the trend of data sets becoming more important, we are likely to see more data sets being published. In my future career as a data curation librarian, I will encourage data set publication at my institution. This article provides a nice overview of the reasons why it is important.

Digital Preservation Coalition. (n.d.). Decision tree for selection of digital materials for long-term retention. in Digital Preservation Coalition. Retrieved June 24, 2010, from http://www.dpconline.org/advice/preservationhandbook/decision-tree

This is not an article, but a resource for determining whether or not a digital resource should be acquired for the library’s collection. It uses a decision tree format, similar to a flow chart, to ask a series of questions one can answer. Depending on the answer, the person is directed to another box. It is a convenient tool that I appreciate since it uses a logical algorithm.

Topics that the questions cover include if the resource fits within the overall collection development policy, if it is needed for research purposes, will the library have to preserve it, if it is technically feasible to migrate the resource to new formats, and if sufficient documentation has been provided.

Tools like the one in this resource provide a straightforward way for library personnel to determine whether or not to obtain a digital resource. The concepts included in this decision tree are applicable to data sets, as well as other digital materials. It will be a helpful article for my future career as I will have to develop workflows for ingesting data sets to a repository and develop policies for selection and appraisal.

Eastwood, T. (2004). Appraising digital records for long-term preservation. Data Science Journal, 3, 202-208.

This article seeks to apply appraisal concepts from within the archival field to appraisal of digital objects. The author notes that one main difference in appraising object for a traditional archives and appraising digital objects that the with digital objects, we must preserve not only the resource itself, but also the means by which we can view them. In other words, if we preserve a Microsoft Word document from a version today, we will need to also preserve the software required to understand it (i.e. Microsoft Word 2010).

According to Eastwood, there are four main parts to appraising digital resources for preservation. First, one must compile information about the digital objects. The next is assessing the resources’ capacity of serve the needs of the community in question. The next is determining the feasibility of preservation of the digital resources. Last is making the actual appraisal decision.

This article was chosen for this annotated bibliography because of the comparison with traditional archives techniques of appraising resources. I have heard some people claim that data curation has more in common with traditional archives, so it should learn some lessons from that world. This article was successful in laying out the reasoning why we could learn so much from archives.

Esanu, J., Davidson, J., Ross, S., & Anderson, W. (2004). Selection, appraisal, and retention of digital scientific data: Highlights of an ERPANET/CODATA workshop. Data Science Journal, 3, 227-232.

This article complements Anderson above. Anderson discusses the challenges and issues surrounding preservation of scientific data, this article explains a workshop held in Portugal to study selection, appraisal, and retention of scientific data in light of the issues and challenges outlined in Anderson, 2004.

One of the main goals was to determine commonalities and differences in the selection, appraisal, and retention of scientific data across different scientific disciplines. One commonality that arose was that maximum value of scientific data is achieved by reuse of that data. Value of research data increases with use. One difference is that the amount of data gathered and the methods by which it is gathered vary greatly across different disciplines.

One of the outcomes of this workshop was the consensus of all in attendance that data are the basis for scientific discovery. Data often have more than one life as scientific inquiry advances and reuses data. Also, there was agreement that curation and management of data maximize the initial investment in creating data.

Gutmann, M., Schurer, K., Donakowski, D., & Beedham, H. (2004). The selection, appraisal, and retention of social science data. Data Science Journal, 3, 209-221.

In this article, the authors claim that it is impossible to archive data from every study done, so there needs to be a procedure for selecting and appraising data sets based on certain criteria. This article introduces the selection and appraisal process at two data repositories: the Inter-University Consortium for Political and Social Research (ICPSR) in the United State, and the UK Data Archive in the United Kingdom.

The ICPSR’s selection and appraisal policies are included in this annotated bibliography below. This article was selected for this bibliography because it compares strategies across different national data centers. The data centers are similar in that they both have mandates to collect data from certain activities and certain groups. ICPSR has a mandate to collect data from its member institutions; the UK Data Archive has a mandate to collect data from the Economic and Social Research Council in the UK. One difference in the two archives lies in how they determine which data sets will be retained and preserved. The ICPSR has flexibility in deciding how much effort its processors want to expend on any given data set, based on its value and interest. UK Data Archive has an Acquisition Review Committee that decides how much effort is expended on any data set. The authors also compare issues such as confidentiality and metadata between the two archives.

Discussions and comparisons of two successful data archives is helpful for my future career, as I help my institution develop policies for selection and appraisal of scientific data for the data repository. This article will be a useful reference.

Harvey, R. (June, 2006). Appraisal and selection. In DCC Digital Curation Manual. Retrieved February 23, 2013, from http://www.dcc.ac.uk/resources/curation-reference-manual/.

I selected this article for this annotated bibliography because it contains valuable information on how to design selection and appraisal policies for digital data sets. Coming from the Digital Curation Centre in the UK, I knew it was sure to be heavy on practical advice and less on theory. That is certainly the case. Coming from an engineering background, a very applied, practical science, I appreciate the DCC’s approach to its materials, essentially “this is what you need to know and this is how you do it.” This article will be a frequently consulted resource as I work to implement policies in my institution, as it contains step by step procedures and questions to answer for developing policies.

Harvey begins with a discussion of the need for appraisal and selection policies. As Whyte and Wilson (2010) said, “[i]t is not possible for all digital data to be kept forever…” This is why we need guidance criteria on which data sets will be kept and which will not.

One way appraisal and selection of digital materials is different than for physical materials, Harvey points out, is the fact that additional factors come into play, such as intellectual property, copyright, and preserving additional representation information for a resource. Contrastingly, one way it is similar to the equivalent for physical materials is that every situation is different and requires different criteria. In other words, every library has a different set of users with different needs, so its collection development criteria should meet the needs of its users. Likewise, every institution has different needs with respect to digital data sets, so their criteria need to reflect that.

ICPSR. (n.d.). ICPSR collection development policy. In ICPSR Data Management. Retrieved Feburary 23, 2013, from http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/policies/colldev.html.

This website was chosen for inclusion as an example of one prominent data repository’s collection development policy. It provides several useful concepts that could be implemented in my institution; I intend to refer to it in the future when developing policies at my institution, assuming none exist. I find it important that the policy begins the discussion with an explanation of who ICPSR’s users are. This is important because if the users are not known, how can an organization determine how to meet their needs or if they are meeting their needs?

ICPSR’s collection development policy includes five criteria. It also explains that ICPSR seeks data within certain emphasis areas, such as diversity data, mixed-method data, and international data. Another important aspect to ICPSR’s collection development policy is that when data sets are not accepted for archiving, ICPSR personnel attempt to refer the data set’s creator to a suitable repository. I especially like that this is done instead of simply denying acceptance and moving on.

ICPSR distinguishes its selection policies from its appraisal policies on the website. As explained in the introduction, there is a difference: one deals with the overall criteria for which items will be included (selection), and the other deals with determining whether or not one resource meets the criteria (appraisal).

National Archives. (September, 2007). Strategic directions: Appraisal policy. In National Archives. Retrieved March 5, 2013, from http://www.archives.gov/records-mgmt/initiatives/appraisal.html.

This resource was selected for this annotated bibliography because it provides a thorough overview of the policies of selection and appraisal for a government agency. Government agencies must determine the value of the data they produce. It might be useful for institutions to model their selection and appraisal policies after the National Archives and Records Administration (NARA).

This article discusses three main groups of records that NARA collects: records documenting rights of citizens, records documenting the actions of federal officials, and records documenting the national experience. It then outlines six criteria to determine archival or material. Examples of these criteria include 1. the NARA will collect any records that “Provide evidence of our Government’s conduct of foreign relations and national defense” and 2. the NARA will collect any records that “Provide evidence of the significant effects of Federal programs and actions on individuals, communities, and the natural and manmade environment.”

This resource will be a help reference when developing selection and appraisal policies for my institution. I especially like the fact that they provide a description of the classes of data they will accept. This will be a helpful addition to any selection and appraisal policy, and one that I will employ at my institution.

Palmer, C. L., Weber, N. M., & Cragin, M. H. (2011). The analytic potential of scientific data: Understanding re‐use value. Proceedings of the American Society for Information Science and Technology, 48(1), 1-10.

As data sets become more plentiful, selecting and appraising them based on a defined set of criteria is important. This article talks about one aspect of the selection and appraisal of data — the determination of whether or not is has re-use potential. Palmer, Weber, and Cragin adapt the concept of “epistemological potential” of documents from Hjorland. The authors claim that when determining which data sets to keep, analytic potential should be the guiding concept for research collections supporting research.

Using the concept of epistemological potential, the authors identify three main aspects of the analytical potential of data sets: preservation readiness, potential user communities, and fit for purpose. The describe each of these aspects in more detail. Preservation readiness refers to representation, provenance, context, reference, and fixity information. Fit for purpose means that the data is fit — it is ready and applicable — to the intended purpose. Last, potential user communities, which is directly associated with fit for purpose, refers to the intended group of people who will use this data.

I believe this article’s subject — re-use potential — is key to determining selection and appraisal policies. Even so, if there is no method by which to determine a data set’s re-use potential, it will be difficult to determine whether or not to save it. Concepts from this article will be helpful in my future career as I will have to develop policies and use those policies to archive data sets.

Phillips, L. L., & Williams, S. R. (2004). Collection development embraces the digital age. A review of the literature, 1997-2003. Library Resources & Technical Services, 48(4), 273-299.

This article is a literature review that addresses how collection development has changed with the advent of the internet. An institution must have a collection development policy for any material they obtain to ensure that it is obtaining resources that meet its users’ needs. In addition to guidance on collecting books, the institution should also have a policy on collecting data sets. This article was chosen as context and background for where we are currently with respect to digital data sets and resources.

The first major change with collection development in the digital age was the collections themselves were becoming digital. This raises the question of how best to manage these types of mixed collections. Librarians welcome guides that explain how to manage digital collections.

The second major change with collection development concerns scholarly publishing. Issues like copyright and open access became more important. This has important implications for data sets within institutions, especially since the current climate is one of encouragement and even requirement to share data sets from federally funded research.

This article is a thorough overview of the history behind collection development, which includes appraisal and selection. It helped me to understand why we do some of the things we do now in the digital age. It will be a helpful reference in my future career.

Whyte, A. & Wilson, A. (2010). How to appraise and select research data for curation. in Digital Curation Centre. Retrieved February 23, 2013, from http://www.dcc.ac.uk/resources/how-guides/appraise-select-data

Since, as Whyte and Wilson say, “[i]t is not possible for all digital data to be kept forever…” appraisal and selection policies must be in place to determine which ones will be kept and which ones will be disposed of. The authors give four reasons why appraisal and selection of data sets is important. First, the expanding volume of digital data along with the decreasing cost of storage may make the cost the same. Second, back and mirroring of data, which is essential to long term preservation, makes the cost of storage at least twice as much. Third, if we kept everything, the “noise to signal ratio” of searches will be too high. In other words relevancy of searches will decrease. Last, managing data is expensive, so having to manage everything is not economically feasible.

To avoid the four issues mentioned on the preceding paragraph, the authors say that we must appraise and select research data according to a policy similar to a library’s or archives’ collection development policy. As such, selection is guided by set principles and is not “ad-hoc.” To be complete, the appraisal and selection policy should cover seven areas:

  1. Does the data set serve the mission of the institution?
  2. Does the data set have any scientific or historical value?
  3. Is the data set the only or most complete source of this data?
  4. Is this data set one that the institution could redistribute to other institutions?
  5. Are the data irreplaceable? Climate observations fall into this category. One the data has been collected, it cannot be collected again.
  6. Could the data set provide any future financial benefits? In other words, will it be valuable and worth selling?
  7. Is there adequate documentation provided with the data set so that someone else can understand it?

In my future career as a data curation librarian, part of my job will be to be an advocate for sharing data sets. In order for this to happen smoothly and systematically, there must be an institutional policy in place that addresses which ones we will accept. This guidance document is helpful in highlighting important considerations. It will be a useful reference in the future.