John Zarrillo 2014-02-11 11:16:03
Uniting the Makers of Modern Genetics through Mass Digitization John Zarrillo, Cold Spring Harbor Laboratory Library and Archives Sixty years ago, in a cold laboratory in England, two undistinguished scientists assembled a model that would secure their places in history. James Watson was a twenty-five-year-old former birdwatcher who obtained his PhD from Indiana University before eventually making his way to the Cavendish Laboratory at the University of Cambridge. Francis Crick, a thirty-five-year-old physicist, was transitioning to the world of molecular biology while pursuing his doctorate. The two would piece together, with help from a few other key scientists, the double helical structure of DNA. Strikingly simple, the structure had monumental implications regarding heredity and the transfer of genes. Watson’s papers are held at the Cold Spring Harbor Laboratory (CSHL) Library and Archives in New York, where he served as director for about thirty-five years. Crick’s professional papers were acquired by the Wellcome Library in London. Following the acquisition, the Wellcome Library in 2010 began an ambitious mass-digitization project that would later become known as Codebreakers: The Makers of Modern Genetics. This project incorporated a number of partner institutions that held collections related to the history of genetics, including King’s College London, Churchill College, the University College London, and the University of Glasgow. CSHL Library and Archives, the only American partner, brought not only the Watson collection but also the papers of Sydney Brenner, who worked with Crick on the genetic code. Codebreakers also includes the papers of Rosalind Franklin, whose famous X-ray photograph of the “B-form” of DNA, dubbed “Photograph 51,” was the crucial piece of data that led Watson and Crick to their model. Digitizing Our Collections Preparations began in summer 2011, when CSHL Library and Archives Executive Director Mila Pollock and I (an archivist at CSHL) began conferencing with the Wellcome Library. In exchange for bringing the Watson and Brenner material to the project, Wellcome agreed to fund the digitization of both collections. The digital images would then be made available on both Wellcome Library’s Codebreakers homepage (http://wellcomelibrary.org/usingthe- library/subject-guides/genetics/makersof- modern-genetics/) and CSHL Library and Archives’ digital repository (http://libgallery .cshl.edu/). Wellcome Library supplied us with a list of mandatory fields in ISAD(G), which mapped to Dublin Core elements and Were later utilized in our online repository. Wellcome also set technical requirements for the images, which were to be delivered in the JPEG2000 format. The months leading up to the digitization of our collections were spent planning workflows and preparing the material for shooting. Using the existing collection hierarchies, we assigned each folder a reference code, which would later be used to name the image files. The most time consuming task, however, was also the most mundane: removing all staples from the documents. On the bright side, this process allowed us to review each folder and flag material that was either confidential (such as personal medical or financial information or social security numbers) or needed to be digitized separately from the paper documents (such as photographic slides and negatives, which were later digitized on an Epson Expression 10000 XL flatbed scanner). Digitization began in August 2011. A digitization laboratory was assembled in our library, consisting of a Canon EOS 5D mkII digital camera mounted on a copy stand with dual flash lights (Speedotron 1005CC Deluxe Location Kit [120V]). We set up the lab near our archives storage facility so that transferring the material to the digitizers was a simple task. We transferred a new set of boxes to the digitizers each morning. The boxes contained a file inventory, which included the reference code used to name the digital files. The following morning, we’d return the digitized boxes to the archives before transferring a new set of files to be digitized. The transfer and digitization of each box was logged daily in a spreadsheet shared among project members via Google Drive. With the funding provided by the Wellcome Library, we were able to secure the services of photographer Ardon Bar-Hama, who has worked on a number of large digitization projects for a variety of institutions, including the New York Philharmonic Archives, the New York Public Library, the Vatican, and the Albert Einstein Archives. Once the material had been digitized, it was transferred to Bar-Hama’s servers, and each image was cropped and converted into both TIFFs (our “master” copies) and compressed JPEG2000 (for online delivery). Quality Control and Metadata Bar-Hama sent images in batches of about 50,000 on external hard drives to be checked by CSHL digital project archivist Stephanie Satalino. Satalino checked the images for quality and confidentiality issues. Quality issues included poor focus, problems with the flash, crookedness, missing pages, and the occasional finger that made it into the frame. Images with quality issues were flagged and later reshot. In terms of confidentiality, we were required by the Wellcome Library to meet the standards outlined in the United Kingdom’s 1998 Data Protection Act, which regulates how living individuals’ personal information is shared. All information related to an individual’s professional performance, including employment references, is restricted for sixty years from the creation of the document. Personal medical or financial records are suppressed for eighty-four years from the creation of the document if the subject is an adult (sixteen years or older), or ninety-three years if the individual was younger than sixteen. Academic grades are permanently suppressed. The digital project archivist also was responsible for creating the metadata associated with each folder of images. We used the following standard fields: Country Code, Repository Code, Reference (unique identifier), Title, Date, Level (collection, series, file, etc.), Description, Creator, Access Conditions, Access Status, Reproduction Conditions, Copyright, and Material Type. We also used subject fields utilizing both Library of Congress Authorities and National Library of Medicine Medical Subject Headings (MeSH). Much of the information for these fields already had been assembled in the existing collection finding aids. We used the Description field to highlight significant material that was not made obvious via folder titles. Our images are delivered to the Wellcome Library on external hard drives. We ship batches of 50,000 to 75,000 images every three months, along with a spreadsheet of all related metadata. The project went live on the Wellcome Library’s Codebreakers homepage and CSHL Library and Archives’ digital repository in March 2013. Copyright Issues Copyright is an issue that hangs over almost every digitization project. The Wellcome Library decided to take a risk-managed approach to copyright. Each participating institution provided a list of individuals who are featured prominently in our collections. Wellcome Library then assembled a list of “high-risk” individuals—mostly published authors and politicians—to be contacted directly and cleared for inclusion in the project. They then developed the detailed Copyright Clearance and Takedown webpage to allow users to flag material they believe is in violation of copyright (http://wellcomelibrary.org/about-this-site/ copyright-clearance-and-takedown/). The risk-managed approach was necessary due to the sheer number of rights holders whose material appears in the digitized collections; it would take years to track down the author of each letter. Although we have suppressed some items at the rights holders’ requests (for personal, not copyright, reasons), we have not encountered any legal problems since the project went live. Copyright should be carefully considered when undertaking a digitization project; however, it should not unnecessarily inhibit the accessibility of archival material. The Digital Future Collaborative mass-digitization projects such as Codebreakers: The Makers of Modern Genetics are becoming increasingly common, especially due to new techniques in digitization and the relatively low cost of digital storage. But the single most important factor in the growth of these projects may very well be the ease of modern communication. The ability to communicate instantly across the globe to multiple parties is crucial—especially when our partner institutions are located in the United Kingdom and our digitizer is based out of Israel. Some tools we utilized during the project included Google Drive, Basecamp, and Skype, along with plenty of emails and conference calls. When questions about technical standards, metadata, or copyright arose, we were able to address them quickly and effectively. CSHL Library and Archives had conducted digitization projects in the past, but never on the scale of Codebreakers. During the course of this project, we realized the importance of careful planning, receiving clear instructions from the project leader, and maintaining open communication among all parties. These lessons, as well as those learned from our digitizers, will inform our future digitization efforts. We have even purchased our own equipment to continue our digitization program, as well as provide digitization services for local libraries, historical societies, and museums. The public demand for online access to original documents is always growing, and we hope that Codebreakers: The Makers of Modern Genetics will provide a model for future collaborative digitization projects.
Published by Society of American Archivists. View All Articles.