Kelli Bogan, Rebecca Petersen, Rachel Taketa and Kristen Yarmey 2015-10-07 23:18:45
Web Archiving with Limited Resources Few archivists need to be persuaded of the importance of capturing and preserving web content. What does take some convincing is the idea that web archiving is practicable and achievable, even for smaller institutions with limited resources. Below are highlights from our ARCHIVES 2015 panel session, in which we (along with Sylvie Rollason-Cass from Archive-It) addressed some frequently asked questions about small-scale web archiving. Why did you start web archiving? Taketa: I’m in an area of the University of California–San Francisco (UCSF) Library known as the Industry Documents Digital Library. We collect documents concerning the tobacco, pharmaceutical, chemical, and food industries—basically industries that have an impact (often negative) on public health. Our library currently contains more than 14 million documents pulled from the files of tobacco companies, which detail historical marketing and policy-making strategies. Our web archiving program started as part of a grant aimed at studying how the tobacco industry is fighting smoking restrictions in California. Petersen: At Wake Forest University, the University Archives and Library technology team prioritized web archiving as more and more university content was being published only on the web and wouldn’t be captured by traditional paper-based archives. Yarmey: Our program also emerged out of growing concerns about preserving born-digital university records. So many university publications and reports are now web-based, and thinking about each series individually was incredibly overwhelming. Capturing the university website as a whole seemed like a tangible, achievable step I could take. Bogan: I began advocating for preserving our digital records and web presence when I arrived at Colby-Sawyer College (CSC) in 2008. Over time, I raised enough awareness that in 2011, when a decision was made to eliminate a printed student newspaper, the administration approached me about finding a web archiving solution before the newspaper went digital. What are your collecting priorities? Taketa: Our web archiving program focuses on topical collections that complement our main collecting areas. Researcher needs and requests play a large part in what we decide to collect. For example, a group of researchers are looking into e-cigarettes and the marketing effects on youth and young adults, so last year I began crawling e-cigarette brand websites to document how the industry markets online and how it may change over time in response to regulations. I also prioritize time-sensitive web content, such as a website that is likely to disappear after a specific event is over. For instance, in 2012, California had a tobacco tax measure, Prop 29, which did not pass. Many of the sites for the industry’s front groups (the No on Prop 29 camp) are no longer available today, but I have them in our archive. Bogan: Our main focus is capturing college web content, especially those documents that are no longer being produced in print format (the alumni magazine, press releases, sports information, etc.). We’re also now capturing the college’s SharePoint intranet, using Preservica. Yarmey: Like Bogan, my main priority at this point is capturing important university-related content. Most of what I collect is from our main university domain (scranton.edu), but I also try to capture external pages or sites that are significant or relevant to the university community; for example, a faculty member hosting a disciplinary conference on campus might create a conference website on wordpress.com. I’ve also dipped a toe into capturing university-related social media accounts. Petersen: Our priority is capturing materials from our main University domain (wfu.edu) and related sites outside the University domain, but we also run one-time crawls for news articles and other publications that mention Wake Forest. What tools do you use? Petersen: We have been using Archive- It since 2008. It’s been a great service to help us understand our priorities and workflows. Taketa: We began in 2009 by partnering with the California Digital Library (CDL) Web Archiving Service (WAS). In 2014, we began a project with Archive- It and recently migrated all of our WAS collections to that platform. Unless you are a programmer or know one who will work pro bono, definitely use a service such as Archive-It. They not only provide the interface and crawlers, but they have great customer service and support, which frees you up to do curation and QA. Bogan: We signed on with Archive-It in 2011, when we began our web archiving program. Since 2013, we’ve partnered with Preservica to capture our SharePoint intranet and other digital content. Yarmey: We also partnered with Archive- It from the beginning (2012), and their services have been crucial—I never could have started my own web archiving initiative from scratch! We also use DuraCloud to store backup copies of the WARC files we capture with Archive-It. How do you handle description and access? Bogan: We have a less process, more product approach to web archiving. We don’t have the time to create metadata, but all of our crawls are full-text searchable, so users can at least find the content. Yarmey: I’m also taking a minimalistic approach to description. In a very rough effort to integrate our web archives with our CONTENTdm collections, I’ve been creating a very basic Dublin Core record for each seed in CONTENTdm. I’d love to somehow automate metadata creation to make captured content (especially images, videos, and PDF documents) more discoverable. Taketa: When I add the captures as records into our library’s index, I rely solely on metadata for search and retrieval. In the interest of time and budget, I had to settle on a total of five fields for each seed to create a basic record: Title, Creator, Description, Subject (keywords), and Collector. Petersen: Our Archive-It metadata includes Title, Description, and Group. We haven’t created a finding aid for the collection, but we link to our Archive-It site from our Digital Collections page. Who works on your web archiving program, and how often? Yarmey: I get occasional technical support from a fantastic coworker, but for the most part, it’s just me, and web archiving is only a small part of my job. Unfortunately, it’s often less than three to four hours a month. Bogan: It’s about the same for me. I get occasional input and mediation from our communications staff members, because they create the majority of our web content and work with the vendor for our athletics site. Taketa: It’s just me, with some site nominations coming from researchers. I put in maybe five hours a month (on a good month). Petersen: A few people in our library commit very limited time to web archiving; three to four people meet for about two hours a month to discuss seeds, review reports, and troubleshoot content. How do you allocate your time? Taketa: I spend most of my efforts and time upfront, setting up the initial crawl for a site and making sure to scope it as best as I can. When an event starts bubbling up online such as a new tobacco proposition, I will spend some time searching for seeds, adding them to Archive-It with my minimal metadata, setting off test crawls, looking at the crawls to make sure I got all of the functionality I need, and then setting a frequency. The rest of the time, the collections are crawling as scheduled without much QA. I try to set aside half a day in the month to QA crawls or scope and run new seeds, but sometimes that gets pushed aside. Yarmey: My process is similar to Rachel’s. I spent a lot of time on selection, setup, and scoping when I first started. But now I’m essentially in maintenance mode—I leave my scheduled crawls to run unattended. Sometimes months will go by without me doing any quality control or administration. Bogan: Ditto. Set it and forget it is the way I have to work in this case! What would you do if you had more resources? Taketa: I would love to be able to capture more social media, especially for my e-cigarette marketing collection. While large brands such as Blu and Njoy can afford splashy websites, smaller homegrown vaping companies rely on social media for free marketing. Because I don’t have the time or budget to really capture this section of the market, I’m not capturing the entire picture. Petersen: I’d love to have a better sense of control over what we are already capturing. Bogan: If we had more time, we’d try to capture the social media we create. I’d also love to capture websites relating to our manuscript collections, particularly the papers of our founding family and our vaudeville collection. Yarmey: I’d love to do more thematic collecting—capturing websites that relate to our special collections, working with faculty on research interests, or even starting a local history collection. I worry about long-term access to web content from our city government, regional newspaper, arts and culture organizations, nonprofits, small businesses, and local events. What are you most proud of? Taketa: I’m proud to have the small and messy web archiving program at all. Our collections focus on industries that harm public health, and these industries have a long history of using money and influence to get what they want, from favorable legislation to decreased regulation. So much of their countermeasure activity and product marketing now takes place online. I sometimes have to act quickly to capture it, because once it’s over, they tend to erase their trail. Petersen: Even when we are frustrated or confused, we’re thrilled to be capturing anything at all. I use our past crawls in my reference work, and it’s exciting when I find something that only exists there. Bogan: I agree, just having a program is something I am proud of. Even at a very small institution, we are doing our best to document our institution’s history as technology continues to evolve. This information was originally shared during Session 103, Big Web, Small Staff: Web Archiving with Limited Resources at ARCHIVES 2015
Published by Society of American Archivists. View All Articles.
This page can be found at http://www.bluetoad.com/article/Big+Web%2C+Small+Staff/2289206/275633/article.html.