What is the background on the data access and collection?
The Internet Archive is the primary data repository used for this research. Since 1996, IA has functioned as a library for Web pages, conducting ongoing crawls of the World Wide Web to archive content. The range of formats and the depth of each crawl into a given site have increased each year, with latter years containing a robust network of multimedia information. Currently, harvests are performed in three- to four-month windows; approximately 8 billion unique pages are captured in a given harvest.
Historic data before 2009 contained in the Internet Archive is stored primarily in the ARC (ARChive file) format, which was defined by IA in the late 1990s to store the actual content of Web pages, including hypertext markup language (HTML), image files, and graphics. However, beginning in 2009 most new data entering the collections were transitioned to being written in WARC format. WARC (Web ARChive file format) is an ISO standard file format that expands on ARC and adds enhanced metadata records describing the Web content. With the advent of WARC, the metadata was stored in the WARC files alongside the Web content for easy retrieval by access and analysis tools. The use of WARC files streamlines processing of the Internet Archive, and will aide in accelerating research and access.
In addition to the WARC files for each domain, IA is in the process of transitioning to a dual file structure; the entire archive is being reprocessed to create a WAT (Web Archive Transformation) metadata file and WARC archive file for each domain. The WAT file formats the metadata using JavaScript Object Notation (JSON) language, which allows the data to be formatted in an easy-to-parse hierarchy. The WAT file is intended to facilitate efficient large-scale data extraction and is easily integrated with the Hadoop software framework. The metadata for each domain contain a number of key descriptive elements, including: x general information about the domain, including dates when the domain was archived by Internet Archive; x outlinks from the domain that were present each time the domain was archived; x the size of the domain (measuring the amount of content present); and x a uniform resource identifier allowing for rapid identification of the actual WARC file containing the archived content.
The full IA is hosted on servers in San Francisco, CA. For the purposes of this research, however, a set of IA metadata is being housed on Rutgers University’s Hadoop cluster in New Brunswick, NJ, to enable testing of the prototype Web extractor tool.
Building on the successful prototyping of HistoryTracker, this project produced a number of sample datasets to demonstrate the validity of the proposed research approach. These databases are hosted on a SQL server at Rutgers University, and publicly available through the ArchiveHub interface.
What is History Tracker?
The planned prototype tool will enable analysis of unprecedented amounts of Web-related data in the social sciences. Of particular importance, HistoryTracker will give researchers the opportunity to understand how the Internet is changing over time. Mastering research with large-scale data, both in terms of building usable sets of Web pages and visualizing the networks of topics contained on the Web historically, will be one of the major challenges for scholars in the years to come as the Web continues to grow as a key source of social information. Most efforts to understand the structure and impact of the Web have so far been limited to cross-sectional snapshots. However, with the participation of IA, this research project will design and verify a tool to extract data based on defined inputs, and thus create a virtual observatory of the changing constellations of social information.
What is Internet Archive?
In the realm of social science, and across disciplines, archival Internet data represent a vast repository of untapped research potential. For public audiences, the Internet Archive repository has proved immensely popular; the public Wayback Machine interface to the Internet Archive serves 300,000 visitors a day, and more than 200 requests a second. Currently, the Internet Archive contains more than seven petabytes of data and offers a reliable historical record of Web sites dating from 1995 to the present. In terms of data availability, the Internet Archive is by far the largest digital source for historical research pertaining to the Web and its contents over time.
The Internet Archive is a 501©(3) non-profit that was founded to build an Internet library. Its purposes include offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. Founded in 1996 and located in San Francisco, the Archive has been receiving data donations from Alexa Internet and others. In late 1999, the organization started to grow to include more well-rounded collections. Now the Internet Archive includes: texts, audio, moving images, and software as well as archived web pages in our collections, and provides specialized services for adaptive reading and information access for the blind and other persons with disabilities.