This NSF supported project is based around a community of social science researchers focused on tackling next-generational questions of Internet research using the Internet Archive (IA), the largest single record of the history of the World Wide Web, dating from 1995 to the present. The team is working together to: (1) develop the prototype tool, Archive Hub, as a powerful Internet research tool to support their research projects, and (2) create sample databases to demonstrate how to conduct research using archival Web data, particularly the longitudinal studies this historical record makes possible.
In the realm of social science, and across disciplines, archival Internet data represent a vast repository of untapped research potential. For public audiences, the Internet Archive repository has proved immensely popular; the public Wayback Machine (www.archive.org) interface to the Internet Archive serves 300,000 visitors a day, and more than 200 requests a second. Currently, the Internet Archive contains more than seven petabytes of data and offers a reliable historical record of Web sites dating from 1995 to the present. In terms of data availability, the Internet Archive is by far the largest digital source for historical research pertaining to the Web and its contents over time.
Although IA contains billions of Web pages and has tremendous potential to facilitate research, research languishes because the search, crawl and extraction functions are severely limited. For example, to browse IA resources, one generally needs to know the target uniform resource locator (URL). A researcher cannot run a full-text search of the entire back history of the Web, nor are there facilities to download sets of related Web pages except for manual download. This reflects the current barriers to accessing large-scale data from the Internet Archive. Thus, a primary objective of this project is to tear down those barriers and simultaneously bring together a new community of researchers.
The mission of this project is to coalesce a community of social science researchers focused on tackling next-generational questions of Internet research using the Internet Archive (IA), the largest single record of the history of the World Wide Web, dating from 1995 to the present. ArchiveHub serves as our venue for gathering collective input, sharing updates, and ultimately publishing databases collected through the course of this research.