2018 Publication Set 1
Apache Spark + Scala
Code:
https://github.com/KristinaPlazonic/nsfia
Documentation:
http://archivehub.rutgers.edu/nsfia-project-summary-1
http://data.scilsnet.rutgers.edu/weber/nsfia2018/00-README.txt
-
EARLYWEB - 1996 to 2000
This represents the entirety of the Internet Archive crawls collected from its start in 1996 through the year 2000.
Most file range in size from 3GB to 7GBEARLYWEB_1996_2000_part01.tar
EARLYWEB_1996_2000_part02.tar
EARLYWEB_1996_2000_part03.tar
EARLYWEB_1996_2000_part04.tar
EARLYWEB_1996_2000_part05.tar
EARLYWEB_1996_2000_part06.tar
EARLYWEB_1996_2000_part07.tar
EARLYWEB_1996_2000_part08.tar
EARLYWEB_1996_2000_part09.tar
EARLYWEB_1996_2000_part10.tar
EARLYWEB_1996_2000_part11.tar
EARLYWEB_1996_2000_part12.tar
EARLYWEB_1996_2000_part13.tar
EARLYWEB_1996_2000_part14.tar
EARLYWEB_1996_2000_part15.tar
EARLYWEB_1996_2000_part16.tar -
HOUSE.GOV - 2009
File range in size from 4GB to 5GB
-
SENATE.GOV - 2009
File range in size from 5GB to 7GB
2018 Publication Set 2
Apache Hadoop + Pig
Code:
http://data.scilsnet.rutgers.edu/weber/nsfia2018/
Documentation:
http://archivehub.rutgers.edu/nsfia-project-summary-1/
http://data.scilsnet.rutgers.edu/weber/nsfia2018/00-README.txt
-
MEDIA - 2008 to 2012
Most files range in size from 4GB to 6GB
MEDIA_2008_2012_part01.tar.gz
MEDIA_2008_2012_part02.tar.gz
MEDIA_2008_2012_part03.tar.gz
MEDIA_2008_2012_part04.tar.gz
MEDIA_2008_2012_part05.tar.gz
MEDIA_2008_2012_part06.tar.gz
MEDIA_2008_2012_part07.tar.gz
MEDIA_2008_2012_part08.tar.gz
MEDIA_2008_2012_part09.tar.gz
MEDIA_2008_2012_part10.tar.gz
MEDIA_2008_2012_part11.tar.gz
MEDIA_2008_2012_part12.tar.gz
MEDIA_2008_2012_part13.tar.gz
MEDIA_2008_2012_part14.tar.gz
MEDIA_2008_2012_part15.tar.gz
MEDIA_2008_2012_part16.tar.gz -
Occupy Wall Street - 2008 to 2012
Most files range in size from 4GB to 6GB
-
Hurricane Sandy - 2003 to 2012
File size is 762MB