Rutgers Home   |   Rutgers Search

2018 Publication Set 1

Apache Spark + Scala

Code:
https://github.com/KristinaPlazonic/nsfia
Documentation:
http://archivehub.rutgers.edu/nsfia-project-summary-1

http://data.scilsnet.rutgers.edu/weber/nsfia2018/00-README.txt

  1. EARLYWEB - 1996 to 2000

    This represents the entirety of the Internet Archive crawls collected from its start in 1996 through the year 2000.
    Most file range in size from 3GB to 7GB

  2. HOUSE.GOV - 2009

    File range in size from 4GB to 5GB

  3. SENATE.GOV - 2009

    File range in size from 5GB to 7GB

2018 Publication Set 2

Apache Hadoop + Pig

Code:
http://data.scilsnet.rutgers.edu/weber/nsfia2018/
Documentation:
http://archivehub.rutgers.edu/nsfia-project-summary-1/

http://data.scilsnet.rutgers.edu/weber/nsfia2018/00-README.txt

  1. MEDIA - 2008 to 2012

    Most files range in size from 4GB to 6GB

  2. Occupy Wall Street - 2008 to 2012

    Most files range in size from 4GB to 6GB

  3. Hurricane Sandy - 2003 to 2012

    File size is 762MB