Web archiving: Difference between revisions
JaimeMcCurry (talk | contribs) mNo edit summary |
JaimeMcCurry (talk | contribs) mNo edit summary |
||
Line 3: | Line 3: | ||
== Web Archiving == | == Web Archiving == | ||
[[File:Websites.png|thumbnail|left]] Web archiving is the process of harvesting web content, organizing the content into a collection, and preserving the collection for access and use. Web content is harvested through a process in which “web crawlers” systematically access and gather content from designated URLs through a process referred to as crawling. A web crawler is an internet “bot,” or program, that browses the web for indexing purposes. Crawlers access the desired website in generally the same way that a web browser does and captures all content related to the site, including any necessary information needed to render the site correctly as if it were live on the web – CSS files, etc. The results of these crawls are captures of web content that can then be archived and organized into digital collections.There are multiple digital resources involved in the capture and harvesting of even just one seed. A seed is an individual URL within a web archive collection. Following a web crawl, the information pertaining to a seed is organized into a WARC preservation file. The WARC file format is able to contain all necessary information and digital resources gathered from a seed during a crawl. It can also be expanded upon to include ancillary metadata elements. Websites archived in the WARC file format can be replayed, viewed, and interacted with in a web browser using access tools such as the Internet Archive’s Wayback Machine. Browser access tools pull information from the WARC files and display the captured web content in a way that allows users to interact with the archived website as they would on the live web. Web archiving combats the impermanence of the web and the ephemeral nature of these important pieces of our current digital cultural heritage. Advanced manipulation of web archive data can facilitate a number of research possibilities: potential uses for web archive collections include data, textual, or link analysis, among others. Ultimately, web archiving is intended to preserve a realm of cultural history that is increasingly present, and sometimes only present online in a digital format. Digital information is very sensitive. Sites are reliant upon a number of external factors in order to be accessed by users: content creators, host domains, web browsers, markup languages, etc.; subsequently, internet content can disappear for a variety of reasons frequently and often without notice. For example, the popular web resource [http://shakespeare.palomar.edu Mr. Shakespeare and the Internet] was taken down in October of 2013. If it were not saved by the large scale web collecting efforts of the [https://archive.org/web/ Internet Archive], the information it contained would have been lost to users forever. Lucky, [https://web.archive.org/web/20131006112152/http://shakespeare.palomar.edu/ the website was archived in time] and is accessible via the Internet Archive's Wayback Machine. | [[File:Websites.png|thumbnail|left]] Web archiving is the process of harvesting web content, organizing the content into a collection, and preserving the collection for access and use. Web content is harvested through a process in which “web crawlers” systematically access and gather content from designated URLs through a process referred to as crawling. A web crawler is an internet “bot,” or program, that browses the web for indexing purposes. Crawlers access the desired website in generally the same way that a web browser does and captures all content related to the site, including any necessary information needed to render the site correctly as if it were live on the web – CSS files, etc. The results of these crawls are captures of web content that can then be archived and organized into digital collections.There are multiple digital resources involved in the capture and harvesting of even just one seed. A seed is an individual URL within a web archive collection. Following a web crawl, the information pertaining to a seed is organized into a WARC preservation file. The WARC file format is able to contain all necessary information and digital resources gathered from a seed during a crawl. It can also be expanded upon to include ancillary metadata elements. Websites archived in the WARC file format can be replayed, viewed, and interacted with in a web browser using access tools such as the Internet Archive’s Wayback Machine. Browser access tools pull information from the WARC files and display the captured web content in a way that allows users to interact with the archived website as they would on the live web. Web archiving combats the impermanence of the web and the ephemeral nature of these important pieces of our current digital cultural heritage. Advanced manipulation of web archive data can facilitate a number of research possibilities: potential uses for web archive collections include data, textual, or link analysis, among others. Ultimately, web archiving is intended to preserve a realm of cultural history that is increasingly present, and sometimes only present online in a digital format. Digital information is very sensitive. Sites are reliant upon a number of external factors in order to be accessed by users: content creators, host domains, web browsers, markup languages, etc.; subsequently, internet content can disappear for a variety of reasons frequently and often without notice. For example, the popular web resource [http://shakespeare.palomar.edu Mr. Shakespeare and the Internet] was taken down in October of 2013. If it were not saved by the large scale web collecting efforts of the [https://archive.org/web/ Internet Archive], the information it contained would have been lost to users forever. Lucky, [https://web.archive.org/web/20131006112152/http://shakespeare.palomar.edu/ the website was archived in time] and is accessible via the Internet Archive's Wayback Machine. | ||
== Folger Web Collections == | == Folger Web Collections == | ||
So how is the Folger involved in web archiving? The Folger began archiving select websites since the Fall of 2011 using Archive-It, a subscription service that allows partner institutions to collect, manage, preserve, and provide access to their own curated web collections. The current Folger web collections can be accessed [[here]]. | So how is the Folger involved in web archiving? The Folger began archiving select websites since the Fall of 2011 using Archive-It, a subscription service that allows partner institutions to collect, manage, preserve, and provide access to their own curated web collections. The current Folger web collections can be accessed [[here]]. |
Revision as of 00:39, 21 March 2014
Archiving the web allows us to combat the impermanent nature of online content, making future access and use possible. The Folger has been collecting and archiving select websites using the Archive-It subscription service since 2011. The Folger Shakespeare Library web collections can be accessed here.
Web Archiving
Web archiving is the process of harvesting web content, organizing the content into a collection, and preserving the collection for access and use. Web content is harvested through a process in which “web crawlers” systematically access and gather content from designated URLs through a process referred to as crawling. A web crawler is an internet “bot,” or program, that browses the web for indexing purposes. Crawlers access the desired website in generally the same way that a web browser does and captures all content related to the site, including any necessary information needed to render the site correctly as if it were live on the web – CSS files, etc. The results of these crawls are captures of web content that can then be archived and organized into digital collections.There are multiple digital resources involved in the capture and harvesting of even just one seed. A seed is an individual URL within a web archive collection. Following a web crawl, the information pertaining to a seed is organized into a WARC preservation file. The WARC file format is able to contain all necessary information and digital resources gathered from a seed during a crawl. It can also be expanded upon to include ancillary metadata elements. Websites archived in the WARC file format can be replayed, viewed, and interacted with in a web browser using access tools such as the Internet Archive’s Wayback Machine. Browser access tools pull information from the WARC files and display the captured web content in a way that allows users to interact with the archived website as they would on the live web. Web archiving combats the impermanence of the web and the ephemeral nature of these important pieces of our current digital cultural heritage. Advanced manipulation of web archive data can facilitate a number of research possibilities: potential uses for web archive collections include data, textual, or link analysis, among others. Ultimately, web archiving is intended to preserve a realm of cultural history that is increasingly present, and sometimes only present online in a digital format. Digital information is very sensitive. Sites are reliant upon a number of external factors in order to be accessed by users: content creators, host domains, web browsers, markup languages, etc.; subsequently, internet content can disappear for a variety of reasons frequently and often without notice. For example, the popular web resource Mr. Shakespeare and the Internet was taken down in October of 2013. If it were not saved by the large scale web collecting efforts of the Internet Archive, the information it contained would have been lost to users forever. Lucky, the website was archived in time and is accessible via the Internet Archive's Wayback Machine.
Folger Web Collections
So how is the Folger involved in web archiving? The Folger began archiving select websites since the Fall of 2011 using Archive-It, a subscription service that allows partner institutions to collect, manage, preserve, and provide access to their own curated web collections. The current Folger web collections can be accessed here.
Folger Shakespeare Library Websites and Social Media
The Folger Shakespeare Library Websites and Social Media collection is an institutional collection. It archives and preserves the Folger's web presence over time. The collection includes all Folger domains, blogs, and social media profiles. There are currently 35 seeds in this collection and they are crawled for new content on a quarterly basis. The collection can be accessed here.
Shakespeare Festivals and Theatrical Companies
The Shakespeare Festivals and Theatrical Companies collection is a thematic collection. The purpose of this collection is to archive official websites for theatrical companies and drama festivals which focus on Shakespeare performance. The scope of this collection is primarily limited to the United States; however, a growing number of international resources are included as well. There are currently over 280 seeds in this collection and they are crawled for new content on a semi-annual basis. The collection can be accessed here.
Upcoming
The Folger is developing new collections, including one titled Shakespeare in the Media, which will document Shakespeare-related articles and posts on news sites that are not primarily dedicated to Shakespeare.
Contact
Please feel free to contact Jaime McCurry at jmccurry@FOLGER.edu if you have any questions or comments regarding the Folger Shakespeare Library web collections, or if you would like to report a problem you have encountered while interacting with these collections. If you would like to nominate a website for inclusion, you may complete this form. While all nominations are carefully reviewed, please note that we cannot guarantee the inclusion of a nominated website in the Folger web collections.