Message #390:
From: AzTeC SW Archaeology SIG
To:   "'Matthias Giessler'" 
Subject: Historical Internet Archives
Date: Sun, 01 Dec 1996 20:14:59 -0700 
Encoding: MIME-Version: 1.0


[ The information provided below holds great significance for
archivists, anthropologists and historians.  Items A - D were clipped
from the Internet Archive and related web sites and provided below for
those SASIG readers with 'e-mail only' access to the Internet -- SASIG
Ed.]

From: 	Matthias Giessler 
Subject: Internet Archaeology
Remember that in SF I told you about an effort to record the current
state of the Internet for future internet archaeology? Well, the site
for this undertaking is at http://www.archive.org/.  Some related
sites and discussions at:

http://www.archive.org:80/hypermail/archivists-archive/
http://community.bellcore.com/lesk/auspres/aus.html
http://www.si.umich.edu/e-recs/Research/


A.
Internet Archive's  purpose is "Building a
Digital Library for the Future."  They are gathering, storing, and
providing access to public materials on the Internet such as the World
Wide Web, Netnews, and downloadable software. The Archive will provide
historians, researchers, scholars, and others access to this vast
collection of data (reaching ten terabytes), and ensure the longevity of
the information.


B.
http://www.archive.org/webarchive96.html provides these statistics:
There are 30 million web pages on 225,000 sites (Alta Vista) 
The mean lifetime of a web object is only 44 days (Chankhunthod et al.,
USC / UCBoulder). The size of all of the HTML on the public web is about
200 GB (Alta Vista). No one knows the total size including all images,
audios and videos.  To compare the WWW content to other data
collections...
Local Radio Station		    1    TB		15,000 hrs of music
Public Branch Library		3    TB		300,000 books
Typical Video Rental Store	8    TB		5,000 videos
Library of Congress		    20   TB		20 million books (ASCII)
The Web Archive		        1-10 TB
Key:	1 MB = 1 Megabyte = 1,000,000 bytes 
	    1 GB = 1 Gigabyte = 1,000 Megabytes or 1,000,000,000 bytes
	    1 TB = 1 Terabyte = 1,000 Gigabytes or 1,000,000,000,000 bytes 

Progress Being Made at Web Archive - Public information on the Internet
is constantly growing and changing, and it's hard to know exactly how
much information is there at any given time.  We have collected 500
gigabytes so far. We'll keep you updated on our progress.  How is the
Internet Archive doing it?  We are in the process of collecting,
organizing, and storing the data with crawling technology and robots.  
What are the challenges as we move forward in providing digital content
for an Internet University, managing and administering terabyte
technology?


C.
http://www.archive.org:80/hypermail/archivists-archive/0001.html
The mission of the Internet Archive is gathering, storing, and providing
access to public materials on the Internet such as the World Wide Web,
Netnews, and downloadable software. The collection, reaching ten
terabytes, will provide historians, researchers, scholars, and others
access to this vast collection of data, and ensure the longevity of the
information. Why bother? Being able to look back on this unprecedented
collection of human thought with the power of current computers is too
great of an opportunity to pass up. One of the advantages of having an
open public system like the Internet is we can add value to the
collected expression which is the current web. And, besides, its fun!
What will happen to the bits? All bits donated to the Archive will be
saved forever by moving the bits from medium to medium every 10 years or
so. These bits are "held" by a non-profit trust so that exclusive access
to these bits can not be bought. We think this is important. We hope to
grow into a real digital library. Is there a commercial angle? Yes, a
copy of the bits go into the non-profit trust, but the technology we are
developing is in a for-profit corporation.  If we can figure out a way
to make money or spin off technology, then that helps fuel the data
collection process. What about the privacy and copyright issues? Most of
the intellectual property issues really come up when you offer access to
the data again. We don't know what level of access is "right" for this
kind of data. We are looking for others to help in figuring this out
from a legal and social perspective by soliciting comments and issues.
The archivists@archive.org list is meant for this kind of discussion. 
Who is funding it? We are getting data donations from many places,
equipment donations from vendors, and the salaries are being funded by
Brewster Kahle, based on the sale of the Internet company WAIS Inc. We
are always looking for financial or other help. How can other's help?
Data donations: If you have historical archives that you think would be
interesting to future historians, we will keep the data alive. If it is
donated with restrictions we will honor them (if we accept the
donation). Equipment: We need the best mass storage devices, databases,
tape systems to do this gargantuan task. Please help. What we offer back
is our experiences, publicity, and possible spinoff products. Technical
help: We need people that are interested in helping with crawlers and
the like. For instance, we have not started crawling gopher or DNS. We
look for donated time, and we are also hiring. Social and legal help:
What are the existing legal issues around this type of archive? What is
the right thing? What is the status? We have 227GB of data, mostly HTML
and FTP sites. We are working with existing crawler groups and starting
to explore weaving pieces together to build our own crawlers.  There are
8 of us working on his project in the Presidio, a park in San Francisco.
Please visit us. http://www.archive.org. subscribe to
archivists-request@archive.org .
-brewster and the Archive team

Brewster Kahle
Presidio
President Bldg 1014, Box 29141
Internet Archive San Francisco, CA 94129
brewster@archive.org 415-561-6793, -6795 fax


D.
http://community.bellcore.com/lesk/auspres/aus.html
Preserving Digital Objects: Recurrent Needs and Challenges 
Michael Lesk Bellcore  -- Abstract
We do not know today what Mozart sounded like on the keyboard, nor how
David Garrick performed as an actor, nor what Daniel Webster's oratory
sounded like. What will future generations know of our history? We
thought that when printing was discovered, and libraries were created,
we would no longer have disasters such as the loss of all but 7 plays
from the 80 or more that Aeschylus wrote. Then acid process wood pulp
paper, used in most books since about 1850, again threatened cultural
memory loss. But digital technology seemed to come to the rescue,
allowing indefinite storage without loss. Now we find that digital
information too, has its dark side, and although it can be kept without
loss it can not be kept without cost. Keeping digital objects means
copying, standards, and legal challenges. This is a process, not a
single step. Libraries have to think of digital collection maintenance
as an ongoing task. It is one that gets steadily easier per bit; last
generation's difficult copying problem is now easy. However, the rise of
more complex formats and much bulkier information mean that the total
amount of work continues to increase. Our hope is that cooperation
between libraries can reduce the work that each one has to do. The
archive is recording a running copy of the World Wide Web to be offered
as a service to the Internet community. Internet Archive, Presidio, P.O.
Box 29141, San Francisco, CA 94129; info@archive.org; Ph: 415.561.6900;
Fax: 415.561.6795; http://www.archive.org/