[EAS]Web Size & Legacy

Sat Jul 29 19:51:11 EDT 2000

Subject:   Web Size & Legacy

Dear Colleagues -

The two items below from EDUPAGE remind us of the too-seldom
considered archiving of digital information, now that so much
intellectual value is generated in purely electronic form. 

The Library of Congress is already so overloaded with the volume of
print items that quite some time ago they had to stop shelving new
arrivals by subject matter. They now just give each item an
accession number that links to a computer catalog entry. The books
etc. are just 'shovelled' onto shelves. So no more proximity
browsing there the way you can still do in our Yale libraries.
Enjoy that while you can.

And the fact that Web resources are much vaster than what search
engines index should not be news to readers of these mailings,
having recently been commented on in

http://www.yale.edu/engineering/eng-info/msg00724.html

   --PJK

------------------------------------------------------------------
(from Edupage, 28 July 2000)

SAVING THE NATION'S DIGITAL LEGACY
The National Academy of Sciences released a report yesterday
severely criticizing how the Library of Congress has conducted the
archiving of electronic material.  The report says the library has
neither the digital storage capacity nor the technical expertise
necessary to preserve the immense amount of copyrighted material
based on the Internet and other electronic sources.  Furthermore,
the library is so dependent on bureaucratic measures that it cannot
react quickly enough to preserve Web sites and other such media
that may not exist for long.  For example, there is no record of
Web sites that went offline before 1996.  The library has
acknowledged its shortcomings, but its head librarian cautions that
the library is likely to be short on funding for its electronic
archives. Still, the head librarian believes that the library will
be able to maintain these archives through partnerships with other
libraries and institutions as well as advances in the library's own
archiving systems.  The Library of Congress, like all major
research libraries, must also determine what part of the wealth of
electronic content is worthy of saving. (New York Times, 27 July
2000)

STUDY FINDS WEB BIGGER THAN WE THINK
The Web is expanding so rapidly that today's search engines only
cover a fraction of the existing pages, but some companies are
developing new search software that will tap the volumes of
information that are now part of the so-called "invisible Web."
BrightPlanet, a company that offers sophisticated search software
called LexiBot, on Wednesday released a study estimating that the
Web is 500 times larger than the segment covered by standard
search engines such as Yahoo! and AltaVista.  Although the Web
now holds about 550 billion documents, search engines index a
combined total of 1 billion pages, BrightPlanet says.  One reason
that search engines have not kept up with the number of pages on
the Web is that data is increasingly stored in large databases
maintained by government agencies, universities, and companies.
The dynamic information housed in databases is difficult for
traditional search engines to access, because the search software
is designed to locate static pages.  However, BrightPlanet
created its LexiBot to find information in databases, as well as
data that is covered by traditional search engines.  LexiBot
targets advanced users in the academic and scientific
communities. (CNet, 26 July 2000)