DIGITAL LIBRARIES
ERCIM News No.27 - October 1996



Caching and Mirroring Techniques in WWW and Digital Library Architectures


by László Kovács

The World Wide Web is a networked hypermedia architecture connecting millions of documents via hypertext links. The documents are stored on server machines and client softwares running on practically any networked computers are used to retrieve documents through Internet. Mirroring in the Internet jargon means the creation of a remote copy of some data or complete hypermedia documents. This technique is used for information that is very popular or served via low-speed connections. It can help in decreasing the network traffic over the Internet backbone. Various techniques of mirroring work well for other types of Internet services, such as FTP or USENET News, and have an enormous significance in the area of the World Wide Web that generates most traffic of all services over the Internet.

Although there exist a few public domain scripts for WWW mirroring, the topic is in a somewhat premature state according to the evolving needs of the Internet society. At SZTAKI different mirroring and caching techniques are now being developed in the context of WWW services as well as in Digital Library architectures.

A mature algorithm for mirroring and a standardized portable hypermedia (PHM) format can ease the distribution of hypermedia documents through the World Wide Web. Recently, a two-phase mirroring algorithm has been developed. The algorithm can create a remote copy of a complex HTML document stored in another WWW server. The algorithm provides the mirrored document in a PHM format defined in a paper presented at the 7th Joint European Networking Conference. Hypermedia documents in PHM format can be transferred without any need for further semantic transformation. A software environment based on this algorithm for mirroring hypermedia documents was built. It provides different high-level, intelligent automatic mirroring services via usual WWW interfaces (set of forms). The proper use of this environment can decrease the network load during peak periods and it can increase the accessibility of the selected hypermedia documents.

The mirroring technique developed at SZTAKI can be the first step in the direction of introducing a separate protocol and/or protocol extension for mirroring purposes. The application of new mirroring techniques affects the searching methods to be applied.

Uniform searching techniques

In heterogeneous distributed systems, searches are carried out through heterogeneous search engines, with different schema, different transaction models, and different search protocols. Hiding the details in such complex searches is an open issue. Transparency in distributed systems (such as location and access transparencies) is considered as a key point for the usability of the system. Users would like to be completely unaware of the internal details of system mechanisms, although this requirement is sometimes in conflict with the performance issues of the system. SZTAKI plans to develop a new uniform searching engine that makes homogeneous search possible even in the presence of mirrored WWW documents.

Caching and replication of services

The performance and reliability of distributed systems can be improved by using caching and replication techniques. Different digital library architectures require sophisticated caching and/or replication architectures, eg hierarchical distributed caching. Propagation of safe modification of contents in the case of complex caching and/or mirroring architectures remains an open question. Copyright and intellectual property issues can affect the techniques used for improving the performance of the system or even prevent the use of some of these techniques completely. A tradeoff between improving perfor-mance and taking intellectual property issues into consideration implies the redefinition of the very idea of copyright. As a recent activity in this field, new replicated Dienst index services have been installed in three different European regions (INRIA, SZTAKI and FORTH).

This new system installation has improved the European facilities by providing faster and more reliable access to the distributed computer science technical report digital library (NCSTRL).

Please contact:
László Kovács - SZTAKI
Tel: +36 1 269 8286
E-mail: laszlo.kovacs@sztaki.hu

return to the contents page