Caching HTTP Proxy Server
 
Omer Horvitz, Ehud Soen, Avraham Kenigsberg
Computer Communications Lab
Computer Science Department, Technion
Haifa, Israel
 
 


Table of Contents

Abstract

Within the scope of the Computer Communications Lab, we designed and implemented a caching proxy server that handles HTTP traffic. The proxy features:

The application is implemented in Java, and as such it benefits from cross-platform portability.

 

Introduction

An HTTP proxy server is a server that handles HTTP requests from clients. If the clients are of a common organization or domain, or exhibit a similarity in browsing behavior, the proxy can effectively cache requested documents. Caching, which migrates documents across the network closer to the users, reduces network traffic, reduces the load on popular Web servers and reduces the time that end users wait for documents to load.

A proxy server accepts requests from clients. When possible and desired, it generates replies based upon documents stored in its local cache; otherwise, it forewords the requests, transfers the replies to the clients and caches them when possible.
The proxy thus acts both as a server and as a client. It is a server when accepting HTTP requests from its clients, but a client to the remote servers it connects to when it is unable to fulfill requests by the means of its local cache.
 
 

The Proxy Mechanism

When the proxy is brought up, a thread is set to wait for incoming HTTP requests from clients on a specific port (8080 by default). Upon receipt, every request is administered to a newly created thread for handling.

The handling thread parses the request.

If the request is to retrieve a document (indicated by the request method 'GET'), the document exists in the local cache, the request allows a reply from a cache (the pragma 'no cache' is not specified in the request), an 'if-modified-since' field is not specified in the request and the origin host does not appear on the list of servers that are accessible without proxy - the document is retrieved from the cache and is sent to the client (a cache hit).

In all other cases, the request is forwarded to the origin host or to an alternative proxy, if one has been defined. Note that among these cases, only the ones in which the requested document does not exist in the cache and the other conditions tested to be false - are regarded as cache misses.

The reply from the remote server is transferred to the client. In addition, successful replies to retrieval requests (indicated by return code 200) from hosts that do not appear on the list of servers that are accessible without proxy are saved to disk as local files and are cached upon completion of their transmission. Note that whenever a new version of a cached document is received (for example, in reply to a request that contained an 'if-modified-since' field ), it replaces the old version in the cache.

Statistics accumulated during the transaction with the remote server (connection time, bandwidth) are used to update the servers database.
 
 

The Caching Mechanism

The cache handles DocumentInfo objects that correspond to files in the local cache directory and contain information about their creation date, size, and number of times they have been referred to. A DocumentInfo object is created and inserted into the cache after the successful receipt and saving of the desired document.

Newly received documents are always inserted into the cache (they replace older versions if such exist). Since disk space is bounded, when there is no room to insert a new document, cache cleanup must be performed. In our application, the upper bound to the size of the cache directory is determined by the High Water Mark parameter. The current size of the cache directory is indicated by the Current Water Mark parameter. Upon cleanup, documents are removed until the size of the cache directory is below the Low Water Mark parameter. The High and Low water marks are customizable.

The algorithm that chooses documents for removal is called the removal algorithm. The algorithm that we chose to implement is optimized to reduce the time that end users wait for documents to load. The algorithm is based on the Hybrid algorithm (HYB) suggested by Wooster and Abrams (refer to their article 'Proxy Caching That Estimates Page Load Delays').

When invoked, the removal algorithm assigns a mark to each cached document, quick-sorts the documents according to their marks and removes documents - starting from the one with the lowest mark and upwards - until the contents are below the Low Water Mark.

A document is marked on the basis of the following factors:

  1. An estimation of the connection time to the origin server (Clatserver(i)).
  2. An estimation of the bandwidth to the origin server(Cbwserver(i)).
  3. The size of the document (si).
  4. The frequency of references to the document (Frefi).
The first two factors are estimations based on connection times and bandwidths measured during past interactions with the origin server, and are stored and handled by the ServersDataBase object. Whenever a new document is received from a server, the connection establishment latency and bandwidth for that document (transmission time / size) are measured (let these samples be denoted by sclat and scbw respectively) and the estimates are updated as follows: where Alpha is a smoothing factor, set to 1/8 as it is in the TCP smoothed estimation of RTT (Alpha is customizable).

Note that statistics are on a per-server basis rather on per-document basis. This eliminates the problem of lower bandwidth measurements for small documents than measurements for large documents on the same server due to slow start: a server is expected to contain a range of documents of different sizes, so the bandwidths measured for all servers are expected to be equally influenced by the above phenomenon.

The last two factors are kept track of by DocumentInfo objects. When a document is saved locally, its size and the current date are noted. Whenever a cache hit event occurs, the reference counter of the hit document is incremented. The frequency of references to a document is the quotient of  the number of references and the time since it was introduces to the cache.
 

The four factors are combined to produce a single value (document mark) according to the formulae:

where WClat,  WCbw, WFref and Wsi are customizable weights of the factors involved.

The removal algorithm prefers to remove documents with low marks. Therefore, a document is not likely to be removed if the expression above is large, which would occur if the document resides on a server that takes a long time to connect to (i.e. large Clatserver(i)) and is connected via a low bandwidth link (i.e. small Cbwserver(i)), if the document has been referenced frequently (i.e. large Frefi), and if the document size is small (i.e. small si).
 
 

Concurrency Control
 
Several data structures are common to all running threads, and must be protected to ensure their consistency.

The cache, which handles DocumentInfo objects that correspond to files in the local cache directory, features a two-level locking scheme that enables concurrent reading, writing and checking the existence of documents in the cache, while retaining the consistency of the structure:

The scheme has the following properties: Several other structures are common to all threads, namely the servers statistics database, the list of servers that are accessible without proxy, the configuration object that stores all the parameters of the application and the components of the graphical interface. Synchronization of threads accessing these objects is achieved simply by requiring the acquisition of a lock on the object before using its services, and releasing the lock when done. Threads hold the discussed objects for neglectable intervals of time as they only offer simple operations and they all reside in main memory; thus, the scheme will not affect parallelism of the application to any significant degree.
 
 

Initialization and Shutdown Routines

The initialization and shutdown routines are responsible for the recovery and saving of the proxy environment.

When the server is brought down intentionally, the environment is saved to the disk in the following files:

  1. "proxy.ini" - configuration parameters.
  2. "cache.bak" - cache contents.
  3. "server.bak"  - statistics concerning remote servers.
  4. "serverWOProxy.bak" -  list of servers that are accessible without proxy.
If the environment has been successfully saved, a note is made in "proxy.ini".

Upon restart, the initialization routine attempts to read "proxy.ini". If successful, the configuration parameters are recovered; otherwise, the configuration parameters are set to default values. The routine now checks if the proxy was brought down properly during its last use. If so, the backup files are used to restore the environment. Otherwise (a crash occurred during the last use, some or all backup files are corrupted, etc.) the cache is set to contain only documents that appear both on "cache.bak" and in the cache directory, the servers statistics database is set to contain only servers whose documents are in cache, and the list of servers that are accessible without proxy is recovered or set empty.
 


Download Area

Download the source code

  1. Download the source code
  2. Compile sources to byte codes by typing 'javac Server.java' (add the Java directory to your path if neccessary).
  3. Run the application by typing 'java Server'.
Download the byte code
  1. Download the byte code
  2. Run the application by typing 'java Server'.
Download the preliminary report