Within the scope of the Computer Communications Lab, we designed and implemented a caching proxy server that handles HTTP traffic. The proxy features:
An HTTP proxy server is a server that handles HTTP requests from clients. If the clients are of a common organization or domain, or exhibit a similarity in browsing behavior, the proxy can effectively cache requested documents. Caching, which migrates documents across the network closer to the users, reduces network traffic, reduces the load on popular Web servers and reduces the time that end users wait for documents to load.
A proxy server accepts requests from clients. When possible and desired,
it generates replies based upon documents stored in its local cache; otherwise,
it forewords the requests, transfers the replies to the clients and caches
them when possible.
The proxy thus acts both as a server and as a client. It is a server when accepting HTTP requests from its clients, but a client to the remote servers it connects to when it is unable to fulfill requests by the means of its local cache.
The Proxy Mechanism
When the proxy is brought up, a thread is set to wait for incoming HTTP requests from clients on a specific port (8080 by default). Upon receipt, every request is administered to a newly created thread for handling.
The handling thread parses the request.
If the request is to retrieve a document (indicated by the request method 'GET'), the document exists in the local cache, the request allows a reply from a cache (the pragma 'no cache' is not specified in the request), an 'if-modified-since' field is not specified in the request and the origin host does not appear on the list of servers that are accessible without proxy - the document is retrieved from the cache and is sent to the client (a cache hit).
In all other cases, the request is forwarded to the origin host or to an alternative proxy, if one has been defined. Note that among these cases, only the ones in which the requested document does not exist in the cache and the other conditions tested to be false - are regarded as cache misses.
The reply from the remote server is transferred to the client. In addition, successful replies to retrieval requests (indicated by return code 200) from hosts that do not appear on the list of servers that are accessible without proxy are saved to disk as local files and are cached upon completion of their transmission. Note that whenever a new version of a cached document is received (for example, in reply to a request that contained an 'if-modified-since' field ), it replaces the old version in the cache.
Statistics accumulated during the transaction with the remote server
(connection time, bandwidth) are used to update the servers database.
The Caching Mechanism
The cache handles DocumentInfo objects that correspond to files in the local cache directory and contain information about their creation date, size, and number of times they have been referred to. A DocumentInfo object is created and inserted into the cache after the successful receipt and saving of the desired document.
Newly received documents are always inserted into the cache (they replace older versions if such exist). Since disk space is bounded, when there is no room to insert a new document, cache cleanup must be performed. In our application, the upper bound to the size of the cache directory is determined by the High Water Mark parameter. The current size of the cache directory is indicated by the Current Water Mark parameter. Upon cleanup, documents are removed until the size of the cache directory is below the Low Water Mark parameter. The High and Low water marks are customizable.
The algorithm that chooses documents for removal is called the removal algorithm. The algorithm that we chose to implement is optimized to reduce the time that end users wait for documents to load. The algorithm is based on the Hybrid algorithm (HYB) suggested by Wooster and Abrams (refer to their article 'Proxy Caching That Estimates Page Load Delays').
When invoked, the removal algorithm assigns a mark to each cached document, quick-sorts the documents according to their marks and removes documents - starting from the one with the lowest mark and upwards - until the contents are below the Low Water Mark.
A document is marked on the basis of the following factors:
Note that statistics are on a per-server basis rather on per-document basis. This eliminates the problem of lower bandwidth measurements for small documents than measurements for large documents on the same server due to slow start: a server is expected to contain a range of documents of different sizes, so the bandwidths measured for all servers are expected to be equally influenced by the above phenomenon.
The last two factors are kept track of by DocumentInfo objects. When
a document is saved locally, its size and the current date are noted. Whenever
a cache hit event occurs, the reference counter of the hit document is
incremented. The frequency of references to a document is the quotient
of the number of references and the time since it was introduces
to the cache.
The four factors are combined to produce a single value (document mark) according to the formulae:
The removal algorithm prefers to remove documents with low marks. Therefore,
a document is not likely to be removed if the expression above is large,
which would occur if the document resides on a server that takes a long
time to connect to (i.e. large Clatserver(i)) and
is connected via a low bandwidth link (i.e. small Cbwserver(i)),
if the document has been referenced frequently (i.e. large Frefi),
and if the document size is small (i.e. small si).
Several data structures are common to all running threads, and must be protected to ensure their consistency.
The cache, which handles DocumentInfo objects that correspond to files in the local cache directory, features a two-level locking scheme that enables concurrent reading, writing and checking the existence of documents in the cache, while retaining the consistency of the structure:
Initialization and Shutdown Routines
The initialization and shutdown routines are responsible for the recovery and saving of the proxy environment.
When the server is brought down intentionally, the environment is saved to the disk in the following files:
Upon restart, the initialization routine attempts to read "proxy.ini".
If successful, the configuration parameters are recovered; otherwise, the
configuration parameters are set to default values. The routine now checks
if the proxy was brought down properly during its last use. If so, the
backup files are used to restore the environment. Otherwise (a crash occurred
during the last use, some or all backup files are corrupted, etc.) the
cache is set to contain only documents that appear both on "cache.bak"
and in the cache directory, the servers statistics database is set to contain
only servers whose documents are in cache, and the list of servers that
are accessible without proxy is recovered or set empty.
Download the source code