So we add network overhead (double request), why not request the url
from one of the servers ?
Statistically, the chances that the requested url will be found in
a specific server are low. Suppose we have 10 servers that each of them
holds 20% of the files (some files are mirrored). The chances that the
requested url will be found on a specific server is 1:5. If the specific
server contains the requested url, we performed only one request. If not,
we spent three requests (we need to reroute the request to a manager which
will reroute it again to the appropriate server). In simple math the average
of the requests per url is:
1/5 * 1 + 4/5 * 3 = 2.6
So, we better ask the manager first and then request the url from the correct server.
The manager communicates with the servers through predefined API's collection called RSMC (Remote Server Management Commands). The RSMC requests are sent to the servers in RSMP (Remote Server Management Protocol). the RSMC API contains methods that allow the manager to query information from the servers and transfer files between the servers. The whole system can operate as a distributed proxy server or as a distributed web server. In either case, the manager must know exactly which files exist in each server. In case of distributed managers, this information can be common to all managers (each manager contains all the information) or splitted between them. Splitting the information can be useful on large models where there are many files on each server. For example, the files information can be splitted between two managers: the first holds information about files in A-L interval and the second holds information about files in M-Z interval. A more efficient way is to hold the files information in a way that equalize the load between the managers.
The manager should decide which files would be in each server. If the system operates as a distributed web server, the manager must save at least one copy of each file (better to have more copies to prevent loss of data due to a server failure), so moving files between the servers must be handled with care. In web proxy system, the manager uses the servers as a caching space and saves only the most needed files.
The most needed files are identified using a caching algorithm. It can be either simple LRU or more complex algorithm. We recommend an algorithm that considers the following aspects:
last recently used - High chances that it will be used again soon.
most recently used - High chances that it will be used again.
distance from server - Files from close servers load faster. So we can load them directly from the remote server. Better to cache remote files.
file size - Saving smaller files allow more files to be saved.
file types - The most requested files are html files. Html files
are also small relative to image files. So it would be better to save html
files (or any other common file types) instead of rare file types. Also,
when an html file is in cache, it will load quickly and allow the user
to read the text while the other components are being loaded.
The manager manages all file transfers between the servers and should
always have an updated picture of which files are on each server.
The manager should hold a file table which contains the files in each
server. The manager should update this table after each request is acknowleged.