$Id: Release-Notes-1.4.txt,v 1.3.4.5 1995/12/20 02:42:45 duane Exp $ TABLE OF CONTENTS C version of ftpget HUP signal rotates log files Querying neighbors based on URL domains Dealing with file descriptor shortages Multiple swap directories Separate IP ACLs for proxy and httpd-accel Command line option (-D) to disable DNS tests Periodic garbage collection disabled by default. Firewalls Description of server selection algorithm Description of neighbor resolution algorithm The CGI cache manager interface Virtual host support Fast reload mode Dealing with cache stalls under SunOS 4.1.x New UDP-based cache announcement service Other additions to cache configuration file * ascii_port and udp_port * proxy-only * connect_timeout * neighbor_timeout Cached-1.5 pre-announcement Thanks and acknowledgements ======================================================================== C version of ftpget ------------------- The 'ftpget' program has been rewritten in C. The Perl version caused core-dumps for a number of users. The C version is more reliable and efficient. HUP signal rotates log files ---------------------------- Sending the cached process a HUP signal "rotates" the log files. Each active log file is closed and then renamed with a digit extension. The configuration parameter rotate_number specifies how many rotated log files to keep. If set to zero, the log files are simply closed and re-opened, allowing them to be renamed manually. Querying neighbors based on URL domains --------------------------------------- By default the cache queries all neighbors (and parents) for requested objects which it does not already have. It is now possible to selectively query neighbors based on the domain name of the requested URL. This is done with the 'cache_host_domain' directive in the configuration file. The usage is: cache_host_domain neighbor-name domain [domain...] cache_host_domain neighbor-name !domain [domain...] The 'neighbor-name' must match the host name of a neighbor previously specified with a 'cache_host' line. The same 'neighbor-name' may appear on multiple 'cache_host_domain' lines. It is possible to specify that a neighbor should be queried only when the domain matches, or when it does not match (by prepending a ! to the domain). The order of listed domains is important. The action of the first domain to match is taken. ("action" means "do query" for 'domain' and "do not query" for '!domain'). If the URL does not match any listed domains, the default action is the opposite of the last one. For example, assume your cache was running inside an organization named "this.org." Requests for URLs inside the domain should be forwarded to other caches inside the this.org, but not up to the parent cache running outside of this.org. Similarly, for URLs outside of this.org there is no need to query the inside neighbors. The configuration would look like: cache_host neighbor N1.this.org 3128 3130 cache_host neighbor N2.this.org 3128 3130 cache_host parent P1.foo.net 3128 3130 cache_host_domain N1.this.org this.org cache_host_domain N2.this.org this.org cache_host_domain P1.foo.net !this.org Dealing with file descriptor shortages ------------------------------------- An extremely busy cache may occasionally run out of file descriptors. This is especially a problem on SunOS where each process is limited to 256 FDs. Additionally some Unix kernels seem to be slow in recycling socket FDs even though they received a proper close(). For that reason the cache process and the kernel have different values for the number of available FDs. Running out of FDs causes real problems for active objects (e.g. swapping the object to disk will fail). For this reason we now start denying new client connections when the cache calculates that only 64 FDs are available. For each socket(), accept(), and open() failures then the number of reserved FDs is increased by 1, to a limit of MAXFD-64. Multiple swap directories ------------------------- The cache now supports having multiple 'cache_dir' directories. The cache objects will be saved in each directory in round-robin fashion. For example, if the configuration file contained cache_dir /bigdisk1/cache cache_dir /bigdisk2/cache Then the first four objects would be saved as /bigdisk1/cache/01/1 /bigdisk2/cache/01/2 /bigdisk1/cache/02/3 /bigdisk2/cache/02/4 The parameter cache_swap specifies the total, maximum disk space to use for swap files, not the amount to use for each swap directory. Separate IP ACLs for proxy and httpd-accel ------------------------------------------ When 'httpd_accel_with_proxy on' is in the config file, the cached process can handle both proxy and httpd-accel requests at the same time. However, there was only a single IP address access list to control client access to both types of requests. It is now possible to have separate IP ACLs for proxy and httpd-accel requests. The configuration file supports six directives for specifying IP ACLs: proxy_allow proxy_deny accel_allow accel_deny manager_allow manager_deny If no address are given for the accel ACL then it defaults to be the same as the proxy ACL. Command line option (-D) to disable DNS tests --------------------------------------------- Upon startup the cache queries a handful of well known Internet host names to ensure that the host name resolver (i.e. DNS) is working properly. This may be undesirable for sites behind firewalls or those with very slow links to the greater Internet. It is now possible to disable these DNS tests with a -D option on the cached command line. Periodic garbage collection disabled by default ----------------------------------------------- The parameter 'clean_rate' determines how often the cache should perform a full garbage collection. This involves removing all expired objects and reclaiming disk space. For very large caches it will take a significant amount of time to garbage collect on every object. During this time processing of cache requests from users is suspended. For this reason, full garbage collection is now disabled by default. The preferred method of garbage collection is to check only a small number of objects approximately every second. If you have used previous versions of the cache, Please check your cache configuration file and make sure the clean_rate parameter is really what you want. Firewalls --------- A number of changes have been made so that cached works better behind a firewall. If you want to run cached behind (i.e. "inside", but not *on*) a firewall then you should use these configuration parameters: cache_host: This should point to your proxy server running on the firewall host. inside_firewall: A list of domains which are inside your firewall. If a URL-host does not match one of these domains, the request will be forwarded to a parent cache. local_domain: A list of domains which the cache can connect directly to. If a URL-host matches one of these domains, we never query the parents or neighbors. single_parent_bypass: If you have only one parent cache (and no neighbor caches), then it may make sense for you to set this to 'yes'. When behind a firewall, and there is only one parent, it makes little sense to query the parent, wait for the reply, and then fetch the object from the parent regardless whether the reply is HIT or MISS. Description of server selection algorithm ----------------------------------------- When cached recieves a request for an object not already in its cache, it must decide where the object should be fetched from. In general, an object might be fetched from a cache neighbor, cache parent, or the source site (URL-host). Below is a flow chart of sorts, which describes how cached will choose the site to fetch the object from. Is the URL-host beyond our firewall? | | No Yes-> Query the hierarchy, | No DNS lookup for URL-host. | Is the object uncachable, or is URL-host one of the local domains? | | No Yes-> Fetch the object directly | from the URL-host. | Is there a local_ip list? | | No Yes-> Do a DNS lookup for URL-host | so we can check it in the list. | Would we Query any parents/neighbors for this URL? (i.e., check the cache_host_domain restrictions). | | Yes No-> Fetch directly from URL-host | Would we ping only a single parent, and not also ping the source host, and is single_parent_bypass on? | | No Yes-> Fetch directly from the | single parent cache. | +-----------------> Query the hierarchy, do a DNS lookup for URL-host. Description of neighbor resolution algorithm -------------------------------------------- * Query all neighbors and parents, subject to cache_host_domain configuration. * Query the source site if 'source_ping on' and the URL-host is not beyond a firewall. * If the object is uncachable (e.g. cgi-bin scripts), do not query the neighbors and parents. Instead just request it from the first allowable parent. The cache will begin fetching the object as soon as one of the following occurs: * a HIT message is received. * all query replies (i.e. MISSes) arrive. * the 'neighbor_timeout' timeout occurs (default two seconds). At this point, the object will be retrieved from one of the following, in order: * The first neighbor or parent to reply with a HIT message. NOTE: the source host always returns a HIT if queried. * The first parent to reply with a MISS message. * The source server if it is not beyond a firewall. * The single-parent for the URL-host, if appropriate. * If none of the above apply, an error message is generated. The CGI cache manager interface ------------------------------- Versions prior to v1.4 included a tcl/tk cache manager program. That program has been dropped from the distribution due to its difficulty to install and maintain. We now provide a CGI program which can be used to monitor the cache process. This program is installed as $prefix/cgi-bin/cachemgr.cgi. To enable it from your httpd server, you should either copy the executable to your httpd's cgi-bin directory, make a symbolic link, or configure your http to enable running CGI scripts from $prefix/cgi-bin. Don't forget to ensure that the httpd process has permission to execute the file. Note that security is essentially defeated with the cachemgr.cgi program. The configuration parameters manager_allow and manager_deny specify which IP addresses may make cache manager reqeusts. With the CGI program, the requests will be coming from the httpd server host. So anyone with access to the CGI program will also have access to the cache manager functions Most of these functions are harmless and provide only very general information. Very paranoid administrators may wish to totally disable cache manager access to their cache. In order to shut down the cache from the CGI program a password must be provided. The password is taken from the /etc/passwd file for the user 'cache'. If there is no user 'cache' in the system password file, then the cache cannot be shut down from the cachemgr.cgi program. Note that the "shutdown" command causes the current cache process to exit. If the RunCache script is running, another cache process will be automatically restarted. Virtual host support -------------------- The cache supports a simple form a virtual host operation for the httpd-accelerator mode. When the -V command line option is given, incoming requests are rewritten with the IP address of the local socket. For example, the request /index.html might be rewritten as http://172.16.0.1/index.html If -V is given, then the 'http_accel'configuration parameters are ignored. Fast restart mode ---------------- The cache now supports a fast restart mode. The ability to perform a fast reload depends on the swap log file being "clean" when the cache last exited. A clean log file gets written when: - the cached process is killed with a SIGTERM or SIGINT signal. - the process exits due to a "Segmentation Violation" or "Bus Error" and the -C command line option was not given. A fast restart means that the cache does not call stat(2) for each swap file on disk. It assumes that the swap files have not been removed or modified since the swap log was last written. The first time you start the v1.4 cache, it will not be possible to do a fast restart (but the disk files will be preserved). This is because the swap log file now includes the object size as a fifth field for each object. Dealing with cache stalls under SunOS 4.1.x ------------------------------------------- SunOS has a 5 connection listen limit which is insufficient under high workload. You can patch your kernel to increase this to a reasonable number. Instructions are outlined in http://excalibur.usc.edu/cachedoc/sunos.txt New UDP-based cache announcement service ---------------------------------------- In version 1.3, the cache would send an email message to cache_tracker@cs.colorado.edu to announce itself. The email code has been removed for a couple of reasons. To take its place, we provide an external, UDP-based announcement program ('send-announcement'). The announcement is disabled by default. To enable it, change the value of 'cache_announce' in your configuration file. This service is provided to help Harvest cache administrators locate one another in order to join or create cache hierarchies. For more information see the Web page at http://www.nlanr.net/Cache/Tracker/ Other additions to cache configuration file ------------------------------------------- * ascii_port * udp_port The "Ascii" and "UDP" ports can now be specified in the configuration file with 'ascii_port' and 'udp_port'. Values given with -a and -u command line options will override those in the configuration file. * proxy-only This has been added as an option to the 'cache_host' instruction. Previously, we used 'neighbor_cache_obj' to enable or disable the local caching of objects fetched from all neighbors. With 'proxy-only' it is possible to disable local caching of objects fetched for specific neighbors or parents. * connect_timeout Some systems (notably Linux) can not be relied upon to properly time out connect(2) requests. This parameter can be used to force pending connections to close before the operating system timeout occurs. * neighbor_timeout Previously this timeout was hardcoded as two seconds. It is how long to wait for ICP replies to arrive before we start fetching an object from the source. Cached-1.5 Pre-Announcement --------------------------- The next version of the cache will support 1) GET If-Modified-Since requests hierarchically, and intelligently use neighbors and parents. 2) A multiplexing mode for getting around file-descriptor limits 3) Better logging 4) Dynamic neighbor and parent installation 5) Decreased meta-data size 6) Storage manager extended to handle big objects 7) Redirection and stop lists 8) Double peak-load performance Thanks and Acknowledgements --------------------------- The Harvest team would especially like to thank the following individuals for providing extensive feedback and testing: Johannes Demel Michael Fischer Martin Hamilton Joe Meadows Patrick Michael Kane Donald Neal Henrik Nordstrom Anthony Rumble Mark Thomas