Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp .com.

Notice: This material is excerpted from Running A Perfect Web Site with Windows, ISBN: 0-7897-0763-2. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 18 - Usage Statistics and Maintaining HTML

Setting up your server is literally just the start of your work. Administering, maintaining, and monitoring usage are important tasks that will keep you busy after your server is operational. The rapid growth of Internet usage and the growing interest in analyzing that usage means that there is a lot of work. Among your tasks will be monitoring usage statistics and monitoring and presenting those statistics.

Your customers, those people who put up the content of the WWW site and expect it to provide some return on investment, need to know if other people are coming to the site and what they do when they are there. Fortunately, with some effort, you can provide that information. A tremendous amount of information about the client systems and activities of the WWW site is captured and available for analysis.

Information such as how often your server is being accessed, what files are being accessed most often, what client is accessing your server, how often they visit. You can convert any or all this information into graphical summaries quite easily using programs designed to collate server usage statistics.

When the amount of information on your server becomes large, checking that all intended files have been properly becomes more and more difficult. As the number of related documents grows, the only practical way to do this is to use automated programs that check your documents for you. Some of these programs are described below.

In this chapter, you learn:

Understanding Usage Logs

When your Web server is running, every document or file request is logged as a separate entry in the server's log files. The names and directory for these files can be administrator defined during server setup. For example, the access log file can be named "access.log" and placed in a logs directory under the computer's root directory. Errors are logged separately in "logs/error.log." The access and error logs are very similar but are discussed separately for clarity. The log files should not be located in a subdirectory of the Web documents directory! This makes this information too easily accessible to the world and you most likely will not want to share it.

The Log Formats

All major Web servers produce logs in one of two common formats, either CERN or EMWAC. This means you must use utilities written for the correct log format to analyze logs on your server. The formats include a lot of useful information about every document request except how long the transfer took.

The Access Log

The access log file records every single connection made to your server by other computers. These accesses are written in real-time and include a lot of information on the connection. Most server programs either have a default directory for storing log files or allow you to configure the server program to set where the log files should be kept. On NT computers, the default is in the system's "LogFiles" subdirectory usually on the C drive. However, most servers allow you to not only specify another directory but even a different name for the files. If you are trying to keep this information inaccessible to others or even if it will be easier for you to have a different name, you should change the defaults.

Information in the access log can include, depending on the server:

If the information is not captured in the log file, it still is accessible from an environmental array passed by the client browser application to the server. If this is the case, you will have to write some code to parse the environmental array. Listing 18.1 is a short PERL script that returns all the values passed by a browser.

Listing 18.1 PERL Script Returning All Values Passed by a Browser
#!/usr/bin/perl
print "Content-type: text/html\n\n<HEAD>\n";
print "<PRE>\n";
printf ("%-24.24s %-80.80s\n", "Variable", "Value");
foreach (sort keys %ENV) {
     printf ("%-24.24s %-80.80s\n", $_, $ENV{$_});
}
print "<?PRE>\n";

The following is an excerpt from an access log generated by NCSA HTTPD for Windows, this log is in the CERN format:

s115.infonet.net - - [20/Oct/1994:20:53:17 -0500] 
"GET / HTTP/1.0" 200 418
s115.infonet.net - - [20/Oct/1994:20:53:37 -0500] 
"GET /httpddoc/overview.htm HTTP/1.0" 200 3572
s115.infonet.net - - [20/Oct/1994:20:54:00 -0500] 
"GET /httpddoc/setup/admin/Overview.htm HTTP/1.0" 200 1165
s115.infonet.net - - [20/Oct/1994:20:54:17 -0500] 
"GET /httpddoc/setup/Configure.html HTTP/1.0" 200 2500
s115.infonet.net - - [20/Oct/1994:20:54:27 -0500] 
"GET /httpddoc/setup/httpd/Overview.html HTTP/1.0" 200 1121

The first item in each log entry is the address of the system that requested the document, followed by the date and time, the HTTP method (GET in this example), the virtual path to the file requested, the HTTP protocol level (1.0 in this example), status information (200 means OK), and the number of bytes transferred.

The following is an excerpt from an access log generated by WebQuestNT, this log is in the EMWAC format:

Tue Mar 05 11:29:45 1996 204.96.64.103 205.23.164.13 GET / HTTP/1.0
Tue Mar 05 11:29:54 1996 204.96.64.103 205.23.164.13 GET /ssiecho.sht HTTP/1.0
Tue Mar 05 11:30:04 1996 204.96.64.103 205.23.164.13 GET /flowctrl.sht HTTP/1.0
Tue Mar 05 11:30:20 1996 204.96.64.103 205.23.164.13 GET /ssiplus.htm HTTP/1.0
Tue Mar 05 11:30:53 1996 204.96.64.103 205.23.164.13 GET /odbctest.sht HTTP/1.0
Tue Mar 05 11:30:56 1996 204.96.64.103 205.23.164.13 GET /mailform.htm HTTP/1.0
Tue Mar 05 11:31:04 1996 204.96.64.103 205.23.164.13 GET /odbcupdt.htm HTTP/1.0

The first item in each log entry is the date and time of the request, followed by the IP address of the web space/virtual server, then the IP of the client, the request method, the element requested and the protocol.

The following is an excerpt from an access log generated by Microsoft IIS:

205.242.205.145, -, 3/5/96, 16:15:46, W3SVC, REDBACK, 204.96.64.10, 2203, 198, 3193, 200, 0, GET, /Default.htm, -,
205.242.205.145, -, 3/5/96, 16:15:51, W3SVC, REDBACK, 204.96.64.10, 4997, 211, 440, 200, 0, GET, /pix/cupid.gif, -,
205.242.205.145, -, 3/5/96, 16:15:57, W3SVC, REDBACK, 204.96.64.10, 5558, 213, 3066, 200, 0, GET, /pix/jm_logo.gif, -,
205.242.205.145, -, 3/5/96, 16:16:22, W3SVC, REDBACK, 204.96.64.10, 36603, 212, 21089, 200, 0, GET, /pix/butbar.gif, -,
205.242.205.145, -, 3/5/96, 16:16:32, W3SVC, REDBACK, 204.96.64.10, 41220, 220, 300, 200, 0, GET, /butbar.map, 456,61,

The first item in each log entry is the client's IP address, followed by their username or a hyphen if the username is unavailible. Next is the date and time of the request followed by the services that responded to the request, either WWW (W3SVC), FTP (MSFTPSVC) or Gopher (GopherSVC). Next is the computer name of the server and it's IP address, followed by the processing time, bytes received, bytes sent. Next is the service code status, Windows NT status code, type of operation and the element of the operation. The MS IIS logs can be converted into the CERN or EMWAC format using utilities that come with MS IIS.

The address of the requesting client is usually in a name format, such as s115.infonet.net in this example, but can also be numerical if the server is unable to look up the name corresponding to the client's numerical IP address.

From these files and arrays, it is possible to put together a wide variety of statistics on your server usage, including

Because every document access is recorded, log files can grow very quickly. This is compounded by the fact that in-line GIF files are processed as separate requests, so, for example, a request for a document with three in-line GIFs actually shows up as four separate requests-one for the document and three for the GIFs. On even a moderately busy server, the access logs can grow to many megabytes each month.

With most servers, you can specify that new log files be generated every week or even daily. With the Microsoft IIS, the default is to create log files daily. In almost every circumstance, it is recommended that you set the server to generate daily logs. You can always combine them if needed. If you want to save historical log data, it is a good idea to periodically compress the log files and move them to an archive. You might want to do this automatically at the beginning of each month. The log files are plain text files viewable at any time by your favorite editor. Therefore you can keep track of usage on a real-time basis if you desire.

The Access Log of the Microsoft's Internet Information Server

Log entries recorded by Microsoft's Internet Information Server (IIS) service have the following items in this order:

  1. Client IP Address
  2. Client User name (if known)
  3. Date
  4. Time
  5. Service
  6. Computer Name
  7. IP address of server
  8. Processing time (ms)
  9. How many bytes received
  10. 10. How many bytes sent
  11. 11. Service status code
  12. 12. Windows NT status code
  13. 13. Type of operation
  14. 14. Target of operation

If no information is available, the server inserts a dash (-) into the log file. For example, the Microsoft IIS generates access listings such as:

154.73.22.8, -, 12/28/96, 13:45:07, W3SVC, INETSRVR2, 157.55.84.1, 220, 
250,2593, 200, 0, GET, /Intro/tour/netshow.htm, -, 

This means that an unidentified user from a computer with an IP address of 154.73.22.8 connected on December 28, 1996 seven seconds after 1:45 P.M. The user requested to GET and was served the file netshow.htm from the Intro/tour/ subdirectory. The request was all of 250 bytes and took 220 milliseconds to execute (without error) and resulted in a data return of 2593 bytes.

Literally every file requested is recorded in this fashion. GIFs and other files referenced from an HTML file are recorded as separate accesses. This is why an access log file can get big pretty fast!

The Access Log of the WebQuest Server

The WebQuest server records accesses in the following format:
Tue Dec 19 06:40:12 1995 204.96.64.11 137.175.2.66 GET /ict05.gif HTTP/1.0

This indicates the date and time, the IP addresses of the server and the client computers, the operation performed (in this case a GET), the file requested and the protocol.

The Access Log of Other Servers

Other servers generate slightly different listings. The Microsoft IIS provides a log conversion utility to convert a log from the default format to NCSA Common Log File or EMWAC format as well as perform reverse-DNS-lookup replacing all IP addresses with domain names.

The EMWAC's log file provides less information. A typical listing for the EMWAC access log looks like this:

Sun Mar 17 00:28:44 1996 inetsrvr.business.com 236.29.10.128 GET /images/sept.gif HTTP/1.0

This listing only includes the date and time, the name of the server, the IP address of the client and the operation and file requested. This example shows a .GIF file, even though the file was an image that was part of an .HTML file.

The following is an example from an access log generated by NCSA HTTPD for Windows:

guest.info.net - - [20/Oct/1995:20:53:17 -0500] "GET /index.html HTTP/
1.0" 200 418

The first item in the log entry is the address of the system that requested the document, followed by the date and time, the HTTP method (GET in this example), the virtual path to the file requested, the HTTP protocol level (1.0 in this example), status information (200 means OK), and the number of bytes transferred (418).

The address of the requesting client is usually written as the numerical IP address. It is more useful if the address is converted to a name format, such as guest.info.net as in the example above. However, log analysis software, like the products described below can take the numerical IP address and look up the name corresponding to the client's computer. Therefore how the access log records the client is not a critical consideration. However, the more the access log does directly record is an advantage.



If you want to see document requests as they happen rather than after the fact, the log files are accessible any time through your favorite text editor, even Notepad.

The Error Log with HTTPD

The format of the error log is very similar to that of the access log. Instead of reporting the number of bytes transferred, however, the error log reports the reason for the error. The following is an excerpt for an error log generated by NCSA HTTPD for Windows; this log is in the CERN format:

[20/Oct/1994:21:02:20 -0500] httpd: access to 
c:/httpd/htdocs/httpddoc/setup/admin/AccessingFiles.html failed for 
s115.infonet.net, reason: client denied by server configuration
[20/Oct/1994:21:07:53 -0500] 
httpd: access to c:/httpd/htdocs/docs failed for s115.infonet.net, 
reason: file does not exist
[20/Oct/1994:21:08:13 -0500] 
httpd: access to c:/httpd/htdocs/ failed for s115.infonet.net, 
reason: client denied by server configuration

The format of the file is pretty self-evident. The first part of the line indicates the date and time of the error. The second part of the log entry indicates what the client was trying to access when the error occurred. The third part of the log entry explains why the error occurred.

Error logs are valuable for showing attempted access to controlled documents by unauthorized users and reporting server problems. If error logs are monitored frequently, they may be your first clue that a hyperlink is "broken" because a document is missing or has moved. If you see several failed connection attempts to the same document, and the document does not exist, you could find the broken hyperlink (missing link?) by looking in the access log during the same time frame to see where the client was linking from.

Hopefully, your error log doesn't grow nearly as quickly as your access log, so archiving it is not as important for conserving space. However, if there are secure documents on your server, it may not be a bad idea to keep the error log in case it's needed to track down security problems discovered later.

The Error Logging with WebQuest

Error logging within WebQuest has been disabled. The logging was part of the ODBC enhanced logging that is an option from WebMeister. Very few people use the ODBC based logging given the added workload for the server. By default Error and Access logging are disabled with WebQuest. Contact Questar if you need to have these features enabled.

Sifting Usage Data

The access file is a great record of your server's activity, but it's pretty tough to get anything meaningful out of the raw data. You need to sift and sort the log files and turn them into valuable demographics that illustrate the usage of the Web site. This will justify the investment in the successful pages and assist in understanding how to improve the less successful pages. This information can be used to support the quality of the server or justify the need to upgrade.

There are a wealth of tools and products available for sifting and analyzing the access log file. They range from simple operating system commands to sophisticated relational databases.

(d)Quick and Dirty Analysis in DOS

Although there are a number of programs available to analyze access logs, the following are easy to do steps for finding answers quickly. Using some simple searches, however, you can find many items you need without having to write a line of code. For starters, look at the basic search tools available under DOS.

Searching in DOS

The DOS FIND command is the easiest way to perform a search of a text file To search for all instances of nasa.gov in the access log, enter

FIND "NASA.GOV" ACCESS.LOG



With FIND, all search strings must be enclosed in quotes, regardless of whether they contain special characters.

Although the DOS FIND command does not have as many options as grep, it has enough for simple log-file searching, including

Because the log files are just ASCII text, you can also open your logs in a word processor and use the search features that are part of that particular program. You can also write macros to search for particular strings of text, such as certain error codes, to help you scan through your logs faster.

(d)Useful Search Patterns

Now it's time to put FIND to work looking for useful data in the access log. Without writing a line of programming code, you can see

Sifting by Address

Suppose you get a couple of calls one day from users wanting to know why they can't get to the weather map anymore. You ask for their addresses and discover that they never should have had access in the first place. What do you do now? To verify their claims and assess the damage, you can start by simply searching for their addresses in the log file. Suppose the unauthorized users are from iam.illegal.com and ur.illegal.com. To see what they've looked at besides the weather map, you can simply search for illegal.com. With trepidation, you enter

FIND "illegal.com" access.log

The result is a fascinating chronicle of unauthorized activity. If there are too many lines to count, use FIND /C to do the dirty work for you, and e-mail the results to your boss on a good day.

This scenario is not all that unlikely, by the way. Basic Web server security itself is good but only as good as the rules that are made for it. More often than not, problems arise when people make assumptions or generalizations that turn out to be false. You may think, for example, that all addresses in a certain subnet (beginning with 127.34.26, for example) are located on your network, only to find out later that the first 20 addresses belonged to another company. The trick here is just to be aware of what you're doing when you're doing it. Taking the "easy way out" can sometimes open up more of a hole in your security than you're really intending.

If you're running a restricted-access Web server, you might want to check now and then to make sure that no one has gotten in from the outside. You can do this easily by looking for all accesses not from your site:

FIND /V "widgets.com" ACCESS.LOG

In this case, anything returned by the search indicates a possible security breach.

Sifting by File or Directory

Perhaps you've recently added a new feature to your Web site and want to see how much attention it's getting. Just search your logs for the directory or file name and you're in business. To see how many times your What's New page has been read in the current logging period, you simply enter

FIND "whatsnew.htm" ACCESS.LOG

Or if you've added a whole new directory of stuff (called "/stuff"), try

FIND "/stuff" ACCESS.LOG



The correct URL to get an automatic directory index is the directory name followed by a slash (/). Some servers, like NCSA's HTTPD for Windows, return an error if the trailing slash is omitted. Most others, however, generates a Redirect URL (status code 302) and then a second request containing the proper URL, causing the document request to show up twice, and thus distorting true usage figures.



The ease with which simple searches can find all accesses to a given directory is a strong argument for maintaining a close relationship between the hyperlink structure of documents and the physical directory structure.

Computing Total Accesses

One measure of your Web server's utilization or exposure is the number of total document requests. This is not necessarily a measure of effectiveness because many people who visit your site may spend but a few seconds there and travel on. This is especially true now because of the Web's notoriety. In fact, the ratio of tourists to seriously interested patrons of the Web may even be lower than the percentage of sales resulting from direct-mail campaigns. Fortunately, Web space is a lot cheaper. Nonetheless, the number of documents requested or "hits" is of major interest.

If nothing else, measuring your server's growth in utilization can give you a good indication of when you'll have to buy more powerful hardware. Without running a more advanced usage statistics program, you can get a good feel for you server's growth simply by counting the number of total document accesses. In general, you want to exclude GIF files, however, because in-line GIFs show up as separate document requests, hence distorting the true number of HTML pages accessed. Of Course, if providing images is a major part of your site, you may not want to exclude them in the count. But for example, to find out many HTML pages have been accessed on your server, less the GIF files, you enter

FIND /C /V ".gif" ACCESS.LOG

To see how many accesses occur during some specified time period, simply run this command every six hours and compute the difference between each run. For more regular time periods, however, such as days and hours, you can use the next technique.

Computing Accesses during a Given Period

The access log turns out to be in a very convenient format for finding out how many document requests have been processed in most common time periods. For example, if you wanted to find out how many documents were transferred between 3:00 and 4:00 p.m. on October 25, 1994, use

FIND /C "25/Oct/1994:15" ACCESS.LOG

Using this technique, you can look at total accesses in a given hour, day, month, or year. By piping the output of one FIND command into another, you can obtain even more detailed information. For example, to find all accesses from red.widgets.com in the month of October, use

FIND "red.widgets.com" ACCESS.LOG | FIND "/Oct/"

The first FIND command finds all occurrences of red.widgets.com, while the second FIND looks only in that data for occurrences of /Oct/. (Of course, if you haven't cleaned up your log files for a while, you end up with data from this and all previous Octobers since you last purged or archived your file.)

Usage Utilities

Now for the really neat stuff. What has been described above gives you a lot of answers about your site and its usage. But they require separate actions and still give you raw output. There are numerous products, some free, some commercial that take all the grunt work out of collating and totaling usage statistics. They range from freeware that still requires some programming effort on your part to commercial packages that provide easy to use graphical user interfaces to set up and customize. They all take the raw data in your log files and create reports and graphs customized to your specifications.

Amongst the freeware offering, one of the nicest is wwwstat, available from http://www.ics.uci.edu/WebSoft/wwwstat/. wwwstat is nice because it produces thorough and nicely-formatted output and can be used with gwstat, which turns the output of wwwstat into attractive usage graphs (in GIF format, of course). Gwstat is available from ftp://dis.cs.umass.edu/pub/gwstat.tar.gz, and both wwwstat and gwstat are available on the WebmasterCD.

wwwstat

Wwwstat is a PERL script that reads the standard access-log file format and produces usage summaries in several categories. Wwwstat produces summary information for each calendar month and can be run for past months as well as the current month. Summary categories include

Figure 18.1 shows an example of Daily Transmission Statistics generated by wwwstat.

Fig. 18.1 - Wwwstat generated these Daily Transmission Statistics.

Figure 18.2 shows wwwstat's summary of statistics by client domain, which brings home the truly global nature of the Internet. Part of the wwwstat distribution is a file containing all the country codes in use on the Internet.

Fig. 18.2 - Wwwstat's output of country codes and names.

Because wwwstat is a PERL program, you can port it to other platforms, although no one has, as yet, done that publicly.

statbot

Another popular WWW log analyzer is Statbot. It works by "snooping" on the log files generated by most WWW servers and creating a database that contains information about the server. This database is then used to create a statistics page and GIF charts that can be "linked to" by other WWW resources.

Because Statbot "snoops" on the server log files, it does not require the use of the server's cgi-bin capability. It simply runs from the user's own directory, automatically updating statistics. Statbot uses a text-based configuration file for setup, so it is very easy to install and operate, even for people with no programming experience.

You can find Statbot at http://www.xmission.com/~dtubbs/club/cs.html.

AccessWatch

A third freeware product is AccessWatch, a PERL script from Bucknell University. It converts the analyzed data into an HTML file. Figure 18.3 is an example of AccessWatch output. It was generated for a subdirectory of HTML files about creating an online newspaper, called CReAte.

Fig. 18.3 - Example of AccessWatch.

It then adds detailed data in HTML tabular form. The full page can be viewed at

AccessWatch is available from http://www.eg.bucknell.edu/~dmaher/accesswatch/getAccessWatch.html.

You can find a long list of other analysis tools in the Yahoo directory http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/.

Commercial Products

Commercial products will be proliferating soon. Two early offerings are WebTrends and net.Analysis. WebTrends is a Mid-range product that functions more in a batch processing mode and net.Analysis is a high-end product complete with an Informix database and real-time capability. Both offer great flexibility in customizing reports.

Reports generated by WebTrends include statistical information as well as colorful graphs that show trends, usage, market share and much more. Reports are generated as HTML files that can be viewed by a browser on your local system or remotely from anywhere on the Internet if you want. WebTrends claims it can read the log files of all available servers. You are able to download an evaluation copy from http://www.webtrends.com/ and try it out with your server. It is highly recommended that you try out any software for an evaluation period before you purchase it.

The following figures are some examples of WebTrends output available from its Web site. These are representative of the kinds of output possible from all of the packages.

Fig. 18.4 - This graph illustrates what Internet domains connected and the number of user sessions over a sample day.

Fig. 18.5 - This table includes additional information such as total and average hits per day.

Fig. 18.6 - This graph illustrates the hits to the pages over a set period of days.

Fig. 18.7 - This table includes additional information such as total number of hits and user sessions.

Fig. 18.8 - This graph illustrates the activity as percentage of total visits.

Fig. 18.9 - This table includes additional information such as the number of user sessions per state.

Fig. 18.10 - This graph illustrates the activity over a twenty four hour period as percentage of total visits.

Fig. 18.11 - This table includes additional information that contrasts the weekdays and weekends as well as indicates the busiest and slowest times.

net.Analysis

net.Analysis is a product designed for complex real-time log analysis. It places the log into an Informix database and runs a host of customizable queries to present as complete an analysis as is possible. Figures 18.12 and 18.13 are two examples of the results generated by net.Analysis.

Fig. 18.12

Fig. 18.13 - net.Analysis is available from: http://www.netgen.com/

These examples are not meant as an endorsement of any particular products. There are literally new products and updates daily. It behooves you to check what is currently available, download evaluation copies and decide for yourself what you want.

A list of other programs to analyze log files is available from http://union.ncsa.uiuc.edu/HyperNews/get/www/log-analyzers.html.

Also there is a list at the Yahoo directory at

http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/index.html

Checking HTML

As your server grows, it becomes more and more difficult to find broken hyperlinks, both to documents on your own server as well as documents on other servers. This is especially true if many people are responsible for creating and editing documents on your server. Fortunately, there are also tools to help you analyze the structure of your HTML database and find problems. Some of these tools are freely available on the Internet.

HTML Analyzer

HTML Analyzer is a C program that both finds broken links and attempts to ensure that the HTML database is well-organized and makes sense to users. It is available in various forms from

The file name will be something like "html_analyzer-0.30.tar.gz." The documentation for HTML Analyzer is contained in the program's distribution.

The basic philosophy of HTML Analyzer is that the text of any given hyperlink should always point to the same place and that no other text should point to that same place. This is necessary in order for users to get a clear picture of the organization of the HTML database. HTML Analyzer performs three checks on a database of HTML files-validity, completeness, and consistency.

Checking for Validity

The first check performed by HTML Analyzer is for link validity. This ensures that all hyperlinks point to valid locations (that is, no server errors are returned). Empty hyperlinks (such as, HREF=""), local links (such as, HREF="#intro"), and links to interactive services (Telnet and rlogin) are not checked. Even without running the other two checks, validity checking helps to ensure that users of your site won't be frustrated by broken links.

Checking for Completeness

The completeness check ensures that each anchor's contents always occur as a hyperlink. If a hyperlink contained the text Beginner's Guide, for example, and the same text occurred as regular text (not a hyperlink) elsewhere, this is reported. The intent of the completeness check is to improve user-convenience by expecting a hyperlink everywhere there can be, and also to prevent user confusion because the same text sometimes occurs in a hyperlink but not in others.

Checking for Consistency

The final check ensures that every occurrence of a hyperlink anchor points to the same address and that every occurrence of that address is pointed to by the same hyperlink anchor. In other words, HTML Analyzer checks to see that there is a one-to-one correspondence between hyperlink anchors and their respective addresses.

Listing 18.2 is an example of the results of HTML_Analyzer. In this example, there is no file: /u/CIMS/Demo_Description.html located on the server named nsidc1.colorado.edu, a httpd server listening on port 1729. The first series of tests discovered this and notified the user as such. It also discovers an incomplete link and an inconsistent link.

Listing 18.2 Sample Results of HTML_Analyzer
+++++++++++++++++++++++++++++++++++++++++++++++
VERIFYING LINKS...
WWW Alert:  HTTP server at nsidc1.colorado.edu:1729 replies:
HTTP/1.0 500 Unable to access document.
WWW Alert:  Unable to access document.
WARNING:  Failed in checking:
 http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html
   With content of:  Description of this demo
   In local file: ./temp/example.html

VERIFYING COMPLETENESS...
WARNING: These filenames contain the content:
   Description of this demo
 Without a link to:
  http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html
example.html

 VERIFYING CONSISTENCY OF LINKS...
WARNING: Link used inconsistently.
  HREF: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html
  occurs 1 time with content:
Free Text Frame
  as in file: ./temp/example.html, but also
  occurs 1 time with content:
More Info Frame
  as in file: ./temp/example.html 

VERIFYING CONSISTENCY OF CONTENTS...
WARNING: Content used inconsistently.
  CONTENT:
Free Text Frame
  occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/Even_more_info .html
  as in file: ./temp/example.html, but also
  occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html
  as in file: ./temp/example.html

 ++++++++++++++++++++++++++++++++++++++++++++++++++++

MOMspider

MOMspider is a PERL program originally written as a class project in distributed information systems at the University of California. MOMspider stands for Multi-Owner Maintenance Spider and is similar to other spiders and robots that traverse the World Wide Web looking for information. MOMspider is available from http://www.ics.uci.edu/WebSoft/MOMspider/ and requires libwww-perl, a library of PERL code for the World Wide Web available from the same site.

Because MOMspider is designed to follow hyperlinks anywhere on the Web, it has many features for controlling the depth of searches and is respectful of other sites' wishes not to be visited by automated robots like MOMspider. MOMspider also has an interesting feature that can build a diagram of the structure of the documents it finds. In addition, MOMspider can avoid sites that are known to cause problems for Web-roaming robots. Examples of these kinds of sites are those that use scripts to generate all output rather than static HTML documents.

WebQuest's Webmeister

The WebQuest Server comes with a utility called Hyperlink Mode that shows you the hierarchical structure of your Web space based on the hyperlinks in your HTML documents. It starts with the default document of your Web space and then displays all the links. To see all hyperlinks in a particular document, double-click on the document's name.

Each hyperlink is then validated with a network call (see fig. 18.14). Valid links on the local server appear white, invalid links on the local server appear white with a red X superimposed. Valid links to remote servers appear yellow with a green check, invalid links to remote servers appear yellow with a red X.

Fig. 18.14 - Here you can learn the status of each link in a document.

Finding What's New

When your Web site is being maintained by many people independently, such as an internal server might be in a large organization, it becomes impractical, if not impossible, to require that HTML authors tell you every time they create or modify a page on your server. However, it is highly desirable that server administrators be able to quickly and easily find out what new items have been added each day in order to spot potential problems before they spread too far.

In addition to administrative concerns, information about new or modified documents on the server is helpful for users, who can look on the What's New page and see that the server is continually being updated with valuable information.

By including a FIND command in a shell or PERL script, you can easily generate a list of What's New page, as in the following PERL example (see listing 18.3).

Listing 18.3 PERL Script That Generates a What's New Page
#!/usr/bin/perl

# whatsnew.pl--David M. Chandler--January 13, 1995
# This program finds all files underneath the search directory which have been
# created or modified within the last day. The output is an HTML What's New
# page with hyperlinks to the new pages.

# Invoke the script and redirect the output to your What's New page
# whatsnew.pl >whatsnew.html

#Put your server's document root here
$SEARCHDIR="/httpd/htdocs";

#Create header for What's New document
print "<TITLE>What's New<TITLE>\n";
print "<H1>What's New!</H1>\n";
print "The following documents were created or modified
yesterday:<P>\n";
print "<DL>\n";

#Find all new/modified HTML files in the past day
for each $file (´find $SEARCHDIR -type f -mtime 1 -name '*.html'´)
{
  #Construct the URL from the filename by removing the 
# directory path
  if ($file =~ m%$SEARCHDIR/(.*)%) {
  $url = $1; }

  #Find the document title
  chop($title = ´grep '<TITLE>' $file´);
  if ($title =~ m%<TITLE>(.*)</TITLE>%i) {
    $anchor = $1; }

  #Create the What's New listing
  print "<DD><A HREF=\"$url\">$anchor</A>\n";
}
print "</DL>\n";

Windows for Workgroups users can accomplish this task easily in File Manager by using the Date Sort tool, which lists all files in chronological order. Likewise, many Windows-based shells, such as Norton Desktop or PC Tools for Windows have similar features in their file management utilities. DOS users aren't fortunate enough to have the -mtime option available to list only those files modified recently; however, it is possible to see a directory listing sorted by date so that a quick scan reveals any new or modified files. To list a directory with the most recently created or modified files last, use

DIR /OD directory_name

To list a directory with the most recently created or modified files listed first, use

DIR /O-D directory_name 

Conclusion

This chapter will set you well on the way to managing the usage of your Web site. You will be able to furnish the content managers with detailed and organized data on the accesses to the site and its pages. You will also be able to check the HTML pages that get placed on the server to see if they are linked properly. This will help to make your site more professional and productive.

QUE Home Page

For technical support for our books and software contact support@mcp.com

Copyright ©1996, Que Corporation