Previous Page TOC Next Page


27

WAIS The Database of Databases

On the Internet, people have a remarkable desire to share knowledge. Why altruism should be a feature of cyberspace is anyone's guess, but the pioneer spirit may have something to do with it. Just as the Wild West campfire always had room for a stranger (in contrast to today's urban scene), the database always has room for another terminal. One of the great tools for finding useful stuff in many databases is WAIS.

The Wide Area Information Server (WAIS, pronounced ways) attempts to harness the vast data resources of the Internet by making it easy to search for and retrieve information from remote databases, called sources in WAIS terminology.

Sources are collections of files that consist mostly of textual material. For example, if chemistry is your forte, you can find several journals on the subject through WAIS. WAIS servers not only help you find the right source, they also handle your access to it.

Like Gopher, WAIS systems use the client-server model to make navigating around data resources easy. Unlike Gopher, WAIS does the searching for you. Currently, more than 520 sources are available through WAIS servers. A WAIS client (run either on your own computer or on a remote system through Telnet) talks to a WAIS server and asks it to perform a search for data containing a specific word or words.

Most WAIS servers are free, which means that the data is occasionally eccentric and erratic. The data can also have great gaps in coverage on some subjects and more coverage than you can believe on others. For example, you can find tons of material in WAIS about chemistry and computer science, but sources on, say, art history or the theory of juggling, are nonexistent at the moment. New WAIS servers and sources are created from time to time, so a library of Van Gogh's writings may yet be established.

WAIS is simple to use, although its text-based interface is a little user hostile. The X Window client is much easier to use, but requires that you run X Window (of course). WAIS clients are available for Macintoshes, PCs, and even supercomputers.

What Is WAIS?

WAIS was one of the first programs to be based on the Z39.50 standard. The American National Standard Z39.50—Information Retrieval Service Definition and Protocol Specification for Library Applications standard, revised by the National Information Standards Organization (NISO)—attempts to provide interconnection of computer systems despite differences in hardware and software.

WAIS was the first database system to use this standard (which may well become a universal data-search format). Unfortunately, WAIS was based on an old version of the Z39.50 standard. The newer standard is somewhat incompatible with the older one. There have been discussions about making WAIS clients and servers that can use both protocols, however.


NOTE Z39.50 is similar in some respects to Structured Query Language (SQL), but it is simplified. Although this makes Z39.50 less powerful, it consequently makes it more general, so Z39.50 is likely to gain wide acceptance.

Z39.50 is an important step in making information sources on the Internet more accessible. Today, most Internet databases are accessed in ways completely different from each other. They use different standards for storing data and different tools to access that data. Although Z39.50 may change that, it is not yet clear when or how.

For example, one library catalog system may use find as its search command for a subject heading; another may use subject. Still another may use topic. If they all conformed to a standard, life would be much simpler. Z39.50-compliant systems all use the same format to construct queries. You don't have to know anything special to search a WAIS database. You just use whatever word you think may be used in relevant documents because WAIS indexes all the text in a source.

Document Rankings

After you run a search that identifies any documents, you receive a list of hits, or ranked document titles. The WAIS server ranks the hits from the most-relevant to the least-relevant document. Each document is scored, with the best-fitting document awarded 1,000 points. All other scores are relative to the top score.

WAIS ranks documents by the number of search words that occur in the document and the number of times those words appear.

WAIS servers also take into consideration the length of the document. WAIS servers are smart enough to exclude common words, called stop words, to make the search manageable. Words such as a, about, above, across, after, the, and so on should be excluded from your search because the frequency of their appearance in most documents makes them irrelevant in most searches. For example, if you search for Who is Richard Simmons?, who and is are excluded from the search because they are stop words.


NOTE Stop words are controlled by the administrator of each WAIS server. In addition to generally common words, many words common to a database may become stop words. For example, the word WAIS may be a stop word in the database of a WAIS newsgroup; the word Internet may be a stop word in a database of Internet protocols.

In a WAIS server, a word is a series of alphanumeric characters, possibly with some embedded punctuation. A word must start with an alphabetic character: you can't search for numbers. A word can have embedded periods, ampersands, or apostrophes, but only the first kind of punctuation that you use is treated as punctuation. Any other punctuation is interpreted as a space and ends the word. I.M.Pei is a valid word and so is AT&T, but A.T.&T. is two words: A.T. and T.

Hyphens are not accepted as embedded punctuation because they're used so freely that they inflate the database dictionary.

Two classes of words are ignored in queries. First are stop words chosen by the database administrator for their complete lack of value in searching. There are 368 stop words for the public CM WAIS server. Some common stop words are a, about, aren't, further, he, will, and won't—you get the idea.

Some words are far too common to be helpful in searches. These are weeded out by the database software as the database is built. There are currently 777 buzz words for the public CM WAIS server, each of which occurs at least 8,000 times in the database. They include words such as able, access, account, act, action, add, added, addition, additional, address, addresses, administration, all the way through to winkel (I have absolutely no idea why that one's in there).

Limitations

You cannot use Boolean logic in most WAIS searches. That is, you can't do anything other than find a single word or several words. A search for cow and farm searches for documents that contain cow and/or and and/or farm. The and should be excluded from the search. Notice that the search is "and/or" not just "and." The search for cow farm gives you all documents that contain any of the following:

You can guarantee that this limitation won't always be the way of things; already there's a new version of WAIS called FREEWAIS (get it? freeways?) which does support Boolean searches.

Also, no wildcard searching is available in WAIS. This means that you can't specify that you would accept cows as well as cow.

Unlike many regular database searches, WAIS searches can't be expanded to include articles that may talk about similar topics or to retrieve all articles that have those words (for example, cars or automobiles or trucks or motorcycles). Neither can you exclude words in a search (for example, cars but not trucks).

You can, however, increase the number of relevant documents by using more specific terms in a search. A search for car automobile crash statistics may retrieve more pertinent documents on the subject you want.

What Is Available?

The sources available through WAIS are as varied as the groups that communicate over the Internet: Renaissance music, beer brewing, Aesop's fables, software reviews, recipes, ZIP code information, a thesaurus, environmental reports, and many other databases are available.

The WAIS system for Thinking Machines alone gives access to more than 60,000 documents, including weather maps and forecasts, the CIA World Factbook, a collection of molecular biology abstracts, Usenet's Info Mac digests, and the Connection Machine's FORTRAN manual (a must for pipe-stress freaks and crystallography addicts). The Massachusetts Institute of Technology makes a compendium of classical and modern poetry available through WAIS.

Where To Get WAIS

WAIS was developed by Thinking Machines Corporation, Apple Computer, and Dow Jones; access to the system is available free from Thinking Machines by connecting to telnet://quake.think.com/WAIS.

As an alternative, WAIS client software (both executable and source) is available through anonymous FTP at Thinking Machines (use the same Internet address) in the pub/wais/ directory. WAIS clients are available for a number of operating systems (X Window, DOS, Macintosh, and others), but they do require that your computer have some kind of TCP/IP connection to the Internet.

Searching WAIS

You can access WAIS in three ways. You can Telnet to quake.think.com and log in as wais, or you can run a local WAIS client. Your system administrator may have set up your system so that typing wais automatically connects you to whatever WAIS service is available. Another way to get to WAIS is through Gopher. You'll find an entry on Gopher menus such as Other Gopher and Information Servers that will lead you eventually to WAIS.

The first screen you see on WAIS is a list of the WAIS servers and sources available. At the time of this writing, 529 WAIS sources are available through the WAIS client at Thinking Machines, starting with aarnet-resource-guide and ending with zipcodes.

The following example screen gives you a reference number for each source, the location of the WAIS server in brackets, the name of the server, and the cost of searching that library. At this time, all WAIS servers available through Thinking Machines are free.

# Server Source Cost

001: [ archie.au] aarnet-resource-guide Free

002: [ munin.ub2.lu.se] academic_email_conf Free

003: [wraith.cs.uow.edu.au] acronyms Free

004: [ archive.orst.edu] aeronautics Free

005: [ bloat.media.mit.edu] Aesop-Fables Free

006: [ ftp.cs.colorado.edu] aftp-cs-colorado-edu Free

007: [nostromo.oes.orst.ed] agricultural-market-news Free

008: [ archive.orst.edu] alt.drugs Free

009: [ wais.oit.unc.edu] alt.gopher Free

010: [sun-wais.oit.unc.edu] alt.sys.sun Free

011: [ wais.oit.unc.edu] alt.wais Free

012: [alfred.ccs.carleton.] amiga-slip Free

013: [ munin.ub2.lu.se] amiga_fish_contents Free

014: [ coombs.anu.edu.au] ANU-Aboriginal-Studies $0.00/minute

015: [ coombs.anu.edu.au] ANU-Asian-Computing $0.00/minute

016: [ coombs.anu.edu.au] ANU-Asian-Religions $0.00/minute

017: [ coombs.anu.edu.au] ANU-CAUT-Projects $0.00/minute

018: [ coombs.anu.edu.au] ANU-French-Databanks $0.00/minute

Keywords:

<space> selects, w for keywords, arrows move, <return> searches, q quits, or ?

You are now ready to conduct a search. As is true with Gopher, the problem with using WAIS is deciding which of the 529 libraries to search. An added problem is that the names of the servers don't necessarily describe what they contain. Fortunately, a directory of servers is available that contains short abstracts of the contents of each server and other information about the source of the server. Until you know exactly which server you want to search, you should start with the directory of servers.

How do you get there? The preceding screen looks like an alphabetical list of WAIS servers, so using the down-arrow key can do the trick but may take a while. Issuing the ? (help) command to reveal the online help that comes with this client displays the following information:

SWAIS Source Selection Help Page: 1

j, down arrow, ^N Move Down one source

k, up arrow, ^P Move Up one source

J, ^V, ^D Move Down one screen

K, <esc> v, ^U Move Up one screen

### Position to source number ##

/sss Search for source sss

<space>, <period> Select current source

= Deselect all sources

v, <comma> View current source info

<SB2 BOX>

<ret> Perform search

s Select new sources (refresh sources list)

w Select new keywords

X, - Remove current source permanently

o Set and show swais options

h, ? Show this help display

H Display program history

q Leave this program

Press any key to continue

This help screen tells you how to move through the screens of the source directory. WAIS uses UNIX editor commands for moving around (the j and J, for example, are UNIX editor commands for moving down by line or by screen). Try your Page Down and arrow keys; they may work if you're using VT-100 terminal emulation. The /sss is an important command because it quickly moves the pointer to a source on a specific line. Also note that the space or period selects a source; the equal sign deselects all sources.


NOTE Unless your terminal emulator does a good VT-100 emulation, don't bother with swais; you'll go crazy trying to figure out what's going on.


TIP Here's a feature not covered in the swais help screen: use the spacebar or period on a selected source to deselect it.

It's too bad that the directory of servers isn't the first item on the list of sources. You know the name, so use a forward slash with the name of the server to get there. Type /dir to get close; after the screen is refreshed with names of new sources, use the down arrow key or type j once to highlight the directory of servers.

SWAIS Source Selection Sources: 429

# Server Source Cost

145: [ ds.internic.net] ddbs-info Free

146: [ irit.irit.fr] directory-irit-fr Free

147: [ quake.think.com] directory-of-servers Free

148: [ zenon.inria.fr] directory-zenon-inria-fr Free

149: [ zenon.inria.fr] disco-mm-zenon-inria-fr Free

150: [ wais.cic.net] disi-catalog Free

151: [ munin.ub2.lu.se] dit-library Free

152: [ ridgisd.er.usgs.gov] DOE_Climate_Data Free

153: [ wais.cic.net] domain-contacts Free

<SB2 BOX>

154: [ wais.cic.net] domain-organizations Free

155: [ ftp.cs.colorado.edu] dynamic-archie Free

156: [ wais.wu-wien.ac.at] earlym-l Free

157: [ bio.vu.nl] EC-enzyme Free

158: [ kumr.lns.com] edis Free

159: [ ivory.educom.edu] educom Free

160: [ wais.eff.org] eff-documents Free

161: [ wais.eff.org] eff-talk Free

162: [ quake.think.com] EIA-Petroleum-Supply-Monthly Free

Remember that you are not searching a huge database containing source materials but a database of descriptions of source databases. The terms you choose should reflect what the author or owner of the database would probably use to describe it.

The following example search uses the words wais and Z39.50 to find information on the NISO standard and how WAIS uses it. WAIS uses the words wais and Z39.50 to retrieve search results that contain those words (see the following example). The information is returned in ranked order—the order WAIS thinks is most likely to contain your information. The first item, scored 1,000, is the one WAIS thinks is most likely to contain what you're looking for.

SWAIS Search Results Items: 40

# Score Source Title Lines

001: [1000] (directory-of-se) cool-cfl 76

002: [ 953] (directory-of-se) dynamic-archie 59

003: [ 858] (directory-of-se) wais-docs 24

004: [ 834] (directory-of-se) wais-talk-archives 18

005: [ 810] (directory-of-se) alt.wais 18

006: [ 810] (directory-of-se) wais-discussion-archives 18

007: [ 691] (directory-of-se) cool-net 50

008: [ 572] (directory-of-se) aftp-cs-colorado-edu 144

009: [ 476] (directory-of-se) bionic-directory-of-servers 31

010: [ 452] (directory-of-se) cicnet-wais-servers 55

011: [ 381] (directory-of-se) cool-lex 59

012: [ 333] (directory-of-se) IUBio-INFO 71

013: [ 333] (directory-of-se) directory-of-servers 32

014: [ 333] (directory-of-se) sample-pictures 23

015: [ 333] (directory-of-se) utsun.s.u-tokyo.ac.jp 32

016: [ 309] (directory-of-se) journalism.periodicals 58

017: [ 309] (directory-of-se) x.500.working-group 38

018: [ 286] (directory-of-se) ANU-Theses-Abstracts 89

This search resulted in some irrelevant sources. For example, cool-cfl is a database of files from a group concerned with conservation in libraries, archives, and museums. This might be a bug in WAIS—not improbable; Internet software is being developed and improved continuously.

The second source, dynamic-archie, discusses a Dynamic WAIS prototype at the University of Colorado that performs Archie searches with WAIS. This could be useful...and so could the next four sources. The rest don't seem to be relevant.

The information that describes the sources in WAIS is determined by the owners of the source. Some sources, such as ERIC databases, give detailed information that makes the directory of sources a valuable tool in finding out which sources are relevant. Other sources have minimal descriptions that aren't very useful or won't be found through the directory of services. Such source descriptions are probably of use only to people who know they are available in the WAIS database.

From here, press the letter s to return to the sources, using the /wais command to select the three sources with wais in the name.

SWAIS Source Selection Sources: 429

# Server Source Cost

415: * [ quake.think.com] wais-discussion-archives Free

416: * [ quake.think.com] wais-docs Free

417: * [ quake.think.com] wais-talk-archives Free

418: [hermes.ecn.purdue.ed] water-quality Free

419: [ quake.think.com] weather Free

420: [ sunsite.unc.edu] White-House-Papers Free

421: [ wais.nic.ddn.mil] whois Free

422: [ sunsite.unc.edu] winsock Free

423: [ cmns-moon.think.com] world-factbook Free

424: [ quake.think.com] world91a Free

425: [ wais.cic.net] wuarchive Free

426: [ wais.cic.net] x.500.working-group Free

427: [wais.unidata.ucar.ed] xgks Free

428: [ cs.widener.edu] zen-internet Free

429: [ quake.think.com] zipcodes Free

You could also select the alt.wais group (the one ranked fifth in your initial search), but these three will work. Using Z39.50 as a search criterion simplifies the search; the word wais is probably scattered throughout most of the documents, lessening its relevance to the search. To enter the search text, select the sources you want to search; you are then prompted for keywords. After typing the keywords, press Enter; WAIS searches each selected source and ranks the results according to their relevance.

SWAIS Search Results Items: 39

# Score Source Title Lines

001: [1000] ( wais-docs) z3950-spec 2674

002: [1000] (wais-talk-archi) Edward Vie Re: [wald@mhuxd.att.com: more 383

003: [1000] (wais-discussion) Clifford L Re: The Z39.50 Protocol: Ques 325

004: [ 939] (wais-discussion) Brewster K Re: online version of the z39 2659

005: [ 893] (wais-discussion) akel@seq1. Re: Net resource list model(s 347

006: [ 823] ( wais-docs) waisprot 1004

007: [ 800] (wais-discussion) Michael Sc Re: Dynamic WAIS prototype an 27

008: [ 338] (wais-discussion) harvard!ap Re: Z39.50 Product Announceme 51

009: [ 333] ( wais-docs) protspec 915

010: [ 331] (wais-discussion) Unknown Subject 6

011: [ 331] (wais-discussion) uriel wile Re: poetry server is up [most 31

012: [ 313] (wais-talk-archi) brewster@q Re: Re: Information about z39 69

013: [ 313] (wais-talk-archi) ses@cmns.t Re: Z39.50 1992 171

014: [ 313] (wais-talk-archi) ses@cmns.t Re: Z39.50 1992 90

015: [ 308] (wais-discussion) Brewster K Re:Hooking up WAIS with othe 66

016: [ 292] (wais-discussion) Brewster K Re: [morris@Think.COM: it's s 25

017: [ 286] (wais-talk-archi) mitra@pand Re:Z39.50 1992 71

018: [ 284] (wais-discussion) Brewster K Re: WAIS-discussion digest #6 18

The results look promising. The first Z39.50 is ranked 1,000. In fact, the first three seem to be relevant. The name of the information source is given, along with the title of the information. In this case, the title appears to come from e-mail message subject headings. Finally, the screen gives the number of lines contained in the information.

From here, you can read each result and have pertinent results e-mailed to you or even to another person. At the search result screen, type the letter m to receive a prompt asking for an e-mail address. If none of the documents are relevant, you can go back to the sources and redefine the search strategies or add additional appropriate sources to search. The sample documents contain the desired information, so this search has worked.

Because WAIS uses natural language query in its search mode and searches the full-text index of the source, changing any of the search words produces different results. Using a natural language search such as how does wais use Z39.50 protocol produces the following results:

SWAIS Search Results Items: 39

# Score Source Title Lines

001: [1000] ( wais-docs) z3950-spec 2674

002: [1000] (wais-talk-archi) Edward Vie Re: [wald@mhuxd.att.com: more 383

003: [1000] (wais-discussion) Michael Sc Re: Dynamic WAIS prototype an 27

004: [ 998] (wais-discussion) Brewster K Re: online version of the z39 2659

005: [ 777] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #4 554

006: [ 675] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #3 535

007: [ 640] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #3 636

008: [ 629] (wais-talk-archi) brewster@t Re: WAIS-discussion digest #5 749

009: [ 608] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #4 601

010: [ 607] (wais-talk-archi) fad@think. Re: WAIS Corporate Paper " 424

011: [ 607] (wais-talk-archi) composer@b Re: WAIS, A Sketch of an Over 449

012: [ 589] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #4 621

013: [ 549] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #3 575

014: [ 524] (wais-talk-archi) brewster@t Re: WAIS-discussion digest #4 682

015: [ 515] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #3 521

016: [ 510] (wais-talk-archi) news-mail- Re: WAIS-discussion digest #4 480

017: [ 507] (wais-discussion) akel@seq1. Re: Net resource list model(s 347

018: [ 495] (wais-discussion) Unknown Subject 6

Although many of the results are duplicates of the search using just the text Z39.50, some new documents are listed. An extensive search for all relevant documents may mean using different search strategies and a variety of WAIS source servers.

WAIS Indexing

In addition to its search features, WAIS also functions as a data-indexing tool. WAIS can take large amounts of information, index it, and make the resultant Z39.50-compliant database searchable. You can build an indexed database for your own use as a stand-alone database or, if you have a TCP/IP connection, you can make your WAIS database public by registering it with wais.com and listing it in the Directory of Sources.

To obtain the WAIS software, go to ftp://think.com/wais. This is the main distribution site for WAIS software and WAIS documentation. Both the WAIS server code and client code are available from think.com. You can find more WAIS software at ftp://ftp.cnidr.org/pub/NIDR.tools/wais, ftp://ftp.wais.com/pub, and ftp://sunsite.unc.edu/pub/wais.

Getting WAIS up and running is no trivial matter. Because it's very complicated, we'll leave that as an exercise for more daring users with time on their hands and a good supply of Valium.

The Ways of WAIS

The use of WAIS is growing slowly on the Internet. WAIS provides a convenient and efficient way to index and search large amounts of information using standards that are starting to be generally accepted on the Internet.

However, WAIS faces some tough competition. The WWW, Harvest, and Hyper-G tools offer facilities that may replace WAIS. However, it is likely that WAIS will survive and be used in niche database applications.

Previous Page TOC Next Page