May 2001
The Internet provides access to a wealth of information on countless topics contributed by people throughout the world. On the Internet, a user has access to a wide variety of services: electronic mail, file transfer, vast information resources, interest group membership, interactive collaboration, multimedia displays, and more. The Internet consists primarily of a variety of access protocols. These include e-mail, FTP, HTTP, Telnet, and Usenet news. Many of these protocols feature programs that allow users to search for and retrieve material made available by the protocol.
For background information on Internet access protocols, see A Basic Guide to the Internet.
The Internet is not a library in which all its available items are identified and can be retrieved by a single catalog. In fact, no one knows how many individual files reside on the Internet. The number certainly runs into the many millions and is growing at a rapid pace.
The Internet is a self-publishing medium. This means that anyone with a small amount of technical skill and access to a host computer can publish on the Internet. It is important to remember this when you locate sites in the course of your research. Internet sites change over time according to the commitment and inclination of the creator. Some sites demonstrate an expert's knowledge, while others are amateur efforts. Some may be updated daily, while others may be outdated. As with any information resource, it is important to evaluate what you find on the Internet. For more information, see Evaluating Internet Resources.
Also be aware that the addresses of Internet sites frequently change. Web sites can disappear altogether. Do not expect stability on the Internet.
One of the most efficient ways of conducting research on the Internet is to use the World Wide Web. Since the Web includes most Internet protocols, it offers access to a great deal of what is available on the Internet.
There are a number of basic ways to access information on the Internet:
Each of these options is described below.
Join any of the thousands of e-mail discussion groups or Usenet newsgroups. These groups cover a wealth of topics. You can ask questions of the experts and read the answers to questions that others ask. Belonging to these groups is somewhat like receiving a daily newspaper on topics that interest you. These groups provide a good way of keeping up with what is being discussed on the Internet about your subject area. In addition, they can help you find out how to locate information--both online and offline--that you want.
E-mail discussion groups can be associated with academic institutions. Many topics are scholarly in nature, and it is not unusual for experts in the field to be among the participants. In contrast, Usenet newsgroups cover a far wider variety of topics and participants have a range of expertise. Be careful to evaluate the knowledge and opinions offered in any discussion forum. Note also that a small number of e-mail groups are cross-posted as Usenet newsgroups. For example, the early music e-mail group EARLYM-L also exists as the newsgroup rec.music.early.
E-mail discussion groups are managed by software programs. There are three in common use: Listserv, Majordomo, and Listproc. The commands for using these programs are similar.
A list of Usenet newsgroups can be accessed from within a newsreader program. Web browser suites such as Netscape Communicator include a newsreader. This offers the convenience of Usenet access in a graphical environment as a part of the Web experience.
A good Web-based directory to assist in locating e-mail discussion groups and Usenet newsgroups is Liszt, located at http://www.liszt.com/.
If you know the Internet address of a site you wish to visit, you can use a Web browser to access that site. All you need to do is type the URL in the appropriate location window. URL stands for Uniform Resource Locator. The URL specifies the Internet address of the electronic document. Every file on the Internet, no matter what its access protocol, has a unique URL. Web browsers use the URL to retrieve the file from the host computer and the directory in which it resides. This file is then displayed on the user's computer monitor.
This is the format of the URL: &nsp; protocol://host/path/filename
For example:
http://www.house.gov/agriculture/schedule.htm - a hypertext file on the Web
ftp://ftp.uu.net/graphics/picasso - a file at an FTP site
telnet://opac.albany.edu - a Telnet connection
Any of these address can be typed into the location window of a Web browser.
Browsing home pages on the Web is a haphazard but interesting way of finding desired material on the Internet. Because the creator of a home page programs each link, you never know where these links might lead. High quality starting pages will contain high quality links. The University Libraries Web site contains quality links leading into the World Wide Web, and is a good place to start your exploration. This site is located at http://library.albany.edu/.
An increasing number of universities, libraries, companies, organizations, and even volunteers are creating subject directories to catalog portions of the Internet. These directories are organized by subject and consist of links to Internet resources relating to these subjects. The major subject directories available on the Web tend to have overlapping but different databases. Most directories provide a search capability that allows you to query the database on your topic of interest.
There are two basic types of directories: academic and professional directories often created and maintained by subject experts to support the needs of researchers, and directories contained on commercial portals that cater to the general public and are competing for traffic. Be sure you use the directory that appropriately meets your needs.
Subject directories differ significantly in selectivity. For example, the famous Yahoo! site does not carefully evaluate user-submitted content when adding Web pages to its database. It is therefore NOT a reliable research source and should not be used for this purpose. In contrast, the Argus Clearinghouse selects only a small number of the subject guides submitted for inclusion, and rates them according to a standard. Consider the policies of any directory that you visit. One challenge to this is the fact that not all directory services are willing to disclose either their policies or the names and qualifications of site reviewers. A number of subject directories consist of links accompanied by annotations that describe or evaluate site content. A well-written annotation from a known reviewer is more useful than an annotation written by the site creator as is usually the case with Yahoo!
It is useful to understand that certain directories are the result of many years of intellectual effort. For this reason, it is important to consult subject directories when doing research on the Web.
The University Libraries Web site includes a list of Internet Subject Directories.
Recommended starting points:
An Internet search engine allows the user to enter keywords relating to a topic and retrieve information about Internet sites containing those keywords. Search engines are available for many of the Internet protocols. For example, Archie searches for files stored at anonymous FTP sites.
Search engines located on the World Wide Web have become quite popular as the Web itself has become the Internet's environment of choice. Web search engines have the advantage of offering access to a vast range of information resources located on the Internet. Many search engines compile a database spanning multiple Internet protocols, including HTTP, FTP, and Usenet. They may also search multimedia or other file types on the deep Web, often accessible as separate searches. Web search engines tend to be developed by private companies, though most of them are available free of charge.
A Web search engine service consists of three components:
Keep in mind that spiders are indiscriminate. Be aware that some of the resources they collect may be outdated, inaccurate, or incomplete. Others, of course, may come from responsible sources and provide you with valuable information. Be sure to evaluate all your search results carefully.
With most search engines, you fill out a form with your search terms and then ask that the search proceed. The engine searches its index and generates a page with links to those resources containing some or all of your terms. These resources are usually presented in term ranked order. For example, a document will appear higher in your list of results if your search term appears many times, near the beginning of the document, close together in the document, in the document title, etc. These may be thought of as first generation search engines.
A new development in search engine technology is the ordering of search results by concept, keyword, site, links or popularity. Engines that support these features may be thought of as second generation search engines. These engines offer improvements in the ranking of results. One reason for this is the insertion of the human element in determining what is relevant. For example, Google ranks results according to the number of highly ranked Web pages that link to other pages. A Web page becomes highly ranked if still other highly ranked pages link to them. This scheme represents an intriguing melding of technology and human judgment.
All search engines have rules for formulating queries. It is imperative that you read the help files at the site before proceeding. Online tutorials can also help you learn the rules. A short list of recommended tutorials appears at the end of this file.
Recommended starting points:
For a more extensive list of recommended Web search engines, see Internet Search Engines.
The concept of the "deep" or "invisible" Web has emerged in recent months. This refers to content that is stored in databases accessible on the Web but not available via search engines. In other words, this content is "invisible" to search engines. This is because spiders cannot or will not enter into databases and extract content from them as they can from static Web pages. In the past, these databases were fewer in number and referred to as specialty databases, subject specific databases, and so on.
The only way to access information on the invisible Web is to search the databases themselves. Topical coverage runs the gamut from scholarly resources to commercial entities. Very current, dynamically chaniging information is likely to be stored in databases, including news, job listings, available airline flights, etc. As the number of Web-accessible databases grows, it will become essential that they be used to conduct successful information finding on the Web.
Other content not gathered by spiders includes non-textual files such as multimedia files, graphical files, and documents in non-standard formats such as Portable Document Format (PDF).
Keep in mind that many search engine sites and commercial portals feature searchable databases as part of their package of services. This phenomenon falls under the heading of converging content. For example, you can visit AltaVista and look up news, maps, jobs, auctions, items for purchase, etc., all things outside the purview of a spider- gathered index. As another example, Google integrates searches of PDF files into its general search service.
Here are a few examples of sites that collect content from the deep Web:
There are three steps to a computer database search:
When conducting any database search, you need to break down your topic into its component concepts. For example, if you want to find information on the budget negotiations between President Clinton and the Republicans, these are your concepts: CLINTON, REPUBLICANS, BUDGET.
Once you have identified your concepts, you need to list keywords which describe each concept. Some concepts may have only one keyword, while others may have many.
For example:
CLINTON
REPUBLICANS
HOUSE SPEAKERBUDGET
BUDGET NEGOTIATIONS
BUDGET BATTLE
BUDGET IMPASSE
BUDGET DEAL
Depending on the focus of your search, there may be other keywords you would wish to use.
Once you know the keywords you want to search, you need to establish the logical relationships among them. The formal name for this is Boolean logic. Boolean logic allows you to specify the relationships among search terms by using any of three logical operators: AND, OR, NOT.
Search Statement Result of search World War I AND Files containing both these terms World War II World War I OR Files containing at least one of these terms World War II World War I NOT Files containing the term World War I but World War II not also the term World War II
Some search engines offer Boolean searching without mentioning the logical operators by name. For example, you might be asked to list your search terms and choose that All of these terms be searched. This denotes AND logic. Specifying Any of these terms denotes OR logic. Most search engines use a type of implied Boolean logic, in which symbols or spaces are used to denote logical relationships. For example, +bears +hibernation denotes AND logic.
Certain search engines allow you to use a proximity operator. This a type of AND logic which specifies the distance between words in a source file. For example, AltaVista and Lycos let you use the NEAR operator. Consider this search: Clinton NEAR budget. In AltaVista, the two terms must be within 10 words of each other in the source file. Lycos allows user-specified distances. Use of this option can help you gain relevance in your search results.
Most Web search engines cannot handle a single search statement that includes all the terms listed in Step 2 above. You may need to repeat your search a few times using terms in different combinations until you get results that are satisfactory. For example, you may start with CLINTON, REPUBLICANS, BUDGET NEGOTIATIONS and connect these terms with AND logic. Take a look at your results. If you are not finding what you want, repeat the search with alternative keywords for the budget concept. Your initial results may give you ideas about which new terms to try.
For more information on formulating searches, see Boolean Searching on the Internet.
Laura Cohen
lcohen@albany.edu