ACC web search | Alkaline search engine

ACC Web pages search

In the spring of 2000, ACC initiated a search engine for the ACC site. The ACC website has a mixture of material of interest to the general public and material mainly useful to ACC staff. When archival information, such as prior year's catalog entries and such, is located on the search just as easily as current information, that makes the search facility less useful than it might be. It is useful for ACC web authors to understand how the search works so that they can make appropriate information come up early in the search lists and other information come up late or not at all.

| Hints to make a page come up early | Hints to "hide" a page | Other ways to "hide" pages | How the search works

I have read the background information from Alkaline and looked through the results of the search process fairly extensively. I have talked with Glenda Keyworth some about the properties of the search and the search configuration file. At her suggestion, I have tried to experiment with the ACC search.


Some hints to make a page come up early in the search list:

  1. Make the title of the page include as many important keywords as reasonable.
  2. Make the first few lines of text on the page include as many important keywords as reasonable.
  3. Put the page fairly high in your directory structure. For example, put it in your root directory or only one or maybe two directories down.
  4. Make links so that the page is fairly high in your linking structure from your homepage. That is, make a link to it from your main page or from a page that is linked directly from your homepage.

Back to the top of this document.


Some hints to make a page come up late in the search list or not at all:

  1. Do not put links from your home page to this page. (Or from any other page that might be linked to from elsewhere.) However, even if you make no links to a given page, you can't guarantee that no one else will. An important footnote here is that we use accweb as our "internal" server and information there is available only from ACC campuses or with ACC dial-in. However, if a link is made from that page to yours, it will be found in the search. So, for instance, all TF minutes have to be available from the "minutes" page on accweb, so they will be indexed in the searches.
  2. Put the page low in your directory structure. It is likely that putting it at least four subdirectories deep will keep our search engine from indexing it. (http://www.austincc.edu/business/second/third/fourth/file.html )
  3. Don't use interesting "keywords" in the page title or near the top of the page. This certainly isn't foolproof, because if the keyword is used anywhere in the page, and that page is found during the indexing, it will be indexed.

Back to the top of this document.


Other ways to "hide" pages:

  1. Don't put it on the web at all.
  2. "Password protect" the page. I believe this requires some scripting access, which is not available on www2.
  3. Make sure that you don't link to the page from your home page (or other pages that might be linked to from elsewhere). This is a bit tricky, since you can't control the links others make. However, you can post a page for your committee, tell them all about it, and tell them not to make links to it. This works pretty well and, if no links to it are made, it will definitely protect the page from searches outside ACC (like from Yahoo, etc.) since they can only follow links and don't have access to the entire file structure within the directories.

  4. Put it on a web server which will not have internal searches run on it. (I suggested as one possibility that ACC omit one server from the search and give accounts to people who ask for them on that server. This would mean quite a bit of extra work and it has not been implemented.) You can also get free or very low cost accounts with outside ISPs and put some web pages up on non-ACC servers. Ask about whether the ISP does have a search engine of its entire site and, if so, how to hide pages from it.
  5. Ask that whoever is running the searches on your web server designate some areas that will be omitted from the search and then put your page there. (I suggested this as one possibility, but it requires quite a bit of extra work and it has not been implemented.)

Back to the top of this document.


How the search works:

When a web author decides to make this (or any typical) search available, first they decide on a search configuration file and then run it to create the search database. Usually a "default" search configuration file is provided. When you search on a specific work or set of words, the program is merely looking through the search database that has been created. It is not going back to the original pages. Thus, if a page has been deleted (or moved) since the search database was last updated, it will still be listed, but when you click on the URL as it came up in the search, you won't find a page. Moreover, if a page has been added since the search database has last been updated, it will not come up in the search.

In response to various questions of mine, Glenda said that it was set to index all the various ACC web servers: www2, www3, www, accweb, lrs, opc, etc. and to follow links from their root directories. She tells me that our search is set to follow links rather than just find all files that are posted. (In general, this is true of search engines.) In confirmation of that, I notice that a number of pages I know about which aren't linked do not appear to be found by the search. However, I have seen some pages that I believe have no links to them that have appeared in the search database, so I would not completely rely on this method. Whether they appear because someone else made a link to the page or because the search engine is really searching the entire directory structure of our web servers, I can't be sure.

Glenda also said that the "site depth" was set at 2 and that it was set in a mode that means it should continually scan the web servers and add pages to its database as they are posted. She ran it in March, I think, and said that the she had to cut it off from creating the search database after about 10 hours. So it is possible that some pages were missed as she stopped it.

I posted some pages appropriately to test this (in my root directory, linked to from my main page) and they have not appeared in the search database after three weeks. So I am skeptical that it is continually adding to the database. That means I can't see the results of my experiment until the search database is created again. Glenda doesn't anticipate doing that very soon.

About the "site depth" of the search at 2 -- it appears to me that should mean pages more than 2 subdirectories deep would not be indexed. Whether that is two subdirectories from the ACC root directory or from your personal web directory isn't clear. I have seen one page in our search database with a URL longer than that. However, I haven't seen any with URLs this long: http://www.austincc.edu/business/second/third/fourth/file.html. I asked her whether my interpretation was correct -- that such pages wouldn't be indexed. She agreed that the technical description of the search engine says that they wouldn't, but cautioned that you can't ever really tell what will happen except by experimentation because those who write the descriptions aren't the same people as those writing the code that makes the program. She did say that she has no intention of ever making that "site depth" parameter any higher than 2.

Back to the top of this document.


Last updated May 21, 2000. Comments or questions?