Software‎ > ‎Htmlsearch‎ > ‎


Why use HtmlSearch

The HtmlSearch allows you to search through a web site that is either local (e.g. on your hard disk , a CD-ROM or a LAN) or remote (e.g. on the World Wide Web).
Many web sites have a search facility but it generally involves the execution of a CGI script that is located on the site's server.
If a site is located on a server that does not support CGI scripts, or on a local media, you cannot search it. HtmlSearch allows you to search without the assistance of CGI scripts. It is therefore useful for:
  • Sites hosted on a server that does not support CGI scripts
  • Sites that do not offer a search facility
  • Sites local to a hard disk or CD-ROM.
  • To verify that all the links on a site are reachable

    Main features

  • works on any platform, operating system, or java-enabled browser
  • search a site without the use of server side indexing or CGI
  • start the search from any page or local file
  • AND and OR boolean operators
  • search for individual words or phrases
  • allow the search to explore hosts other than the starting host
  • ability to search HTML files or any type of file (text file, word processing document, etc...)
  • ability to search through HTML tags
  • refine searches by searching through the previous search results
  • ability to eliminate hosts, file types, paths, large files, directory hierarchies from search
  • shows broken links, inaccessible pages, unavailable hosts
  • ability to save search results as a text list or an HTML index
  • ability to index an entire site - use custom dictionary for word selection
  • configure which functionality and screens you want your users to be able to see and access
  • See also the Release History for new features.


    HtmlSearch Usage

    HtmlSearch can be used either as part of a web site, or on your PC to search any web site or directory structure:
    • As part of a page in a web site: you, the site's designer, configure HtmlSearch to work specifically on this web site. Depending on the choices you make when incorporating HtmlSearch, users may be able to search outside of that site, or not, have access to all the panels described in this help or not, etc...
    • As part of a web page not included in a web site: this would be the case for example if you downloaded the HtmlSearch program so you can use it on any web site or files.
    There is a help file that explains how to set up HtmlSearch.


    How it works

    HtmlSearch functions very much like your browser does when you click on a link, except it does it automatically. In other words, it reads a page, looks at the links it contains, follows these links, so on and so forth.
    As such, it has the same access capabilities and limitations as your browser, in addition to restrictions due to the particular nature of Java applets.
    In addition it keeps the pages it examines in its own cache, so that from one search to the next it doesn't re-read pages that have already been read. (This applies only as long as you do not exit the current search session.)
    The links followed are those found in hyperlinks (e.g. "A HREF" image maps, etc...). HtmlSearch does not follow links to or generated by CGI scripts, Java, or JavaScript calls.
    If a link points to a directory or to an unreachable URL, HtmlSearch will try to access the files 'index.html' and 'index.htm' in that directory.
    Searches can be slow if the pages searched are located on slow or busy servers, or if your modem is slow. The speed of the search also depends on your PC, and the parameters you specified for the applet.

    Which Files Are Searched ? HtmlSearch follows all links in pages. You can elect to search or eliminate files based on their types. By default the following file types are excluded from the search: bmp,jpe,jpg,jpeg,jfif,gif,tif,tiff,qt,mpeg,mpg,mpe,mpa,m1v,avi,mov,mid,midi,wav,au,snd,dll,exe,map,class,js,pdf,zip,gzi,tar,tgz,z You can use the nosearchSuffix parameter to include or exclude your own file types. HtmlSearch also allows to eliminate files based on contentType, size, etc.

    Which Directories Are Searched ? You can eliminate directories from the search based on their names, how far up or down in the directory hierarchy they are from the starting point; you can even elect to scan directories for files (and links to other files) but not search in those files. You can also eliminate some hosts or domains from the search.

    Starting and Stopping the Search: as stated above, HtmlSearch keeps all the visited pages in its cache. This means that if you stop a search then restart it, even from a different URL or searching for a different string, HtmlSearch will first use its cache, before going back to the network to read the page. This minimizes the search time and network load, and allows for very fast searches the "second time", e.g. when searching for different strings on the same set of pages. The cache is window-specific, i.e. there is one such cache per window where HtmlSearch is loaded, and even if there are several HtmlSearch windows opened, they do not share their cache. All the searches do use the browser's caching mechanism which may also reduce access time and network load.
    The caching mechanism may have a negative impact if you do many searches in the same window, since all the pages visited are kept in the cache: if you search through thousands of pages, the memory requirements may exceed the browser's capacity (this is platform, browser and browser settings dependant). It may therefore be wise to every now and then 'kill' the search and restart in a new window.



    HtmlSearch is a Java program, and you need a Java enabled browser (e.g. Microsoft Internet Explorer version 3.0 and above, Netscape version 3.0 and above). It is built on the Java 1.0 JDK and is therefore compatible with browsers using JDK 1.0 and 1.1.

    Depending on your system configuration, where you got HtmlSearch from, and your browser security settings, you may be restricted as to which sites on the WWW you can search with HtmlSearch.


    Other Search Methods

    HtmlSearch needs to read each page to search it, which is very inefficient compared to CGI-based searches, not to mention the associated network load. In other words, if the site you are looking at provides a search function, you may be better off using it than HtmlSearch, especially since HtmlSearch may not be able to access all the pages of the site. On the other hand, HtmlSearch provides indexing while the site's CGI-based search may not; the site's provided search may also restrict the searches in ways that may not fit you (e.g. only some of the pages are covered by the search), or not give you the flexibility that HtmlSearch offers.
    For local searches (hard disk, CD-ROM), HtmlSearch is probably slower than operating systems based tools (grep on Unix, Tools->Search on Windows), but these tools do not follow links: they operate on files only.
    There are also some other Java or JavaScript based search engines available from other sources, but obviously HtmlSearch is better :-). - © Copyright Olivier Zyngier, 1998-2009