havIndex (TM) June 2004 version 1.4 Copyright (c) 1999-2004 by hav.Software and Horace A. Vallas, Jr. - All Rights Reserved TOC ========================================= Introduction Installing havIndex NEW Features Other Features Output File Format User Defined Save API Stop Words Stemming Preparing pages for indexing by havIndex Indexing pages Problems & Known Bugs Contact Introduction: ============= havIndex is a Java application that can be used to index all body text and meta keywords into a file that can be imported into an online database for keyword searches of pages on a site. havIndex reads from URLs - so it sees what visitor browsers will see - it will not see any inline server-side code that may be included in a document source, for example. Also, HTML comments are ignored - so other inline scripts (like JavaScript segments) that are appropriately embedded in HTML comments will also not be seen. Several flat-file save actions are implemented by default but havIndex also provides an API by which users can write and use their own save implementation - for example to save directly into their particular database. Requires a 1.1 or later Java Virtual Machine. Installing havIndex: =================== havIndex is distributed as a standard .zip file containing all of the class and supporting files required by havIndex. On windows systems, we suggest installing havIndex into the C:\Program Files\havSoft\havIndex directory because we have included an example 95/NT shortcut that you can simply drag onto your desktop. This is simply a convenience for you and you can install havIndex anywhere that you like. NEW Features: ============= 1. havIndex 1.4 adds a save option which will save the URL, title, description and stripped text of the URL. Text is saved as a block and with words in the order in which it appears in the URL. If words appear multiple times in the URL, then they will appear multiple times in the saved text as well. 2. havIndex 1.3 is simply a rebuild of 1.1b4c1 3. havIndex 1.1b4c1 is simply a rebuild of b4c 4. havIndex 1.1b4c now allows you to name your proprietary save class anything - havIndexUserSaver remains the default. 5. havIndex 1.1b4b corrected a bug that caused some pages to be incorrectly excluded by robots.txt 6. havIndex 1.1b4a was modified to use the 1.1 event model. 7. As of 1.1b4, havIndex provides an API by which users can implement and attach their own SAVE implementation. (see User Defined Save later in this document - following the "OUTPUT FILE FORMATS" section below) Other Features: =============== 1. Runs on your local computer - requires Java 1.1 or later VM. 2. Can index any URL accessible across the internet. Currently, havIndex can handle either HTML or plain TEXT documents. Note: files with mime type other than text/plain will be parsed as if they are HTML - meaning that havIndex will try to locate and catalog html or html-like tags. 3. By default will honor Robot Exclusion using either the robots.txt or the robots meta tag. For more info see http://info.webcrawler.com/mak/projects/robots/meta-user.html 4. Sees what your browser will see - because indexing is done over an internet connection, there is no need to worry about server-side code being parsed. Also, any client-side code, like JavaScript, that has been "hidden" using standard html comment markup will be ignored during document parsing. 5. Allows you to enter URLs to be indexed by hand OR import a list of URLs from a flat ASCII file. Supports specification of a URL PREFIX that will be applied to all selected URLs - thus allowing shorthand specification of URLS with the same prefix. Optional SAVE URL function allows you to save the URL list to a flat ASCII file with or without the prefix attached. 6. Optional STOPWORD use. Stopwords are words that will be ignored when constructing the index for a URL - words like "and", "or", "at" etc. The sample stopwords.txt file provided contains @ 900 words. You can use this list, add to it, use your own list or not use any stopwords at all. BY DEFAULT stopwords are NOT USED - so you must load a stopword list (file menu item) if you want stopwords applied. Once loaded, you can disable/enable the use of the loaded stopwords. 7. Indexed data is saved to a tab-delimited flat ASCII file that should be easily editable for import into various databases and spreadsheets. You can choose one of two basic formats: - all address records then all words - or - an address record followed by all word records for that that URL - followed by the next address record and its words, etc. MORE ON SAVE FILE FORMATS BELOW... 8. Allows you to (optionally) save word frequency information along with word index data. 9. Optionally will save all URLs and Email addresses retrieved from parsed pages. 10. Supports optional use of pseudo-html tags in documents to allow you to manually set indexing identifiers on docs that you own. When havIndex tags are not found, an md5 digest of the document's URL is used as the index value for words parsed from the document. Should a digest contain either tab or eol, the digest itself is successively re-digested until an acceptable index value is obtained. 11. You can modify the characters used for word boundary determination. NOTE: tab and eol are always used - but you can choose other characters as well. 12. You can control the maximum number of bytes to be read when retrieving a document - thus avoiding overly large files (robot chokers and spider killers). 13. havIndex will process the text body of a document, ignoring all tag contents except meta description and keywords. Output File Format: =================== The output file created by havIndex contains two types of records: address-records and word-records. There is one address-record for each URL indexed in a single havIndex run. There is one word-record for EACH unique word indexed from each URL in an havIndex run. 1. page address records Address records begin with the constant field value of "/addr". Following the /addr tag are several tab-delimited fields: key - the unique ID used to group the URL's index values page URL - the URL that was indexed (ex. http://www.hav.com/) page Title - the value of the URL's ... (if any) page Description - the value of the meta description tag (if any) 2. word records Word-records contain only a key, a tab and a word. The key is the same value as the "key" field of the address-record of the URL from which the word was harvested. The word is simply the word being indexed. Address-records and word-records can be saved to a file in one of two groupings. By default, the address record for a URL will immediately precede all the word records for that URL. Alternatively, you can choose to have all address records grouped together at the beginning of the save file, followed by all the word-records for all URLs. 3. Stripped Text save format will save one line per URL parsed. Each line will contain the URL, the title (from tag set), the description (from description meta tag) and the text of the URL (stipped of all html markup and with all stop words removed (if stopwords are used). Fields on a line are separated by tabs. USER DEFINED SAVE API: ====================== Users can implement their own save functionality by using the USER DEFINED SAVE API added in version 1.1b4. To do so, you must ... 1. create a class named havIndexUserSaver (Note: as of 1.1b4c your saver class is not required to be named havIndexUserSaver - so you can replace "havIndexUserSaver" with your own saver class name in all references below) - implement at least a constructor that takes certain required arguments: (see User Saver Constructor below) - implement a method named "save" with no arguments. this save() method is the root for your save processing. *** see the havIndexUserSaver.java included in the distribution for a working example 2. Compile your havIndexUserSaver.java file to produce the havIndexUserSaver.class file. i.e. javac havIndexUserSaver.java 3. Run havIndex AND your implementation of havIndexUserSaver together ... Probably easiest if you simply place the havIndexUserSaver.class file in the same directory as the havUndex.class ... i.e. java havIndex havIndexUserSaver OR java havIndexE havIndexUserSaver NOTE: the Windows shortcut provided has been modified to assume the havIndeUserSaver.class file IS in the same directory as the havIndex.class file - this should cause no problem if you have not made your own saver class - but should be modified if you DO make a saver and place it elsewhere. The havIndex File Menu now includes a new save action - "USER DEFINED Save" - which will call the "save()" method of the havIndexUserSaver class. If there is no havIndexUserSaver implementation, this new option will remain inactive. User Saver Constructor: Your user defined saver class must have ======================= a constructor of the following form: public havIndexUserSaver( int numURLs, String URLs, String[] idList, String[] titleList, String[] descriptionList, String[] strippedText boolean[] parseStatusList, Hashtable[] wordLists, Hashtable[] refLists, Hashtable[] emailLists ) ... where ... numURLs : int number of URLs that were processed in the last parsing run. Equal to the size of the array arguments below. URLs : String a copy of the URL input list window's value. idList : String[] the ID (index) used for the i-th URL in the URLs list titleList : String[] the value of the pair (if any) that was retrieved from the i-th URL. descriptionList : String[] the value of the meta description tag (if any) that was retrieved from the i-th URL. strippedText : String[] the raw text from the URL with all html removed and all stopwords removed (if stopwords are used) removed. parseStatusList : boolean[] the parse status of the i-th URL. true means the URL was parsed Ok false mean the URL was not parsed wordLists : Hashtable[] the Hashtable containing the words parsed from the i-th URL. - Keyed by a word parsed from the URL. - Elements are instances of Integer whose value is the count of the word in the i-th URL refLists : Hashtable[] the Hashtable containing the link refs parsed from the i-th URL. - Keyed by link ref (like http://www.hav.com/) - Element values are the id of the URL from which the link was retrieved. emailLists : Hashtable[] the Hashtable containing the mailto refs parsed from the i-th URL. - Keyed by mailto ref (an email address) - Element values are the id of the URL from which the link was retrieved. STOPWORDS: ========== havIndex includes the optional use of a stopword list. By default, stopwords are NOT used - so you must select a stopword file to be used if you want to use stopwords. As distributed, havIndex includes an example stopword file named StopWords.txt. You can use it as is, edit it, or make your own stopword file as required. A stopword file should be formatted as ... 1. one word per line 2. comments may be included - each on a line by itself and beginning with a character in the Split-On list (see the havIndex options/other menu item) As distributed, havIndex includes the # character as a split character - so the StopWords.txt file has comment lines marked with the # character. 3. Because havIndex uses a "Split-On" list to determine word boundaries when parsing a page, no parsed word will contain any character in the Split-On list. Therefore, any stopword which contains a character in the Split-On list will never be matched in a parsed page. This is not a problem - and means that you can use one stopword file for different Split-On lists that you might use from time to time. For example, you might choose to allow apostrophes in words - so you could include stopwords like "we'll" "we" and "ll" ... if apostrophes are on the split list, then "we" and "ll" will possibly be matched - but if apostrophes are NOT on the split list, then "We'll" may be matched but "ll" (probably) won't. 4. When you import a stopword file, the "Use Stopwords" option is automatically enabled. Therefore, if, after importing a stopword list, you wish to make a run without using stopwords, you must deselect the "Use Stopwords" program option (options menu). Stemming: ========= havIndex does not currently implement any stemming - at least not really. However, appropriate selection of the characters in the Split-On list can, in fact, cause a sort of lightweight stemming effect. For example, if you include the apostrophe in the Slit-On list, then words like "we'll" will be seen as "we" and "ll" - which doesn't do much to include "will" but sort of stems "we" to a root form. Still - often the second part of a contraction is (probably) either a (verb) stopword anyway (i.e. 'll=will, 've=have, 're=are ...) or simply a possessive (Horace's, Bill's, ...) so maybe it's OK to have these "stemmed-off." (I know , I know... :-))) Maybe I'll add some real stemming in a future revision. Preparing pages for indexing by havIndex: ======================================== Pages indexed by havIndex MAY contain a special pseudo-HTML tag which havIndex will use to determine the "key" to be used for a file as it is indexed. This tag looks like... ... where xxxx is the key to be used for the URL containing the tag. All pages to be indexed on a site should have UNIQUE keys assigned via their havIndex tags. If no tag is found, then an id is formed automatically from the page's URL. If the tag is used, it should be placed IMMEDIATELY AFTER THE OPENING TAG in the document. (i.e.) . . . Indexing pages: =============== 1. Run havIndex on your local PC From the dos prompt in a dos shell - run ... >java havIndex (...or havIndexE in the case of a demo distribution) This will create a file on your PC which contains the data to be imported into the index database. You can control the name of this file in the havIndex program. WINDOWS NOTE: We have included an example shortcut that can be dragged from the Windows Explorer to your desktop. You will need to edit the shortcut to point to where you have the JDK and the havIndex program files installed... TARGET: f:\jdk1.1.4\bin\java.exe havIndex START IN: f:\Program Files\havSoft\havIndex ... where f:\jdk1.1.4 is where you have the JDK installed and f:\havSoft\havIndex is where you have the havIndex.CLASS files installed Problems & Known Bugs: ====================== 1.1b4: There is a sporadic bug (I think related to the java dialog class) that will sometimes cause a traceback to be printed after clicking the "Ok" button following a completed parsing run. The traceback is to the effect.... Exception occurred during event dispatching: java.lang.NullPointerException: Invalid peer at sun.awt.windows.WWindowPeer$FocusOnActivate ... As far as I can tell there is no effect on the functioning of havIndex and, for now, this error can be ignored. 1.1b4: If there are too many URLS parsed in a single parsing run, the program may crash. The number of URLs that can be parsed will depend on the total number of words retrieved from all URLs. I suggest no more that 10-15 URLs be parsed in a single run. Please let me know if you experience any problems - or if you have some thoughts on how things might be improved. Contact: ======== Please feel free to contact us with any questions or suggestions by email to havIndex@hav.com or by phone at 281-341-5035 between the hours of 8am and 6pm Central time. CHANGE HISTORY: ================================================================ Copyright (c) 1999 by hav.Software and Horace A. Vallas, Jr. All Rights Reserved