Works In Progress webSnarfer - WIP

A personal spider for categorized web crawling.

We're in the process of developing a spider that can be used by anyone to crawl around the web looking for and storing links to pages that meet certain user-specified match criteria.

Currently, in all tests that we run, the Snarfer is running on a 486-DX with a 28.8 connection to the net and using a server-based database (postgres95) for link storage/retreval. The ODBC interface is just getting started and, once available, should make database access much faster. BTW, you can see if we're running a test now by looking at the scrolling strip charts located at the bottom of this page.

The initial WebSnarfer work looks promising enough for us to put up this initial WIP page. Please visit again in the near future for more details and progress reports.

havIndex

You might also be interested in a little Java page indexer that is in beta test. It Indexes all words on selected URLs - Runs as a local Java Application and requires a 1.1 VM (like the newest JDK). Because havIndex doesn't "crawl" it's ideal for indexing your own site - allowing fine control over which pages will be reported to users - thus allowing you to control (to some degree) how folks enter into specific areas of your site.

You can download a 30-day trial of the latest version (Sun, Jun 13, 09:22 CDT) in Zip format. The README file has a little more detail. The search fields around the site are an experiment after indexing a couple of the site's pages.

NEW Stripped Text Save option - As of ver 1.4, havIndex adds a save option that allows you to save an indexed document's raw text (with all html and any active stopewords removed) as a single block preserving the order and repitition of words in the original document.

NEW User-defined Save API - As of ver 1.1b4, you can now define, implement and attach your own save processing.

We've just aded a little link review page where you can view the links that have been snarfed.

Features

Runs on your local computer:

WebSnarfer runs on your local computer utilizing any standard link to the internet. As we mentioned above, we're running all tests on a local 486-DX with only a 28.8 dialup connection to the net.

Selective Indexing:

When you start WebSnarfer, you will provide a list of topics or keywords that you want matched as well as other control items like whether or not to harvest new links. You can either supply a starting URL or just let WebSnarfer pick URLS from its existing database.

Descriptive Info:

WebSnarfer summaries include certain information about each document to help you decide which documents you would like to visit and if you want to turn off images befor visiting. This information includes things like...

  • Documant Title
  • Document Size (in bytes)
  • Document Text Size (bytes of non-tag text)
  • Number of unique Tags
  • Number of unique Links
  • Number of unique Images
  • The Date that the URL was last indexed

We're working on snarfing a good description text block from each page, but this isn't yet included ... soon - maybe.

JDBC Database Support:

WebSnarfer now includes a JDBC database interface for storage of new and indexed links. We're using DB2 for NT from IBM and, so far at least, it's looking pretty dog-gone great!!

The old version used a postgres95 database running as part of the primary the webserver. It was a bit of a hog in terms of webserver processing, but the old version is still available if you happen to be running NeoWebScript/Apache.

Honors Robot Exclusion:

WebSnarfer honors robot exclusion using both standard robots.txt AND the newly proposed robots meta tag ...

<meta name="robots" content="none|[no]indexing, [no]follow">

Please send your comments to Snarfer@hav.com, or give us a call at (281) 341-5035.

Thanks again for your time and help with this WIP.


Search the Site
Find Any or All of
  
Questions?
Feel free to drop by and chat if you have any questions - one of us is usually around during normal CST/CDT business hours.

Home || Services & Clients || Products || WIPs || Orders || News
Web Demos || PC Neural Demo || JS Neural Nets || Surfin'

Bonsai Pic's || Bonsai Style || Bonsai Databases || Bonsai Icons
Horace's Personal Home || Stats 'n Stuff || GuestBook || Chat
Left Hand Fingertip Torture - aka trying to play classical guitar ;-)

Copyright © -1995-2010 by hav.Software. All Rights Reserved.


http://www.hav.com/ havBpNet:J, havFmNet:J, havBpNet++, havFmNet++, havBpETT, havCNet, WebSnarfer, havIndex and havChat are all trademarks of hav.Software.

Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries.

There may be other trademarks or tradenames listed in this document to refer to the entities claiming the marks and names or products. hav.Software disclaims any proprietary interest in any trademark, tradename or products other than its own.


webmaster@hav.com
Modified - 06/13/04 - 20173397 - 4972124