|
A personal spider for categorized web crawling.
We're in the process of developing a spider that can be used by anyone to crawl around the web looking for and storing links to pages that meet certain user-specified match criteria. Currently, in all tests that we run, the Snarfer is running on a 486-DX with a 28.8 connection to the net and using a server-based database (postgres95) for link storage/retreval. The ODBC interface is just getting started and, once available, should make database access much faster. BTW, you can see if we're running a test now by looking at the scrolling strip charts located at the bottom of this page. The initial WebSnarfer work looks promising enough for us to put up this initial WIP page. Please visit again in the near future for more details and progress reports.
We've just aded a little link review page where you can view the links that have been snarfed.
Features
WebSnarfer runs on your local computer utilizing any standard link to the internet. As we mentioned above, we're running all tests on a local 486-DX with only a 28.8 dialup connection to the net.
When you start WebSnarfer, you will provide a list of topics or keywords that you want matched as well as other control items like whether or not to harvest new links. You can either supply a starting URL or just let WebSnarfer pick URLS from its existing database.
WebSnarfer summaries include certain information about each document to help you decide which documents you would like to visit and if you want to turn off images befor visiting. This information includes things like...
We're working on snarfing a good description text block from each page, but this isn't yet included ... soon - maybe.
WebSnarfer now includes a JDBC database interface for storage of new and indexed links. We're using DB2 for NT from IBM and, so far at least, it's looking pretty dog-gone great!!
The old version used a postgres95 database running as part of the primary the webserver. It was a bit of a hog in terms of webserver processing, but the old version is still available if you happen to be running NeoWebScript/Apache.
WebSnarfer honors robot exclusion using both standard robots.txt AND the newly proposed robots meta tag ...
Please send your comments to Snarfer@hav.com, or give us a call at (281) 341-5035.
Thanks again for your time and help with this WIP.
|
|
|
|
Home ||
Services & Clients ||
Products ||
WIPs ||
Orders ||
News
Bonsai Pic's ||
Bonsai Style ||
Bonsai Databases ||
Bonsai Icons
Copyright © -1995-2008 by hav.Software. All Rights Reserved.
Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. There may be other trademarks or tradenames listed in this document to refer to the entities claiming the marks and names or products. hav.Software disclaims any proprietary interest in any trademark, tradename or products other than its own. Modified - 06/13/04 - 14739019 - 4127468 |