Re: [Nolug] Terabyte servers on the cheap...

From: Michael <mogmios_at_mlug.missouri.edu>
Date: Tue, 25 Jun 2002 00:28:59 -0500 (CDT)
Message-ID: <Pine.LNX.4.44.0206250014380.8830-100000@mlug.missouri.edu>

> When you say "spider" how do you program something like that exactly?
> im interested in your project btw

The basic idea is something like how Google and similar sites collect
data. The Linux utility 'wget' is a simple example and pretty handy for
small projects. Essentially you start with a list of URL's and follow
each URL and download that file. If that file is something such as a web
page you extract the URL's from it and add them back to your URL list and
the process continues. You can do all sorts of things to arrange your
results to be more productive and of course that is part of how one search
engine can be better than another. I also tend to call programs that
search Usenet, IRC, etc spiders even though technically they usually
aren't considered such unless they are following the web of URL's. Unlike
most spiders I'm not very interested in the URL's other than for
retrieving more files. I don't index my files against the URL's but
instead sort them by mime, checksums, keywords, etc. The general idea for
me is to make sure all the files are unique and not corrupted.

It can be interesting if you're into data. Being a major geek I am very
much so. ;)

___________________
Nolug mailing list
nolug@nolug.org
Received on 06/24/02

This archive was generated by hypermail 2.2.0 : 12/19/08 EST