Spiders, how it's done - PART ONE
By Chris Ridings
Professional search engine optimizers aren't scared of tarantulas, because they know that spiders only eat web pages. In this article we're going to revive ChrisDex, my completely imaginary and extremely egotistical search engine, to explain what spiders are and what they do. For the biologists who've arrived at this page - we're not talking about the eight legged hairy variety that run around the bath as if it was some sort of amusement park, try a different search engine. We're talking about a computer program, that's also sometimes known as a crawler or a robot, that fetches web pages.
If you read my previous articles you'll know that by now ChrisDex has a database of three web pages and a reverse (sometimes known as inverse) index. World domination is not far off, but we're left with the nagging suspicion that a three page index might not compete with the billions of pages search engines index today. ChrisDex needs to get bigger.
Previously, what i did to get my index to the three pages was to surf the web with my web browser and save three pages to my hard drive. I then processed them to put them in my index. I could knuckle down and save a whole lot more using the exact same method if I felt like it, but to be honest most web pages are just too boring for me to want to even look at for the time it takes to save them.
Clearly I need a better solution. And I have one. What I'll do is set up a network of computers and hire ten students to do the surfing for me and they can save the pages on to a central network server. I somehow doubt that they'll be very good at organizing themselves though, we'll probably end up with a lot of duplicate copies of studenty type web sites. So I'd better hire someone to tell them what pages to go to and save, I want him to be a lawyer because he'll also have to go through the saved pages and pick out urls which would be good to get. Lawyers are used to burying information in long documents, so I figure they should be good at finding it too.
I have it on good authority that all of ChrisDex's lesser competitors (Google, AlltheWeb, Inktomi and so on) don't appreciate the human touch that we feel adds greatly to the worth of our engine. Instead of having students in their system they have a computer program that runs on each computer which just saves the pages it's been given in a list. Apparently it's efficient, gets more pages and doesn't require a cabinet of hangover cures to initialize. This works because, like ChrisDex's students, the programs don't have to understand the pages in order to visit and save what they've been told to save. So to picture their systems, there's a number of computers and on each will be running one or more copies of a program that does a very simple task - fetch save...get the next url on the list...fetch save...get the next url on the list...
Our lawyer in charge is replaced with a program called URL control in their systems. This program liases with the programs that fetch and save web pages and tells them what to fetch. Again, it's allegedly more efficient than ChrisDex's method and it's inability to send letters saves the companies a fortune. URL control reads the saved files to get the urls from it, sorts them in to a fair order to retrieve them in, checks that it is allowed to get them by checking the robots.txt file and then provides each of the fetch...save programs with a new list everytime they finish the one they've got. Or in short URL control is the brains behind the operation. The thing to note about this though is that it need not understand the pages, it merely needs to be able to recognise what's a valid url within them.
That entire system is the crawler or spider. It can exist over one machine or over several but always has the same components. It's easier to imagine in the human form, the office with people at computers fetching and saving web pages and someone in charge telling them what to do. No member of that office needs to understand what they are fetching or saving, They just do. The real world software version just does exactly the same thing but faster.
In Part Two I'll discuss the URL control section in more depth.
--- Excerpt From: <Indeterminacy>