Building A Search Engine From Scratch

When I first had the notion of building a search it seemed like an impossible dream. Having a database as big as the internet itself and searching it in a fraction of a second. There is no technology that will allow that to happen.

But somehow the big boys are doing it. More so, they were doing it on PII technology. So there must be a solution that is affordable and still good enough to break into the market.

I have served millions of cgi programs an hour from servers but never any as intensive as searches reading huges amounts of data or querrying billions of records. So how is this done?

In reality I soon realized, it is not done. Then the prospect of building a search began to make sense. By precompiling results for searches the search engine is not actually searching but simply delivering a page.

This takes my earler beleif that I would need a 128 processor machine to the realistic understanding that even a single processor machine could deliver millions of results an hour.

I just happened to find an article written by the Google boys while they were developing their search engine at Stanford, check it out http://www-db.stanford.edu/~backrub/google.html.

Actually the article made things more confusing, but you might get some good ideas from it. The concept of ranking pages may be helpful but the architecture they discuss was over my head.

I have been building pages for many years, and have millions of pages sometimes in one website. So the concept of millions of records does not scare me. But I have never seen a search that could read more than a few thousand pages with out taking all day.

But if the pages are indexed into categories and smaller lists of links, then a user can quickly browse to the subject or page they want.

Then the same will work for a search engine. If the user wants to find pages for "Fishing" then the search is not a search at all. But rather a simple redirect to a premade page with the results for that keyword.

Of course the real search program could have spent 2 days searching all the records and making a few pages. But the end user only sees a fraction of a second and is amazed by the speed never realizing the search was done a month ago and is being served to thousands of people a day.

And why would a search engine waste valueable processing power doing the same search over and over when it can be done once and served out as a static page which requires very little system effort.

To further make the system lean and fast only the sites that have been returned in the initial system search will be considered for the next compiling of results. So rather than spending 2 days making the results page for fishing the second time, it can be done in a few seconds.

You will also want to apply cross limiting to searches. Since you have already crawled millions of pages to find those relavent to fishing, if your complier is building a page with results for "Fishing Poles", you will only need to consider the pages that matched for fishing.

Applying this princeple, as your compiler continues to build results for searches it will get progressively faster and smarter. Only needing to access pages that already qualify for the results.

I know that google started with some basic pc's and built its way up. Now I can understand how starting with a low number of pages first makes growth possible. Since you do not have to keep searching the entire internet to build the result pages. Only new relavent pages will need to be considered for addition to key word searches.

You would never be able to build a database of 8 billion pages and compile the results for keywords while searching every page for every word. Of course the system may be able to handle the data, it may take weeks to compile a single set of results for one keyword.

The princeple of slow growth is the only way that a database can grow that large.

Think of this as Bert Rutan building his x-rocket that fisrt enjoys a 1 hour climb to a high elevation rather than the NASA model of building a massive fuel supply and getting into space in just over a minute. Of course if you have the NASA budget what the hell, go for the muscle and search the internet with a billion terabytes of RAM and the processing power of one millions cpu's.

But if you are poor like me, you will need to take the slow and easy approach. And just maybe, that is the reason after $100million Microsoft is still buying yahoo results. They should think smaller and expand rather than trying to muscle the results. But then again the reason for failure could be even more obvious, they might be running windows servers.

But seriously, this project is doable. With the right planning and decent coding, anyone could build a great search. And once the structure is established, it could be applied to smaller systems such as corporations with millions of records that must be searched. Of course they have Oracle for that. Paying millions of dollars for the ability to do what any decent programer should be able to do with relative ease.

You must admit, the possibility of networking half a dozzen $500.00 pc's to crawl and index a corporate database and output results faster and more efficently than the multi-million dollar Oracle system is slightly appealing.

One advantage of internet crawling is that you do not need to store every page. You can discard pages with low content or undesireable material (ie porn). Not that porn is a bad market, just that the amount of it will overwhelm your bot. Filtering the adult content from your search will cut your database down tremendously.

I would recomend that you specialize in one market keeping your search lean and efficient. Although having a broad based search is great, you could use a meta type search and querry all of your search sites and then sort results real time for one overall search.

If I search google for "fishing" I get millions of results. From yellow page ads and fishing website builders to blatent spam. But very few that are dedicated to fishing since most fishing related websites can not compete on SEO with advertising companies. If someone is searching for "fishing website" are they looking for a website about fishing or to have a website designed for a fishing product?

By limiting the search engine to fishing related sites, the user will only get relavent results to fishing. If they want a website built, they can search the web services database rather than the fishing database.

OK, that defeats the purpose of a primany search engine, but the concept of better searches is more appealing to the user. By first defining what topic they are looking for and then the key words a search engine could operate at maximum efficiency.

Of course our search wont use that principle because people are generally lazy and just want to type in one word and get exactly what they are thinking. So if your engine is going to have broad appeal it must be built with replete simplyosity, or exclusively for the average idiot.

What ever you decide, you should know it does not take a NASA to get a man in space and it does not take a Google to search the internet. Since google started in the owners garage we have to belive it can be done again by the right people.

And when it is done, the MSNs, AOLs, Yahoos will all be lined up to buy results if they can bypass google and get back on top. I am sure they would be willing to pay a few hundreds bucks for the coding if you wanted to sell it.


[ Home ] [ Help ] [ More About The Bot ] [ The Cyber Web Inc ] [ Search Engine News ]
[
About This Website ] [ Sample Searches ] [ Arcitecture Considerations ] [ Business Plans ]

[ Building The Search ] [ Project Progress ] [ Page Ranking ]