More About SquiggleBot Why the name
SquiggleBot? Some people know this as a "Tilde" ~. But more scholarly people like ourselves know the more technical name of the crooked liittle line as the "Squiggle". Since only high end programers would read "~bot" as "SquiggleBot" we had to use the actual word squiggle instead of the ~. Here is the basic rundown on how the idea of the squiggle project was born. Of course the idea of being a search engine and making millions of dollars is always in the back of my mind. But foremost I see many sudo searches popping up, serving content from my pages and getting millions of of page views daily with out ever building one of their own pages. In most cases, they don't link back to me, they just use my text, domain name and hard work to grab my traffic. If I could build a crawler and index some pages, I could build a website with millions of pages overnight rather than typing out a few each day. That in itself was appealing, but needed some value to be a useful and frequented website. So if I made the index searchable, maybe people could look for content they want to see. And what the hell, I have millions of page views on other sites and can get people to my baby search engine. And maybe, just maybe I could become the next google. OK, that is where the dreaming part ended and the work began. I have an old, unused Pentium II 300mhz computer which would be perfect for a simple crawler. So after about 3 hours of programing I had my first crawler. And, wow , was it impressive. With just one crawler running I was indexing over 10,000 pages and hour. I could hardly wait to see what it would do on a multi processor 3.0ghz machine. At this rate I would be the next google in a matter of days! So I let the sucker run for an entire day. Nearly a quarter of a million pages indexed and all I needed was a search. A couple more hours of coding and I started searching the pages. My first reaction was "This is a bunch of crap!". 80% porn, 19% blatent spam and 1% useable content. Out of a quarter of a million pages, I would be hard pressed to find a page full of quality results. So is this idea sunk? Never! Never give up when you have a good idea. So the first step is to filter out the "CRAP". Carefully analyzing the differences between good pages and spam I was able to make a filter to reject undesireable websites quickly and avoid crawling useless pages that will need to be removed later. Well, that is great but it takes my bot from 10,000 pages an hour to just over 100. From 250,000 pages a day to 2500. But that is just fine. Because now I have 2500 quaility or at least better quality sites. Now the wheels begin turning in my head and running over the lightbulbs. If I can limit the internet index to 1% of the Google index I can have a leaner and faster index at a fraction of the overhead. Less costly to crawl the existing pages and faster to compile searches with less data. With google growing to massive sucess and indexing 4 billion pages (now at 8 billion), my index could deliver good results with only 40 million pages. Of course we will still need to crawl those 8 billion pages at least once, but then the crawler can concentrate on the 40 million pages it likes. OK, we will never have the results that a google or yahoo can provide, and we will have filtered out 99% of the porn making our search undesireable to 80% of the searches. We also will not have the capability of indexing pdf, mpegs and word files, atleast not in the begining. But if you cannot find a good result from any of the big boys, squigglebot might just be the solution for you. Clean, family friendly and limited spam. To me, that sounds like the DreamEngine. See a sample output page here. Of course, once people start using it, the spammers will try to beat it. But since most of their urls will already have been filtered out, the threat should be minimal. So day by day, the research goes on, the crawling continues, more computers are added to the network of cheap servers and the database is growing at a snails pace. I have added in SiteCaptures with the index as a feature to make our search unique. By limiting the numbers and using one image of the main page, this is verry possible. I will use a dedicated image server to deliver the thumbs so it does not inhibit system performace on searches. I have thrown out the old PII and replaced it with half a dozzen P4 3.0gig machines. Its making the job much easier to accomplish in this lifetime. I am estimating that I will need 10 -15 servers to deliver database querries to a main server and a single high end image server with dual processors and SCSI drives should be able to deliver 10 million images an hour or about 1 million searches with 10 results. And yes, google does over 10 million searches an hour. If I get that big I will add 10 more servers. Hell, if I get that big, I will just retire. So that is the plan. To jam all the servers into a 42U server cabinet with 100mbs (that is all I can afford with out commercial financing) and then get people to use it. Since bandwith is not cheap, the current crawl is limited to a 10mps total divided between a cable modem, dsl line and some of our datacenter bandwidth in an effort not to bankrupt us is the building process. Hopefully, after search implementation the money will be available to pay for 100mps. Otherwise I will need to limit access. If it works, you will be reading about me in the Wall Street Journal, if not you might see a picture in The National Enquirer of some maniac pushing a server cabinet off the roof of an office building and screaming insanely. And that is my story! Browse the additional pages for more details about the project. [ Home ] [ Help ] [ More About The
Bot ] [ The
Cyber Web Inc ] [ Search Engine News ] [ Building The Search ] [ Project Progress ] [ Page Ranking ] |