SquiggleBot Project Progress SquiggleBot is a home based project unlike the mega financed MSN and Google systems. We are on a learning curve and our progress is much slower and will likely be taking as many back steps and forward. We will list our steps here, in hopes that some of our learning will benefit others looking to build the next big search. You are welcome to laugh at us and discuss how stupid some of the ideas may be, and even joke about how little we know. But, we are doing it, and we want to share our ideas with other people that may know even less or have some interest in the project. Foremost we realize how much we need to learn, but not everyone starts at the top. The game is to get there in the end. Recently we realized our crawler was not harvesting enough urls to keep crawling new websites. Our acceptance paramaters for websites is so strict that most websites will not qualify to be listed or have their links urls included for crawling. It seems like getting pages to crawl should be the last problem for a bot. But with link farms and the heavy layer of spam online it is easy to suck up spam sites and lose the bot in a black hole of useless pages. Since we want to be more discriminating in our results, our bot has been tweaked to skip the bulk of spam sites and avoid adult content. We also are not squeezing links from the entire website as most harvesting type bots that simply mine your site for useable data. As time progresses, millions and millions of disqualified domains are added to our list further making the chore of finding new unique urls even more difficult. If we wanted just spam, we could have a "crawler gone wild". But we want a good index and good sites with quality content. These are the sites that have become increasingly hard to find with conventional searches. After finding other sites trying to accomplish our very same goal, one suggested http://rdf.dmoz.org the DMOZ data dump. Since most of the urls have been reviewed, it was a goldmine of information. You can download a 2gig datafile of reviewed urls. The urls are already classified and seperated into categories. Of course extracting the useful data and integrating it into your own system could be a chore. We have sucessfuly extracted the urls and have set them up as a base for our crawler. This should give us a great base to build on. In some cases the urls may have expired or have just failed to pay for hosting services and are redirected to commercial dynamicly generated spam pages. So the spam traps are still a high risk even with the manually reviewed sites. But the database is still the largest we have found and a great start for any serious search engine. 8/05/05 - Currently have 5 servers running data, crawling and taking site captures for accepted websites. Our plan is to add 10-15 more once we have a useable search. But money does not grow on trees and before we invest in a rack full of servers it would be nice to see finished software. Or at least good software. We are also trying not to recrawl sites untill we are fully functional. So don't panic if you do not see our bot again for a couple of months. It will be back and will descend further into your website. 8/18/05 - so much for waiting. Our impatience is getting the best of us. Now with several quad xeon servers ordered and on the way we have started assembling that rack full of servers. Still our low end p-4 machines are running data, but we needed much more power to compile data. A recent sample run of data took over 72 hours to process only one million domains. No doubt the programing could be tweaked, but more power will make the job much faster as well as allow more frequent updates. We have also encountered many delays in screen shots that can be easily overcome by more powerful machines on a more reliable network. Its not that we can not achive our goal of using inexpensive machines, its just that we can do it faster with more power. Our initial push is to crawl millions of domains and process the data on the sites. We could do this over the next year or boost horse power and get to the finish faster. We are still nowhere near the power of MSN or Google, but we are taking the next step to launching the search service. 8/22/05 - In 1 month we have already revised SquiggleBot for the third time. Of course the crawler gets daily updates, but we up the revision number when its a major reweite. To keep the smaller p-4 machines as our primary crawlers and still crawl enough websites to have a good index, we have set up a new network to process data rather than doing it s we crawl the page. The small p-4's can quickly download a page, but asking it to extract urls, look for undesireable content and complile the text into a database is slowing down the process too much. SquiggleBot_3.0 will only download the main page of a website and write it to a cache file. We will then use quad processor machines to run the data in a backend process. This should up the productivity of a single crawler from about 50-100,000 pages a day to 3 or 400,000 pages in a single day. It will also allow us to run more crawlers on onle machine, since the system resources are not being stressed. Using the "Main Frame" (a rack full of quad xeon machines) to process the incoming pages and compile searches will give us the ability to index pages more frequently and provide results closer to real time than from pages crawled months ago. 8/28/05 - Getting frustrated. Begining to realize the imensity of the task at hand as it gets out of control. I have installed (2) 22U server racks in my livingroom. Now with 12 servers set up in the search network as I build the architecture of the programing and database. 5 of the servers are Quad processors, a single dual processor and 5 single processor machines. So that is only 27 processors, still far short of the 64 processor Oracle systems that still would not do the work. I read this from an HP press release: "Running on a single HP Integrity Superdome server with 64 IntelŪ ItaniumŪ 2 1.6 GHz processors with 9MB cache using the HP-UX 11i operating environment, Oracle Database 10g Release 2 achieved world record performance of 71,847.8 QphH@3000GB" At that rate it would cost over a million dollars just to search the domains I have indexed in the last month. Its easy to see how a company can spend $100million. But it still makes more sense to break down the database into presearches and be able to deliver 10 to 20 times the querry rate on a moderately priced server. The reality of how much data I am compiling is taking hold. As the structure changes and I relocate data to alternate machines the process takes hours at a time. That was the primary reason for upgrading to quad processor machines that can kick the pants off the single p4 machines. Still using the p4's to run low level tasks like crawling and site captures is keeping costs down. I figure I can get this moving for less than $100,000.00. Which may seem like a heafty chunk of change, but it's peanuts in the search engine game. 9/14/05 - still in the learning process. Since we have no experience in large databases, its a game of small steps and slow progress. To help develop a fast search we built http://beeshops.com as a shopping search engine. With 3 million records, it is a good exersize in overcoming the search problem. Its not perfect, but, with no SQL, no proprietary databases other than what we developed, its not too bad. And if we can search 3 million records as quickly as it shows, then completing the squigglebot project should be in the near future. We have temporarily stopped crawling websites as we develop the searh end of this project. The crawler has been maximized to crawl 1+ million pages a day from a single server. At that speen, we should be able to update content at a reasonable rate. The compiling process can runs on seperate servers and should not affect the search ability of the system. The only slow part is gathering site captures. That process continues daily as does the programing and development of the database.
3/01/06 - Not enough time to dedicate. I wish I could spend more time on squigglebot, but I need to pay my rent and development costs time and money and pays nothing. Although I have developed so great programing in the process. A site using the technology is http://beeshops.com. With a few million records the search is fast and the result are pretty good. Of course the results are only as good as the data, and that is not great. I am using that as the fisrt platform for tweaking the compiling process and search. This allows me to spend time on a site with a potential income rather than my dreams of kicking google in the ass. I have also used the thumbnail generating software on smaller searches like BuzzTrader Car Clubs. It has spruced up the look and makes the search more user friendly. However the logistics of creating a thumbnail of every internet domain are starting to hit home. It does not seem feesable, but when you consider what google is raking in, anything is feesable. Maybe just not from my home office. Expected launch window on or before January 2006. Of course our fingers are crossed and we are wearing four leaf clovers, but that should be enough time to complete the software and index a few million domains. Then the marketing begins as we try to try to get people using the search. Keep in mind we already have millions of daily page views and getting traffic should not be a big hurdle. [ Home ] [ Help ] [ More About The
Bot ] [ The
Cyber Web Inc ] [ Search Engine News ] [ Building The Search ] [ Project Progress ] [ Page Ranking ] |