Considerations for Building a Quality Index

We are learing quickly that no search engine will be worth a crap unless it is monitored by human (or at least canine, feline or bovine) moderators.

Since there is a fix for every filter the only way quality sites can be indexed is if they are reviewed by real people.

Here is the dilmea. Google says no doorway or spam pages. Of course they negelect to tell you what they mean or what specifically not to do, so outwardly they could refuse your site for not complying but have no real comprehensive manner to make that judgement.

So we get this scenario. The webmaster spends a year building an awesome website, with great free services and no spam or advertising. By rights, the site should rank #1 on google. However, you will never find the site in google, unless some other site ranked for some other reason provides information about it. The surfer will have to access the remote site and click a link to the site they should be at.

So what can the webmaster do? Build pages that will appear in the top google results.

It is well know in SEO circles that a good website normally will not perform well in the search engines. Optimizing the site to work often takes away the functionality and quality of the site.

So by design, google forces the webmasters to build pages that get traffic rather than having a good website.

With over 8 bilion pages, google would be hard pressed to review every page and determine what is real and what is spam. Then with bots extracting urls from existing pages, the spam would only fill the index again much faster than it could be deleted.

Google also states that you do not want to link to bad neghiborhoods. Doing so could ban your website from the index. So linking to outside sites is now a risk.

How do you know if a site is a bad website. You can make some assumtions, but, how can you guess what googlebot was thinking. A very good website such as a classified could get squashed becuase someone posts a link to a know spam website.

This would be unfair to the site, but ultimately unfair to the surfer. Buy banning sites based on page link paramaters can often eliminate great websites and replace them with spam pages that have been built by savvy webmasters.

We have considered this issue on our free website hosting system. If we let people build pages at our domain and they link to spam, our entire domain will pay the price as well as other users on the system.

Meanwhile, there are hundreds of scammers buying top level domain names in their country or any country code such as com.jp. Then they set up thousands of subdomains with yourname.com.jp. They capitalize on your name, they operate out of the country sheilded from US Laws and kill the search results.

What makes matters worse, is that they often dispaly google ads.

We beleive that there is no way around human intervention in search filtering.


The Google Cloning Syndrome

With http://yahoo.com trying to get back in the search market after piggybacking on google for years, the latest update is the most blatent duplication of google I have ever seen. Other than the yahoo logo, the page is an exact duplicate. MSN is not much better, but at least they have some different color shading.

Below are 3 screencaps of the big three. Do you see 3 unique websites or three people that are trying to be the same.

We have seen many other google clones as well. But we almost expect that from the smaller companies. But now, there is no diversity in the searches. Even the results are similar and bland.

We would like our search engine to be unique and offer some aspect that the others can't. One feature is the thumbnail indexing we have developed. This will allow anyone to see a screen cap of the website before wasting time surfing useless spam. Check out the sample output page.

One thing we can promise is that when our index is unleashed, it wont look like the clones above.


After much thought on ranking pages, we have decided to throw that out the window. A page or website must be ranked solely on its own content.

If you encourage outside links, internal links, page text size and so on, you begin to mold the internet into efficient websites instead of quality content.

If relavency based on outside links is eliminated, it will eliminate millions of spam sites, thousands of scammers selling page links services and clean up the web considerably.

Has google ever considered what they ahve unleased on the net? Do they realize they have indirectly created the massive SEO market and micro scammers? Didn't they realize the spammers would crack thier code and fight them all the way to oblivian?

Imagine if you could write a great article on, lets say "Cold Fusion Energy", and people could actually read it. Currently, with out page rank, your article that could change the world would sit unread.

However, we would like to change that.

The biggest obstacle is, you would actually have to index more pages, more often and compile results faster. This is a huge hurdle to get over. Unless you can cut the internet down to size. Filter the unwanted content, trim the fat from the spam, and compress the quality content.

We have several mini-bots running daily. Each is crawling about 100,000 domains each day. However, only about 10% are listed for inclusion. That means we only have to search 1/10th of the data to get good results. Sites that are included can be crawled more often and searches can be closer to real time than with current systems.

This type of selective prejudice will beat the spammers. Since only a fraction of their websites will pass the filter and likely be excluded later by human intervention, they have little gain from thier work. But the scholar has great gains for writing a great article and the users will get better results.

Will good sites be bypassed? Of course. But with human intervention, they can always be reconsidered. And the same goes for spam sites, they could be included but will likely be removed later.

One philosophy we do not argee with is seeing the same sites at the top of a search for ever. It seems that is the content is changing, the site sould be placed based on the page itself rather than past performance, age, or some over inflated page rank.

The nature of the net is that is changes every minute. A good search engine should reflect that, and not rely on old data or old information complied from thousands of websites. If a page has an artice about "Walking Your Dog" and the webmaster changes that to "Asian Cooking", people will be disturbed when they seach for dog stuff and see a gourmet platter of meat. And rightfully so.

Of course no search can be real time, however, that page could be levereged to #1 status for months due to outside links and previous search data.

Our main concern is keeping databanks lean and current. Since the net is instant, people tend to expect realtime results. Many people have no clue that it could be 6 months to a year before their website is included in an index.

We would like to change that. Of course the same people may be very mad, if they learn their site did not qualify to be included. But it would be wonderful if a website needed to be good to be part of the index. And such will be the case with squigglebot.


[ Home ] [ Help ] [ More About The Bot ] [ The Cyber Web Inc ] [ Search Engine News ]
[
About This Website ] [ Sample Searches ] [ Arcitecture Considerations ]

[ Building The Search ] [ Project Progress ] [ Page Ranking ]