Friday, October 8, 2010

Making Own search engine

Making own search engine is always a difficult task ,I've been working on this area for about 6-7 months ago when first time i heard about Google secret page ranking algo. I'm not going to make a search engine like Google i am working on a search engine that will crawl MP3,video songs file like beemp3 search engine.
beemp3 is a world most popular mp3 search engine that doesn't store any mp3 file in their database but they only believe to crawl.
Before search engine a lots of different programs were used like Gopher,Veronica and Jughead to make indexing of web pages. first search engine created was Archie, created in 1990 by Alan Emtage, a student at McGill University in Montreal.
and after that many search engine came.
Search engine functionality can be divided in to three parts:
1-Web Crawler: It is a program that send HTTP request to URL and receive page content and again it identify new URL and push these new URL into the Queue and again retrieve new URL from Queue to find out new URL. This process work infinite till the condition satisfied.condition can be only find URL given Host or Out side.
2-Search engine indexer: I've found many places where people say that indexer used to make Queue only but I think it is not limit to only Queue it means also to make index of database to perform fast query operation .
3-Query engine: This is the again very critical part , Google uses different mechanism like Boolean expression,Logical etc. to retrieve a good result from database.But most of the time we get unnecessary link.They are working in this field to improve the query output.
The new era would be where search engine will be based on natural language Query Like ASK.COM.










It is a single search engine that works on Natural Language.
Before 1998 web crawlers mostly used BFS And DFS algo to travesre each URL but
In 1998 first Lary page and sergey brin proposed a new technique that is page rank .
In page rank web crawler decide special rank for each page on the basis of in linking and out linking of a single page.They proposed this method in his research paper "The Anatomy of a Large-Scale Hyper textual Web Search Engine" This research paper is available on this link http://infolab.stanford.edu/~backrub/google.html but it is not complete and after that they hide their secrets and now many companies are trying to make a Page rank based search engine but it is not possible yet.
On that time Google crawler crawl 100-200 pages in just one 1 second . Now u can imagine the speed of crawling .
Today they r using Distributed environment for web crawling.
After the completion of module I'll post remaining part
If u have any doubt , feel free to ask..

No comments:

Post a Comment