For many, Google is the internet. It’s the starting point for finding new sites, and is arguably the most important invention since the internet itself. Without search engines, new web content would be inaccessible to the masses. Every search engine has three main functions: crawling (to discover content), indexing (to track and store content), and retrieval (to fetch relevant content when users query the search engine).
There are three basic stages Involved in it:
Crawling: where content is discovered;
Indexing: where it is analyzed and stored in huge databases; and
Retrieval: where a user query fetches a list of relevant pages
Crawling is where it all begins: the acquisition of data about a website.
This involves scanning sites and collecting details about each page: titles, images, keywords, other linked pages, etc. Different crawlers may also look for different details, like page layouts, where advertisements are placed, whether links are crammed in, etc.
But how is a website crawled? An automated bot (called a “spider”) visits page after page as quickly as possible, using page links to find where to go next. Even in the earliest days, Google’s spiders could read several hundred pages per second. Nowadays, it’s in the thousands.
Modern crawlers may cache a copy of the whole page, as well as look for some additional information such as the page layout, where the advertising units are, where the links are on the page etc.
Indexing is when the data from a crawl is processed and placed in a database.
Imagine making a list of all the books you own, their publishers, their authors, their genres, their page counts, etc. Crawling is when you comb through each book while indexing is when you log them to your list.
So after Google has crawled your data, it places all the data in huge (computer files full of information). You can relate it to by assuming you are in a bookstore, watching every book in the shop.
This will be crawling but if you start writing name and author by creating a data that will be calledas indexing
All the data of google is stored in huge data centers with thousands of (quadrillion bytes) worthof drives.
Retrieval is when the search engine processes your search query and returns the most relevant pages that match your query.
Ranking algorithms check your search query against billions of pages to determine each one’s relevance. Companies guard their ranking algorithms as patented industry secrets due to their complexity. A better algorithm translates to a better search experience.
Google guard their own ranking algorithms as patented industry secrets because if some other will know its algorithm they will try to hack it or exploit it for monetary gain.
Also to avoid these Google always brings updates in his algorithm. Google Panda and Google Penguin are some Important updates of Google.