What is a web crawler?
Web Crawlers web or robot exploration
Called search engine bot, web crawler or simply bot, this AI is used to download and index content from the entire Internet. The goal of such a bot is to discover (as much as possible) what each page on the web is about. In this way, relevant information can then be retrieved when necessary. These AIs are often called web crawlers because automatically accessing a site to obtain data via software is called crawling.
Most active crawlers on the web are operated by search engines. They apply a search algorithm to the data collected by bots to select relevant links in response to user search queries. This is how the list of web pages that appear when a user performs a Google, Bing or other search engine is generated.
A crawler works like someone browsing through all the books in a totally disorganized library. It sets up an index (card catalogue), allowing anyone visiting the library to quickly and easily find the information they need. In order to classify and sort books in the library by subject, the crawler works like a librarian. He reads the title, abstract, and a portion of the internal text of each book to understand its subject matter and relevance.
Unlike a library, however, the Internet is not made up of physical piles of books (or even sites). It is therefore difficult to know whether all the necessary information has been correctly indexed, or whether large amounts of data are missing.
In order to find all relevant information available on the web, a crawler will start by visiting a certain set of known and reputable web pages. He will then follow the hyperlinks from these pages to other web pages. He will then follow the hyperlinks on these other pages to a new series of additional pages, and so on.
No one really knows the number of exploration robots currently used by search engines, nor their real degree of success. Some experts estimate that only 40% to 70% of the entire Internet is currently indexed for research purposes. This limited percentage already represents billions of web pages.
Updated on: 15/05/2023