We are continuing our article session about searching engines and this time we are moving on to the next part of search engines functions – indexing.
About Indexing
How answer engines interpreter and store web-pages
As soon as you’ve made sure, your website has been crawled, the next step is to ensure it can be indexed. But if your site can easily be discovered and crawled by a search engine, it doesn’t guarantee that it is stored in the index. Earlier we’ve already discussed how search engines find your web-pages, check the previous article if you forgot something. The index means the place where your found pages are stored. First, a crawler discovers a page. After that, the search engine renders it as a browser would and makes out the contents of that page. The place for information storage is called index.
Keep on reading to know more about indexing and how to have your site included in this essential database.
Is it possible to see how a Googlebot crawler sees my internet pages?
Of course, the cached version of your page displays a snapshot of the last time Googlebot crawled it. Google caching and crawling of web pages occurs at different frequencies. Well-known sites have an edge over less famous ones, for example, websites that post 24/7 are crawled more frequently. Take a look at the picture below. Сlick the drop-down arrow next to the URL in the SERP and choose “Cached”. Here you can see how your cached version of a page looks.
Can pages be removed from the index?
Yes, they can. There are the most common reasons for URL removal:
-
- The URL is returning a “not found” error (4XX) or server error (5XX). This could happen accidentally (the web-page was dislocated, and a 301 redirect was not set up), or deliberately (the internet page was deleted and 404ed to get it removed from the index).
- The URL was removed from the index because it has been manually penalized due to the search engine’s Webmaster Guidelines violation.
- A noindex meta tag was added to the URL. Site owners often use this tag in order to make the search engine omit the page from its index.
- The addition of a password required for access to the page also entails blocking the URL from crawling.
Make search engines index your site
Robots meta directives
Meta directives (also called “meta tags”) mean instructions that are given to search engines to make your web page be indexed\crawled.
If you want “not to index this page in search results” or “not to pass any link equity to any on-page links”, you can give these commands to search engine crawlers. These instructions are executed in the <head> of your HTML pages via Robots Meta Tags (most commonly used) or via the X-Robots-Tag in the HTTP header.
Meta tag “robots”
Meta tag “robots” is used in the <head> of your web-page HTML. Its main feature is the ability to exclude all or specific search engines. The most common meta directives and situations when they can be applied are given below.
If you elect to use “noindex,” you’re making clear to crawlers you want to exclude the page from search results. By default, search engines assume they can index all pages, therefore using the “index” value is unnecessary.
-
- When the tag might be used: You may mark a page as “noindex” if you’re trying to trim thin pages from Google’s index of your site (ex: user-generated profile pages), but you still want them accessible to visitors.
To tell search engines whether links on the page should be followed or nofollowed, use “follow/nofollow”
“Follow” makes bots follow the links on your page and pass link equity through to those URLs. Or, if you employ “nofollow”, the search engines won’t follow or pass any link equity through to the links on the page. By default, all pages are assumed to have the “follow” attribute.
-
- It’s better to use “nofollow” (it is often used together with “noindex”) when you’re trying to prevent a page from being indexed as well as prevent the crawler from following links on the page.
To stop search engines from storing a cached copy of the page, use “noarchive”.
By default, the engines will keep visible copies of all pages they have indexed accessible to searchers through the cached link in the search results.
-
- When it might be used: In case you own an e-commerce site, and you regularly change prices. To prevent internet users from seeing outdated pricing, you can use the noarchive tag.
X-Robots-Tag
Regarding the HTTP header of the URL, the x-robots tag is often used here. Unlike meta tags, it is more flexible and functional. “x-robots” is designed for blocking search engines at scale because you can block non-HTML files, use regular expressions, and apply sitewide noindex tags.
Following the different ways of influence on indexing and crawling will assist you in avoiding the typical mistakes that prevent web-pages from getting to the top.
Our next topic will tell you about search engines ranking.