A crawler or bot is a program that visits web sites and reads pages and other information in order to create entries for a search engine index. Every search engine has a crawler running using an algorithm that is designed to index key words along with relevant links. When search results are displayed, a number of factors affect which sites are shown at or near the top of the results.
Search engine optimization (SEO) is the art of attempting to "please" the search engine algorithms so that your site comes up as high as possible – at or near the top, hopefully – in the search results.
Factors considered by the crawlers include (but are not limited to) the following:
- Relevance to the search term
- How frequently content on the site is updated
- Server uptime, availability and response factors
- Page load time
- Duplication of content (results in lower ranking)
- Popularity of content, as determined by the back-links
- Correct HTTP response headers
- Uniqueness of title and meta tags on all the pages of the site
- Webmaster settings on the search engine
Duplication of content is one of the important aspects to consider while serving content over the Instart service; indeed, over any content delivery service. Search engines typically penalize sites for what they see as duplicate content – even though it's just the original content being mirrored for better performance.
The following measures can help avoid being penalized by search engines for apparent duplicate content.
Use rel="canonical" headers
If the site uses multiple static hostnames (such as cdn.domain.com, cdn1.domain.com, etc.), it is important to indicate the original source of the asset under which the all of content needs to be index and stored. This is done with the help of rel="canonical" header. It is preferable to have all files to set a canonical url to the "preferred" domain and path to be used by the search engine. For example, with an Apache web server, you can set up the .htaccess file as shown below:
Header add Link '<http://www.example.com/pdf-download.html>; rel="canonical"'
Header add Link '<http://www.example.com/product-page.html>; rel="canonical"'
This indicates the preferred URL to associate the two specified pages so that the search results will be more likely to show users the domain specified as the canonical entry, www.example.com, rather than cdn.example.com.
Set up your robots.txt file
Update your robots.txt on the origin server to block all parsing while enabling the robots.txt on the Instart server to parse content. This will request the crawlers to not index the files on the origin server and so further avoid content duplication.
The following examples can be used in robots.txt to instruct crawlers what content is allowed access and what content is disallowed:
Disallow indexing of everything:
Disallow a path:
Disallow a specific page or asset:
For more information about SEO, see
- Google's Search Engine Optimization Starter Guide
- Mozilla's Beginners Guide to SEO
- Bing's Bing Webmaster Guidelines
If you have any further questions about how SEO works in the presence of the Instart service, please contact us.