The robot.txt (http://www.robotstxt.org/) is a publicly available file and when used properly is a very good way to control what search engines crawl and what they don’t.
“Who cares. I’d rather watch the grass grow…”
Well, if you are using Volusion, then you may. Volusion has .asp pages that are sometimes tied to parameters (i.e. “?” and “&”) which are based on session/query stuff which, in turn, can generate a ton of URLs all with the same TITLE and META data. You will have lots of URLs all looking the same essentially. You will end up having quasi-duplicate content and the best policy, regardless of how you read into Google’s duplicate content policies, is to minimize as much of it as possible.
“Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results.”- Google
Why leave it to guess work when you can finally control something yourself by writing some exclusion rules thereby giving Google more relevant content.
What now?
Dont’ freak out. Simply edit your robots.txt file in your SEO area (/admin/SEOFriendly.asp). Your goal here is to DISALLOW all* search engines from crawling these pages/patterns.
You can also add your googe_sitemap.asp to your robots.txt file and tell the Google com’n'get it (or submit your google_sitemap.asp via the webmaster tools).
Here’s the robots.txt I use.
Sitemap:http://www.YOURSITE.com/google_sitemap.asp
User-agent:*
Disallow: /cgi-bin/
Disallow: /AccountSettings.asp
Disallow: /Affiliate_info.asp
Disallow: /Affiliate_signup.asp
Disallow: /Affiliate_thankyou.asp
Disallow: /catalog_subscribe.asp
Disallow: /donate.asp
Disallow: /EmailaFriend.asp
Disallow: /Email_Me_When_Back_In_Stock.asp
Disallow: /FileUpload/TextObject.aspx
Disallow: /GiftOptions.asp
Disallow: /help.asp
Disallow: /Help_EmailBetterPrice.asp
Disallow: /Help_FreeShipping.asp
Disallow: /kb_results.asp
Disallow: /login_sendpass.asp
Disallow: /Login.asp
Disallow: /mailinglist_subscribe.asp
Disallow: /mailinglist_unsubscribe.asp
Disallow: /myaccount.asp
Disallow: /MyAccount.asp
Disallow: /OrderFinished.asp
Disallow: /one-page-checkout.asp
Disallow: /orders.asp
Disallow: /ProductDetails.asp
Disallow: /PhotoDetails.asp
Disallow: /PlaceOrder.asp
Disallow: /Returns.asp
Disallow: /Register.asp
Disallow: /Receipt.asp
Disallow: /SearchResults.asp
Disallow: /ShoppingCart.asp
Disallow: /shoppingcart.asp
Disallow: /Terms.asp
Disallow: /Terms_privacy.asp
Disallow: /Ticket_List.asp
Disallow: /Ticket_New.asp
Disallow: /TrackPackage.asp
Disallow: /WishList.asp
Note: Volusion does not have a robots.txt file for both it’s SSL and regular layers. Only one, so you are not able to write a special one for https. It’s not terribly common for this to happen but searching for site:www.yoursite.com always brings up some interesting things.
Don-
Thanks for the Volusion robots.txt sample file. I’ve adapted mine to include yours, and for now, am leaving other weird stuff that has shown up in my Webmaster Tools results. Checking the site:uncommonscents.com command and other results in GWT, I’ve had a big problem with duplicate indexing in Google of my homepage resulting in decreased home page PageRank (down fron PR5 to PR3): http://www.uncommonscents.com
https://www.uncommonscents.com
http://uncommonscents.com/default.asp
http://uncommonscents.com
etc., etc.
My domain is simply uncommonscents.com (without the www.). Do you know a safe way to employ the canonical tag in Volusion to send all iterations of my home page URL to uncommonscents.com? I’ve added it to my META tags and it seems to be helping, but I’m having to disallow indexing of some .asp pages that I would otherwise want indexed (cindex and pindex) to avoid Duplicate Title and Description tags… Ideally the canonical tag would only exist on all iterations of my home page, but Volusion seems to not allow that. I’ve seen the discussion in the forums about the “IF_HOMEPAGE” and “IF_NOT_HOMEPAGE” tags, but I think they can only be used in the .css template which probably doesn’t help.
Real quick, why did you opt for domain marketing without www?
I’m not quite following your thread, but I just got done using if_homepage and it works like this. If you place this
in your html template, everything within the if_homepage div will only appear on your home page. But, if you’re looking to style if_homepage, it doesn’t exist in the final pages. Volusion must only use it when the page gets parsed. If you want to style it, you’ll need to wrap your homepage content in another div.
The if_not_homepage div works the same way. Why volusion chose to do it this way, god only knows, but it’s better than using javascript to do it. Now if I could just figure out a decent menu system that doesn’t rely on javascript…
how to use the “IF_HOMEPAGE” and “IF_NOT_HOMEPAGE” tags to avoid duplicate title tags in the cindex and pindex page.
Sorry for the late reply. Submit an email through my form. Thanks.
Great posting about Volusion quirks. I have a problem with my volusion cart and I would like to know if you experience the same thing. I am having major problems getting Google to index any of the content articles within Volusion. Like http://www.zipinstallation.com/Articles.asp?ID=210 , I have checked the robots file and I have zero articles excluded. In Google webmaster tools it says that I have 482 pages not being indexed because of robots.txt exclusion. I have verified many times that the volusion robot file doesn’t exclude the articles. Are your articles getting indexed? If yes, what category are your articles in withon volusion category selection? Appreciate any help
I usually build an index file for the articles and stick it in the footer so the bots can grab it and run through it easier. Why do you have these lines in your robots.txt?
Disallow: /pindex.asp*
Disallow: /cindex.asp*
Ran into a problem where pdf files are being scraped off my site into pdf search engines, defeats the purpose of using those files to bring people to our site so we ran across this command
Disallow: /*.pdf$
Disallow: /*.cgi$
Disallow: /*.asp$
and wondered what you thought of short cutting the robot.txt file in this manner – does it work?
Try adding a declaration.
User-agent: Googlebot
Disallow: /*.pdf$
I could really help you if you tell me how your .pdfs are organized. Are they in different directories or all in one directory?
we moved the pdf files to a different directory and so far the wild card pdf disallow is working for all engines, we’re so happy with the shortcut that based on v0olusion stats this is what we are doing
User-Agent: *
Disallow: /cgi-bin/
Disallow: /*.pdf$
Disallow: /*.asp$
Disallow: /*.aspx$
Disallow: /*.cgi$
Disallow: /*.css$
Disallow: /*.js$
Disallow: /admin/
Disallow: /fileupload/
Disallow: /net/
Allow: /
Hello,
I’m using the example you have posted above and I’m having some pages being blocked by the robots.txt file that I don’t want blocked. For example:
http://www.shopvsc.com/Toner-SMT-Series-8-Port-Multi-Taps-p/ton-smt108-32.htm
http://www.shopvsc.com/Toner-TGT-Seriers-8-Port-Taps-p/ton-tgt8-14.htm
I saw these and a few more items that are in that category (more Toner products) that are getting blocked in Google Webmaster Tools.
Do you have any ideas?
Thanks!
Ouch, sorry for the late reply. I just came off of large project. I can’t bring up http://www.shopvsc.com/robots.txt. Is there anything I should know?
KMS,
Hmmm I was just able to do so. But other than changing the site map URL at the top of the robots.txt file it is exactly the same as the one posted above.
I just looked through Webmaster Tools and it isn’t showing up as being restricted by the robots.txt file. It had been there up until sometime last week I think.
I guess it’s okay now though, if anything changes I’ll reply again.
Thanks,
KMS,
I’m back again. I just received an email from Google Base this afternoon about it not being able to crawl certain pages of our site. I’ve gotta fix this by February 28th or they will remove.
I generated a new Google Base file just to take a look at it and it is using the ProductDetails.asp?ProductCode= URL instead of the good Google Friendly URL’s.
I am generating the Base file using the Volusion API to generate it.
Is there another way you suggest to generate the base file to use the SEO friendly URL’s? Base draws a good amount of traffic for us and we don’t want to be dinged.
Or do I just remove the Disallow: /ProductDetails.asp and leave it be?
Thanks Again,
Ah! You are the 5th person to contact me about this issue. Currently, there is no checkbox or Where clause for the URL in the Volusion API. Open a ticket with Volusion. I have told everyone to do this. I will be posting on the forum soon as well.
Two roads:
1. Remove the disallow and let them crawl it. There is a good chance that it will not show up in the index (regular search results) unless someone blogs your dynamic URL. Also, the duplicate content might not be that bad and considered secondary content.
2. Don’t remove it and wait for the issue to resolve within the Volusion workplace.
I am currently working with GoDataFeed on a custom solution. I can’t say when it will be done.