Stop Google Indexing a Page


(Brandon Scott Corlett) #1

In Foundation use: SEO Helper + Robots (don’t follow links) + (Don’t index page)


(Christopher Watson) #2

Do the SEO helper stacks work in non foundation themes?


(Brandon Scott Corlett) #3

At the moment it is Foundation only, however @joeworkmandid mention somewhere that one day he would create a “any theme” version.


(Brad Halstead) #4

Page Safe - a users explanation

About protecting content when @joeworkman Page Safe is used.

The gist of the stack is to prevent page(s) load(ing) without the proper password being entered. It in no way promises to protect your resources from Google Indexing. Only to limit access to the html/php page!

When the page is viewed in the browser of choice (maybe even more-so in Google Chrome), I assume that Google will track these resources due to page content loading and browser cache/history once the correct password is entered… correct me if I am wrong, please!

There is more than one way to protect your resources… @BrandonCorlett mentions robots.txt file, but you may not want your entire site blocked from indexing as shown in his example. A good source of what the robots.txt file is and how to use it is here.

However, since it is HTML 4.01 spec it does not mean that all robots abide by it! Just be aware of this if you think you have your resources protected.

So, for a start, block robots from seeing your resources folder and it’s sub-directories (Without publishing your page, just export it locally and you will see the resources folder and it’s sub-directory names to add to the robots.txt file, set up your robots.txt file ad then publish once you have it configured correctly. You may even want to use robots.txt to protect the Page Safe page(s) so they are not indexed. Same issues apply though.

Remember also, that if content is already indexed, it will be there for a while and there is nothing you can do about it other than change the resource name and republish after you have it configured for protection.

As @ben & @dan mention in Podcast 22, you can password protect certain file types (PDF, Pages, numbers, etc) and show the passwords for the files for the users on your Page Safe protected page. Page Safe in combination with Robots.txt & file password protection are your best bet when combined but no guarantee.

RapidBot 2 by ForeGround is an excellent way to add/remove/create your robots.txt file, I have found no other that even compares.

I’m not totally up on my htaccess file (mostly because my host won’t allow me to use it) so there may be ways to protect content through there as well. A good resource is available here.

That is about the most you can do to protect resources.

With that being said, in Mary’s case, if a user downloads the PDF file, removes the password and then uploads it to the web there is nothing she can do about it, absolutely nothing. So maybe the best solution would be to make the PDF file in this case an html file protected by Page Safe instead of a resource…

Just thinking out loud and hopefully give users some ideas on what and how things work.

Brad


(Brandon Scott Corlett) #5

@Turtle, I intentionally put this out there as a general tip for users rather than a specific answer to the podcast question. Mostly because finding the best possible solution for @bpequine involves a further conversation about her needs in regards to her specific question, current skill set, current stacks/plugins owned, and current budget.

However, since you brought it up… :slightly_smiling:

  1. Page Safe comes with 3 stacks;

    • Page Safe, protects the entire page
    • Stack Safe, controls whether or not a stack is displayed or not based on whether or not a Page Safe stack has ben unlocked
  • Logout, allows any button to be used as a logout button for Page Safe
  • With both of the “Safe” stacks the content is only ever pulled from the server if/when a password has successfully been entered. This would indicate that crawlers would be unable to access this information regardless of whether or not they adhere to the “no follow” standards.
  1. Browsers do cache information, however I think it highly unlikely that this information makes it’s way onto the web since this is intended for local use on a per device basis. We would have to look up the terms and conditions of the browser to know for certain. In any case you can prevent browser caching.

  2. You are right that you may not want your entire site blocked from indexing. SEO Helper can be used on a page by page basis or in a partial to be uniform across pages. So the choice is yours.

  3. Not all robots may abide by 4.0.1+ standards, but I’m certain that google and other major search engines do. From the sound of it her primary concern isn’t hackers or looking to steal information, but rather google simply doing what it does best, indexing.

  4. Blocking robots from seeing the resources folder should definitely be an option from within RW. I’m actually surprised this hasn’t come up before. I know that @joeworkman’s Total CMS protects the cms-content folder from robots. So If you store the file using Total CMS you should be good to go without any technical setup. @nikf, is this something that would be easy enough to add in to a small update?

  5. Good point about the search engine having already indexed it. I know that google does have resources available for you to petition to have content removed, but they don’t make it all that easy.

  6. The password protection of the file is a clever solution.

  7. RapidBot 2 is a great product and was my first into into robot.txt files.

  8. I think the most you could do to protect your resources is to store them in a secure location off of the server like on Amazon S3. Then you could use a file delivery system such as Rapid Cart Pro or Cartloom to deliver a download link to users that have access to a page protected by Page Safe. This would send users an email with a unique link to download the file. Preventing anyone from knowing the true location of the file. It would also alert you anytime someone downloads that file.

  9. Yup, can’t stop someone from republishing the list if they have it, but that isn’t the scenario that she is experiencing. She is concerned about search engines i.e. Google

  10. I agree that a PDF may not actually be the best solution. I could see some reasons why it would be a better medium. Although, I think that in most cases using something like @joeworkman’s [Power Grid Stacks + Page Safe/Stack Safe + an optional use of Easy CMS/Total CMS for online editing of the table] would be a better solution.

  11. Great thoughts and I’m sure that some or all of this will be useful to someone. :relaxed: Hopefully @bpequine!

Cheers!!

Brandon


(Christopher Watson) #6

So many words…!


(Brad Halstead) #7

@bitbumpy, yes, but was it helpful?

@BrandonCorlett, awesome collab on brainstorming on this one, thanks for your feedback and suggestions :slight_smile: :+1:


(Mary Delton) #8

Wow! Thank you all for your insights to my problem. To answer @BrandonCorlett, my current skill set does not include coding. I am a volunteer that took on this website project to learn and I am very thankful to have found RapidWeaver and this forum. I originally did the website with iWeb and I have picked up a bit of HTML in the process of figuring out things but that’s all. As to current stacks/plugin, I have quite a few and many I haven’t even used. I tend to buy any that I think I could use because I like to experiment with the different ways to set up other websites that I do as a volunteer. Budget is my own so not a problem but I do like to understand how to use the things I buy. I should also mention that I did a fair amount of “over the phone” training with Ryan Smith when I started with RW and that was very useful. So back to the original question/problem.

Maybe I didn’t make it clear in my question but I was originally using Lockdown - and the info that was on the lockdown page is what showed up on the web. Now I’m using JW’s PageSafe and wanting to make sure that the info on the PageSafe page doesn’t end up on the web. I realize that the info I am concerned about in both cases is a pdf file in my resources folder so the suggestion of @BrandonCorlett of blocking robots from seeing the resources folder sounds to me like a great solution but it is not one that is available yet. As to doing something with robot.txt files, that is beyond me right now. #9 in @BrandonCorlett list seems like a lot to go through but would probably work but having the resources folder blocked from robots sounds easier for me but probably harder for @nikf.

@Turtle thanks for the explanation of the PageSafe stack. I think I’m using it correctly and perhaps the info I have on that page isn’t yet on the web. Your suggestion of “So maybe the best solution would be to make the PDF file in this case an html file protected by Page Safe instead of a resource…” sounds like a solution but I don’t see how to do that.
Isn’t that what Lockdown does for you? If so, then since I was originally using Lockdown and the supposedly password protected page was on the web, then Lockdown isn’t the answer. When I used Lockdown the desired protected file was in the RW resources folder.

I think my problem is of interest to others so hopefully this discussion will help more than just me.

Thx again.
Mary


(Brad Halstead) #9

@bpequine

Personally, I typically only buy what is shown in RapidWeaver Community so I have not used Lockdown by LogHound or any of their products to be honest.

RapidBot 2 will assist you with creation of the robots.txt file, it is a plugin and you can configure what is indexed, blocked from crawlers (things that go through your site like google to index content of your pages). No need to see if @nikf will/can do it programmatically inside RapidWeaver :wink:

@BrandonCorlett suggested Joe Workman’s Power Grid Stacks. There are 3 I believe, one for .csv files (comma separated values) that might work for you instead of a PDF.

As @dan and @ben suggested in their podcast, you can password protect the PDF document from within Apple’s Preview app or from whatever app you are using to create the PDF file.

Did you watch (I think 3) video’s of Page Safe on Joe Workmans site (1 in the stack page and the rest in the documents portal I do believe)? They are very informative.

Page Safe is a display page, when the correct password is entered THEN the page content is downloaded to the browser. So I have to ask, are you referencing this same PDF file elsewhere on your website that you maybe forgot (that could be a way that the PDF file link got onto the web)?

The Resources are, to my knowledge, not protected in any way, hence the suggestion to use RapidBot 2 to create and maintain a robots.txt file blocking areas of your website from being crawled, there is a resource to robots.txt above in my original post that can give you some examples and understanding of robots.txt, but for ease of creation you may want to check out RapidBot 2.

Hope I got everything there and you find something useful between myself and @BrandonCorlett.

Edit: NOTE to all, if using Warehoused Resources, you will want to block them as well with robots.txt

Brad


(Christopher Watson) #10

Looks as though the SEO stack does work in non foundation themes… Although; is it suppose to create a robots.txt file or just have the meta tag? Does it even work with just a meta tag?

(note: have not tried the robots setting on a foundation site yet…)

Cheers of your ears…


(Nik Fletcher) #11

HI folks

Firstly: Google will index files that it finds a link to. So, if you’re linking to resources in your pages, it means that Google will by default index them.

There’s also the possibility that your web server is configured to show “folder listings” - this means that, if someone attempts to visit yourwebsite.com/resources some hosts will show the contents of that folder. You can normally ask your host to configure this to NOT happen (if it’s enabled).

As Robots.txt files are something hosts can create by default (to exclude certain directories etc), we’re not likely to immediately support Robots.txt creation in RapidWeaver, but I’d certainly never rule it out :slightly_smiling:

What you could do, in the mean time, is to on links to the resource add a custom attribute via the Link Inspector. This tells Google, when finding the link, to not consider the target for indexing.

—Nik


(Brandon Scott Corlett) #12

@nikf shoots… AND HE SCORES!!!

:smile: Thanks @nikf, that is a great solution!


(Michael M.) #13

You mix up different things: Protecting a page from indexing has nothing to do with protecting a page or a directory from opening. Protecting from indexing you can do with a ROBOTS instruction in the head of a page or in a robots.txt you save to the root directory of your server. It’s really simple and you do not need any plugin or special toll for this.

In the head you write:

<meta name="robots" content="noindex, nofollow">

Or you create a plain text file with that content:

User-Agent: * 
Disallow: /the_directoryname_you_do_not_want_to_be_indexed/

A rule in the .htaccess is not for protecting from indexing, it is for instructing the server. Here you can create rules that require a password when opening a page or sth similar.

Nik wrote:

There’s also the possibility that your web server is configured to show “folder listings” - this means that, if someone attempts to visit yourwebsite.com/resources some hosts will show the contents of that folder. You can normally ask your host to configure this to NOT happen (if it’s enabled)."

You can deny crawlers from indexing the resources folder by writing an instruction into the robots.txt:

Disallow: /resources/

A problem is that some crawlers will not follow the rules. But these crawlers will even follow “nofollow links”

Additionally you should put your server to “deny directory browsing”.


(Brandon Scott Corlett) #14

Great post @apfelpuree! :+1:


(Joe Workman) #15

Yes. I should have an all new SEO Helper stack that is supported in all themes later this month. :slightly_smiling: