I received an AdGrant from Google, but my site (www.wikibeaks.org) keeps getting disallowed. One of the complaints from the Google console was that the crawler couldn’t find a robots.txt file, so it can’t index the site - that’s the default now, so the crawler doesn’t expose private links. I thought that checking the “(robots)index this page” box would take care of this, but it seems it doesn’t. So I created a simple robots.txt file. I’m not sure what I need to put in it … I want the crawler to only see the published pages, and it seems too much hassle to explicitly allow them. I need to get this working so I can get re-indexed and re-approved.
What directories/files should I exclude? Here’s what I have now:
You should allow everything, and then specify the pages you don’t want to be crawled/indexed. It looks like you’re blocking the a lot of important stuff that would cause issues.
If you REALLY don’t want something seen on the internet, you shouldn’t publish it.
I didn’t know how much stuff - such as page drafts would be indexed and searchable. I only want to block the minimum amount of stuff. If I don’t need to block the RW general files, I’ll just block the readme - which is just a note reminding me not to delete the Google authentication file.
** Just got off the phone with Google AdWords. They had me modify my robots.txt file as below, and it’s now submitted for review. Apparently needs the specific google permissions … asterisk isn’t good enough.