What do I need to specify in my robots.txt fie


(Karen) #1

I received an AdGrant from Google, but my site (www.wikibeaks.org) keeps getting disallowed. One of the complaints from the Google console was that the crawler couldn’t find a robots.txt file, so it can’t index the site - that’s the default now, so the crawler doesn’t expose private links. I thought that checking the “(robots)index this page” box would take care of this, but it seems it doesn’t. So I created a simple robots.txt file. I’m not sure what I need to put in it … I want the crawler to only see the published pages, and it seems too much hassle to explicitly allow them. I need to get this working so I can get re-indexed and re-approved.

What directories/files should I exclude? Here’s what I have now:

robots.txt file created for http://www.wikibeaks.com/

May 8, 2018

Exclude Files From All Robots:

User-agent: *
Disallow: /resources
Disallow: /rw_common
Disallow: /blog_files
Disallow: /markdown
Disallow: /Wikibeaks_readme.txt

End robots.txt file

==================


(NeilUK) #2

You should allow everything, and then specify the pages you don’t want to be crawled/indexed. It looks like you’re blocking the a lot of important stuff that would cause issues.

If you REALLY don’t want something seen on the internet, you shouldn’t publish it.


(Bill Fleming) #3

Why not go for the whole shebang and enter this in your robot.txt :smile:

User-agent: *
Disallow: /


(Karen) #4

I didn’t know how much stuff - such as page drafts would be indexed and searchable. I only want to block the minimum amount of stuff. If I don’t need to block the RW general files, I’ll just block the readme - which is just a note reminding me not to delete the Google authentication file.

** Just got off the phone with Google AdWords. They had me modify my robots.txt file as below, and it’s now submitted for review. Apparently needs the specific google permissions … asterisk isn’t good enough.

robots.txt file created for http://www.wikibeaks.com/

May 8, 2018

Exclude Files From All Robots:

User-agent: *
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-image
Disallow:

End robots.txt file


(system) #5

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.