How to prevent search Engines indexing some PDFs but not others?

Some time back I added an X-Robots tag in htaccess to prevent PDFs on a client website being indexed. This was because they belonged in a Members area and were not for public viewing.

However, they have recently added more publicly available PDFs and are keen for these to be indexed. Anyone know how to make some indexable and others not?

Hi @manofdogz

To block the indexing of PDFs in a specific directory (let’s say the pdfnoindex directory), you can add the following to your .htaccess file:

<FilesMatch “pdfnoindex/.*.pdf$”>
Header set X-Robots-Tag “noindex”

This will prevent all PDFs in the pdfnoindex directory from being indexed. For other PDFs that you want to be indexed, either place them in a different directory or explicitly allow indexing with this:

<FilesMatch “path/to/public-area/.*.pdf$”>
Header set X-Robots-Tag “index”

This gives you control over which PDFs are indexed and which are not.

This way, they can manage their PDFs as needed while ensuring certain directories remain unindexed.

Important note : not all the robots respect that. So… :four_leaf_clover::magnet::crossed_fingers:

1 Like

This!

Please keep in mind. The robots.txt file instructions are recommendations to the spiders. THEY ARE NOT SECURITY MEASURES. If the search engines find your files and they like the data they will index it.

This is just a guess on my part, but I believe all data the spiders find is stored. And then they choose what is made public with links. If you want privacy, don’t put it on the internet.

Even with :four_leaf_clover: you sure ??? :crazy_face:

Thanks for that. So just a new htaccess in each directory with just the explicit instruction for that directory yes?

Yes i agree. This is not a big privacy concern however - just keeping certain documents for certain people.

1 Like

You can use a .htaccess file at the root of the site with the two instructions inside. Make sure to double-check the directory names you choose. In your last article, you mentioned that some PDFs should only be accessible to certain people. These instructions don’t address that, and I think you already know this, but I prefer to state it clearly to avoid any misunderstanding. I should also remind you that Google will respect these instructions, but not all bots will, especially the less official ones than Google. Good luck :+1:

If the non-public PDFs are behind a members only area (they have to sign in to see the content), Google shouldn’t be able to pick those up anyway, no need for any robots.txt or .htaccess rule.

If you use a content-blocking gate that prevents anyone from accessing the content until a form is filled out, then yeah, Google can’t see the content either. Google’s bots are never going to fill out a lead generation form, so they will never see the content on the other side.

If your client is placing those public PDF’s outside of the members only area (on the public site everyone can access), again no need to add any robots.txt or .htaccess rule, Google should be able to pick them up.

How is the members-only area built, are the pages behind a password protected sign-in, or is it just a hidden area of the site that users have to know the URL for? If the latter then yeah some kind of rule would be required.

Members area is on a password protected page. I recall a long time back that the whole directory could be accessed so I added options -indexes to the htaccess file. Beyond that the site has 2 subdomain Wordpress blogs and I think Yoast has added the x-Robots file but tbh I’m not sure (I don’t have anything to do with the WordPress bit) . They have just pointed out that Analytics is saying the ‘public’ PDFs are not indexable and I think it’s the x-robots tag:

<FilesMatch “.(doc|pdf)$”>
Header set X-Robots-Tag “noindex, noarchive, nosnippet”

I’m lost…

So am I :sweat_smile: I’ve removed the x-robots tag and we’ll see what happens. Hopefully, the ‘public’ PDFs will be indeed and the members wont be found anywhere in SERPS … but we’ll see

1 Like

As long as those non-public PDFs are behind some kind of gated member log-in, Google’s bots shouldn’t be able to get to them, but it does depend on how the member area was built (such as access permissions, content behind a password, etc.). All stuff we don’t know here.

If you find the non-public PDFs start getting indexed, you can always add the relevant meta tags to noindex them and submit a URL removal request via Google’s Search Console. :slightly_smiling_face:

1 Like