How to prevent search Engines indexing some PDFs but not others?

manofdogz · 14 October 2024 13:46

Some time back I added an X-Robots tag in htaccess to prevent PDFs on a client website being indexed. This was because they belonged in a Members area and were not for public viewing.

However, they have recently added more publicly available PDFs and are keen for these to be indexed. Anyone know how to make some indexable and others not?

Bruno · 14 October 2024 14:10

Hi @manofdogz

To block the indexing of PDFs in a specific directory (let’s say the pdfnoindex directory), you can add the following to your .htaccess file:

<FilesMatch “pdfnoindex/.*.pdf$”>
Header set X-Robots-Tag “noindex”

This will prevent all PDFs in the pdfnoindex directory from being indexed. For other PDFs that you want to be indexed, either place them in a different directory or explicitly allow indexing with this:

<FilesMatch “path/to/public-area/.*.pdf$”>
Header set X-Robots-Tag “index”

This gives you control over which PDFs are indexed and which are not.

This way, they can manage their PDFs as needed while ensuring certain directories remain unindexed.

Important note : not all the robots respect that. So…

Flash · 14 October 2024 15:09

This!

Please keep in mind. The robots.txt file instructions are recommendations to the spiders. THEY ARE NOT SECURITY MEASURES. If the search engines find your files and they like the data they will index it.

This is just a guess on my part, but I believe all data the spiders find is stored. And then they choose what is made public with links. If you want privacy, don’t put it on the internet.

Bruno · 14 October 2024 15:15

Even with you sure ???

manofdogz · 14 October 2024 15:50

Thanks for that. So just a new htaccess in each directory with just the explicit instruction for that directory yes?

manofdogz · 14 October 2024 15:51

Yes i agree. This is not a big privacy concern however - just keeping certain documents for certain people.

Bruno · 14 October 2024 17:30

You can use a .htaccess file at the root of the site with the two instructions inside. Make sure to double-check the directory names you choose. In your last article, you mentioned that some PDFs should only be accessible to certain people. These instructions don’t address that, and I think you already know this, but I prefer to state it clearly to avoid any misunderstanding. I should also remind you that Google will respect these instructions, but not all bots will, especially the less official ones than Google. Good luck

dang · 14 October 2024 18:01

If the non-public PDFs are behind a members only area (they have to sign in to see the content), Google shouldn’t be able to pick those up anyway, no need for any robots.txt or .htaccess rule.

If you use a content-blocking gate that prevents anyone from accessing the content until a form is filled out, then yeah, Google can’t see the content either. Google’s bots are never going to fill out a lead generation form, so they will never see the content on the other side.

If your client is placing those public PDF’s outside of the members only area (on the public site everyone can access), again no need to add any robots.txt or .htaccess rule, Google should be able to pick them up.

How is the members-only area built, are the pages behind a password protected sign-in, or is it just a hidden area of the site that users have to know the URL for? If the latter then yeah some kind of rule would be required.

manofdogz · 15 October 2024 10:01

Members area is on a password protected page. I recall a long time back that the whole directory could be accessed so I added options -indexes to the htaccess file. Beyond that the site has 2 subdomain Wordpress blogs and I think Yoast has added the x-Robots file but tbh I’m not sure (I don’t have anything to do with the WordPress bit) . They have just pointed out that Analytics is saying the ‘public’ PDFs are not indexable and I think it’s the x-robots tag:

<FilesMatch “.(doc|pdf)$”>
Header set X-Robots-Tag “noindex, noarchive, nosnippet”

Bruno · 15 October 2024 10:04

I’m lost…

manofdogz · 15 October 2024 13:50

So am I I’ve removed the x-robots tag and we’ll see what happens. Hopefully, the ‘public’ PDFs will be indeed and the members wont be found anywhere in SERPS … but we’ll see

dang · 15 October 2024 16:49

As long as those non-public PDFs are behind some kind of gated member log-in, Google’s bots shouldn’t be able to get to them, but it does depend on how the member area was built (such as access permissions, content behind a password, etc.). All stuff we don’t know here.

If you find the non-public PDFs start getting indexed, you can always add the relevant meta tags to noindex them and submit a URL removal request via Google’s Search Console.

Topic		Replies	Views
Making a members area private along with documents in it Classic	3	233	25 September 2020
Stop Google Indexing a Page Classic	14	3530	6 January 2016
Easy way to prevent robots indexing resource folders and content? Classic	7	1261	26 January 2018
Robots text to preclude indexing of one page only Classic	5	231	27 May 2021
How do you prevent a page form beeing seen by any search engines Classic	21	2015	25 February 2016

How to prevent search Engines indexing some PDFs but not others?

Related topics