Some time back I added an X-Robots tag in htaccess to prevent PDFs on a client website being indexed. This was because they belonged in a Members area and were not for public viewing.
However, they have recently added more publicly available PDFs and are keen for these to be indexed. Anyone know how to make some indexable and others not?
To block the indexing of PDFs in a specific directory (letâs say the pdfnoindex directory), you can add the following to your .htaccess file:
<FilesMatch âpdfnoindex/.*.pdf$â>
Header set X-Robots-Tag ânoindexâ
This will prevent all PDFs in the pdfnoindex directory from being indexed. For other PDFs that you want to be indexed, either place them in a different directory or explicitly allow indexing with this:
<FilesMatch âpath/to/public-area/.*.pdf$â>
Header set X-Robots-Tag âindexâ
This gives you control over which PDFs are indexed and which are not.
This way, they can manage their PDFs as needed while ensuring certain directories remain unindexed.
Important note : not all the robots respect that. SoâŚ
Please keep in mind. The robots.txt file instructions are recommendations to the spiders. THEY ARE NOT SECURITY MEASURES. If the search engines find your files and they like the data they will index it.
This is just a guess on my part, but I believe all data the spiders find is stored. And then they choose what is made public with links. If you want privacy, donât put it on the internet.
You can use a .htaccess file at the root of the site with the two instructions inside. Make sure to double-check the directory names you choose. In your last article, you mentioned that some PDFs should only be accessible to certain people. These instructions donât address that, and I think you already know this, but I prefer to state it clearly to avoid any misunderstanding. I should also remind you that Google will respect these instructions, but not all bots will, especially the less official ones than Google. Good luck
If the non-public PDFs are behind a members only area (they have to sign in to see the content), Google shouldnât be able to pick those up anyway, no need for any robots.txt or .htaccess rule.
If you use a content-blocking gate that prevents anyone from accessing the content until a form is filled out, then yeah, Google canât see the content either. Googleâs bots are never going to fill out a lead generation form, so they will never see the content on the other side.
If your client is placing those public PDFâs outside of the members only area (on the public site everyone can access), again no need to add any robots.txt or .htaccess rule, Google should be able to pick them up.
How is the members-only area built, are the pages behind a password protected sign-in, or is it just a hidden area of the site that users have to know the URL for? If the latter then yeah some kind of rule would be required.
Members area is on a password protected page. I recall a long time back that the whole directory could be accessed so I added options -indexes to the htaccess file. Beyond that the site has 2 subdomain Wordpress blogs and I think Yoast has added the x-Robots file but tbh Iâm not sure (I donât have anything to do with the WordPress bit) . They have just pointed out that Analytics is saying the âpublicâ PDFs are not indexable and I think itâs the x-robots tag:
<FilesMatch â.(doc|pdf)$â>
Header set X-Robots-Tag ânoindex, noarchive, nosnippetâ
So am I Iâve removed the x-robots tag and weâll see what happens. Hopefully, the âpublicâ PDFs will be indeed and the members wont be found anywhere in SERPS ⌠but weâll see
As long as those non-public PDFs are behind some kind of gated member log-in, Googleâs bots shouldnât be able to get to them, but it does depend on how the member area was built (such as access permissions, content behind a password, etc.). All stuff we donât know here.
If you find the non-public PDFs start getting indexed, you can always add the relevant meta tags to noindex them and submit a URL removal request via Googleâs Search Console.