While preparing for the release of Gigahatch Managed Kubernetes, we encountered the issue of our staging environment potentially being indexed by search engines. To fix this, we would like to serve a robots.txt
file with the following content in the root of our website:
User-agent: *
Disallow: /
To do this, we can either host the file in the application itself, or we can host it somewhere in the environment outside the application. Let’s look at both approaches.
One approach to adding the robots.txt file is embedding it within the web application container. This would require building different images for staging and production environments or managing the robots.txt file within the CD pipeline. Alternatively, we could configure the web container to serve a different robots.txt file based on the environment. However, this adds unwanted complexity to an otherwise independent application image and requires a code change every time we want to change something in this setup.
Ideally we would be able to configure our kubernetes ingress to serve this file directly. Then we wouldn’t have to touch the image and we could choose to serve the file or not based on the environment. We could also trivially reuse the same solution for multiple applications.
We could run a minimal pod that just serves a robots.txt
file and add a new path
entry to the ingress. But that seems like a very complicated and inflexible setup. There doesn’t seem to be an ingress-native way to do this, so we looked for ingress-nginx-specific solutions, since that is the ingress controller we use. Fortunately, we can configure NGINX to handle this with almost zero overhead using configuration snippets.
Configuration snippets are disabled by default by ingress-nginx due to security concerns, so make sure to understand the implications of allowing configuration snippets before enabling this flag.. In our case this is no problem, because we are the only ones using our kubernetes cluster.
We use Flux for our GitOps workflow, combined with Kustomize to manage environment-specific deployments. This setup makes it easy to apply this configuration selectively to specific environments using kustomize patches. If you use a different approach to GitOps, your approach to applying this per environment will differ.
Here is our base ingress resource, shared across all environments. This is the yaml file the patches will be applied to.
ingress-frontend.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: frontend
labels:
name: frontend
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
rules:
- host: cloud.gigahatch.ch
http:
paths:
- pathType: Prefix
path: '/'
backend:
service:
name: frontend
port:
name: http
# ... other paths omitted
tls:
- hosts:
- cloud.gigahatch.ch
secretName: cloud-gigahatch-ch-crt
To instruct NGINX to serve our robots.txt
, we simply add the nginx.ingress.kubernetes.io/server-snippet
to our ingress (here using a Kustomize merge patch). You could also add this annotation directly to your resource definition if you don’t want it to change between environments.
patch-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: frontend
annotations:
nginx.ingress.kubernetes.io/server-snippet: |
location /robots.txt {
return 200 "User-agent: *\nDisallow: /\n";
}
nginx.ingress.kubernetes.io/configuration-snippet: |
add_header X-Robots-Tag none;
We also added the nginx.ingress.kubernetes.io/configuration-snippet
annotation to set the X-Robots-Tag
header on all locations to none
. This is the recommended way to instruct search engines not to index a page. Note that robots.txt
alone does not prevent indexing if another website links to your page. Setting the X-Robots-Tag header to none
fully blocks indexing.
If you are using flux like us, make sure you don’t forget to add the patch-ingress.yaml
file to your kustomize.yaml
file, otherwise nothing will happen:
dev/kustomize.yaml
resources:
- ../base/ingress-frontend.yaml
patches:
- patch:
path: patch-ingress.yaml
target:
kind: Ingress
name: frontend
For security reasons, configuration snippets are disabled by default in ingress-nginx. Since we control and trust every ingress resource created in this cluster, we can safely enable this feature.
To allow snippets, we modified our ingress-nginx Helm chart deployment. We use the HelmRelease
CRD from Flux to provision the ingress, so we only needed to set the allowSnippetAnnotations: true
flag in the chart values.
ingress-nginx.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: ingress-nginx
namespace: ingress-nginx
spec:
chart:
spec:
chart: ingress-nginx
reconcileStrategy: ChartVersion
sourceRef:
kind: HelmRepository
name: ingress-nginx
interval: 1h0m0s
timeout: 10m
targetNamespace: ingress-nginx
install:
crds: Create
upgrade:
crds: CreateReplace
values:
controller:
# --> Set this flag to true
allowSnippetAnnotations: true
config:
# ...
service:
type: LoadBalancer
annotations:
# Depending on your cloud provider, you might need to adjust these labels.
# In this example, we use Gigahatch Managed Kubernetes
load-balancer.gigahatch.cloud/location: 'EUROPE_CENTRAL_1'
load-balancer.gigahatch.cloud/use-private-ip: 'true'
load-balancer.gigahatch.cloud/uses-proxyprotocol: 'true'
If you are using the Helm CLI rather than Flux, the following Helm command accomplishes a similar setup:
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx
--namespace ingress-nginx
--set controller.allowSnippetAnnotations=true
By using the nginx.ingress.kubernetes.io/server-snippet
annotation provided by the ingress-nginx controller, we configured NGINX to serve the robots.txt file without running a separate pod. This approach helps us keep the web application image simple and manage the ingress configuration using Flux GitOps.
This solution demonstrates how we can leverage Kubernetes’ flexibility to solve problems in a clean and efficient way.
I hope you found this article helpful. If you have any questions or suggestions, please leave a comment below.