Production k8s Deployment
atlantis
To avoid the bootstrap circular dependencies, atlantis in shared is installed in infrastructure-live.
Atlantis GUI: https://atlantis-ui.shared.company.io
blackbox-exporter
The blackbox-exporter is used to:
- verify network connectivity between VPCs, both legacy and new world;
- verify presence of atlantis in the shared cluster;
- verify presence of vault in all the new world clusters.
Deploy the blackbox exporter using helm chart.
- repo name:
prometheus-community, but, given that-is not acceptable, make itprometheus_community. - repo URL: https://prometheus-community.github.io/helm-charts
- chart name:
prometheus-blackbox-exporter - version: 8.4.0
The latter two derived from the contents of https://prometheus-community.github.io/helm-charts/index.yaml
> tk tool charts add-repo prometheus-community https://prometheus-community.github.io/helm-charts
{"level":"info","time":"2023-10-31T12:05:42-07:00","message":"Skipping prometheus-community: invalid name. cannot contain any special characters."}
Error: 1 Repo(s) were skipped. Please check above logs for details
> tk tool charts add-repo prometheus_community https://prometheus-community.github.io/helm-charts
{"level":"info","time":"2023-10-31T12:06:11-07:00","message":"OK: prometheus_community"}
Then:
tk tool charts add prometheus_community/prometheus-blackbox-exporter@8.4.0
TODO: blackbox-exporter endpoints are no longer scraped by the datadog. Start scraping these using splunk, e.g. using smart agent
Troubleshooting k8s Endpoints
Get a shell in the pod and then:
wget -qO - http://localhost:9115/metrics|grep prom
Check connectivity to shared, example of failure to connect:
~ $ wget -qO - http://localhost:9115/probe?target=https://95CD862D2577203EE8607925A3FF3411.gr7.us-east-1.eks.amazonaws.com/livez?verbose
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.003848593
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 5.001584557
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 0
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0.003848593
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 0
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 0
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 0
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 1.137491e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0
Check connectivity to dev, example of success:
~ $ wget -qO - http://localhost:9115/probe?target=https://E7281548B3E48DB5B68C580B64F46EC1.sk1.us-east-1.eks.amazonaws.com/livez?verbose
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.001192028
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.01042065
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 1539
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.000859662
probe_http_duration_seconds{phase="processing"} 0.003866346
probe_http_duration_seconds{phase="resolve"} 0.001192028
probe_http_duration_seconds{phase="tls"} 0.003989397
probe_http_duration_seconds{phase="transfer"} 0.000130071
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 1539
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 2
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.97724711e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns last SSL chain expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.729957185e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds -6.21355968e+10
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="be495eb2078fe6d04882545c3c79f9d713cc983f3cb75f6784f597ec581e0f2a",issuer="CN=kubernetes",subject="CN=kube-apiserver",subjectalternative="e7281548b3e48db5b68c580b64f46ec1.sk1.us-east-1.eks.amazonaws.com,ip-172-16-107-176.ec2.internal,kubernetes,kubernetes.default,kubernetes.default.svc,kubernetes.default.svc.cluster.local"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Returns the TLS version used or NaN when unknown
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.3"} 1
Vault Endpoints
Vault health endpoint
uses HTTP status code to represent vault state. To ensure that
unsealed and standby is represented by HTTPS status code 200 required by the
blackbox exporter, we use the following customization:
wget --no-check-certificate -cqSO - https://vault.dev.company.io/v1/sys/health?standbycode=200
As you can see, for the standby being false AND true, result in HTTP status 200.
Having established the vault endpoint URL, let’s verify that the vault is healthy in the dev cluster using the blackbox exporter:
wget -qO - http://localhost:9115/probe?target=https://vault.dev.company.io/v1/sys/health?standbycode=200
Produces:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.003709816
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.010071418
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 289
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.000878359
probe_http_duration_seconds{phase="processing"} 0.001725035
probe_http_duration_seconds{phase="resolve"} 0.003709816
probe_http_duration_seconds{phase="tls"} 0.003300463
probe_http_duration_seconds{phase="transfer"} 9.0415e-05
...
cloudzero
Cloudzero is a tool to consolidate cost information, enrich with context, and distribute to the organization. One of the many capabilities it has is to allocate a single cost to other non-billed resources. A prime example of this is the EKS compute costs; we have a lot of EC2 instances running for the cluster but only parts of those costs should be allocated to the given workload.
To accomplish this workload allocation we use this CloudZero k8s integration here. This will provide the telemetry to CloudZero to split the EC2 costs to the underlying workloads correctly.
depesz
external-dns
flagger-system
flux-system
FluxCD is the the CD tool we are using to deploy the state from this repo to the actual k8s cluster. This specific application will generate a flux Kustomization for each Environment.
Some options are exposed as labels on the Environment, namely:
| Label Key | Flux Option | Default |
|---|---|---|
| fluxcd.io/interval | interval | 15m |
| fluxcd.io/prune | prune | true |
Access control for Flux Kustomizations
Role and RoleBinding is automatically generated for teamsWithKustomizationRights array in environments/flux-system/main.jsonnet, if the team is listed either as owner_group or in edit list inside service_catalog.json.
This allows given team to not only control the resources within their application namespace, but also edit the kustomization, which mostly just means an ability to Suspend and Resume kustomization.
goldilocks
See https://github.com/FairwindsOps/goldilocks/tree/master/hack/manifests
istio-gateway istio-system istio-waypoint
istio “extends Kubernetes to establish a programmable, application-aware network.”
Install istio using helm charts:
Waypoint: Istio Ambient Waypoint Proxy Made Simple
karpenter
an open-source cluster autoscaler that automatically provisions new nodes in response to unschedulable pods.
We use a (default) provisioner. After upgrade switch to NodePool.
Provisioner support for startupTaints was introduced here.
keda
KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.
Autoscaling by:
- CPU
- SQS: https://keda.sh/docs/2.16/scalers/aws-sqs
- Kafka: https://keda.sh/docs/2.16/scalers/apache-kafka
langsmith
Followed https://docs.smith.langchain.com/self_hosting
Installed via helm with a chart from:
- https://github.com/langchain-ai/helm/blob/main/README.md
- https://langchain-ai.github.io/helm/charts/langsmith/
llm-proxy
The llm-proxy service provides a centralized proxy for accessing LLMs through a consistent API interface.
A web interface is available via open-webui deployment that connects to the proxy service.
Endpoints:
- Main API: llm-proxy.[account]
- Web UI: open-webui.[account]
Available Models: Azure
- gpt-o3-mini
- gpt-o1
- gpt-o1-mini
- gpt-4o-mini
- text-embedding-3-small
- text-embedding-3-large
Adding new models: When adding new Azure models please remember about setting base_model for accurate cost calculations. Documentation can be found here.
k8s metrics server
Installed using helm chart
milvus
n8n
https://github.com/8gears/n8n-helm-chart
nodetaint
Problem this solves: have workload pods started only after system-critical pods, e.g., CSI secret provider and otel, are already running.
How this works:
- karpenter
startupTaintsare used to have the nodes started with the taintnodetaint/notready. - infrastructure pods are configured with
nodetaint/notreadytolerations which allows them to start on the newly started nodes. Other pods without such tolerations are NOT scheduled on the new nodes. - this nodetaint controller removes a
pre-configured taint
nodetaint/notreadyfrom a node after the daemonsets annotated withnodetaint/system-criticalare running on the node. This ensures that the system critical daemonsets are running on a node before it can run any other pods.
We mark the following daemonsets as system critical:
vault-csi-provideristio-cni-nodeistio-ztunnel
TODO:
- also mark as system critical
opentelemetry-operator-collector-daemonset
platform-insights
Palantir policy-bot
secrets-store-csi
This namespace is defined elsewhere. We only add DatadogMonitor’s.
slack-group-updater
ssl-exporter
SSL metric exporter, docs.
strongdm
trino-a trino-mcp trino-gateway
We deploy trino using helm chart.
Configuration
How to verify new Vault secret injection
Step 1. Add new secrets to the secretsPerEnv in the secrets.jsonnet file
Step 2. Create init-dev-config.sh and extend ConfigMaps in the secrets.jsonnet file
configMaps: [
configMap.new('trino-init-config-script', if account == 'dev' then {
'init-dev-config.sh': (importstr 'init-dev-config.sh'),
} else {
'init-config.sh': (importstr 'init-config.sh'),
}),
],
upwind-operator
vault vault-operator
We deploy bank-vaults vault-operator, docs, using a helm chart. vault-operator in turn deploys vault.
vault environment deploys vault-csi-provider using the official HashiCorp Helm chart for vault.
velero
Velero offers tools to back up and restore your Kubernetes cluster resources and persistent volumes. How Velero Works.
Velero is scheduled to perform backup daily, namespaces (specifically vault) and resources within
the namespace to backup.
Backup Destination is an s3 bucket velero-<cluster-name>-<cluster-region>-<cluster-accountId>