From Manual
Processes to
Resilient Infrastructure
Building TLS certificate management at the scale of thousands of services
As infrastructure grows, TLS certificate management stops being a purely administrative task. It begins to directly affect service reliability, SLA compliance, and the security of business processes.
Manual workflows, centralized request queues, and a lack of transparent control create systemic risks when an organization operates thousands of services. Some certificates fall out of sight, the number of incidents increases, and engineering teams spend time on approvals instead of building and improving products.
Sergei Naumov, Senior Backend Developer at Avito working in the field of information security, focuses on challenges related to secrets management and PKI processes. His experience includes designing architectures for TLS certificate automation, integrating such systems into distributed infrastructure, and building processes where certificate management becomes part of the service lifecycle.
As part of one infrastructure project, the goal was to automate the management of approximately 2,000 TLS certificates for key platforms, from PaaS to data processing services. These systems are used by nearly all development teams, which means that mistakes in certificate management can affect the stability of a significant part of the infrastructure.
In this interview, Sergei explains why a centralized certificate management model eventually stops scaling, how to build a service-oriented approach to certificate management, and which engineering decisions help reduce operational risks without adding more manual control.
When does TLS certificate management start affecting the resilience of the entire infrastructure?
When the number of certificates becomes large enough, the task is no longer limited to tracking expiration dates. Manual tools such as spreadsheets, reminders, and request-based workflows stop being reliable.
In practice, this becomes visible quite quickly. Some certificates get lost in the process, some renewals are delayed, and some environment-specific details are missed. As a result, incidents begin to occur, and those incidents already affect the operation of services.
In a distributed infrastructure, an expired certificate can break integrations, degrade services, or disrupt internal platforms. When there are many dependencies, such problems can start spreading through the system as a chain reaction.
That is why certificate management becomes part of reliability architecture. Timely renewal is important, but it is not enough on its own. What also matters is how the process is designed: who is responsible for certificates, how they are issued, how errors are tracked, and how the system behaves during failures.
Why does a centralized certificate management model stop scaling over time?
At the early stages, a centralized model may look reasonable: there is a dedicated team that issues certificates, and other teams submit requests.
But as infrastructure grows, the number of operations increases sharply. Issuance, reissuance, parameter changes, and renewals all turn into a constant stream of repetitive tasks. The central team becomes a bottleneck, while developers become dependent on request processing speed.
In practice, this leads to several consequences. First, the workload on infrastructure engineers increases. Second, product teams experience delays. Third, there is a risk of workaround solutions, when teams start using temporary schemes just to speed things up.
Under these conditions, the model itself begins to slow down infrastructure development.
Which architectural principles become key at this scale?
The key principle is to move certificate management closer to services.
In practice, this means that issuing and renewing certificates should no longer be treated as a separate administrative procedure. Instead, these processes should be embedded into the service infrastructure through tools that teams can use within their own workflows.
Such solutions require designing the architecture of the certificate authority, configuring the CA system, and thinking through certificate issuance and rotation scenarios.
For example, an open-source CA can be adapted to the internal infrastructure: issuance logic can be configured, integrations with services can be built, and security rules can be defined according to the organization's requirements.
Another important part of the work is developing an automation tool that allows services to obtain and renew certificates on their own. This tool must account for infrastructure-specific details, integrate with internal systems, and operate without manual involvement.
Observability is also critical: metrics, alerts, error tracking, and scenarios for handling large-scale failures.
These integrations affect different environments. What turns out to be the most difficult part?
The main difficulty is integration into real infrastructure.
The system has to work across different environments: virtual machines, Kubernetes, and various deployment scenarios. This requires taking into account network restrictions, firewalls, CI/CD specifics, and the way services behave when certificates are renewed.
Kubernetes is a separate and important area. Here, the architecture must allow certificates to be issued and renewed automatically inside clusters, without user involvement. This requires integration with the cluster infrastructure and an understanding of orchestration-specific behavior.
It is also important to design for failure scenarios: what happens during a large-scale outage, how the system reacts, and how quickly normal operation can be restored.
Open-source CAs are often used in such projects. Why is this approach in demand?
Open source makes it possible to adapt the system to a specific infrastructure.
In these projects, a CA system is rarely used exactly as it comes out of the box. It is configured, integrated with internal services, and extended with custom workflows.
For example, teams may need to customize the certificate issuance process, configure communication with other infrastructure components, or integrate the CA with automation tools.
At the same time, such a system requires proper support: testing, monitoring, and ongoing improvements. In practice, it becomes an internal infrastructure product.
How did moving approximately 2,000 TLS certificates to an automated model affect engineering teams?
The main change was the transition from manual operations to automated processes.
Previously, certificate management could be centralized: teams submitted requests, waited for them to be processed, and then received the result. In the new model, certificate management is embedded into services. Teams work with certificates directly through dedicated tools.
This reduces the number of manual actions, lowers the workload on the central team, and speeds up basic operations.
In addition, the system starts automatically tracking certificate expiration dates and initiating renewals. This helps reduce the number of incidents caused by expired certificates.
From an efficiency perspective, where is the impact most visible?
The strongest impact is in reducing operational workload and risk.
Automation removes a significant share of repetitive actions: issuing certificates, renewing them, and tracking expiration dates. This reduces dependency on the human factor.
It also reduces the number of incidents related to expired certificates. In infrastructures with many services, this is critical because such incidents can affect multiple systems at once.
Another important effect is that teams move faster. When certificate management is embedded into development workflows, basic operations are completed more quickly and do not require additional approvals.
What would you recommend to companies facing a growing number of certificates?
First, it is important to define the operating model.
Companies need to determine who is responsible for certificates, which processes remain centralized, which processes are delegated to teams, and which operations should be automated first.
The next step is architecture: the certificate authority, automation tools, infrastructure integration, and observability.
Practice shows that automation alone does not solve the problem unless the operating model changes as well. Automation starts working effectively only when it is embedded into real processes and supported by a clear system of responsibility.