5 Tips On How to Deal With Common Problems When Running Large Production Environments

Working as a platform operator with cloud-native technologies, L2 technical support and participating in CF installations give a unique opportunity to observe how different companies implement new technologies in their works. Among various bad experiences, imperfect ideas, and the most reprehensible habits related to running and maintaining cloud infrastructures those listed below can generate the most complicated problems.

Bad practices often occur when it comes to productive CF infrastructures. However, these guidelines should help everyone who runs or uses any of the production-ready workloads.

Neglected capacity planning

Let’s start with this: you have to be aware that you will run out of resources eventually. Then you should plan how to scale up. If you run on-premises software, you should consider hardware and virtualization layer’s requirements. Proper sizing of the availability zones will always save you many problems.

On top of IaaS there is always a PaaS or some container orchestrator. The key to success here is to optimize all the limits, quotas and other configurations (like application scaling rules, etc.) so the microservices never consume available resources, even under high load.

It’s obvious that both hardware and virtualized capacity planning requires a buffer. You need to be prepared for issues, maintenance and infrastructure changes. There is no best configuration. It always depends on many factors but nevertheless, it is always worth taking into consideration.

Capacity and resources have to be monitored. A good monitoring tool with decent alerting rules will help you predict possible problems and react quickly if anything bad happens to your infrastructure.

Poor or no CI/CD

If you want to maintain any piece of software, don’t forget how valuable is automation. Many times people quit on CI/CD implementation because of the deadline or tasks formally more important. In most cases, it doesn't end up well.

It's hard to build, test and deploy software without automation. The manual process is highly exposed to the risk of human error. Apart from that, it is almost impossible to keep track of deployed software (version, updates, hotfixes, security patches, etc.) in large production environments. Sometimes you have to maintain CF platforms hosting 1K+ applications. Consider how problematic would be the migration process if there is a business decision to switch to a different solution.

For operators maintaining the infrastructures, platforms, and services used by developers it’s critical to keep everything up to date, take care of security patches and configuration changes. It is impossible to handle this manually with minimal or zero downtime of the services. That is why automated pipelines are so important, and you should never give up on implementing them in the first place.

Poor or no backup/restore procedures

Backup/restore is another important process that people often put in the background. You may think that your applications are safe if your IaaS offers you a highly available environment or containers you run have an auto-healing function. This is not true. Any disaster can happen, and in order to recover quickly, you have to create a well-defined backup and restore procedures that work. That’s not all, as the procedures have to be tested periodically. You need to be sure that backup/restore works fine since the process may depend on some external services that might have changed or just brake.

No periodic updates

Every software has to be updated regularly in order to keep it secure. It is also much safer to perform minor updates with a little chance of failure or downtime than doing ‘big jumps’. Major updates introduce higher risk, and it is hard to catch up with versions especially if there is no automation implemented.

You may see cloud infrastructures that were just installed and never upgraded and that generates a lot of issues for platform operators (users can’t see any difference). It is not a problem until everything works correctly. But after some time people may start escalating issues related to the versioning of the services. Unfortunately, it is too late to upgrade smoothly. It becomes a big spider’s web of dependencies. It may take weeks to plan the upgrade process and months to execute it.

Flawed architecture

Defective architecture generates serious problems. Many times developers are not aware of the issue until it shows up in production. After that, it’s really hard to admit the architecture needs to be changed and people often try to get rid of the effect instead of fixing the cause of the problem.

Let’s take a real-life example often faced. You may be receiving Prometheus alerts saying that ELK stack is overloaded. After investigating the issue, it may turn out that microservices are so verbose that they generate thousands of log messages per second. What if you raise the possible architecture problem, but nobody cares? As a result, you’ll have to scale ELK. In those cases, it may waste hundreds of CPUs and terabytes of memory and storage. That makes somebody spend money just to store 90% of useless data and maybe 10% of valuable information. This is really a simple way to put yourself in a situation without a way out.

Conclusion

Following these guidelines will definitely not be easy. Sometimes people responsible for taking decisions are just not aware of the consequences of taking some actions. Role of every technically skilled person in the project is to spread the knowledge and make people aware of what may happen if they ignore those basic rules that matter. You can’t step back if you encounter such practices in the future. Be an example for others and drive the change - it’s always worth trying.

Share the story

Related