Steps to Successful Application Troubleshooting in a Distributed Cloud Environment

At Grape Up, when we execute digital transformation, we need to take care of a lot of things. First of all, we need to pick a proper IaaS that meets our needs such as AWS or GCP. Then, we need to choose a suitable platform that will run on top of this infrastructure. In our case it is either Cloud Foundry or Kubernetes. Next, we need to automate this whole setup and provide an easy way to reconfigure it in the future. Once we have the cloud infrastructure ready, we should plan how and what kind of applications we want to migrate to the new environment. This step requires analyzing the current state of the application’s portfolio and answering the following:

  • What is the technology stack?
  • Which apps are critical for the business?
  • What kind of effort is required for replatforming a particular app?

Any components that are particularly troublesome or have some serious technical debts should be considered for modernization. This process is called “breaking the monolith” where we try to iteratively decompose the app into smaller parts, where each new part can be a new separate microservice. As a result, we end up with dozens of new or updated microservices running in the cloud.

So let’s assume that all the heavy lifting has been done. We have our new production-ready cloud platform up and running, we replatformed and/or modernized the apps and we have everything automated with the CI/CD pipelines. From now on, everything works as expected, can be easily scaled and the system is both highly available and resilient.

Unfortunately, quite often and soon enough we receive a report that some requests behave unusual in some scenarios. Of course, these kind of problems are not unusual no matter what kind of infrastructures, frameworks or languages we use. This is a standard maintenance or monitoring process that each computer system needs to take into account after it has been released to production.

Despite the fact that cloud environments and cloud-native apps improve a lot of things, troubleshooting our applications might be more complex in the new infrastructure compared to what the ‘old world’ represented.

Therefore, I would like to show you a few techniques that will help you with troubleshooting microservices problems in a distributed cloud environment. To exemplify everything, I will use Cloud Foundry as our cloud-native platform and Java/Spring microservices deployed on it. Some tips might be more general and can be applied in different scenarios.

Check if your app is running properly

There are two basic commands in CF CLI (github.com/cloudfoundry/cli) to check if your app is running:

  • ‘cf apps’ – this will list all applications deployed to current space with their state and the number of instances currently running. Find your app and check if its state says “started”
  • ‘cf app <app_name>` - this command is similar to the one above, but will also show you more detailed information about a particular app. Additionally, since the app is running, you can also check what is the current CPU usage, memory usage and disk utilization. 

This step should be first since it’s the fastest way to check if the application is running on the Cloud Foundry Platform.

Check logs & events

If our app is running, you can check its lifecycle events with :

`cf events <app_name>`

This will help you diagnose what was happening with the app. Cloud Foundry could have been reporting some errors before the app finally started. This might be a sign of a potential issue. Another example might be when events show that our app is being restarted repeatedly. This could indicate a shortage of memory which in turn causes the Cloud Foundry platform to destroy the app container.
Events give you just a broad look on what has happened with the app, but if you want more details you need to check your logs. Cloud Foundry helps a lot with handling your logs. There are three ways to check them:

  • `cf logs <app_name> --recent` - dumps only recent logs. It will output them to your console so you can use linux commands to filter them.
  • `cf logs <app_name> - returns a real-time stream of the application logs.
  • Configure syslog drain which will stream logs to your external log management tool (ex: Prometheus, Papertrail) - https://docs.cloudfoundry.org/devguide/services/log-management.html.

This method is as good as the maturity or consistency of your logs, but the Cloud Foundry platform also helps in the case of adding some standardization to your logs. Each log line will have the following info :

  • Timestamp
  • Log type – CF component that is origin of log line 
  • Channel – either OUT (logs emitted on stdout) or ERR (logs emitted on stderr)
  • Message

Check your configuration

If you have investigated your logs and found out that the connection to some external service is failing, you must check the configuration that your app uses in its cloud environment. There are a few places you should look into:

  • Examine your environment variables with the `cf env <app_name>` command. This will list all environment variables (container variables) and details of each binded service. 
  • `cf ssh <app_name> -i 0` enables you to SSH into container hosting your app. With the ‘i’ parameter you can point to a particular instance. Now, it is possible to check the files you are interested in to see if the configuration is set up properly.
  • If you use any configuration server (like Spring Cloud Config), check if the connection to this server works. Make sure that the spring profiles are set up correctly and double-check the content of your configuration files.

Diagnose network traffic

There are cases in which your application runs properly, the entire configuration is correct, you don’t see anything extraordinary in your events and your logs don’t really show anything. This could be why:

  • You don’t log enough information in your app
  • There is a network related issue 
  • Request processing is blocked at some point in your web server

With the first one, you can’t really do much if your app is already in production. You can only prevent such situations in the future by talking more effort in implementing proper logging. To check if the second issue relates to you:

  • SSH to the Container/VM hosting your app and use the linux `tcpdump` command. Tcpdump is a network packet analyzer which can help you check if the traffic on an expected port is flowing.
  • Using `netstat -l | grep <your_port>` you can check if there is a process that listens on your expected port. If it exists, you can verify if this is the proper one (i.e. Tomcat server).
  • If your server listens on a proper port but you still don’t see the expected traffic with tcpdump then you might check firewalls, security groups and ACLs. You can use linux netcat (‘nc’ command) to verify if TCP connections can be established between the container hosting your app and the target server.

Print your thread stack

Your app is running and listening on a proper port, the TCP traffic is flowing correctly and you have well designed the logging system. But still there are no new logs for a particular request and you cannot diagnose at which point and where exactly your app processing has stopped.

In this scenario it might be useful to use a Java tool to print the current thread stack which is called jstack. It’s a very simple and handy tool recommended for diagnosing what is currently happening on your java server.

Once you have executed jstack -f , you will see the stack traces of all Java threads that run within a target JVM. This way you can check if some threads are blocked and on what execution point they’ve stopped.

Implement /health endpoints in your apps

A good practice in microservice architecture is to implement the ‘/health’ endpoint in each application. Basically, the responsibility of this endpoint is to return the application health information in a short and concise way. For example, you can return a list of app external services with a status for each one: UP or DOWN. If the status is DOWN, you can tell what caused the error. For example, ‘timeout when connecting to MySQL’.

From the security perspective, we can return the global UP/DOWN information for all unauthenticated users. It will be used to quickly determine if something is wrong. The list of all services with error details will be accessible only for authenticated users with proper roles.

In Spring Boot apps, if you add a dependency to the ‘spring-boot-starter-actuator’, there are extra ‘/health’ endpoints. Also, there is a simple way to extend the default behavior. All you need to do is implement your custom health indicator classes that will implement the `HealthIndicator` interface.

Use distributed HTTP tracing systems

If your system is composed of dozens of microservices and the interactions between them are getting more complex, then you might come across difficulties without any distributed tracing system. Fortunately, there are open source tools that solve such problems.

You can choose from HTrace, Zipkin or Spring Sleuth library that abstracts many concepts similar to distributed tracing. All this tools are based on the same concept of adding additional trace information to HTTP headers.

Certainly, a big advantage of using Spring Sleuth is that it is almost invisible for most users. Your interactions with external systems should be instrumented automatically by the framework. Trace information can be logged to a standard output or sent to a remote collector service when you can visualize your requests better.

Think about integrating APM tools

APM stands for Application Performance Management. These tools are often external services that help you to monitor your whole system health and diagnose potential problems. In most cases, you will need to integrate them with your applications. For example you might need to run some agent parallel to your app which will report your app diagnostics to external APM server in the background.

Additionally, you will have rich dashboards for visualizing your system’s state and its health. You have many ways to adjust and customize those dashboards according with your needs.

APM Examples : New Relic, Dynatrace, AppDynamics
These tools are must-haves for a highly available distributed environment.

Remote debugging in Cloud Foundry

Every developer is familiar with the concept of debugging, but less than 90% of time we are talking about local development debugging where you run the code on your machine. Sometimes, you receive a report that something is doesn’t behave the way it should on one of your testing environments. Of course, you could deploy a particular version on your local environment, but it is hard to simulate all aspects of this environment. In this case, it might be best to debug an application in a place where it is actually running. To perform a remote debug procedure on Cloud Foundry , see the below:

Pivotal - How to Remotely Debug Java Applications on Cloud Foundry

Please note that you must have the same source code version opened in your IDE. This method is very useful for development or testing environments. However, it shouldn’t be used on production environments.

Summary

To sum up, I hope that all the above will help you with troubleshooting problems with microservices in a distributed cloud environment and that everything will indeed work as expected, will be easily scaled and the system will be both highly available and resilient.

Share the story

Related