Industry has seen several great tools in APM (Application Performance Monitoring) space like AppDynamics, New Relic, CA Introscope, Dynatrace… These tools are great for monitoring the application’s performance characteristics. They can tell you whether your application is performing properly or not. If not performing, they can tell you the symptoms: CPU spiked up x%, memory degraded by y%, response time increased by z seconds. To a degree, they can tell you what caused the degradation. However, in several cases, you have to capture dumps from the application, analyze them manually to identify the root cause of the problem i.e., what line of code caused the CPU to spike? What objects are triggering memory leaks? Etc.
Modern applications generate diverse types of dumps:
1. Garbage Collection logs
2. Thread dumps
3. Heap dumps
4. Core dumps
5. hs_err_pid
6. TCP/IP dumps
These dumps are typically in proprietary format without any documentation, mostly in binary format, tends to run for several MB to GB which makes the hard problem harder. Thus, if you don’t have resident experts in your organization to diagnose these dumps, you will have to reach out to vendors to troubleshoot the problem. Your cloud vendor might say it’s not their problem; JVM vendor might say it’s not their problem; App server vendor might say it’s not their problem; DB Vendor might say it’s not their problem. While this blame game is continuing, your production will continue to be unstable. You end-up doing restarts (universal stop-gap solution to most problems).
Even if you have spotted right vendor, going through their support team and reaching right expert to troubleshoot might take considerable time (days, weeks). APM tools don’t process these dumps. This is where RCA (Root Cause Analysis) tools such as GCeasy, fastThread, HeapHero, Eclipse MAT, Wireshark, etc., comes in to picture. These tools analyze these cryptic dumps, applies patterns and isolates the root cause of the problem. Some of these tools have intelligence even to forecast future problems that are going to occur in the application.