In this post, we would like to share our real-world experience in optimizing a Java application which was controlling the robots in a warehouse. This application would give instructions to the robots in the warehouse on what actions to perform. Based on those instructions, robots carry out their job in the warehouse. Occasionally, this application was slowing down and not giving instructions to the robots. If robots don’t receive instructions from the application, they would start to make autonomous decisions causing degenerated behavior in them, which in turn was affecting the delivery and shipments in the warehouse.
Long Garbage Collection Pause
The best way to start troubleshooting the Java application’s performance is to study its Garbage Collection performance. This is even more true when the application suffers from a slowdown. We took this application’s Garbage Collection log file and uploaded it to the GCeasy tool. (Note: Garbage Collection log contains vital statistics that are not reported by most monitoring tools, and they add almost no overhead to your application. Thus, it’s a best practice to enable Garbage Collection log on all your production instances). The tool analyzed the Garbage Collection log and instantly generated insightful GC log analysis report
The tool reported various interesting metrics and graphs. The Garbage Collection Pause time graph in the report is of most interest to our discussion. Below is that graph:
Fig 1: Garbage Collection Pause Duration Graph generated by GCeasy
Whenever the Garbage Collection event runs, it pauses the entire application. During that pause time period, none of the customer transactions will be processed. All the transactions which are in flight will be halted. From the above graph you can notice that at 11:35am, during the peak traffic time, a Garbage Collection event was pausing the entire application for 329 seconds (i.e. 5 minutes and 29 seconds). It means during that entire 5+ minutes window all the robots wouldn’t have gotten any instructions from this application. They would have taken the decisions autonomously, causing disruption to the business.
What is causing the long GC pause?
There were two primary reasons causing such long Garbage Collection Pause:
1. Large Heap size. Application is configured to run with a 126GB heap size. Typically, when heap size is large, garbage collection time will also be longer. Because a lot of objects would have been accumulated and it would take a long time to be evicted from the memory.
2. CMS (Concurrent Mark & Sweep) algorithm. CMS GC algorithm runs well and is an apt fit for several applications, however it’s major drawback is its occasional long GC pause. CMS GC algorithm doesn’t always cause long GC Pauses, most of its GC pauses are in acceptable range, however occasionally it causes terribly long GC Pause due to heap fragmentation, which can last for several seconds (sometimes even minutes) like in this situation.
Potential Solutions to reduce Long GC Pause
Here is a blog post which highlights potential solutions to reduce long Garbage Collection pause. We were contemplating a couple of solutions to address this long GC pause.
1. Reducing Heap Size
Reducing the application’s heap size is a potential solution. However, the object creation rate of this monolith application was very high and reducing the heap size has potential to affect the application’s responsiveness. Heap size can be reduced only if the application’s memory consumption can be reduced, which warrants the refactoring of the application code. Already this monolith application’s re-architecture was underway. This monolith application was getting broken down and re-written as microservices with much lesser heap size. However, this application re-architecture was slated to go live in 6 – 9 months later. Thus, the customer was hesitant to reduce the heap size until then.
2. Switching from CMS to G1 GC algorithm
The other solution was to migrate away from the CMS GC algorithm. Irrespective of this GC performance, this CMS GC algorithm has been removed from Java 14. If we want to move away from the CMS GC algorithm, what are the alternatives we have? Below are the alternate GC algorithms that are available in OpenJDK:
1.Serial GC
2.Parallel GC
3.G1 GC
4.ZGC
5.Shenandoah GC
Serial GC algorithm is useful only for single threaded, desktop type of application. Since this application has multiple concurrent threads with a very heavy object creation rate, we eliminated the Serial GC algorithm. Since this application was running on Java 8, we ruled out ZGC and Shenandoah GC because they are all stable only from Java 17+. Thus, we were left with the choice of either using Parallel GC or G1 GC.
We simulated production traffic volume in the performance lab and experimented with Parallel GC and G1 GC algorithm settings based on the best GC tuning best practices. We found out that Parallel GC pause time was not as bad as CMS GC, but it was not better than G1 GC. Thus, we ended up switching from CMS GC algorithm to G1 GC algorithm. Here is the GC log analysis report of this robotics application in the performance lab when using G1 GC algorithm. Below is the GC Pause duration graph when using G1 GC algorithm:
Fig 2: G1 GC pause time graph generated by GCeasy
From the graph you can notice that the maximum GC pause time was 2.17 seconds. This is a phenomenal improvement from 5 minutes and 29 seconds. Also, the average GC pause time was only 198ms, way better than the CMS GC algorithm for this application.
Conclusion
After switching to the G1 GC algorithm, the application’s random slowdowns completely stopped. Thus, without major architectural changes, without code refactoring, without any JDK/infrastructural upgrades, just by tweaking the GC arguments in the JVM, we were able to bring this significant optimization to this robotics application’s performance.