Performance tuning for webMethods Integration Server

Christoph Jahn

Having spent a lot of time with performance tuning on webMethods Integration Server, I wanted to share some thoughts here. They will equip you with a starting point and the flexibility to go wherever your project takes you.

The webMethods Integration Server is a feature-rich and flexible platform. This versatility means that literally each and every system out there is unique in many ways. That sometimes seems to make it a bit challenging to diagnose performance issues. But if you know a few fundamental aspects, it is really not that hard.

In this article I share the things that I have found to be pretty common, when it comes to performance. I learned these over the past 20+ years, having been involved in some of the most demanding installations of Integration Server.

The key takeaways are

The biggest lever to improve overall performance is your application logic.
Finding the bottleneck is the hard part, not fixing it. Often the bottleneck is one of the connected systems.
Infrastructure is important, so you need to know a lot about hardware, the OS, and the JVM.
Once you understand the basics, you can master pretty much every situation.

I should also point out that nothing in this text is truly specific to Integration Server. It just so happens that I have learned these things in that particular context. But you can apply them to a lot of other situations as well.

Key characteristics of performance

Before looking into specifics, we need to be clear about the basics. There are two main properties of a system when it comes to performance: throughput and latency.

Throughput

Throughput is how many transactions will be completed in a given timeframe. When you are looking at how much speed you need, the timeframe should ideally be as short as possible. That will make your sizing much more tangible. Example: Saying “I need to process 5 million transactions per day” leaves a lot of room for interpretation. Because it assumes a level distribution, which is hardly ever the case.

Without further details it would be ok to design a system that can process 58 transactions per second. The day has 86400 seconds (24 [hours] * 60 [minutes/hour] * 60 s [seconds/minute]). If you divide 5 million by 86400 you end up with a bit less than 58 transactions per second. But what if your transactions are all crammed into a windows of 12 hours? You can of course add a mechanism to queue things up and work on them as quickly as possible.

However, that comes with a number of issues. The biggest one is obviously that your customers must be ok with it. If we are talking about sending out monthly invoices for mobile phones, that is perhaps ok (leaving out aspects like opportunity cost through lost interest gains). But what if we are talking online shopping? The customer usually wants to have a confirmation email in less than a minute. So we are basically limited to asynchronous and not time-critical workloads here.

The other problem is that “queuing things up” is far more complex than many people realize. Not only do I have to add code to handle the queues under normal circumstances. I also need to think about what happens in case of an outage. Do I need to mirror all unprocessed transactions to multiple backup data centers (perhaps on another continent)? Things need to be persisted in a secure way, be taken into account for disaster recovery procedures, etc. And we haven’t even talked yet about in-sequence processing and other interesting challenges. (Big secret: You will need some buffering anyway to cater for maintenance windows and for various other reasons.)

Latency

The second aspect of performance is latency. It describes how quickly a single transaction is completed. Think of it as basically responsiveness. So you need to make sure that your system is powerful enough to complete a request within a given timeframe. It is important to understand that you always need to look at this time together with the number of concurrent transactions.

Let’s assume that you have an SLA to complete a transaction in 1 second. To do this the capacity of the underlying system is used by 1%. So in a very simple world we could meet our SLA for up to 100 concurrent transactions. If you run a retail website and traffic (i.e. number of concurrent transactions) doubles, you need to double the resources to just keep your latency. (You have doubled your throughput, though.)

Of course the reality is much more complex. It is immensely difficult to design a system that can scale in a more or less linear fashion. In fact, it is one of the reasons why mainframes are still so popular. They have addressed these issues a long time ago, but it comes with a hefty price tag.

When we are looking at challenges like this you must be able to understand all the layers involved. That particularly includes hardware and a good example is networking. At 400 Gbps the latency of Ethernet is a bigger issue than with your standard PC connection. So perhaps we need Infiniband instead?

All this is barely scratching the surface. So please don’t think that you are good, when the things mentioned here are dealt with. They are purely meant to make you aware of the depth and breadth of the topic. There is also a very strong link to the business requirements. Some things are relatively generic, but overall you always need a tailored approach.

Where is my bottleneck?

It may surprise you to learn that the biggest challenge with performance tuning is not removing the limiting factor. It is to find the bottleneck in the first place. An analogy I like to use in this context is the diagnosis of a medical condition. If someone is always tired, the root cause can range from breathing problems during sleep, through lack of vitamin D (insert 25 other things here), up to cancer. The symptom is tiredness, but the root cause is something else.

It is exactly the same with performance. Slow processing is the symptom, but the root cause can be anything from the source system not being able to deliver input fast enough (actually a rather common issue) up to a hardware failure in your network cabling that causes packet loss and time-consuming re-transmission of data.

I am sort-of coming back to the point I made before about the need to understand all layers of the system. Ideally you have someone on your team who has that kind of knowledge. If not, and that will be the norm, people should be good at thinking in layers. Like “the database is slow, so I need to talk with the DBA about the execution plan first and in worst case go down the rabbit hole and ask the SAN guys about IOPS”.

With the approach to identify the true bottleneck comes one interesting finding: The single biggest factor for performance is almost always your code. At the same time a lot of people seem to be fixated on JVM tuning (esp. garbage collection) and ask for the “best universal setting”. Unfortunately the latter does not exist. So instead of looking for a silver bullet, you need to understand where in your application time is being spent. By clever reorganization you can speed things by sometimes orders of magnitude.

Next comes the application and Integration Server configuration. Some aspects here can have a substantial impact. The usual suspects are the level of concurrency for triggers and database connections. Also, the number of threads for Integration Server needs to be looked at. Set it too high and the performance goes down dramatically.

JVM tuning should be your last step under normal circumstances. It is a very individual and time-consuming exercise that typically gives you a performance gain of only around 10-30%. That is not negligible and should be done. But only after you realized the 1000-10000+% from improving your application logic and configuration.

Less is more

If a system cannot deliver the desired throughput, many people try to address this by increasing parallelism. This usually works up to a point, beyond which the performance goes down again. The reason is competition for a scarce resource, often showing itself through locking. There always comes the point, when a resource can only be accessed by a single “party” (usually a thread). So as long as one thread is using it, all others have to wait. Classic examples are rows in a relational database or in-memory data structures with synchronized access. (Since both cases are about locking to achieve consistency, you can often relax this with things like the CAP theorem, but only up to a point.)

So the best approach, in theory, is to design a system that does not need this context switching and locking. Because the process of switching consumes additional resources. Up to the point that the system is literally doing no real work anymore, but just shuffling around threads (sometimes called over-threading). So you should aim for a design that is so fast that it meets its performance requirements with sequential processing. This is not easy, but if you can pull it off, the results will be amazing.

The most extreme instance for this kind of optimization is CPU cache eviction. If the CPU can get the required data from one of its internal caches (ideally L1), that will be thousands of times faster than going out to RAM let alone disk (even the fastest SSD). So if you really need top performance (e.g. for high-frequency trading) this is the level you need to work at. For many other scenarios the added work does not pay off, though. So we can look at much simpler things. Here are a few examples:

For static or semi-static values you should probably use some kind of caching. Instead of retrieving the same value from the database millions of times per day, just do so once and keep it in memory locally.
If you have something like a web site with daily special offers, you can pre-compute the static parts of it and serve them via CDN. It will save you loads of resources and make your customers happier at the same time.
The cheapest operation is the one not done at all. If you have functionality where the success as well as the failure of the operation are logged, do you still need to log that the operation is finished? Or can that information be inferred from success or failure?

The common theme is to understand how long the execution of each piece of the code lasts. If it is long enough to warrant more of your time, look at how to make it faster.

Impact of infrastrucure

It is hard to overstate the importance of infrastructure for performance. If you do not have expertise here, you are in for a bumpy ride. As an example, I have come across many people who thought that a SAN is by definition faster than local storage. That is simply wrong, because the difference at this level is only the connection (Fiber Channel/iSCSI vs. NVMe/SAS/SATA). What also matters is the underlying storage in terms of bandwidth and IOPS as well as how much capacity is used by other clients.

A more subtle aspect is how infrastructure determines the design on the software side. If you have a mainframe with practically unlimited I/O capacity, you can take a different approach compared to a highly distributed system with relatively high network latency. While that is still rather obvious, how about the difference between development and production environment? And how does the behaviour change if you process more than 20 parallel threads, which is the most you can do on your laptop? What if the data volume increases by a factor of 10 million? Does your daily backup still complete in less than 24 hours? (Seriously, this can be an issue.)

The good thing here is that today we have such powerful hardware. So we can often be a bit more relaxed than during the mid-2000s, when there were no SSDs. Having 300-600 IOPS from a mechanical disk in those days, compared to the millions we get from even cheap consumer SSDs today, has really changed a lot of things. Yes, some of that has been off-set by the dramatic increase in data volume, but it is still much easier. Example: Back in 2002 it was usually at least difficult on the given hardware to have a single relational database instance with 100 GB of data in it. So we had to come up with things like sharding, storing data in regular files and link from the DB to those, and various other ideas.

While the thresholds have moved a lot over the last 20 years, the core challenge remains the same: You need to understand the limits and how to work around them, if necessary. Depending on the expected growth of workload this thought process can be rather remote or already pretty specific right from the start. Even if you are absolutely sure that you don’t need scalability above a certain level, be very careful not to architect-in hard limits early on. The world is full of examples where that calculation went badly wrong.

Real-world stories

To illustrate how sometimes things don’t work out as planned, I want to share a few real-word stories I have come across during my time in IT. Don’t be tempted to think “those folks were so stupid, this never could have happened to me”. In all the examples I share, I know that the people involved were highly qualified. The combination of how many decisions you have to make in a very short time, how accurate the information given to you is, and how well predictions come true, is always going to “win”.

Production is too fast

This is not a typo. A production system can actually be too fast. At least if you have made assumptions about the processing speed and how they affect your implementation. In this case we are talking about a system that created a lot of PDF files as part of an output management solution. Like hundreds of thousands of files per day. And of course those files needed a unique file name. The developer (a very senior person) made the decision to use a timestamp for this, with a resolution of seconds. You probably already know where this is going.

On the relatively slow laptop this was no problem, because the process always took several seconds. But in production there was a really beefy server, so multiple files per second were the norm. Nobody noticed until customer complaints started to pour in. The fix was easy, but everybody felt rather embarrassed. The post-mortem showed that somehow nobody (incl. business and IT staff from the customer) had realized the potential issue.

How big is your machine?

I was once on a performance proof of concept (POC) where the potential customer expressed concern about the speed we had demonstrated. Having done relatively similar exercises a couple of times before, I was surprised, because the numbers were actually pretty good. So I asked for their reason to be unhappy. They told us “we had expected more, since we have given you a machine with 4 CPUs”. That made me curious and I asked for a bit more detail, knowing that x86 machines with 4 CPUs are a relative rare species.

Funnily enough I knew the machine they named quite well, because I had configured a few of them for an earlier project. It was a 1 U server with 2 CPU sockets and was mostly ordered with only 1 of them populated. So I asked how they had come to the conclusion that it was a machine with 4 CPUs. I was stunned by the response: “Well I had opened the task manager in Windows and on the performance tab there were 4 bars. Therefore I had concluded the machine has 4 CPUs.” In reality it was a single CPU with 2 cores and hyper-threading turned on.

Once that misunderstanding had been cleared up, we were out of the woods. This really taught me to be careful with accepting “truths” that I don’t fully understand.

Sequential or parallel?

This is one of my favorite stories, so I left it to be the closing highlight here. It is about a project I had been on many years ago. The task was to replace a complex system that transferred large lists (millions of line items) as flat files between globally distributed systems. The size of those lists and the parallel processing was one of the big challenges of the project.

We were already a couple of weeks into the project, when over lunch I asked the customer for the reasons that they were switching the platform. After all, the main addition we provided was a sophisticated mechanism to handle different job priorities. That was far from trivial. But on the other hand it didn’t seem to warrant a considerable project.

It turned out that the existing solution was designed in such a way that at one crucial point the execution was sequential. A bit like when the passport control at an airport has only one counter open. Up to that point everything is fine. People can come to the control point from multiple planes at the same time. But then they are forced to queue. This is exactly what happened with the existing system. In addition, a strict FIFO (first in, first out) handling had been implemented. So regardless of priority, files were processed in the order of their arrival.

For our border control example the equivalent would be that the first 200 people in the queue have arrived at their final destination, while positions 201-300 need to hurry for a connecting flight. And just like missing your flight is bad, so was the delayed processing of critical lists. In some cases the delay was so long that the receiving parties had to wait until the next business day to get the data they needed. I didn’t ask any further questions. My guess was that somehow the different priorities had been missed as a requirement for the design of the old system.

So what is this story’s relationship to performance? Basically that performance is always relative to the point where it is measured. If the message (or file) processing system is running on the fastest available hardware and optimized like hell, it can still miss the requirements. If I am interested in the end-to-end performance it doesn’t help me much, if my file is processed in 5 seconds while I have to wait 4 hours or more for that processing to start in the first place.

In closing

You probably guessed from the length of this article that I find performance a fascinating topic. Performance covers, as I could hopefully bring across, a huge variety of aspects. Bringing all those together is sometimes challenging. Interestingly, I have not met many people who seemed to be interested in performance and performance tuning. So if you are, this is your chance to shine, since not much competition is out there.

Although this is one of my longer articles, there is a lot more to cover. Here are additional aspects:

For performance testing you need to be able to inject the appropriate load in a controlled way. This is rather complicated and not only on the technical level. You also need to be careful with compliance and other legal issues, when it comes to customer data for testing. If you need help here, let me know and I can connect you with some folks.
I had briefly touched upon maintenance windows. That is a topic of its own and should be high on the list of requirements when it comes to application design.
Containers and orchestration platforms are rising, and the same is true for Cloud-native architectures. Whether hosted on-premises or in the public Cloud, a lot of additional factors come in.
Microservices as a hyper-distributed architectural style has been fashionable for a while. Unfortunately, it has often been used to address performance issues with a previously monolithic application. That hardly ever works. A monolith is not a bad approach per se, and also much simpler. If you have issues there, chances are they will be bigger with Microservices.
Understanding the business will always be the single most imortant thing. Don’t be distracted from this by technical stuff.

That’s it for today. Although it took a while to write this, I thoroughly enjoyed it. Hopefully you do the same when reading it.

If you want me to write about other aspects of this topic, please leave a comment or send an email to info@jahntech.com. The same applies if you want to talk how we at JahnTech can help you with your project.

Microservices and code reuse for integration platforms

Today I want to write about a software development approach I was able to put into practice a few years ago. I was myself very much surprised how well things worked out, and especially how fast I was able to see tangible benefits for the business side.

Custom logging with Log4j on Integration Server

Every serious IT system needs proper logging. This article describes how I use Log4j v2 with webMethods Integration Server. There is no need to change anything on the system, just install the package and it works.

External Java libraries in webMethods Integration Server

To get most out of your Integration Server you will sooner or later need to leverage the functionality of existing Java libraries. Common examples are Log4j or Apache POI (for working with Excel or Word files). Here is a short guide for this.

Fixing Windows 10 update problem with WinRE partition

There are (as of May 2024) update issues on many Windows 10 systems due to a bug in Bit Locker. If you get error “0x80070643” and updates are not installed, you need to fix this manually.

Running webMethods Integration Server in containers

Organizations are moving from traditional deployments to containers and Kubernetes to increase business agility. This article discusses a lot of topics that are relevant to actually reach the goal of faster delivery of value to the business through software. Because if you do it wrong, you just end up with yet another technology stack. And, as always, technology is not your biggest challenge.