Performance Tuning Methodology

Performance Tuning Methodology

I’m taking a brief excursion from my usual identity and API-centric posts to answer a question about performance tuning that someone asked me earlier this year. In a previous incarnation of my career, I was focused on performance tuning and diagnostics—particularly involving Java systems. However, the same principles can apply to just about any running system. This post explores how to approach load testing and performance tuning for just about anything.

Approach

Performance Testing (and Tuning) is generally best done with an approach similar to Black Box Testing, which involves testing a system without any specific knowledge of the internals of the system—or, at least approaching the problem in this fashion. In all likelihood, you are going to have a fairly detailed idea of what the system architecture and internals of your system looks like. This may not be true of SaaS, PaaS, or third-party vendor applications where you don’t have source code access.

This approach works with everything (including vendor and SaaS apps). Note, most SaaS providers (and third-party hosting providers) don’t necessarily want you load testing their systems—unless specific provisions are made (legally, operational support, and system capacity standpoint).

So, one is not examining source code at this point—that comes later. We are interested in the overall behavior of the system (including the code running in it). I was at a client site a few years ago, working directly with an application development team and doing this type of work; the lead developer had trouble understanding why my performance tuning methodology didn’t start with reviewing code.

This approach to performance tuning entails:

  1. Understand/document what needs to be tested.
  2. Define desired targets in terms of throughput or other system metrics.
  3. Create load test scripts.
  4. Produce load against system.
  5. Observe behavior of end-to-end system (ie, response time).
  6. Record CPU utilization, memory utilization, network I/O, disk I/O, throughput during these tests.
  7. Identify where bottlenecks are.
  8. Fix the most prominent bottleneck (even if you see more than one bottleneck, only fix one at a time).
  9. Repeat steps 4–8 until the desired SLA (Service Level Agreement) has been met.

Step #8 is the hard part.

Given my background in middleware, I tend to start towards the bottom of the technology stack and start working my way up until I find the problem. Eventually, I get to the application code. Sometimes, that proves more efficient; sometimes, starting at the application code and working your way down is more efficient.

One of my peers approaches this activity in the exact opposite fashion that I do; he will start looking at the code first, and then work his way down into the system until he finds the problem. This peer’s skill set includes being very proficient at using Java debuggers to efficiently find problems. Use the approach that comes most naturally to you. Not everyone has the mindset or patience for this activity; your organization should employ someone who does.

When a bottleneck or error is encountered, start at the place where the issue is observed. Then, move deeper into the system as needed to identity the root cause. Maintaining discipline in this activity tends to pay off in the long run.

I described this methodology as similar to black box testing, but there is an important difference. The testing team is aware of the system internals (to some level), often has access to the source code, but approaches testing as though they don’t have these things.

When problems start surfacing, the team uses the knowledge of the system (system internals, application code, etc.) to troubleshoot the problem. It is likely that the team members applying this knowledge to troubleshoot problems is different from the team members running the load tests.

What is Being Load Tested?

It’s necessary to understand what is being load tested. In fact, the more familiar you are with the system you are trying to tune, the more successful you will be (and most likely, the easier it will be) in your tuning efforts.

Create a document that captures all of this information before you begin creating load test scripts or running load tests.

Human Actors

The following people (or roles that one or more people may fill) are needed for the load testing phase of a project. Some of these roles may be filled by the same person. Some may not be relevant to all situations. Sometimes, more than one person is needed to bring all the necessary knowledge to the table for one of these roles. Multiple people may be needed, too. Additional resources could be needed:

Prep Work

Before load testing begins, some information must be gathered. This includes:

That last one is the real trick. If this is a brand new system, then you are essentially guessing. If this is an existing system with a point upgrade or minor update, then there is real world data that can be used to figure out relative usage of various transaction to build a load test.

The Process

The basic idea is to generate load, identify bottlenecks, eliminate bottlenecks, and repeat until the desired SLA is obtained with the desired load. Getting through even one iteration of this cycle can take anywhere from hours to days (if not longer for new or complex systems).

The steps described in the “Prep Work” section above must be completed before you begin. If you don’t have this information, success will be elusive. Or, you will simply be tracking it down a little ways down the road.

You need to produce load. Choose your load testing tools. Ideally, your organization would already have tooling and hardware in place. Use these tools to record a load test that mimics expected production usage.

Run load tests to produce sufficient amount of traffic to meet the defined SLA. You will incrementally approach this target.

Most of the time when I have done this, the first several iterations of running a small amount of load encounters various issues ranging from identity and access issues for test accounts to network connectivity.

The test engineer must get the test to a point where it can run at 1–2 TPS without any errors occurring. For a brand new system, this can be challenging. Keep at it and you will get there. But, be realistic; I’ve seen this take anywhere from a couple of days to two-six weeks, depending on the complexity of the system.

Once errors related to configuration and application bugs have been eliminated, you can begin increasing load. At some point, you will hit your first load related issue.

Unfortunately, I cannot tell you how to resolve those issues. These could be at any layer/tier of the application. It could be in system configuration, middleware configuration, system resources, application code, or other pieces. As a first step, narrow down where the problem is:

Having a good understanding of the software the application is built on top of is a necessary precondition to doing this successfully. It doesn’t have to be the load test engineer who has these skill sets. Other Subject Matter Experts (SMEs) can be brought in as necessary to monitor and troubleshoot problems.

Once the current issue is resolved, run the load test again. Keep repeating this cycle until the system can run a sustained load test for at least several hours (four to six) at the desired load level.

As issues are encountered, it can be difficult to separate the cause and effect. Which observation is the root cause and which is a side effect? This comes with experience.

Keep a detailed log of configuration changes that are made when troubleshooting this process. It seems obvious, but sometimes people forget to be diligent about this and then have no idea what fixed it. Only change one thing at a time! Even if it seems like a small detail, only apply one change at a time.

Additional Thoughts

Summary

Before new systems, or changes to existing systems, go live, performance tuning needs to be done. Performance tuning is a process. It’s a journey—not a destination. You could always spend more time to make the system a little more efficient, but is it needed? This is where defining the required SLA (or target performance level) ahead of time is needed. This will tell you when it is not only good, but that it is good enough.

Robert C. Broeckelmann Jr.

Principal Consultant

My focus within Information Technology is API Management, Integration, and Identity–especially where these three intersect. Most recently, I have been working with Apigee Edge and WebSphere DataPower.

Related Posts