Monday 31 August 2015

Why some people think you don’t need to do performance engineering

For many IT professionals that have not experienced painful performance problems (the lucky ones.) there exist several myths that allow the issue of performance to be put to the back of their minds and not to be addressed during a project. This article outline some of the common myths.

Add More Hardware

Performance is simply a problem of not having enough hardware available to do the job. Therefore, if we get a performance problem it is simply a question of purchasing more hardware. This is only a valid approach to fixing some problems with throughput i.e. the response times are acceptable but the system can’t support the required number of users. Addition hardware will work if your application has been designed to be scalable and to operate on large hardware. All too often projects hit performance problems, purchase more hardware and then discover the application has not been designed to utilize that additional hardware. Even with well designed applications they rarely manage to double performance by doubling the number of CPUs. Figure 1 shows for a “typical” application the sort of performance improvement that can be achieved by adding additional CPUs. The reason for the less than expected performance improvement is that often the overhead of managing processes increases as CPU are introduced and processes need to spend more time communicating between themselves and all the other processes

Figure 1 Performance vs Number of CPUs

Not all performance problems are hardware related but are due to a poor design that does not order or schedule the tasks in an efficient or appropriate manner. For example an on-line bank failed to generate pages within the required response times as all the call made to the bank-end system where made in sequence rather than in parallel causing the cumulative delay to exceed the response time requirement. In this example no additional hardware would solve the response time problem.

Figure 2 Bob soon discovered having the faster hardware doesn’t guarantee success

Even if the purchasing of additional hardware can solve the problem then two problems still exist (a) the cost of the additional hardware could be prohibitively expensive (b) your solution currently uses the top of the range hardware with no room to grow.

We can fix it later

Often performance engineering or performance tuning has been ignored early in the design cycle in deference to using the fix‑it‑later approach. The fix‑it‑later approach is to wait until the implementation is nearing completion or even in operation before the performance of the system is considered and then to rectify it by a quick fix. The arguments for the fix‑it‑later approach is that only a small portion of the code is performance critical and therefore can be optimised.

The quick fix approach has two problems, first is that the fix will possibly invalidate the design work that will lead to re‑documentation and additional work. The second, more fundamental problem is that only so much can be achieved with a code fix before a major design change is required. This is illustrated best by an analogy to energy saving within a house. An old house can be insulated against the cold by means of draft excluders and loft insulation. Further, more dramatic steps can be taken with changes to the building such as cavity wall insulation and double glazing. Unfortunately, no matter what improvements you make an old house it will never be as efficient as a new house that has been designed with energy saving in mind. A new house will have extra thick cavities and be positioned to get the best warmth from the sun.

You can’t design for performance

As traditional software and system development methodologies concentrate on achieving functional goals many designers and project managers have never attempted or thought about designing for performance. In addition, the complexity and intangible nature of software systems lead people to believe that early performance prediction it useless. In fact there are many negative responses to the initial thought of trying to uses performance assurance in the process, some but all are not all are listed below:

Wishful thinking – Performance problems won’t happen to me?
Pessimistic thinking – Performance Assurance will never work why bother?
Lack of scientific thinking – its more expeditious, less painfaul and often just more fun to guess what potential problems there might be than to know what they are within reasonable doubt
Its not important – performance assurance takes valuable time away from software development and testing
There is no time – development and test schedules are fixed, tight and shrinking.
Performance Assurance is not a requirement – there is no statement in development contract that the performance assurance process needs to be followed.

Performance Assurance does take time and requires the appropriate skill and management to be successful but will benefit the development process. A quotation from John F. Kennedy comes to mind: “There are risks and costs to a program of action. But they are far less than the long range risks and costs of comfortable inaction”.

It is to expensive

A successful Performance Assurance program should begin early and continue throughout the project life cycle. For systems development, this is especially true; Figure 1 illustrates why. The cost of making a change to a system increases dramatically with time. On the other hand, the uncertainty of how a system will function and perform decreases with time; in the early stages of a project, uncertainty can be quite high. Without Performance Assurance, much of the early phase of system design is “guesswork,” increasing the chances of needing to make costly changes later in the development life cycle. Performance Assurance reduces uncertainty, allowing necessary changes to be implemented early or eliminating the need for a change altogether.

Determining the need for changes early pays handsomely. Making a change during the design phase is vastly simpler than making a change during development; re-designing a system on paper is always much easier than re-designing one in the data center. If hardware and software have already been purchased, there is less latitude in what can be done to correct a problem. Furthermore, as in Figure 3, the later a problem is found, the more costly it is to fix.

Figure 3 The Benefits of Performance Assurance

Late stage changes and fixes are costly for many reasons. The further into development, the more there is to repair if a flaw appears. Database tables, stored procedures and triggers, code routines, GUI windows and much more could all be impacted by a single change. Worse, if the system fails or needs modification during the production phase, the cost of downtime—not to mention lost business, customers, or reputation—must be factored into the cost of a fix.

In short, Performance Assurance applied early can save a great deal of time and money in the long run and boost the overall quality of an information system. But the early stage of a project is not the only time Performance Assurance delivers value.

Applying Performance Assurance throughout the project life cycle, from planning through production, is the key to a successful distributed system. While risk and uncertainty are especially high in the early stages of a project, they never disappear—every addition or modification to a system introduces new risk.

During development, the addition of new features affects existing code and data structures. The sooner bugs and errors are detected, the less costly they are to fix, so new features should be tested as they are added.Likewise, during production, the addition or removal of users, data, hardware, software and networks, and the imposition of new requirements all contribute to the need for continued risk management.

How to write Performance Requirements with Example

The only way in which systems will meet their performance targets is for them to be specified clearly and unambiguously. It is a simple fact that if performance is not a stated criterion of the system requirements then the system designers will generally not consider performance issues. While loose or incorrectly defined performance specifications can lead to disputes between clients and suppliers. In many cases performance requirements are never ridged as system that does not fully meet its defined performance requirements may still be released as other consideration such as time to market.

In order to assess the performance of a system the following must be clearly specified:
• Response Time
• Workload
• Scalability
• Platform

Response Time

In some cases the system response times are clearly identified as part of a business case for example a criminals fingerprint need to be identified while the criminal is still in custody (less than an hour). In some cases the response time will be dictated by legal requirements although this is rare.

For general applications asking users to what is an acceptable response time is like asking people how much salary they require!! The whole process is simply a process of negotiation.

The general advice on response time from Jakon Nielsen book on Usability is:

0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.

1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.

10 seconds is about the limit for keeping the user’s attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.

For systems that have to support significant numbers of users the cost of response times delays can actually be measured in monetary terms and therefore can form part of trade-off studies between different architectures providing different levels of performance.

REMEMBER USERS HATE VARIATION IN RESPONSE TIMES

Whatever is chosen must be measurable in the real system. Care must be taken to ensure that the performance measurement is unambiguous, concise and completely defined. Response time specification should include the following information:

Measurement Points: The points at which you want the response time measured needs to be included. For example are you interested in response time at the data centre or from a branch network?

What is included: Define what is included in the measurement. For example does the measure for a web page include the browser render time or just the delivery time to the browser?

What is excluded: Ensure calls to 3rd parties beyond the control of the system developer are defined. This is particularly important when defining contractual response time requirements, as suppliers can only be responsible for areas they can influence.

Statistic Type: The statistic type needs to be defined. For example 95% of all response time should be less than 8.5 second.

Measurement Period: Define the period over which to measure the response time. This is particularly important where workload varies over the day.

Platform: Should it not be possible to test on the real production hardware any allowance for testing on test hardware needs to be stated.

Error Rate: Define the acceptable error rate allowed during the measurement of the response times. Some systems may produce errors under high workloads and therefore the acceptable error rate need to be defined.

Finally once you have defined the response times have them reviewed by somebody else to check if the definition is indeed clear and unambiguous. Remember also to state the workload at which the response time are to be met.

Workload

Again the business case or existing process should be the start of the workload definition. However, it is not enough to state that “the system should be capable of supporting 80,000 customers” or “the system should be able to support 4 pages/sec”. These statements are often good metrics at a high level management level but do not define the work that the system must support. This is particularly important as the mix of transaction affects the performance. For example a DB system may easily handle 10,000 read transaction per hour but only 3,000 update transactions per hour.

The most likely transactions to specify are the user initiated transactions but care must be taken to consider all the users of the system and the batch processes.

For example, a system may have external customers, internal staff providing data entry and batch processes such as backups.If the backup is not completed overnight then it may seriously disrupt the performance experienced by the users the next day.

The workload is often described as the scenarios that the users are likely to execute. The table below is an example that shows for a user scenario how many requests per day, what pages a user executes and the think time between pages. You will notice that a percentage is included before the exit page, this is to specify that only a percentage of users will view that page.

Scenario	Daily Total	Pages	Think Time
View Balance	2000	Login, Portal, 50% Exit	15 secs
Bill Pay	45	Login, Banking, Payment, Confirm, 75% Exit	20 secs
..	..	..	..

When completing a workload specification a check must be made to ensure that all relevant functions have been covered. This includes not just the obvious user workloads but special cases such as management requests, backups and error scenarios/handling. Once all loads have been considered, infrequent or inappropriate workloads can be eliminated.Inappropriate workloads might include error scenarios where it is understood that errors will be very infrequent although it should be appreciated that an appreciable workload can result from the need to provide adequate error and/or security logging.

The workload should be specified up to the date that you wish current hardware to support without upgrade.

Workload Profile

The performance of the system is dependent on how the load is delivered to the system. For example it is easier to achieve faster response times for a system that receive an regular arrival of work throughout the day compared to one that receive burst of traffic. Therefore it is important that the workload profile is defined

The arrival rate into the system is rarely going to be constant throughout the day or the week. The figure below shows the arrival rate for requests to a website.

As can be seen in the figure the workload peaks around lunchtime and late evening but activity is very quite during the night.The above example shows how workload varies over a 24 hour period but it may be more important to show how workload varies over a month for example batch or over an hour for example end of day share trading.

Defining a Peak Workload

Many organizations’ workload may suffer a rare or unexpected increase in the workload of the system and this is often referred to as a “peak” workload. The decision facing system developers is whether to design a system to cope with a peak workload. The answer to this will depend on the consequence of failing. For example certain organizations design for a peak. The NY stock exchange in 1997 could process 3 billion transactions in a day while the daily average is less than 1.2 billion (Times, August 14 1997). This is because of the need to execute trades the same day and to maintain an accurate record of the state of the market.

When defining the workload for a new venture and little or no existing workload data exists and the system is to be developed by a supplier the temptation is to specify a high peak workload. This is a valid but often an expensive approach which may still not protect you from high demand as it can still be higher than you specify! An alternative solution is to specify the workload as your business model and analysis predicts it will be. Next, have a detailed specification of the scalability of the system to ensure that the supplier develops a system that is scalable quickly. Specify requirements for detecting and processing overload to ensure flood control mechanisms are in place to avoid the system crashing under intensive loads. Test that the system is scalable by renting in additional equipment or using vendors test environments. Finally ensure the operational staff have a procedure for increasing the scalability of the site and contracts are in place for additional hardware/bandwidth etc.

Scalability

In one respect scalability is simply specified as the increase in the system’s workload that the system should be able to process. The scalability required is often driven by the lifespan and the maturity of the system. For example, a new (and hence immature) system could suffer an unexpected growth in popularity and suffer from a significant increase in workload as it becomes popular with new users. More mature systems which represent improvements on older systems are likely to have more accurately defined workloads and thus be less likely to suffer in this respect

Remember to specify that the response time requirements should still be meet as the workload scales

A problem with scalability specification is that it may not be economically viable to test the scalability, as it often requires additional hardware. Therefore an alternative is either to rent in the additional hardware for the purpose of the tests or to use an extrapolation technique such as a simulation model.

Platform Considerations

A platform is defined as the underlying hardware and software (operating system and software utilities) which will house the system.It is not always the case that the designer will be given a “green field” choice of what platform on which to house the system.In some cases the customer may dictate this choice or there may be internal reasons (product strategy perhaps) that will constrain the designer’s freedom. It may also be the case that the system will require various generic products to be used in which case the performance of these must also be specified.

Consideration must be given as to whether the hardware will be used exclusively for the system or whether it must be shared with some other process.A new system might have to compete with an old system for the processor and its resources.The processor and resource requirements of the old system must be specified so the amount of free space and performance can be assessed.

When the customer specifies the platform for the system it is not only important that the project engineer knows what has been specified but that they also understand its capability in terms of performance.

When part of the platform consists of external resources such as connections to external databases or banking systems then the response times of these external resources must also be specified

Contractual Considerations

For any project where the system is developed by a supplier, considerable thought needs to be given to the wording of performance goals. The problem is that the customer will lose potential leverage if performance requirements are specified after the ink has dried. While the supplier will be wary of signing upto performance requirements before they have undertaken enough analysis to access how reasonable the performance requirements can be met. Problems are particularly acute for new ventures where no existing workload date is available to estimate demand accurately and therefore assumptions and projections can be challenged. This should not be used as an excuse for neither party to ignore specification of performance requirements.

Make every effort to ensure that during the contractual negotiations the performance requirements are discussed and negotated. This allows trade-offs to be made by both customer and supplier. A customer who is unsure of the system workload may specify high or even excessive volumes. The supplier should analyze these and explain how this is driving the cost of the system allowing the customer to decide if they want to pay for the comfort of a high workload margin.

Where time does not exist to fully explore the performance requirements then a task should be created to define these in the analysis phase of the project. If this is the case a clause should be added to the contract stating that this work will be undertaken and when.

An Example

This example is from an on-line bank and the supplier of the front end software. It was decided that the approach taken was that the software supplier would be responsible for a level of software performance within the test environment that closely resembled the production environment.

Response Time

The purpose of this section of the document is to outline the Software Performance Goals for Product X.These are the goals that Supplier Y and Customer Z minimally require to see in the Performance Test environment before putting an application into Production.These are not Data Centre SLA measures.

Software:

Supplier Y warrants that in supporting 300,000 customers it shall ensure that performance shall not fall below the following level:

95% of ALL visible pages for “normal” customers respond in 8 seconds or less, including infrastructure, excluding backends.

Measurement Points:

The response times will be measured using HP LoadRunner (or similar tool) located behind the firewall and in front of the web servers. The timer will measure the time from the request for a page to when the last bit required to render the page is returned. Backend response times will be measured using the application server log files.

Definitions

Backends are third party products and information providers such as Reuters share quotes not supplied by Supplier Y.

For the purpose of measuring the response time the performance tests should not exceed 60% CPU utilization during the busy hour

Visible Page shall mean a web page visible (non-blank) as seen by a customer.All redirect page times will be included in the response time of the page it redirects to

The test workload will be based on a normal business day, as defined in the workload section below, executed in the release acceptance test environment.

If the Supplier Y Software fails the response time criteria as set out above, then Supplier Y shall be liable to pay to Customer Z of the following amounts (depending on the level of failure):

Software Response Times for failure to meet 95% of all pages in less than 8 seconds

Score	0	1	2	3	4	5
Response Time expressed in seconds	8	8.01-9	9.01-10	10.01-11	11.01-14	14.01 and Over
Cumulative Total Compensation	£0	£8,000	£15,000	£45,000	£90,000	£180,000

Workload

The software must support 80,000 customers which will on a busy day generate 4500 customer interactions as outlined in the table below:

Ref No	Description	Pages	Daily Total
1	Portal	Login, Portal 50% Exit	2500
2	Transaction History (Statement)	Login, Portal, 50%Balances, Statement, 70% D Stat,50% Exit	500
3	Bank Viewer	Login, Portal, 50%Balances, Statement, 70% D Stat, balances,charges,balances,so list, dd list, Int Trf, Bal, portal,50% Exit	700
4	News Reader	Login, Portal, 50% Bal, Stat, 70% D Stat, Intra Day, News, Portal, News, Portal, BV add, 50% Exit	250
5	Portfolio edit	Login, Portal, Portfolio View, Portal Pref, Portfolio View, Add Shares, View Share, Share Quantity,Portfolio View, 50% Exit	100
6	OO Payment	Login, Portal, 50% Balances, Statement, 70% D Stat, OO List, New OO, New Payee, Confirm Payee, OO Details, Confirm OO, OO List, Payee List, Click Payee, Delete Payee, confirm Payee, 50% exit	245
7	Assign MM Category	Login, Portal, 50% Balances, Statement, 70% D Stats, GoMM, CreateCategoty, ConfirmCatagory, Balances, Statement, Click Item, Statement, GoMM, Click Report, Report, 50% Exit	105
8	Detail Bank Browser	Login, Portal, 50%Balances, Statement, 70% D Stats, In Progress, Portal, Alerts, View Alert, Portal, 50% Exit	100
		Total	4500

The percentage in front of certain page numbers represents the probability of that page being requested. The think time between all pages is 20 seconds with the exception of the think time between the login and portal pages which is 15 seconds. The profile for work arriving is shown in the figure below:

Scalability

Supplier Y warrants that the banking Software shall be capable of supporting at least 300,000 customers when implemented into a suitable production environment.

How do you know if your Load Test has a bottleneck

The bottleneck in a system may not be obvious. (Life would be easier but less fun if there where always easy to find). This is because there are two types “hard” and “soft”. Hard bottlenecks are the ones where a resource such as a CPU is working flat out which limits the ability of the system to process more transaction. While a soft bottleneck is some internal limit such at number of threads or connections that once all used limit the ability to process more transaction. Therefore, how do you find know if you have a bottleneck. If you are looking at the results from a single load test you may not know you will need to run multiple load tests at different numbers of virtual users and then see if you number of transactions per second increase with each increase in virtual users. The results can be seen in the two graphs below. The first shows how the throughput (transaction per seconds) increases and levels off when saturated and the second shows the response time. You will probably have heard the express below the knee of the curve and this is an the point that is to the left of the bend in the response time graph.

Throughput Graph

The graphs above where actually generated using a spreadsheet model for the performance of a closed loop model. This is like LoadRunner and other testing tools where the are a fixed number of users that use the system then wait and return to the system. The reality is that the performance graphs may look different from the expected norm. An example is shown below from a LoadRunner test the first graph shows how the number of VUser where increased during the test and the second graph shows the increase in response times. In this case the jump in response time is dramatic. However, in some cases the increase in response time will be less dramatic as the system will start to error at high loads which will distort the response time figures.

Example LoadRunner Graph Showing Increasing Response Times

Having discovered there is a bottleneck in the system then you have to start looking for it.