Optimization Overview

HTTP Traffic Profile

In order to optimize a servlet container it is important to understand how requests are delivered to the container and what resources are used to handle it.

Browser Connection Handling

Each user connecting to the webapp container will be using a browser or other HTTP client application. How that client connects to the server greatly effects the optimization process. Historically browsers would only send a single HTTP request over a TCP connection, which meant that each HTTP request incurred the latency and resource costs of establishing a connection to the server. In order to quickly render a page with many images, each requiring a request, browsers could open up to 8 connections to the server so that multiple requests could be outstanding at once. In some specific circumstances with HTTP/1.0 browsers multiple requests could be sent over a single connection.

Modern browsers are now mostly using HTTP/1.1 persistent connections that allow multiple requests per connection in almost all circumstances. Thus browsers now typically open only 1 or 2 connections to each server and send many requests over those connections. Browsers are increasingly using request pipelining so that multiple requests may be outstanding on a single connection, thus decreasing request latency and reducing the need for multiple connections.

This situation results in a near linear relationship between the number of server connections and the number of simultaneous users off the server:

SimultaneousUser * NconnectionPerClient == SimultaneousConnections

Server Connection Handling

For Jetty and almost all java HTTP servers, each connection accepted by the server is allocated a thread to listen for requests and to handle those requests. While non-blocking solutions are available to avoid this allocation of a thread per connection, the blocking nature of the servlet API prevents these being efficiently used with a servlet container.

SimultaneousConnections <= Threads

Persistent Connections

Persistent connections are supported by the HTTP/1.1 protocol and to a lesser extent by the HTTP/1.0 protocol. The duration of these connections and how they interact with a webapp can greatly effect the optimization of the server and webapp.

A typical webapp will be comprised of a dynamically generated page with many static components such as style sheets and/or images. Thus to display a page a cluster of requests are sent for the main page and for the resources that it uses. It is highly desirable for persistent connections to be held at least long enough for all the requests of a single page view to be completed.

After a page is served to a user, there is typically a delay while the user reads or interacts with the page. After which another request cluster is sent in order to obtain the next page of the webapp. The delay between request clusters can be anything from seconds to minutes. It is desirable for the persistent connections to be held longer than this delay in order to improve the responsiveness of the webapp and to reduce the costs of new connections. However the cost of this may be many idle connections on the server which are consuming resources for no server throughput.

The duration that persistent connections are held is under the control of both the client and the server, either of which can close a connection at any time. The browsers cache settings may also greatly effect the use of persistent connections, as many requests for resources on a page may not be issued or may be handled with a simple 304-NotModified response.

Optimization Objectives

There are several key objectives when optimizing a webapp container, unfortunately not all of them are compatible and you are often faced with a trade off between two or more objectives.

Maximize Throughput

Throughput is the primary measure used to rate the performance of a web container and it is mostly measured in requests per second. Your efforts in optimizing the container will mainly be aimed at maximizing the request rate or at least ensuring a minimal rate is achievable. However you must remember that request rate is an imperfect measure as not all requests are the same and that it is simple to measure a request rate for load that is unlike a real load. Specifically:

  • Containers will be more efficient handling high requests rates from a few long held persistent connections. Unfortunately this is often not a real traffic profile and requests more often come in from many connections which are mostly idle and/or short held. Thus it is key to also consider connection rate or at least the number of simultaneous connections when consider the meaning of a request rate figure.
  • Requests with content or large responses take more time to package and process and may be exposed to more network inefficiencies. Thus requests rates of realistically sized requests must be considered and in some circumstances it is useful to consider data rate.
  • There are several different ways that a webapp may serve a request and features that may be applied that will effect throughput, e.g. Static versus dynamic content, fixed versus variable length or security. The complexity of the requests must be considered when measuring throughput.

Minimize Latency

Latency is a delay in the processing of requests and it is desirable to reduce latency so that web applications appear responsive to the users. There are two key sources of latency to consider:

  • The latency between when a request is initiated and when the handling of that request starts. This latency is effected by the time taken to establish a connection and the scheduling of threads within the server.
  • The latency between requests in a request cluster. This latency can be large if the response for a previous request must complete before the next request can be issued. Browsers reduce this latency by using multiple connections or pipelining requests over a single connections.

While latency is not directly related to throughput, there is often a trade off to be made between reducing latency and increasing throughput. Server resources that are allocated to idle connections may be better deployed handing actual requests.

Minimize Resources

The processing of each request consumes server resources in the form of memory, CPU and time. Memory is used for buffers, program stack space and application objects. Keeping memory usage within a servers physical available memory is important for maximum throughput. Conversely using a servers virtual memory may allow increased simultaneous users and can also decrease latency.

Servers will have 1 or more CPUs available to process requests. It is important that the scheduling of these processors is done in such a way that they spend more time handling requests and less time organizing and switching between tasks.

The servers often allocate resources based on time and it is important to tune timeouts so that those resources have a high probability of being productively used.

Graceful degradation

Much of optimization is focused on providing maximum throughput under average or high offered load rates. However for many systems that wish to offer high availability and high quality of service, it is important to optimize the performance under extreme offered load, either to continue providing reasonable service to some of the offered load or to gracefully degrade service to all of the offered load.