Tuning CMP Performance

For any complex persistence system there are two main factors that impact the performance of the system:

  1. The number of operations performed by the persistent store.
  2. The complexity of each operation
  3. The amount of data transferred from the store

In performance tuning a persistence system, the primary strategies for optimizing these factors are:

  • Reducing the number of operations required. For example, with an SQL database, it is much more efficient to excute one query that returns 100 rows than it is to execute 100 queries that each return one row. Often the cost of executing a query that returns 100 rows is not significantly more than the cost of a similar query that returns a single row; the determining factor is the number of physical I/O operations required.
  • Reducing the complexity of operations. In an OLTP system, the large majority of operations act directly on objects identified by their primary key or by a simple secondary index; in an EJB application this would be operations using EJB references or cmr-fields. These are generally fairly simple persistence operations that are executed many times with different values; this allows the database to tune these operations well. There are still a number of relatively complex operations needed to support finder or ejbSelect methods.
  • Reducing the total amount of data transferred. Paradoxically, this may involve transferring more information in one operation to avoid other operations in the future. The key to this strategy is to ensure the operations that have to be performed are leveraged correctly; too little work means more operations need to be performed, too much means that data which is transferred gets discarded before it is used.

JBossCMP offers a variety of mechanisms that allow tuning of its strategies.

Caching Mechanisms used by JBossCMP

With CMP 2.0, the specification added a mechanism that allows the CMP implementation to separate the point in time at which it loads data from the store from the time when an application needs that data. Rather than maintaining information in fields, an application always accesses data using a get or set method, the implementation of which is provided by the Container.

This allows the Container to optimize persistence operations by potentially loading data in advance of when the application needs it, or by delaying the load until the the actual get method is called. It also provides the Container with information on exactly which data has been modified, allowing it to potentially delay storing that data and allowing it to tune those operations so that only modified data gets written.

For these mechanisms to be effective, JBossCMP must be able to cache data for the time between it is loaded from the store until it is ultimately flushed back. It does this using two separate caches:

Entity Cache

JBossCMP uses the main JBoss Entity Cache to store values associated with an Entity that is in the main cache. Such an entity is either enrolled in the current transaction, or has been configured in such a way that JBoss will cache it between transactions (commit option A or D). For each CMP Entity in the cache, JBossCMP stores the state of every cmp- or cmr-field; whether the data is loaded and if so its value, and whether the data has been modified.

JBossCMP ReadAhead Cache

In some cases JBossCMP will optimistically pre-load data for entities that are not (yet) associated with the current transaction. It cannot use the Entity Cache to store this data as doing so would require a lock on those entities. Instead it uses a CMP-specific Read-Ahead cache as a temporary store for this information. Once it is used in the course of a transaction, then the data becomes associated with the Entity instance in the main cache.

Load Groups

Load groups are a mechanism for grouping fields together so that fields that are commonly accessed together by the application can be loaded together in a single persistence operation. This allows us to reduce the number of operations required, and to efficiently transfer data from the store by fetching all the data the application needs for an instance, but no more than is really necessary.

A load-group is a grouping of fields defined in jbosscmp-jdbc.xml. For example, the definition:

    <entity>
<ejb-name>Order</ejb-name>
...
<load-groups>
<load-group>
<description>Fields used by billing</description>
<load-group-name>Billing</load-group-name>
<field-name>orderDate</field-name>
<field-name>billTo</field-name>
<field-name>shipTo</field-name>
<field-name>totalAmount</fieldName>
</load-group>
<load-group>
<description>Fields used by shipping</description>
<load-group-name>Shipping</load-group-name>
<field-name>shipTo</field-name>
<field-name>lineItemCount</fieldName>
</load-group>
</load-groups>
</entity>

defines two load groups: one for billing with fields that are used to bill for an order, and one containing fields that are used to ship the order. A field may be in multiple load-groups as needed.

Load groups are references by the different loading strategies described below to determine which fields should be added to a query.

Eager Loading

Eager loading is the mechanism the Container uses to load data in advance of when it actually needs to return it to an application. This allows data to be piggy-backed onto other persistence operations, taking account that the cost of loading one field or record is comparable to loading several.

JBossCMP can perform eager loading in response to any of three triggers:

  • When executing a query for a finder or ejbSelect method.
  • When executing a query to load an entity
  • When executing a query to load related entities through a cmr-field

Eager Loading caused by Queries

The query for a finder or for an ejbSelect method that returns Entities returns a Set or Collection of references to those Entities. All that is required to construct such a reference is the primary key, so these queries will normally just select the primary key fields. For example, the simple finder

public interface Order extends EJBLocalHome {
public Collection findAll() throws FinderException;
}

SELECT OBJECT(o) FROM Order AS o

would generate SQL like

  SELECT t0_o.ORDER_ID FROM ORDER_DATA t0_o

where orderId/ORDER_ID is the primary key for the Order EJB.

However, if the application then iterates over every EJB in the collection, code like:

  Collection orders = orderHome.findAll();
for (Iterator i = orders.iterator(); i.hasNext();) {
OrderLocal order = (OrderLoca) i.next();
OrderData data = order.getData();
...
}

then every order in the collection must be loaded independently. This requires 1+N persistence operations which violates the first rule above.

To alleviate this performance issue, JBossCMP support the concept of on-find read-ahead which allows additional columns to be added to the generated select statement. The read-ahead definition uses a load group to determine which fields to use. For example, the definition:

    <entity>
<ejb-name>Order<ejb-name>
<query>
<query-method>
<method-name>findAll</method-name>
<method-params/>
</query-method>
<read-ahead>
<!-- strategy must be on-find -->
<strategy>on-find</strategy>
<eager-load-group>Billing</eager-load-group>
</read-ahead>
</query>
</entity>

would cause the fields from the Billing load group to be added to the generated select statement:

  SELECT t0_o.ORDER_ID, 
t0_o.ORDER_DATE, t0_o.FK_BILL_TO, t0_o.FK_SHIP_TO, t0_o.TOTAL
FROM ORDER_DATA t0_o

The addition column values are stored in the JBossCMP ReadAhead cache. When the application iterates over the result of the finder, the vales for the cmp- and cmr-fields in each Order instance can be loaded from the ReadAhead cache and do not need to be loaded from the database. This means that we now only need to perform 1 persistence operation and have eliminated N others.

This is a very effective optimization but care should be taken to avoid loading data that is not needed. This is especially true if the table contains potentially large values, such as data in LOB columns.

Eager Loading caused by Instance Loading

When the application tries to read a field, JBossCMP will first check the ReadAhead cache to see if the field being accessed has already been loaded as the result of a query. If so, then the value is returned from cache and no load operation is performed. However, if the value was not in the ReadAhead cache, then an SELECT statement will be execute to load the require field from the database. JBossCMP provides strategies for setting which additional fields get loaded by the select, and for causing addition rows to be pre-loaded into the cache.

By default, JBossCMP will load all the entity's fields. This can be overridden by specifying an eager load group in jbosscmp-jdbc.xml. For example, we could define the loading strategy for the Order EJB to load just the fields required by shipping:

    <entity>
<ejb-name>Order</ejb-name>
...
<eager-load-group>Shipping</eager-load-group>
</entity>

This would cause the SELECT statement to become:

  SELECT t0_o.FK_SHIP_TO, t0_o.ITEM_COUNT
FROM ORDER_DATA t0_o
WHERE t0_o.ORDER_ID = ?

JBossCMP will also look in the ReadAhead cache for the most recent finder result that included this Entity. If such as result is present, JBossCMP will try to read-ahead other rows from the same finder as determined by the read-ahead page-size element for the Entity.

For example, if the entity definition in jbosscmp-jdbc.xml was:

    <entity>
<ejb-name>Order</ejb-name>
<read-ahead>
<strategy>on-load</strategy>
<page-size>5</page-size>
</read-ahead>
<eager-load-group>Billing</eager-load-group>
</entity>

then the SELECT statement would be modified to include additional rows from the finder:

  SELECT t0_o.FK_SHIP_TO, t0_o.ITEM_COUNT, t0_o.ORDER_ID
FROM ORDER_DATA t0_o
WHERE t0_o.ORDER_ID = ?
OR t0_o.ORDER_ID = ?
OR t0_o.ORDER_ID = ?
OR t0_o.ORDER_ID = ?
OR t0_o.ORDER_ID = ?

The rows fetched will be the required row, plus the next 4 rows returned by the query that was run for the finder.

This strategy can be used where a finder returns too many results to be efficiently held in cache. Instead of trying to hold all the results in memory, only the number defined by the page size are held. This prevents the cache from being flooded whilst still reducing the number of operations required to 1+(N/page-size).

Eager Loading caused by Relationships

JBossCMP will also read ahead when the use of a get accessor for a cmr-field causes a query to be executed to a load related entity. A read-ahead element can be added to the ejb-relationship-role that specifies which fields should be eager loaded and the number of rows to pre-fetch. For example, the definition:

    <ejb-relation>
<ejb-relation-name>Order-has-LineItems</ejb-relation-name>
<ejb-relationship-role>
<ejb-relationship-role-name>Order</ejb-relationship-role-name>
</ejb-relationship-role>
<ejb-relationship-role>
<ejb-relationship-role-name>LineItems</ejb-relationship-role-name>
<read-ahead> <!-- strategy must be on-find -->
<strategy>on-find</strategy>
<page-size>5</page-size>
<eager-load-group>Shipping</eager-load-group>
</read-ahead>
</ejb-relationship-role>
</ejb-relation>

would cause the Shipping fields of the Order to be pre-loaded when the getOrder() accessor was called on a LineItem instance. The SQL generated would be:

  SELECT ORDER_DATA.FK_SHIP_TO, ORDER_DATA.ITEM_COUNT
FROM LINEITEM, ORDER_DATA
WHERE (LINEITEM.FK_ORDER_ID = ORDER_DATA.ORDER_ID)
AND (LINEITEM.ITEM_ID = ?)

If the LineItem instance was returned from a finder, then the page-size pre-load would add the next 4 rows from the finder into the query. The SQL would then be:

  SELECT ORDER_DATA.FK_SHIP_TO, ORDER_DATA.ITEM_COUNT, LINEITEM.ITEM_ID
FROM LINEITEM, ORDER_DATA
WHERE (LINEITEM.FK_ORDER_ID = ORDER_DATA.ORDER_ID)
AND (LINEITEM.ITEM_ID = ?
OR LINEITEM.ITEM_ID = ?
OR LINEITEM.ITEM_ID = ?
OR LINEITEM.ITEM_ID = ?
OR LINEITEM.ITEM_ID = ?
)

Lazy Loading

Lazy loading is activated when an application access a field whose value has not already been loaded through an eager load strategy. To reduce the potential number of operations that need to be performed, JBossCMP can be configured to load additional fields as well as the one being accessed. If specific lazy load groups are defined in jbosscmp-jdbc.xml, JBossCMP will merge together all the fields from all the groups the field being accessed is a member of and then issue a select with all of those that have not already been loaded.

For example, if the following lazy load groups are defined for the Order EJB:

    <entity>
<ejb-name>Order</ejb-name>
...
<load-groups>
<load-group>
<description>Fields used by billing</description>
<load-group-name>Billing</load-group-name>
<field-name>orderDate</field-name>
<field-name>billTo</field-name>
<field-name>shipTo</field-name>
<field-name>totalAmount</fieldName>
</load-group>
<load-group>
<description>Fields used by shipping</description>
<load-group-name>Shipping</load-group-name>
<field-name>shipTo</field-name>
<field-name>lineItemCount</fieldName>
</load-group>
</load-groups>
<lazy-load-groups>
<load-group-name>Billing</load-group-name>
<load-group-name>Shipping</load-group-name>
</lazy-load-groups>
</entity>

and no fields have been loaded, then if the get accessor for orderDate is called would result in the SQL:

  SELECT ORDER_DATE, FK_BILL_TO, FK_SHIP_TO, TOTAL
FROM ORDER_DATA
WHERE ORDER_ID=?

being executed. If the get accessor for lineItemCount is then called, then the shipTo field has already been loaded so the SQL executed would be:

  SELECT ITEM_COUNT
FROM ORDER_DATA
WHERE ORDER_ID=?

Impact of Transactions

Transactions have a major impact on the effectiveness of any loading strategy. JBossCMP can only cache information in memory if it can be sure that it will remain consistent with any information stored in the database. If a transaction is in progress, JBossCMP knows that the current thread will be isolated from any other changes in the database and so is able to cache values until the transaction commits. However, if there is no transaction in progress, JBossCMP assumes that the data may be modified at any time and so does not retain any information in cache. This leads to behaviour that is contrary to expectations:

Attempting to improve performance by avoiding transaction overhead will actually result in a substantial decrease due to additional persistence operations.

For example, consider the simple use case where we query for orders and then iterate over them:

    double total = 0.0;
Collection orders = orderHome.findOrdersShippedToState("CA");
for (Iterator i = orders.iterator(); i.hasNext();) {
OrderLocal order = (OrderLocal) i.next();
total += order.getTotalAmount();
}

Assume that eager load groups have been configured for the finder and load instance cases to pre-load the totalAmount field.

If this is run without a transaction, then the SELECT issued to run the finder will include the pre-load field for amount:

   SELECT t0_o.ORDER_ID, t0_o.TOTAL
FROM ORDER_DATA t0_o, ADDRESS t1_o_shipTo
WHERE t0_o.FK_SHIP_TO_ID = t1_o_shipTo.ADDRESS_ID
AND t1_o_shipTo.STATE = ?1

This will populate the ReadAhead cache with the result of the finder including the values for totalAmount. However, because the finder ran without a transaction, this information is immediately aged out of the cache. As a result, when the first entity in the list is loaded, none of its fields remain in the cache and hence a select must be issued to fetch its values. This repeats for every Order in the list resulting in 1+N persistence operations.

However, if this is run within a transaction, then JBossCMP will not discard the pre-loaded results when the finder completes. As the list is then iterated, the cmp-field values for every entity can be found in the cache and no SQL needs to be executed. As a result, only 1 persistence operation is needed.