by Andrew C. Oliver
This article was edited and republished from the original source.
At JBoss, I was asked to help write some training materials for “performance tuning JBoss/Java on RHEL/Linux”. It wasn’t a very easy task because I knew the audience would primarily be composed of administrators who might not be interested in the whole system, compounded with the fact that most people mean performance and scalability when they say performance. What I would do to make one single client connect and perform its operations as quickly as possible on a single server is inherently very different from what I’d do for 10000 users connecting to a cluster. The type of performance tuning that I do for an application with no users and all messaging is very different from what I do for a standard web application-type system.
The methods I use when designing a high-performance system and determining what subsystems need tuning to what options I select are fairly universal. There are some common places that can be attended to, and some very nasty common snags that can be avoided.
Let me preface this article with some assumptions. I assume you are fairly close to the beginning of your journey and that your system looks more like the second scenario (high-scale web application) than any other. I also assume your system is more of a transaction-processing system than OLAP or messaging system, although some of this is fairly universal. Finally, while JDK command-line settings all derive more or less from the Sun JDK’s settings, they are different (sometimes merely by a letter). When I list them, I use the Sun settings because it is the most widely deployed and all other JDKs are at least influenced by it.
A good way to determine when a journey is over, is to pick a destination before you begin. In other words, the only way to meet your performance goals is to actually have stated specific performance goals. Whether these are to have “under 3 second page loads on the LAN” or specifics for particular subsystems. Ideally you have fairly granular performance goals for pieces of your software (for example, “query/data retrieval performance is key”), and these goals can be tested on your target system (for example, “RHEL 4.4, Itanium II/3ghz with 8gb memory, SATA2 10000rpm —
getFooBarDataFromDatabase() returns in 3ms”.)
However, having jUnit tests that measure method execution performance isn’t enough. Java/Linux systems have non-deterministic performance. For one, Linux is not a real time operating system (generally) and Java runtimes are generally not real time. Additionally, concurrency causes contention for resources: threads compete for processor time and synchronized locks on resources; database locks and disk utilization. In order to really set goals we have to set concurrency goals such as the maximum and average number of logged in users and performance expectations for the criticial components under that load.
Load test tools
Having goals is a fine thing, but we must have ways to measure adherence to them. We should have ways to measure the performance of different aspects of our system. Its easy to get an idea of raw page delivery with ab, the Apache HTTPD Server benchmarking tool, but that doesn’t tell us enough about whether 300 users can log in at once when 1000 users are logged in (one of our goals). For that you need a more sophisticated tools such as the proprietary tools from Mercury, the more affordable Web Performance Suite, the open source (but Windows only) OpenSTA, or the ever-popular (albeit primitive) and multiplatform (Java) Grinder. You can find other open source alternatives at http://opensourcetesting.org.
In general, you want tools that can both record and execute load scripts, preferrably with parameters (as cut and pasting scripts for 10000 unique login names would be tedious at best) and ideally text-based (so that you have the option to cut and paste, or generate with a Perl script, 10000 unique logins). You also want unit test tools, there are other ones, but jUnit is a basic tool all others are measured against. Your unit tests can include temporal expectations, but you may need to code some tolerance for when you run these on a system other than the target system (my laptop might not perform the same as a high end server).
Using load test tools and performance tools is a fine thing, but if you can’t figure out exactly where the problem is, then they aren’t much use. There is the venerable
System.out.println method, but more sophisticated users use profiling tools and network analyzers. I’m sure other network analyzers exist but the only ones I’m very familiar with are the (command-line and opaque) tcpdump and the (very friendly but GUI) ubiquitous Ethereal. I suggest having a UNIX or GNU/Linux-based system nearby even if you develop on Windows as there is no easy/reliable way to spy on the loopback adapter on Windows.
On GNU/Linux this is easy with Ethereal (see Figure 1). Ethereal helps you see what exactly and how much is being passed on your network.
To analyze your code a little more closely, there is always the JVM
-Xaprof option, but for a more sophisticated view you can use the (closed source and proprietary) JProbe, the very sophisticated (but expensive, difficult to use, and closed source) Wily Introscope or one of any number of Java profilers including the yet unmentioned JBoss Profiler.
You need to make sensible choices in the way you write your code, using profiling and testing tools to make sure your code performs to your expectations and finding potential bottlenecks. But what about your physical system? The first aspect to address is your network topology. A common topology is to physically separate each aspect of the system into separate processes on separate physical machines (See Figure 2). Thus, every request is processed by a load balancer, a web server, a servlet container, an Appserver/EJB tier, and a database server. Often times, this is coupled with some network rules, such as a demilitarized zone (DMZ), and rationalized as a security decision.
The problem with this topology is in focusing on preventing less common types of attacks (deliberate crackers), which exacerbates the most common types of attacks (denial of service) and does nothing for some of the simplest attacks (SQL injection). Generally speaking, on a web-based system, the first line of defense is your load balancer and its inability to pass anything more than HTTP. Any cracker who can get past it can likely compromise your HTTP server. Moreover, the more layers of network IO you add, the greater the performance requirements on each part of the system to process any given request. The above mentioned load test tools have been used to deny service to many sites that practice this form of “security” by simply exercising the login process, which is often the most laborious process and most susceptible to denial of service attacks.
An ideal topology would be one in which there are redundant load balancers (for high availability) and one layer of identical nodes, which include the web, servlet, EJB, and database all in-process utilizing operating system threads (this breaks down at some level and we need more advanced scheduling but that is a much longer article). Additional scalability would be achieved by adding one more of the identical systems. This isn’t completely practical with today’s technology and with many datasets it is simply impossible. Therefore our closest ideal is a set of redundant load balancers, multiple web/appserver boxes (running in the same process with operating system threads), and some sort of HA database solution (from Oracle’s RAQ to MySQL’s clustering). Figure 3 shows the architecture of this setup.
For clustered systems we want to separate our network traffic onto separate backbones. Ideally there should be a separate network interface card in each system for incoming client traffic, cluster replication data, and backend (database) communication. The communication to each should be bound strictly to those cards (trivial to do with most open source application servers and Linux) and should pass over a separate backbone. Gigabyte ethernet is a common solution and is a good default for new systems. Older systems may benefit from an upgrade. The rationale for this setup is not only performance, but ease of problem determination and security, that is, it is easier to firewall dedicated network backbones.
If you stick to this topology advice (in a nutshell less IO is more performance and scalability), you’re going to have a question in here about caching, in particular about edge caching (see Figure 4).
There is no clear cut advice here. Theoretically, provided equal network performance, edge caching services (such as Akamai Technolgies, should slow your overall client-side throughput down. Obviously this is not always the case. Why? Because, as of HTTP 1.1, it is no longer necessary for each page, image, or object to require a seperate network connection request/response/close cycle. A client can request several pages via one connection. It is also common for load balancers to use persistent connections to the web server tier. By using edge caching you embed images from a different domain in your HTML output. This means that the client must request those images in a separate request thus driving the real system cost of delivering the page up.
However, all things often are not equal and thus you may find that edge caching does indeed perform better. The only way to know for sure is to load test from an offsite location (preferrably network-geographically near your largest concentration of users). In going with an edge caching solution, ensure that it is not merely a matter of increased CPU speed or that you’re not merely sidestepping a slow load balancer.
Linux has a cornucopia of network tuning options, from the hardware level on up the stack. For many medium grade systems the default options are fine. However, higher-end systems may benefit by changing the network buffer size and quite possibly the maximum transmission unit (MTU) to a larger number on your database and cluster backbones. The MTU is the maximum size a packet can be without being split. The MTU on the internet is more-or-less fixed (due to defective specs for discovery and firewall rules) to about 1500, which was fine for 10-base-T but is dated on gigabit ethernet. It is suggested that an MTU of about 9000 is probably a good tradeoff between safe and optimal. You need to be careful when setting this and ensure that your routers and other networking equipment is configured to handle larger transmission units or you may take yourself off the network. The MTU can be set permanently but the method for doing so differs among various Linux distributions.
The following configuration changes the buffer sizes in the
net.core.rmem_max = 16777216 net.core.wmem_max = 16777216
- TCP max buffer size
You can also change the tcp MTU at the command line.
ifconfig eth0 mtu 9000
Linux was not always a very good operating system for running Java software. Early Java implementations on Linux used “green threads”m which cannot scale to multiple processors. Later implementations used lightweight processes (which is what Linux provided in the way of “thread support”), which performed poorly compared to thread-based systems. Today’s Linux distributions (2.6 kernel and later) support NPTL threads. These threads are lighter weight but can scale across processors. Red Hat and other distributions previously had a backport of NPTL threads to 2.4 kernels, but these backports often caused instability. Ensure that you are running a 2.6 kernel (
uname -a is usually sufficient), preferably with a JDK 5 distribution. Many large scale Lintel deployments are on the Sun or BEA JRockit JDKs or close derivatives. You need at least a 1.4.x distribution (we’ll discuss why 5 makes a big difference later).
Memory In Java
The Java language and runtime are heavily premised on the concept of garbage collection. Unlike typical C and C++ applications, in Java you do not allocate and free memory, the system does this for you. There are multiple different types of memory in Java, two of which you can tune/control. The first is the heap or the Java heap. This is where your objects that are not primitive stack variables are allocated. The heap is divided into segements, which we’ll discuss momentarily. The second type is the stack, which is where the aforementioned primitives and the call stack are allocated. The heap is garbage collected, the stack is allocated on a per-thread basis.
You can set the maximum heap size on the command prompt when starting Java (or often in the shell script that starts your application server) by passing
-Xmx1g (1 GB for example). If your software requires more than this (with space for garbage collection) then you’ll experience an
OutOfMemoryError. There is no good way to determine the exact amount of memory a Java program requires, therefore testing is essential. It is suggested that you also set the minimum heap size to the same as your maximum heap size on larger production systems. If your heap utilization grows at peak then a performance spike may occur. Some of the really nastiest intermittent stability bugs have been in the interaction between heap resizing and garbage collection.
The heap is divided into generations, which are then cleaned differently (see Figure 5). There are different garbage collection algorithms for different situations. For most large systems either parallel or concurrent garbage collection is optimal. In generatel, parallel is suggested, as concurrent is more susceptible to issues of memory fragmentation and thread contention. The first generation is often called the new generation or eden. Objects start here. When eden fills, it is cleaned in a minor collection in which all live objects are moved to the survivor spaces and objects that have survived a few iterations are moved to the tenured generation. After the heap is (by default) 68% full, a major collection takes place and the tenured generation is cleaned. If insufficient memory is cleaned then a full collection occurs. Additionally there is a permanent generation, which is allocated in addition to the heap specified via the
-Xmx option. This is where class definitions go, which is specified using the
-XX:PermSize=64m option. If you get
OutOfMemoryError on redeployment it is most likely here.
JDK 1.4.2 introduced parallel and concurrent collection of a beta quality that was disabled by default. Unfortunately, it frequently core dumped or experienced other stability issues. JRockit offered more stable support for parallel and concurrent collection and often performed better. As of JDK 5, both JRockit and the Sun JDK are close contenders in the area of performance; in my experience, which performs better is very dependent upon the underlying application. Some anecdotal evidence suggest that systems with low contention and large object sizes may perform better with the Sun JDK and systems with high contention and many small object sizes may perform better with JRockit. The garbage collection methods in each differ somewhat, but understanding the Sun model makes it easier to understand other GC models. In JDK 5, much less heap tuning is necessary and you’re probably okay just using the
-Xms options to size it. However sometimes the JDK guesses wrong with its autotuning and you must fix things. It can also be helpful to fix-size the new and perminant generation or explicitly state a preference for parallel garbage collection. A smart thing to do is to test with gc logging outputted to a file and test different options during your load test. A complete set of options for Sun’s JDK can be found here. Purchase machines with parallel garbage collection in mind (i.e., at least 4 total cores and the larger the heap the more cores you need).
Next, on 32-bit Lintel machines the maximum heapsize that you can reliably set is close to 1GB with default settings. This is because the max RSS is 2GB and the perm space, JDK overhead, and thread stacksize * number of threads are all in addition to your
-Xmx setting. On 64-bit systems, using the
-b64 flag, you have theoretically many exabytes of address space (giga, tera, peta, exa). You can up this RSS to 4GB by using large memory pages (which are available with the 2.6 kernel). You can find instructions on this here. For new hardware purchases ensure that it at least supports EMT64 (Intel’s name for AMD’s x64) or x64 (AMD).
Finally, be mindful of the thread stack size of the JDK on your platform. The default on 32-bit Lintel is 512k per thread. The default on a 64-bit system is 1MB per thread. Those may sound like relatively small numbers but you can easily have 1000 threads on a busy application server. That is 512MB. If this is a 32-bit JVM and you’ve sized the heap relatively large you may get
cannot create native thread or an out of memory error that is not connected to actual heap memory. Given that not too many years ago the stack size was around 16k, you can probably afford to lower this. 256k is usually the safe number I give to JBoss customers, but often when tuning onsite I use 128k (which I validate). Be aware that the -Xss option can only make the stack size bigger. You will have to use the
-XX:ThreadStackSize option to make it smaller. It is of course essential to load test this change. You may find that your code uses more stack than you thought!
The most obvious database issue to attend to is your connection pool size. Setting this is application server dependent. You really want a connection pool for any serious system. Also be warned that there are several buggy ones out there in open source (as well as closed source). However, something else you may want to attend to is your database selection, its locking strategy, and its isolation level (usually configurable in the appserver). This is all affected by how well your developers truly understand transactions and if they have utilized proper transaction demarcation techniques, as well as flush strategies. Don’t take it for gratned that they understand these details. Oracle Database and PostgreSQL offer Multi-Version Concurrency Control. When used properly iy can greatly increase concurrency though optmistic locking. This is done Hibernate and EJB3 with versioning. In fact, to get a pessimistic lock when MVCC is used, you typically have to issue the equivalent of a SQL select for update statement. MySQL 4.x (innodb) offers MVCC for reads only (writes still acquire a lock). DB2 and Informix among others, offer pessimistic row locks or page locks (depending on the platform). This can very seriously affect how your code is written and you likely will have to do some really tricky things to deal with concurrency. Theoretically MVCC is less efficient as it requires more copying, but in actuality it is more efficient under concurrency. Consider how frequent contention is when selecting a database and consider its locking strategy.
This is the tip of a much larger iceberg. I’d have loved to talk more about threading and contention and clustering strategies. I’d have loved to talk about the most common concurrency and performance scalability horror stories and more on how to diagnose these things, but “the man” said that this article could only be so long. So in conclusion: design for performance and concurrency, write tests, load test, select a sensible topology, test and tune your JVM’s GC, run on a 2.6 kernel and buy more memory, a 64-bit machine with at least 4 cores (fine if thats in 2 processors or 4).
- How Akamai Works
- TCP Tuning and Network Troubleshooting
- Gentoo Wiki: TIPS Jumbo Frames
- JBoss Wiki, Linux threading model
- Tuning Garbage Collection with the 5.0 Java Virtual Machine
About the author
Copyright (C) by 2006 Red Hat Inc. This article is licensed under a Creative Commons Attribution 2.5 License (CC BY-SA): http://creativecommons.org/licenses/by-sa/2.5/.