Workload Management Problem Determination
Workload Management Problem Determination
Craig Scott
ibm.com/redbooks Redpaper
International Technical Support Organization
July 2007
REDP-4308-00
Note: Before using this information and the product it supports, read the information in
“Notices” on page vii.
This edition applies to WebSphere Application Server V6.1 for distributed and i5/OS platforms.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
The team that wrote this paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Contents v
Chapter 5. Collecting diagnostic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Collecting JVM logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Enabling the WLM trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1 Enabling and gathering trace from a thin application client. . . . . . . . 88
5.2.2 Enabling the trace from a J2EE application using launchClient . . . . 89
5.2.3 Enabling the trace from the administrative console . . . . . . . . . . . . . 90
5.3 Collecting the plug-in log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.1 Setting the log level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
How to get Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
Enterprise JavaBeans, EJB, Java, JavaBeans, JRE, JSP, JVM, J2EE, and all Java-based trademarks are
trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
This IBM® Redpaper helps you to debug common problems that are related to
workload management in WebSphere® Application Server network deployment
on distributed and i5/OS® platforms. It discusses the following areas:
High availability manager
EJB™ workload management
Web server plug-in load balancing
Carla Sadtler
International Technical Support Organization, Raleigh Center
Kevin Grigorenko
IBM WebSphere Serviceability team
Andrew Lam
IBM WebSphere Serviceability team
Mahesh Rathi
IBM WebSphere Serviceability team
Your efforts will help increase product acceptance and customer satisfaction. As
a bonus, you will develop a network of contacts in IBM development labs, and
increase your productivity and marketability.
Find out more about the residency program, browse the residency index, and
apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
Web
Container
HTTP Servlet App Server
Server Requests
Plug-in Web
Container
App Server
This routing is based on weights that are associated with the cluster members. If
all cluster members have identical weights, the plug-in sends an equal number of
requests to all members of the cluster, assuming no strong affinity
configurations. If the weights are scaled in the range from zero to 20, the plug-in
routes requests to those cluster members with the higher weight value more
often. No requests are sent to cluster members with a weight of zero. Weights
can be changed dynamically during runtime by the administrator.
Multiple application servers with the EJB containers can be clustered, enabling
the distribution of EJB requests between the EJB containers as shown in
Figure 1-2.
EJB
Container
Web EJB App Server EJB Java
Container Requests Requests Client
App Server EJB
Container
App Server
In this configuration, EJB client requests are routed to available EJB containers
in a round-robin fashion based on assigned server weights. The EJB clients can
be servlets operating within a Web container, stand-alone Java™ programs
using RMI/IIOP, or other EJBs.
You can also choose to have requests sent to the node on which the client
resides as the preferred routing. In this case, only cluster members on that node
are chosen (using the round-robin weight method). Cluster members on remote
nodes are chosen only if a local server is not available.
The plug-in also provides failover capability, where one or more of the servers
that it is directing requests to is shut down or “crashes”. The plug-in will detect
that an application server is no longer responding to requests and will stop
sending requests to that server. The plug-in will periodically check back with the
server to see if it has become available again.
Ensure all cluster members are responding as expected. You can do this by
using your Web browser to connect to each application server directly rather
than through the HTTP server and plug-in.
First determine the HTTP transport port number (or HTTPS transport port
number if using SSL) and then access the application URL using the
application server host name and port number from your Web browser
You can determine the HTTP and HTTPS transport port number for each
application server from a text file that is in each profile’s log directory named
AboutThisProfile.txt. This file will list the HTTP transport port number as
shown in Example 2-1:
For example: Assume you normally access your application via an external
URL such as www.ibm.com. This URL points to a Web server with a
WebSphere plug-in that load balances requests across the two servers
named appserver1 and appserver2. You would normally connect to your
application using a URL like the following:
http://www.ibm.com/wlm/BeenThere
After determining the HTTP transport port number of the application server
running on appserver2 to be 9081, you would access the application from your
browser via:
http://appserver2:9081/wlm/BeenThere
Check your plug-in fix pack level
The plug-in fix pack level must be equal to or higher than all of the application
servers that it is routing requests to. If you have applied fix packs to your
application servers, ensure you have also applied the equivalent fix packs to
your plug-in. You can check the fixpack levels of WebSphere Application
Server and the WebSphere plug-in using the versionInfo.bat or
versionInfo.sh script that is provided in the bin directory of each installation.
See the following URL for more information about supported combinations of
Web server plug-in and WebSphere Application Server:
– Web server plug-in policy for WebSphere Application Server
http://www-1.ibm.com/support/docview.wss?uid=swg21160581
Tip: Your deployment manager must also be at a higher fix pack level than the
application servers it manages. It is simplest to keep all components at the
same fix pack level.
You should also review PMI data from each application server to monitor the
number of requests being served and also server host metrics to ensure servers
are not being overloaded.
You may also need to review logs from all the application servers in the cluster to
determine if a particular cluster member is experiencing some problem that
prevents it from servicing requests.
Depending on what you find, you may also need to look at the following:
PMI data
Server performance metrics
If you find a server is marked down, restart the server and then go to “Validate
the solution” on page 20.
If you find an application server is being marked down, but you determine the
server is not actually down, go to “Server being marked down unexpectedly”
on page 20.
If this is not the cause of the problem, then you will need to determine the load
balance distribution to investigate further by enabling plug-in detail logging
and by reviewing PMI data.
Using the Trace log level, you will get sufficient information to determine the
distribution of requests based on the applications running in your
environment. However, the amount of data generated will be large as it will
log the sequence of events that occurs as each request that comes through
the plug-in is processed. You should only run with this logging level when
necessary.
In the Web server plug-in configuration file, you can see that the context root
from each installed application is mapped to an application server or cluster
that can handle that request.
To process the results from the http-plugin.log file, you will need to look
through the file and extract the data that shows request distribution.
Example 2-4, shows extracts from a plug-in Trace of a request to the snoop
servlet.
You can see the request come in to the plug-in, the plug-in tells you it is using
a round robin algorithm and that it has chosen IBM99TVXRDNode2_server2
to process the request. Near the end of the request output, the plug-in prints
the overall statistics for the chosen server.
A single request to the snoop servlet at Trace log level will produce 187 lines
of logging data in the http-plugin.log. In a production environment with high
transaction rates, the amount of data to parse is going to get very large very
quickly and it is not practical to process it by hand. For this reason you should
consider using a scripting language such as Python or Perl to assist you
process this data and generate summary reports.
If you confirm that you are experiencing uneven load balancing, then you will
need to look further to determine why.
Check the plug-in configuration to ensure the server weights are set as
appropriate. For more information, go to “Server weights not equal” on
page 19.
If you are seeing a server not being selected, as the maximum number of
requests has been reached, go to “Max connections for the server has been
reached” on page 18.
Plug-in logging levels Stats and higher will also report on requests that have
been routed to maintain session affinity. If you are seeing uneven load
balancing due to session affinity, go to “Session affinity is skewing load
distribution” on page 19.
Review the application logs, PMI statistics and server metrics to determine why a
server is not processing requests, or processing less requests than you expect.
Note: The resolution of these issues is outside the scope of this paper.
If you find you are experiencing excessive Java heap usage, you will need to
pursue that as the root cause. Java heap problems are outside the scope of
this paper. For a list of references that can help you pursue this type of
problem, see 2.7.1, “Java heap problems” on page 38.
If you find uneven processing of requests but there are no messages in the
plug-in log to indicate why, you will need to increase the plug-in logging level
to further debug this issue. Go to “Analyze the plug-in detail logging” on
page 11.
Review the JVM heap usage in the TPV to get an indication of the health of
the application server. Navigate to Monitoring and tuning →
Performance viewer → Current activity and then click each server
name in the cluster in turn. Expand the Performance modules tree in the left
hand side of the TPV and click JVM Runtime. Then click Show modules.
Figure 2-4 is an example of a healthy JVM heap usage.
If you find you are experiencing excessive Java heap usage, you will need to
pursue that as the root cause. Java heap problems are outside the scope of
this paper. For a list of references that can help you pursue this type of
problem, see 2.7.1, “Java heap problems” on page 38.
Server metrics
Server metrics such as CPU, I/O and page space utilization can also show
uneven or unexpected load balancing. You should use the tool most appropriate
to your operating system to check these statistics. For example, in Windows®
you could use Performance Monitor. In a Unix or Linux® environment, you could
use vmstat or top.
If you do this, you will not see even distribution of requests among your
application servers by design. A more detailed discussion of this can be found at:
Understanding IBM HTTP Server plug-in Load Balancing in a clustered
environment
http://www.ibm.com/support/docview.wss?rs=180&uid=swg21219567
Example 2-7 http-plugin.log showing skewed load balancing due to session affinity
[Thu Apr 12 16:26:15 2007] 00001288 00001f90 - STATS: ws_server:
serverSetFailoverStatus: Server IBM99TVXRDNode1_server1 : pendingRequests 0
failedRequests 0 affinityRequests 405 totalRequests 448.
[Thu Apr 12 16:26:16 2007] 00001288 00001300 - STATS: ws_server:
serverSetFailoverStatus: Server IBM99TVXRDNode2_server2 : pendingRequests 0
failedRequests 0 affinityRequests 0 totalRequests 45.
This situation is often seen in test environments and this imbalance will probably
sort itself out as the number of distinct concurrent users with unique sessions
increases. As the number of unique sessions increases, the plug-in will be able
to distribute these more evenly across the cluster members.
Example 2-2 on page 11 shows the error you will see in the plug-in log.
You can also see this condition intermittently if the plug-in fix pack level is lower
than the application server fix pack level. Apply the appropriate fix pack to your
plug-ins.
If this does not resolve the issue, go to “The next step” on page 38.
If you are still experiencing an issue with the plug-in not marking the server
down, go to “The next step” on page 38.
Depending on the plug-in logging level you have set, you may need to set it
higher and recreate the problem to collect the data.
If you see the plug-in still not being able to connect to the restored server, you
will need to review the JVM logs.
If you determine that the application server started, but the plug-in is taking
too long to recognize this, go to “Tune the retry interval” on page 27.
Resolution of this problem is outside the scope of this paper, but you can tune
the plug-in timing of when to check if a server has become available.
The timing of how often the server checks to see if a server has become
available is controlled by the Retry interval parameter set in the administrative
console as shown in Figure 2-5. Navigate to Servers → Web servers →
server_name → Plug-in properties → Request routing.
You can also see the selected value in the cluster definition in the
plugin-cfg.xml file.
If you still see problems with the server not recognizing a restored server, go to
“The next step” on page 38.
Depending on the plug-in logging level you have set, you may need to set it
higher and recreate the problem to collect the data.
The trace will also show you how the partition table is constructed as shown
in Example 2-12. The partition table is populated when the first request is
processed by the plug-in. It communicates with the application servers and
builds the partition table based on the session failover configuration. The
plug-in gets the partition table from the application servers as it needs to be
able to maintain a consistent partition table to handle session failover among
multiple HTTP servers or following an HTTP server restart.
Being able to see how the partition table is setup will allow you to cross
reference the partition ID to the clone ID of each application server. If you are
not seeing a partition table, you should be seeing session affinity being
maintained by the clone ID.
If you are not seeing the session ID being passed back from the client, for
example, in the JSESSIONID cookie, go to “Session ID missing” on page 30.
If you are seeing that the session ID can not be matched to an existing clone
ID or partition ID, go to “Invalid clone ID” on page 31.
If you are not seeing the cause of the problem in the plug-in trace, then you will
need to review the JVM logs.
The allowed session tracking mechanisms are configured for the application
servers in the administrative console. Navigate to Servers → Application
servers → server_name → Session management. Figure 2-6 shows cookies set
as the session tracking mechanism.
If you are still experiencing problems with session affinity, go to “The next step”
on page 38.
There are two recommended mechanisms for handling the sharing of session
data:
The first and recommended mechanism is memory-to-memory replication.
You create a session replication domain in your cluster and allow sessions to
be replicated between the application servers.
The second method is to maintain session data in database that is accessible
to all servers in a cluster.
Plug-in trace
Refer to “Analyze the plug-in detail logging” on page 11 for details on plug-in
tracing.
Finally, the plug-in goes back to the session cookie and checks for another
available server to handle the request and resume the user session based on
the partition ID as shown in Example 2-15. Note also the plug-in retrieving an
updated partition table from the application server.
If you are not seeing a partition table, session replication may not be correctly
configured. Go to “Session replication incorrectly configured” on page 36.
If you are not seeing the cause of the problem in the plug-in trace, then you will
need to review the JVM logs.
Figure 2-9 shows the name of the session replication domain and the replication
mode. From the Distributed environment settings page in the admin console,
click Memory-to-memory replication to review this setting.
If you are still experiencing an issue, go to “The next step” on page 38.
Review the problem classifications to see if there are any other components that
might be causing the problem.
EJB WLM uses a weighted round robin algorithm to decide which server to send
a request to. When you create a cluster, all cluster members are assigned a
weight of 2 so that requests should be evenly distributed among the cluster
members. You have the option of modifying the weight so that powerful servers
are sent more requests for processing than less powerful servers.
Problems with the EJB WLM can cause requests to fail or requests to be
incorrectly routed. This can lead to performance degradation.
Ensure that all the machines in your configuration have TCP/IP connectivity
to each other by running the ping or telnet command:
– From each physical server to the deployment manager
– From the deployment manager to each physical server
– From the client to each physical server
– Between all physical servers
If possible, try accessing the enterprise bean directly on the problem server to
see if there is a problem with TCP/IP connectivity, application server health,
or other problem not related to workload management. You will need a
custom EJB client to achieve this.
Depending on the problem you are investigating, you should consider simplifying
the problem determination process by reducing the number of servers to monitor
while debugging the problem. You would do this by shutting down all but the
minimum number of servers required to recreate the issue.
You can collect it all at once, or start by collecting the JVM logs and then
determining whether one or more of the other logs and traces are needed.
If you cannot find the source of the problem from the logs, you may need to track
the EJB request from the client to the server using a network analyzer to see if
the problem is communications related.
Note that the WSVR0605W message appearing in this example is usually not
associated with an EJB WLM problem but with a hung thread issue in the
application server. However, the reference to the FFDC0009I and FFDC0010I
There are other possible causes for this type of problem and the FFDC logs are a
useful source of debugging information. If you are unable to determine the root
cause of the problem from the FFDC incident log, review the activity log for errors
reported around the time of the problem.
The activity log can give you detailed information about EJB WLM processing.
Look for errors around the time of the problem similar to those shown in
Example 3-2. In this example, all cluster members are down and therefore the
EJB request cannot be processed.
If you can not find any indication of the problem in the activity log, take a network
trace to track the EJB request and reply through the network.
Recreate the problem and look for the packets that make up the EJB request.
Follow the path of the request from the client to the server and follow the
response back. This should show you where the request goes missing if the
problem is related to network issues. Also look for the following:
Network saturation
Dropped or incorrectly routed packets
If you still can not resolve the problem, refer to “The next step” on page 61.
You will see the following error in the node agent logs:
The problem goes away when the routing table is built and normal processing
continues.
org.omg.CORBA.TRANSIENT: SIGNAL_RETRY
This is a transient exception caused when the workload management routing
service tries to route a request to a target server. This is the exception thrown to
the client when there is no reply to the request.
Another EJB target may be available to service the request, but the request
could not be failed over transparently by the WLM because the completion status
was not determined to be "no". In this case the client application needs to
determine if they want to resend the request. For example, if the EJB request
updates a database table, the WLM is not able to determine if the update
occurred or not as it did not get a response. Therefore this is left to the
application code to determine.
Either way, the application will need to determine how to recover the transaction.
org.omg.CORBA.NO_IMPLEMENT
This exception is thrown when none of the servers participating in the workload
management cluster are available. The routing service cannot locate a suitable
target for the request.
For example, if the cluster is stopped or if the application does not have a path to
any of the cluster members, then this exception will be thrown back to the client.
There are many kinds of NO_IMPLEMENT which can be distinguished by the
associated message or minor code with the NO_IMPLEMENT exception:
If possible, attempt to connect to the EJB directly on each cluster member and
review the application server logs for errors that will indicate the cause of the
problem.
NoAvailableTargetException
This exception is internal to IBM only, you may see it printed out in traces with
the WLM trace specification enabled, but this exception is internal to the WLM
code. This exception is often expected, especially in fail over and startup
scenarios and if a real problem exists, it would manifest itself as one of the
NO_IMPLEMENT exceptions above.
If an application server is still failing to serve requests and you are not seeing any
errors, try to restart the server.
If this does not resolve the issue, go to “The next step” on page 61.
Depending on the problem you are investigating, you should consider simplifying
the problem determination process by reducing the number of servers to monitor
while debugging the problem. You would do this by shutting down all but the
minimum number of servers required to recreate the issue.
Recreate the issue and then review the statistics on each server that is acting as
a client:
1. Navigate to Monitoring and tuning → Performance viewer → Current
activity and click each server.
Figure 3-4 shows the EJB client request counts being made from a single servlet
client to a cluster of two EJB servers. You can see that 386 requests were made.
Next, look at the EJB server statistics for the servers running the EJBs.
Figure 3-5 shows the number of requests processed by one of the two servers.
You can see that it processed 152 requests, this is approximately 40% of the
client requests and represents reasonably even workload balancing.
If you are seeing uneven load distribution across the cluster members, it may be
caused by:
If you are still seeing unexpected distribution, first consider that the WLM uses a
weighted proportional scheme to distribute EJB requests and uses feedback
mechanisms that can change the routing behavior on the fly. It reacts to various
scenarios and clustered server load when making routing decisions, so it is
possible that WLM can function perfectly and the requests will not be balanced
exactly as you expect.
If you still believe you are experiencing an uneven load balancing problem,
review the JVM logs for any errors that might be causing a server to have
problems processing requests.
When you modify server weights you will impact the distribution of EJB requests
among the servers. For example: if you have two servers and one is more
powerful than the other, you might choose to assign server weights of 7 and 2
respectively. This means that server A will be sent 7 requests to process for
every 2 that are sent to server B.
EJB WLM includes “fairness” balancing so that server weights of 2 and 7 will
result in the 2:7 distribution ratio with pattern like:
AAAA-B-AAA-B
A-B-A-B-AAAAA
If there are a large number of requests involved in transactional affinity, this can
cause uneven routing of requests. The WLM PMI data will show you how many
requests were routed to a particular server due to transactional affinity.
Figure 3-7 shows you the PMI data where there is no transaction affinity. Look for
the StrongAffinityIIOPRequestCount field. In this example there is no
transaction affinity in use.
Note: This situation is likely to only occur in a test environment and will
probably resolve itself as the number of distinct users and therefore unique
transactions increases.
There are many reasons why this might occur. For more information, see 4.2.6,
“Network issues causing split views” on page 73.
JVM logs
Look for the following:
CORBA messages indicating transaction retries or rollbacks such as:
org.omg.CORBA.TRANSIENT: SIGNAL_RETRY
org.omg.CORBA::TRANSACTION_ROLLEDBACK
If the server that the EJB client is attempting to execute the transaction against is
down and the EJB request can not be completed, you will see CORBA errors as
described in “CORBA errors” on page 48. If you have other available cluster
members that should be able to service the request then you will need to take an
EJB WLM trace (see “Enabling the WLM trace” on page 88).
Since the transaction might have completed, failing over this request to another
server could result in this request being serviced twice. Therefore the WLM puts
the cluster member into Quiesce mode. While in Quiesce mode, the server will
reject all incoming requests which it determines are new work, but still allow
in-flight requests to complete. This is primarily designed to allow transaction work
to finish as above to prevent unnecessary TRANSACTION ROLLBACK
exceptions.
It might be that no servers are available to process the request. Refer to “CORBA
errors” on page 48.
JVM logs
Look for the following:
Ensure the server has finished starting, look for the message that indicates a
server is ready to process requests:
[3/05/07 17:01:21:959 EST] 0000000a WsServerImpl A WSVR0001I:
Server ejbserver1 open for e-business
If this message has not appeared, either wait for the server to finish startup or
try restarting it again.
Ensure the EJB has started correctly:
If the EJB fails to start, you will exceptions that will describe what has gone
wrong with the application on startup. Example 3-3 shows an EJB failing to
start due to a ClassNotFoundException.
You can confirm that this is the problem by ensuring that the server is up and
running and then wait for the unusable interval to elapse before checking to
determine whether load balancing occurs as described in “Analyze the PMI data”
on page 51.
You can adjust the unusable interval by setting the custom property
com.ibm.websphere.wlm.unusable.interval to a value more suitable to your
environment. The parameter is set in seconds.
This parameter is passed to the JVM as a “-D” parameter. If you are using the
standalone WebSphere or Java application client as your EJB client, modify the
command line to include this parameter:
java -Dcom.ibm.websphere.wlm.unusable.interval=150 MyApplication
If your client is an application server, you set this parameter on the JVM
properties page. Navigate to Servers → Application Servers →
server_name → Java and process management → Process definition → Java
Virtual Machine. Scroll down to Generic JVM arguments and enter the
parameter in the text box. Save your changes. You will need to restart the server
for this change to take effect.
Review the problem classifications to see if there are any other components that
might be causing the problem.
You can disable the HA Manager on a given application server process if you do
not use any of the following services:
Memory-to-memory replication
Singleton failover
Workload management routing
On-demand configuration routing
Do not disable the HA Manager on any administrative process (node agent or the
deployment manager) unless the HA Manager is disabled on all application
server processes in that core group.
If you disable the HA Manager on one member of a cluster, you must disable it
on all of the other members of that cluster.
The following WebSphere Information Center article describes how to disable the
HA Manager.
Disabling or enabling a high availability manager
http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp?topi
c=/com.ibm.websphere.nd.doc/info/ae/ae/trun_ha_ham_enable.html
2. Ensure that all the machines in your configuration have TCP/IP connectivity
to each other by running the ping or telnet command:
– From each physical server to the deployment manager
– From the deployment manager to each physical server
– From the client to each physical server
– Between all physical servers
3. Ensure your core group size is not excessive.
The default core group that is created when you create a cell is made of all
member servers in that cell including node agents and the deployment
manager.
IBM recommends that you limit your core group size to 50 members for
performance reasons. If you have large numbers of cell members, you will
need to split up your core groups. For more information about this, see
“Excessive core group sizes” on page 75.
You may need to review logs from more than just the application server that is
reporting the problems. The High Availability service is cluster-aware so
problems can be spread across the cluster.
Depending on the problem you are investigating, you should consider simplifying
the problem determination process by reducing the number of servers to monitor
while debugging the problem. You would do this by shutting down all but the
minimum number of servers required to recreate the issue.
Example 4-1 also shows the HMGR0207I message that shows a server was a core
group coordinator but has lost the leadership of the core group. Finally, it shows
the messages that are produced when a new core group view is installed.
If all is running as expected and all core group servers are started, then the
number of active members (AV) will equal the number of defined members (DF)
A new core group view is installed every time a server that is part of the core
group is stopped or started. Installing a new view might result in significant,
temporary spikes in the amount of CPU consumed and the network bandwidth
used.
Figure 4-3 shows the servers that the HA Manager expects to be in the core
group. That is, the list of servers that the core group coordinator is aware of and
connected to. Each of these servers will be a member of one or more core
groups. Each core group has its own core group view, that is a view of the
servers participating in the core group.
You can also review the core groups that are running at any given point in time
from the administrative console.
Note that core groups or servers will only appear in the Runtime tab when the
associated server or servers are running.
You will typically see a lot of core groups in the Runtime tab. For example, if you
have enabled HTTP session replication and have installed six Web applications,
that is, with six separate context roots, then you will see six separate core
groups. You will also see one core group for each server to manage the session
replication cache. These will be in addition to any other core groups to manage
the other singleton services.
If you click a core group name, you will see the servers that are part of that
particular core group as shown in Figure 4-5. This is known as the core group
view.
This problem is due to a Data Replication Service (DRS) issue, where the DRS
processes are consuming all the threads from the default thread pool. As a
result, the transport threads used by the HA Manager (the DCS connections) are
being closed unexpectedly. This leads to instability in the core group views which
can lead to unexpected failures under error conditions and excessive amounts of
system resources to be consumed.
Using the loopback adapter can lead to this problem. By default, the DCS
endpoint addresses use a Host field of “*” (asterisk) to indicate any host can
respond on that particular port as shown in Figure 4-7.
When DCS resolves the host name of a machine that it is currently running on, in
addition to the expected IP address of that host, it also gets back the loopback
adapter address. This causes connectivity issues within the topology. The
For large topologies, create multiple smaller core groups and link these core
groups with a core group bridge. Core group bridges allow servers in separate
core groups to share WLM information.
If you are still experiencing issues with singleton services not starting on failover,
go to “The next step” on page 84.
The reasons why you are seeing multiples of singleton services unexpectedly are
likely to be the same as those causing singleton services not to start. Go to
“Singleton services not starting on failover” on page 66 to resolve this issue.
The resolution of other startup failures are outside the scope of this document.
In order for the HA Manager to assign ownership of the transaction log to the
Transaction Manager, the application server must establish itself as a member of
a core group view. This requires the HA Manager to establish network
connectivity between all running core group members.
If you are still experiencing issues with singleton services not starting on failover,
go to “The next step” on page 84.
If you don’t see any of these symptoms, continue with the problem analysis by
looking at the verbose GC data.
Verbose GC data
If your JVM heap size usage is getting close to its maximum, you are likely to see
an excessive number of garbage collections being performed. Frequent garbage
collections in the JVM will increase the CPU being used on the server and
reduce the number of requests that can be processed.
You can also look at verbose GC data using the IBM Pattern Modeling and
Analysis Tool for Java Garbage Collector (PMAT) tool that is run from the IBM
Support Assistant (ISA) tool. You can use this tool to show you graphs of the
GC activity.
An OutOfMemory error
This will appear in the verbose GC output and also in the application server
logs. Look for the message shown in Example 4-6.
If you are seeing this occurring only in the cluster member running the core
group coordinator, go to “High JVM heap usage” on page 80.
Server metrics
Server metrics such as CPU, I/O and page space utilization can also show
uneven or unexpected load balancing. You should use the tool most appropriate
to your operating system to check these statistics. For example, in Windows you
could use Task Manager or Performance Monitor while in a Unix or Linux
environment, you could use vmstat or top.
However, ensure you do not allocate too much memory to the JVM heap as you
run the risk of either limiting native memory for the Java process or causing the
system to page memory. Paging is very bad for performance.
You could also consider disabling the HA Manager processes if you do not utilize
the services it provides. For more information, see “Determine if you need to use
the HA Manager” on page 64.
This means that there is a resource constraint on the server where the message
appears. Examples of this include JVM heap memory exhaustion, CPU utilization
or system memory being paged.
2. Specify servers that are not often stopped and restarted and that run on hosts
with spare CPU and memory resources.
You could also consider disabling the HA Manager processes if you do not utilize
the services it provides.
If you are still experiencing issues with too much load on the server hosting the
HA Manager, go to “The next step” on page 84.
Review the problem classifications to see if there are any other components that
might be causing the problem.
For more information about tracing, see the following WebSphere Information
Center articles:
Enabling trace on client and standalone applications
http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/topic/com.ibm.
websphere.base.doc/info/aes/ae/ttrb_entrstandal.html
Object request broker troubleshooting tips
http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/topic/com.ibm.
websphere.nd.doc/info/ae/ae/rtrb_orbcomp2.html
For example:
app_server_root/bin/launchClient.sh ear_file
-JVMOptions="-Dcom.ibm.CORBA.Debug=true -Dcom.ibm.CORBA.CommTrace=true"
-CCtrace=ORBRas=all=enabled -CCtracefile=orbtrace.txt
-CCtraceMode=basic
The ORB trace output is captured in a unique trace file named orbtrace.txt in the
current directory.
By default, the trace is logged to a file called trace.log in the same location as
SystemOut.log and SystemErr.log, profile_root/logs/server_name.
You can determine the name and location of your plug-in log by looking in the
administrative console for your Web server definition as shown in Figure 5-1.
Navigate to Web servers → web_server_name → Plug-in properties and check
the Log file name field.
There are several levels of logging at the plug-in that are relevant to monitoring
load balancing. The logging levels are cumulative so Stats includes and builds
on the information you get from Warn, and so on. The levels are:
Warn - this log level will report warnings issued by the plug-in.
Stats - provides basic statistics on how many requests are sent to each
cluster member. It also reports on requests sent to a server to maintain
The logging level you should choose will depend on the complexity of your load
balancing environment. If you have only one application being load balanced,
then Stats will be sufficient to determine the load distribution. However if you
have multiple applications, some with session affinity and some without, then you
will need to move to a higher log level. You will probably be trying to resolve a
production problem. Given this, it may be best to collect Trace data so that you
do not have to keep running tests and traces at subsequently higher log levels to
collect more information in your production environment.
Tip: If you have not enabled automatic generation and propagation of the
plug-in, you will need to do this manually for the change to log level to take
effect.
You do not need to restart your Web server for this change to take effect. By
default, the plug-in will reload its configuration every 60 seconds. You only
need wait for the reload interval to pass.
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this paper.
IBM Redbooks
For information about ordering these publications, see “How to get Redbooks” on
page 95. Note that some of the documents referenced here may be available in
softcopy only.
WebSphere Application Server Network Deployment V6: High Availability
Solutions, SG24-6688
WebSphere Application Server V6 Scalability and Performance Handbook,
SG24-6392
Approach to Problem Determination in WebSphere Application Server V6,
REDP-4073
Online resources
These Web sites are also relevant as further information sources:
WebSphere Application Server V6.1 Information Center
http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp
Index of MustGather documentation
http://www-1.ibm.com/support/docview.wss?uid=swg21145599
Web server plug-in policy for WebSphere Application Server
http://www-1.ibm.com/support/docview.wss?uid=swg21160581
Understanding IBM HTTP Server plug-in Load Balancing in a clustered
environment
http://www.ibm.com/support/docview.wss?rs=180&uid=swg21219567
Understanding HTTP plug-in failover in a clustered environment
http://www.ibm.com/support/docview.wss?rs=180&uid=swg21219808
Related publications 95
Help from IBM
IBM Support and downloads
ibm.com/support
L R
launchClient 89 Rational Application Developer 45
log level 91 Redbooks Web site 95
loopback adapter 74 Contact us x
restoreConfig 73
Retry Limit Reached 49
M
Maximum Heap Size 81 RMI/IIOP 3
memory-to-memory replication 64 Round robin routing policy 3
message prefis routing table 47
HMGR 4, 66 runtime weight 54
message prefix
CWRLS 4, 66 S
DCSV 4, 66 server metrics 78, 80
MustGather server weight 54
HA Manager 85 server weighted routing policy 3
installation 62, 85 session replication cache 71
SIGNAL_RETRY 48, 57
singleton fail over 64
N
network trace 43, 46 singleton service 63–64, 66, 71, 75–77
No Available Target 49 snoop 47
NO_IMPLEMENT 48–49 split views 73
NoAvailableTargetException 49 SRVE0068E 79
SystemErr 21, 24, 28, 33, 43, 50, 57, 59, 66, 76, 78,
83
O SystemErr log 88
online support SystemOut 21, 24, 28, 33, 43, 50, 57, 59, 66, 76,
EJB WLM 61 78, 83
HA Manager 84 SystemOut log 88
ORB trace 90
org.omg.CORBA 57
org.omg.CORBA.* 44 T
org.omg.CORBA.COMM_FAILURE 58 TCPC0005W 73
org.omg.CORBA.NO_IMPLEMENT 44, 48 tcptrace 47
No Cluster Data 47 thread pool 73
org.omg.CORBA.TRANSIENT 48, 57 trace 88–90
OutOfMemory 79–80 traceFileName 89
OutOfMemoryError 79 TraceSettings.properties 89
transaction affinity 53, 55
transaction log 77
P transaction log file 77
plug-in log 91 Transaction Manager 77
PMI data 50–51 TRANSACTION ROLLBACK 58
PMI statistics 78 TRANSACTION_ROLLEDBACK 57
preferred coordinator servers 81
Profiling and Logging perspective 45
V
verbose GC data 78–79
view size 73
W
Web container 2–3
Web server plug-in 2–3
WLMClientsServicedCount 52
WSVR0001I 59
WSVR0605W 44
Index 99
100 WebSphere Application Server V6.1: Workload Management Problem Determination
Back cover ®
Diagnose Web server This IBM Redpaper helps you to debug common
plug-in problems
INTERNATIONAL
problems that are related to workload TECHNICAL
management in WebSphere Application Server SUPPORT
Diagnose high network deployment on distributed and on i5/OS
availability manager ORGANIZATION
platforms. It discusses the following areas:
problems
HA Manager
Diagnose EJB WLM BUILDING TECHNICAL
EJB workload management
problems INFORMATION BASED ON
Web server plug-in load balancing PRACTICAL EXPERIENCE
REDP-4308-00