Discussion:
[rabbitmq-discuss] Long pauses when closing many channels simultaneously
josh
2013-09-26 10:50:31 UTC
Permalink
I have 1 RabbitMQ Connection, 10K publishing Channels, 10K consuming
Channels and 100 Threads. All channels are in rapid publish/consume "loops"
using exclusive queues and external synchronization; no possibility of
publish, consume or close happening simultaneously on a channel. Closing
channels simultaneously on 100 threads incurs a delay of 2.5 seconds for
each of the first 100 channels. All subsequent closures are processed in
less than 0.5 seconds. If I bump it up to 20K+20K channels the initial 100
each take 5 seconds to close. With 30K+30K each takes 10 seconds, where
we're pretty much at the channel-max for a connection. Similarly if I
reduce the number of threads to 50 then it's the first 50 channels that
close slowly.

Let me restate that... *With 30K+30K channels the first 100 each take 10
seconds to close using 100 simultaneous threads. The remaining 59,900 each
take less than 0.5 seconds.* My feeling is that there's some funky
connection-wide synchronization/continuation going on here. Hit the
connection up with 100 channel-close requests on 100 threads simultaneously
and it baulks. Whatever causes that initial spasm doesn't seem to affect
subsequent close operations and everything swims along nicely.

I've tried ramping up the number of connections to relieve the pressure.
This certainly works with predictable results. With 30K+30K connections
spread evenly over 2 connections the initial 100 channel-close delays are
halved from 10 seconds to 5 seconds. Use 10 connections and the delay is
imperceptible when compared to the subsequent 59,900 channel closures. Jump
to 50K+50K channels (we can do this with 10 connections but not 1
connection due to channel-max) and the delays start to creep back in again.

My concerns with this approach are that 1) multiple connections are
discouraged in the documentation due to i/o resource overhead and that 2)
it's not clear for my application how to sensibly predict the optimum
number of channels per connection. If there is a soft limit to the number
of channels per connection why is it not documented or made available in
the api? If I have to manage a number of connections and allocate channels
across those connections in a robust manner I feel like I'm doing the work
of a client library!

I've tried my hand at modifying the client library by not waiting for
channel-close acknowledgements from the RabbitMQ server. This worked like a
charm. Channels were closed instantly with no delay in the client and
confirmed as closed on the server. Eight hours later though and I was out
of heap space as the channel resources internal to the client library were
not being released. I haven't managed to isolate the source of the delay
either... is it in the client library or the server itself? To progress
further I'd need to trace the wire protocol. I think I'm off track with
this approach!

*Questions:*

Before making application changes I'd like to know if this is a known issue
with the Java client? Are there better workarounds than multiple
connections and application-level channel management? In practise my actual
application uses around 20K channels per process, which I don't feel is
excessive, and message throughput is actually pretty light as I'm
leveraging RabbitMQ more for it's routing capabilities. if you think the
number of channels is a problem in itself then please say so! I could
refactor to use less channels but then I'd be sharing channels and would
either have to synchronize their usage or ignore documentation guidelines.
The error handling paradigm makes this cumbersome though; any channel error
results in it's termination so it's difficult to isolate errors, prevent
them from permeating across unrelated publishers/consumers and recover in a
robust manner.

*Detail:*

RabbitMQ server 3.1.5, Java amqp-client 3.1.5. Also tried 3.1.4, 3.1.3 in
the same test harness and seen what I assume to be exactly the same
behaviour in 3.0 and prior in production application.

My test harness is standalone but a bit obtuse to post here. I create 10K
test clients each of which creates an exclusive queue and runs a continuous
publish-consume-publish-consume loop. The publisher and consumer each have
their own channels. I'm using channel.basicConsume with DefaultConsumers
and a default consumer executor service. Also used my own Java executor
service, which appears to be the same as the default implementation, with
various numbers of threads to no discernible effect.

Messages consumed in handleDelivery spawn a new publish task via a
java.concurrent fixed thread pool, so the consumer thread is not tied up or
used in any channel operations whatsoever. The channel-close is
synchronized so that it cannot happen at the same time as a publish.
However there's nothing stopping a consume happening while close is being
processed - the consumer is out of my control.

Thanks for any pointers!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130926/21ee4ecc/attachment.htm>
Michael Klishin
2013-09-26 13:10:17 UTC
Permalink
Let me restate that... With 30K+30K channels the first 100 each take 10 seconds to close using 100 simultaneous threads. The remaining 59,900 each take less than 0.5 seconds. My feeling is that there's some funky connection-wide synchronization/continuation going on here. Hit the connection up with 100 channel-close requests on 100 threads simultaneously and it baulks. Whatever causes that initial spasm doesn't seem to affect subsequent close operations and everything swims along nicely.
This is correct. Closing either a channel or connection involves waiting for a reply from RabbitMQ.
Iit would be interested to see thread dumps and as much information about lock contention you can provide. My guess is that it is _channelMap but I'm not a very reliable prediction machine.
I've tried ramping up the number of connections to relieve the pressure. This certainly works with predictable results. With 30K+30K connections spread evenly over 2 connections the initial 100 channel-close delays are halved from 10 seconds to 5 seconds. Use 10 connections and the delay is imperceptible when compared to the subsequent 59,900 channel closures. Jump to 50K+50K channels (we can do this with 10 connections but not 1 connection due to channel-max) and the delays start to creep back in again.
Again, hard to tell what the contention point is without runtime data.
My concerns with this approach are that 1) multiple connections are discouraged in the documentation due to i/o resource overhead and that 2) it's not clear for my application how to sensibly predict the optimum number of channels per connection. If there is a soft limit to the number of channels per connection why is it not documented or made available in the api?
See ConnectionFactory.DEFAULT_CHANNEL_MAX and ConnectionFactory#setRequestedChannelMax.

Note that some clients have a different default (like 65536 channels).
I've tried my hand at modifying the client library by not waiting for channel-close acknowledgements from the RabbitMQ server. This worked like a charm. Channels were closed instantly with no delay in the client and confirmed as closed on the server. Eight hours later though and I was out of heap space as the channel resources internal to the client library were not being released. I haven't managed to isolate the source of the delay either... is it in the client library or the server itself?
You need to make sure that ChannelManager#disconnectChannel is used. VisualVM should
pretty quickly show what objects use most heap space.
Before making application changes I'd like to know if this is a known issue with the Java client?
I've seen this before with 2 other clients. In one case the problem was different and mostly solved
(I have not tried 60K channels but for 6-8K it worked reasonably well). Another client is built on the
Java one. So, it's a known problem that few people run into.
Are there better workarounds than multiple connections and application-level channel management? In practise my actual application uses around 20K channels per process, which I don't feel is excessive, and message throughput is actually pretty light as I'm leveraging RabbitMQ more for it's routing capabilities. if you think the number of channels is a problem in itself then please say so! I could refactor to use less channels but then I'd be sharing channels and would either have to synchronize their usage or ignore documentation guidelines.
This is something that should be improved in the Java client, but in the meantime you may need
to use a pool of connections that will open channels using round robin or similar.
The error handling paradigm makes this cumbersome though; any channel error results in it's termination so it's difficult to isolate errors, prevent them from permeating across unrelated publishers/consumers and recover in a robust manner.
This is in part why having one channel per thread is a very good idea.

To summarize: yes, this is a known but rare problem. If you can provide profiling and thread dump
information that will help isolating the contention point, I think the issue can be resolved or largely
mitigated in a future version.

MK



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130926/bd5dd0d5/attachment.pgp>
Martin Rogan
2013-09-26 15:26:36 UTC
Permalink
Post by josh
Let me restate that... With 30K+30K channels the first 100 each take 10
seconds to close using 100 simultaneous threads. The remaining 59,900 each
take less than 0.5 seconds. My feeling is that there's some funky
connection-wide synchronization/continuation going on here. Hit the
connection up with 100 channel-close requests on 100 threads simultaneously
and it baulks. Whatever causes that initial spasm doesn't seem to affect
subsequent close operations and everything swims along nicely.
This is correct. Closing either a channel or connection involves waiting
for a reply from RabbitMQ.
Iit would be interested to see thread dumps and as much information about
lock contention you can provide. My guess is that it is _channelMap but I'm
not a very reliable prediction machine.
In tag 3.1.5 I can point to the close(...) method in ChannelN.java at line
569:

// Now that we're in quiescing state, channel.close was sent and
// we wait for the reply. We ignore the result.
// (It's NOT always close-ok.)
notify = true;
k.getReply(-1);

Here k.getReply(-1) does the waiting. In my dodgy mod I skipped these two
lines and also the finally block (notify==false):

} finally {
if (abort || notify) {
// Now we know everything's been cleaned up and there should
// be no more surprises arriving on the wire. Release the
// channel number, and dissociate this ChannelN instance
from
// our connection so that any further frames inbound on this
// channel can be caught as the errors they are.
releaseChannel();
notifyListeners();
}
}

Hence the channel resource leak and subsequent OOM. Although the delay
disappeared and the channels were closed on the server it doesn't reveal
where the delay was incurred. The client may have just been waiting for
replies to come in after other data on the connection, with no lock
contention, but on the other hand how do the subsequent closures get
processed so much quicker?
Post by josh
I've tried ramping up the number of connections to relieve the pressure.
This certainly works with predictable results. With 30K+30K connections
spread evenly over 2 connections the initial 100 channel-close delays are
halved from 10 seconds to 5 seconds. Use 10 connections and the delay is
imperceptible when compared to the subsequent 59,900 channel closures. Jump
to 50K+50K channels (we can do this with 10 connections but not 1
connection due to channel-max) and the delays start to creep back in again.
Again, hard to tell what the contention point is without runtime data.
My concerns with this approach are that 1) multiple connections are
discouraged in the documentation due to i/o resource overhead and that 2)
it's not clear for my application how to sensibly predict the optimum
number of channels per connection. If there is a soft limit to the number
of channels per connection why is it not documented or made available in
the api?
See ConnectionFactory.DEFAULT_CHANNEL_MAX and
ConnectionFactory#setRequestedChannelMax.
Note that some clients have a different default (like 65536 channels).
In my 3.1.5 client ConnectionFactory.DEFAULT_CHANNEL_MAX==0 and
connection.getChannelMax()==65,536.
Post by josh
I've tried my hand at modifying the client library by not waiting for
channel-close acknowledgements from the RabbitMQ server. This worked like a
charm. Channels were closed instantly with no delay in the client and
confirmed as closed on the server. Eight hours later though and I was out
of heap space as the channel resources internal to the client library were
not being released. I haven't managed to isolate the source of the delay
either... is it in the client library or the server itself?
You need to make sure that ChannelManager#disconnectChannel is used. VisualVM should
pretty quickly show what objects use most heap space.
As revealed by YourKit mountains of Channels were not cleared up by my
dodgy mod skipping disconnectChannel(). But I figured it was unsafe to
invoke as it we don't know "everything's been cleaned up and there should
be no more surprises arriving on the wire."
Post by josh
Before making application changes I'd like to know if this is a known
issue with the Java client?
I've seen this before with 2 other clients. In one case the problem was
different and mostly solved
(I have not tried 60K channels but for 6-8K it worked reasonably well).
Another client is built on the
Java one. So, it's a known problem that few people run into.
A few seconds here and there is not so problematic really. RabbitMQ is so
critical to our application though that we need to ensure we're not falling
off any edges.
Post by josh
Are there better workarounds than multiple connections and
application-level channel management? In practise my actual application
uses around 20K channels per process, which I don't feel is excessive, and
message throughput is actually pretty light as I'm leveraging RabbitMQ more
for it's routing capabilities. if you think the number of channels is a
problem in itself then please say so! I could refactor to use less channels
but then I'd be sharing channels and would either have to synchronize their
usage or ignore documentation guidelines.
This is something that should be improved in the Java client, but in the
meantime you may need
to use a pool of connections that will open channels using round robin or similar.
Done.
Post by josh
The error handling paradigm makes this cumbersome though; any channel
error results in it's termination so it's difficult to isolate errors,
prevent them from permeating across unrelated publishers/consumers and
recover in a robust manner.
This is in part why having one channel per thread is a very good idea.
To summarize: yes, this is a known but rare problem. If you can provide
profiling and thread dump
information that will help isolating the contention point, I think the
issue can be resolved or largely
mitigated in a future version.
Thanks. Will do. Would you prefer a plain old Java app that you can profile
yourself?

MK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130926/30e529ba/attachment.htm>
mrogan
2013-09-27 11:43:43 UTC
Permalink
Post by Michael Klishin
To summarize: yes, this is a known but rare problem. If you can provide
profiling and thread dump
information that will help isolating the contention point, I think the
issue can be resolved or largely
mitigated in a future version.
MK
Uploaded test harness to recreate the delays,
SimultaneousChannelCloseTest.java
<http://rabbitmq.1065348.n5.nabble.com/file/n29973/SimultaneousChannelCloseTest.java>
. Requires amqp-client-3.1.5.jar. No arguments, just change the RabbitMQ
host, port, vhost, username and password in main(). Example output:



Here 99% of the channels closed in less than 1 second. The remaining 100
channels (equal to the number of threads used) closed in a significantly
longer time. Increasing the number of channels increases the delay but only
for the same number of channels as there are THREADS. Increasing the number
of CONNECTIONS alleviates these delays. You can fiddle with these parameters
in the source.

Having played with this for a few hours now my take is that the initial
channel-closes are simply delayed waiting on other stuff to happen on the
connection. There's lots of publishing and consuming going on. Subsequent
channel-closes are not delayed because they're all queued right behind the
initial closes so there's nothing else really going on. Of course the test
harness can be modified to create connection activity while the closes are
being invoked and the delays will be seen in all channel-closes instead of
just the first few. The channel-close is a synchronous roundtrip operation
so we just have to wait. Maybe what I'm looking for is just a way to close
the channels asynchronously so that I don't have to wait too?

So my thinking is that I have 3 ways to alleviate these delays. 1) Use more
connections. 2) Fire off channel-close asynchronously, ensuring that I don't
try to use the channel subsequently. 3) Modify the Java amqp-client to offer
an asynchronous close that fires off the close and returns, while discarding
anything else received subsequently on that channel until the reply is
received, at which point it cleans up properly with disconnectChannel().

What are your thoughts on an asynchronous close in the amqp-client? Is it
warranted? When should listeners be notified shutdownCompleted()?
Immediately on firing the close or finally when the server reply is
received? Is this just complicating the API and would it lead equivalent
asynchronous operations everywhere else, such as when closing connections?

Thanks, Martin.



--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Long-pauses-when-closing-many-channels-simultaneously-tp29937p29973.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
Emile Joubert
2013-10-11 11:51:49 UTC
Permalink
Hi,
Post by mrogan
Here 99% of the channels closed in less than 1 second. The remaining 100
channels (equal to the number of threads used) closed in a significantly
longer time.
Thanks for the test code. I tried this on the latest version of the
broker and was not able to reproduce the problem. All closures took less
than 500ms. Can you confirm that the problem exists when using a
nightly build of the broker and the Java client?



-Emile
mrogan
2013-10-14 11:06:51 UTC
Permalink
Post by Emile Joubert
Thanks for the test code. I tried this on the latest version of the
broker and was not able to reproduce the problem. All closures took less
than 500ms. Can you confirm that the problem exists when using a
nightly build of the broker and the Java client?
-Emile
That's interesting. I really wasn't expecting that! I was using the official
3.1.5 release beforehand. I tried again with last night's 3.1.5.41014 build
of both the broker and Java client. I got the same results:



I installed the 3.1.5.41014 broker locally using the mac-standalone
distribution. I'm running OSX 10.6.8 Snow Leopard with Java 1.6.0_51. With
the previous 3.1.5 tests I ran against a remote cluster over a Gb LAN where
it was hosted on Scientific Linux 6.4 (a RHEL clone). None of this seemed to
affect the results other than the expected extra time for the network hop.

I wonder if you think it's worth me retrying the tests on Windows and/or OSX
Mountain Lion with Java 1.7?

Thanks,

Martin



--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Long-pauses-when-closing-many-channels-simultaneously-tp29937p30473.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
Michael Klishin
2013-10-14 11:21:23 UTC
Permalink
Post by mrogan
I wonder if you think it's worth me retrying the tests on Windows and/or OSX
Mountain Lion with Java 1.7?
Given that JDK 6 is no longer supported, you should switch to JDK 7.

MK
mrogan
2013-10-14 12:01:01 UTC
Permalink
Post by Michael Klishin
Given that JDK 6 is no longer supported, you should switch to JDK 7.
MK
You're absolutely right, I should switch just as soon as my IT department
gets around to upgrading my Mac! Java 7 is not supported on Snow Leopard and
apparently it takes 2 years to upgrade it. Martin..(



--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Long-pauses-when-closing-many-channels-simultaneously-tp29937p30476.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
Michael Klishin
2013-10-14 12:07:19 UTC
Permalink
Post by mrogan
Java 7 is not supported on Snow Leopard and
apparently it takes 2 years to upgrade it
Are you aware of Oracle's own JDK 7 installers available from
http://www.oracle.com/technetwork/java/javase/downloads/index.html?

This may or may not be an option according to your IT dept policies
but these JDKs should work on Snow Leopard.

Apple no longer maintains their own JDK.

MK
mrogan
2013-10-15 14:22:31 UTC
Permalink
Are you aware of Oracle's own JDK 7 installers?
MK
Larry says no... "Java from Oracle requires Mac OS X 10.7.3 or later."

I'm going to try this on my kids' PC just to rule out OS/Java issues.

Thanks, Martin.



--
View this message in context: http://rabbitmq.1065348.n5.nabble.com/Long-pauses-when-closing-many-channels-simultaneously-tp29937p30501.html
Sent from the RabbitMQ mailing list archive at Nabble.com.
Emile Joubert
2013-10-14 11:33:37 UTC
Permalink
Hi,
Post by mrogan
That's interesting. I really wasn't expecting that! I was using the official
3.1.5 release beforehand. I tried again with last night's 3.1.5.41014 build
Can you please confirm whether you believe there still is a problem when
using the latest version? There have been a number of improvements since
the last release which could explain difference in behaviour when using
a nightly build.
Post by mrogan
I wonder if you think it's worth me retrying the tests on Windows and/or OSX
Mountain Lion with Java 1.7?
Using a different OS is unlikely to make a difference. What question are
you trying to answer by doing this?




-Emile
Martin Rogan
2013-10-15 14:58:00 UTC
Permalink
On 14 October 2013 12:33, Emile Joubert <emile at rabbitmq.com> wrote:

Can you please confirm whether you believe there still is a problem when
Post by Emile Joubert
using the latest version?
Yes, I still think there's a problem with the latest version. See my
results with 3.1.5.41014 show that exactly 100 channels took longer than
500ms to close.

Using a different OS is unlikely to make a difference. What question are
Post by Emile Joubert
you trying to answer by doing this?
As you don't see the same issues as me when running the same code I'm
trying to isolate any external differences. While I would usually dismiss
OS differences out of hand they have tripped me up in the past. For
example, fiddling with so_linger on a Linux distribution has implications
for all the sockets sharing the same network interface - closing a socket
can block other sockets indefinitely - while sockets are more isolated on
Windows.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131015/5c8ff234/attachment.htm>
Emile Joubert
2013-10-17 09:11:52 UTC
Permalink
Hi,
Post by Martin Rogan
On 14 October 2013 12:33, Emile Joubert <emile at rabbitmq.com
Can you please confirm whether you believe there still is a problem when
using the latest version?
Yes, I still think there's a problem with the latest version. See my
results with 3.1.5.41014 show that exactly 100 channels took longer than
500ms to close.
Could you please try reposting these results? This information did not
accompany your last message.



-Emile

Continue reading on narkive:
Loading...