[rabbitmq-discuss] Often failed to produce message due to TIMEOUT WAITING FOR ACK

Discussion:

Wong Kam Hoong

2012-09-25 02:46:18 UTC

Hi RabbitMQ Team,

*My RabbitMQ*
Version : 2.8.6
Cluster Environment : Yes, All Nodes in Disk Mode
waitForConfirmsOrDie: Yes, 10000

Recently I found that my program often failed to produce messages to the
RabbitMQ server due to following exception:

*EXCEPTION LOG*
2012-09-24 11:53:19,665 WARN [INMSG-6] RabbitMQQueueSession - Failed to
produce message
{"creationDate":1348458796689,"lastModified":1348458796689,"queueMessage":{"id":44519977,"creationDate":1348458796688,"shortCode":28933,"serviceId":602,"lastModified":1348458796688,"operatorCode":13,"mcc":502,"msgId":"6e04.7dc9180b350b.00000195","class":"nplay.json.data.SnsInMessageValue","scheduledTime":1348458796688,"dataValue":{"message":"@ohhAzirahAzid:
start matrik aku suka duduk kat
library","creationDate":1348458796688,"shortCode":28933,"lastModified":1348458796688,"mcc":502,"operatorCode":13,"msisdn":60134406223,"class":"nplay.json.data.SmsOutMessageValue","scheduledTime":1348458796688}}}
: priority 4 to QUEUE_SMS_IN_4
com.rabbitmq.client.ShutdownSignalException: clean channel shutdown;
reason: #method<channel.close>(reply-code=406, reply-text=TIMEOUT WAITING
FOR ACK, class-id=0, method-id=0)
at
com.rabbitmq.client.impl.ChannelN.waitForConfirms(ChannelN.java:182)
at
com.rabbitmq.client.impl.ChannelN.waitForConfirmsOrDie(ChannelN.java:217)
at
nplay.json.rabbitmq.client.RabbitMQQueueSession.produceTextMessage(RabbitMQQueueSession.java:124)
at
nplay.common.mq.AQueueClient.produceTextMessage(AQueueClient.java:111)
at
nplay.json.rabbitmq.client.QueueJsonRabbitMQClient.produceTextMessage(QueueJsonRabbitMQClient.java:289)
at nplay.common.mq.AQueueClient.send(AQueueClient.java:185)
at nplay.common.mq.AQueueClient.send(AQueueClient.java:275)
at
nplay.endpoint.json.smpp.SnsInHandler.sendQueueMessage(SnsInHandler.java:21)
at
nplay.endpoint.json.smpp.SnsInHandler.sendQueueMessage(SnsInHandler.java:11)
at
nplay.endpoint.json.smpp.AIncomingMessageHandler.update(AIncomingMessageHandler.java:60)
at nplay.endpoint.json.AJsonHandler.update(AJsonHandler.java:25)
at java.util.Observable.notifyObservers(Observable.java:142)
at
nplay.endpoint.common.ASpringEventSource.process(ASpringEventSource.java:202)
at
nplay.endpoint.common.ASpringEventSource.run(ASpringEventSource.java:157)
at java.lang.Thread.run(Thread.java:619)
at
nplay.common.util.ThreadManager$ManagableThread.run(ThreadManager.java:52)
Caused by: com.rabbitmq.client.ShutdownSignalException: clean channel
shutdown; reason: #method<channel.close>(reply-code=406, reply-text=TIMEOUT
WAITING FOR ACK, class-id=0, method-id=0)
at com.rabbitmq.client.impl.ChannelN.close(ChannelN.java:521)
at com.rabbitmq.client.impl.ChannelN.close(ChannelN.java:480)
at
com.rabbitmq.client.impl.ChannelN.waitForConfirmsOrDie(ChannelN.java:222)
... 14 more

Based on the log, the problem seems like related to "*NACKS Received*" and "
*ChannelN*" code throw the TimeoutException with reply-code *406 *
(PRECONDITION_FAILED).

I check all RabbitMQ servers are still working fine and I believe the time
I set for *waitForConfirmsOrDie *(10000) should be sufficient enough.

Can you advise me on the problem? How to solve it?

Thanks & Regards,
Wong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120925/e9b68dba/attachment.htm>

Matthias Radestock

2012-09-25 06:32:04 UTC

Permalink

Wong,

Post by Wong Kam Hoong
com.rabbitmq.client.ShutdownSignalException: clean channel shutdown;
reason: #method<channel.close>(reply-code=406, reply-text=TIMEOUT
WAITING FOR ACK, class-id=0, method-id=0)
at
com.rabbitmq.client.impl.ChannelN.waitForConfirms(ChannelN.java:182)
Based on the log, the problem seems like related to "*NACKS Received*"
and "*ChannelN*" code throw the TimeoutException with reply-code *406
*(PRECONDITION_FAILED).

I don't think it has anything to do with nacks - unless you are actually
seeing some exception mentioning nacks.

Looks like a straightforward timeout.

Post by Wong Kam Hoong
I check all RabbitMQ servers are still working fine and I believe the
time I set for *waitForConfirmsOrDie *(10000) should be sufficient enough.

10s should indeed generally be long enough. But one can certainly
envision scenarios where it won't be. What's the general health of your
rabbit when this happens? Plenty of free memory and disk? No strange
errors in the logs? Are you using mirrored/HA queues at all? Could there
possibly be a spike, i.e. lots of messages getting published in a short
space of time?

Regards,

Matthias.

Wong Kam Hoong

2012-09-26 06:18:16 UTC

Permalink

Hi Matthias,

What's the general health of your rabbit when this happens?
The RabbitMQ nodes seems fine during the incident happened:

*Master Node Usage*
Memory : 300M
CPU % : Average 30%

*Slave Node Usage*
Memory : 130M
CPU % : Average 30%

Plenty of free memory and disk?
Yes, there are plenty of free memory and disk space in both servers.

Are you using mirrored/HA queues at all?
Yes, I'm using HA in all queues.

Could there possibly be a spike, i.e. lots of messages getting published in
a short space of time?
No, I checked there are only few messages during that time, e.g. 2012-09-26
03:28:32 => 4 messages

No strange errors in the logs?
After the incident happened RabbitMQ did logged warning, e.g.

*RabbitMQ Log*
*
*
=WARNING REPORT==== 26-Sep-2012::03:29:25 ===
closing AMQP connection <0.8661.18> (192.168.0.100:46151 ->
192.168.0.100:5672):
connection_closed_abruptly

=INFO REPORT==== 26-Sep-2012::03:29:37 ===
accepting AMQP connection <0.8897.18> (192.168.0.100:43836 ->
192.168.0.100:5672)

Regards,
Wong

On Tue, Sep 25, 2012 at 2:32 PM, Matthias Radestock

Post by Matthias Radestock
Wong,

com.rabbitmq.client.**ShutdownSignalException: clean channel shutdown;
reason: #method<channel.close>(reply-**code=406, reply-text=TIMEOUT
WAITING FOR ACK, class-id=0, method-id=0)
at
com.rabbitmq.client.impl.**ChannelN.waitForConfirms(**ChannelN.java:182)
Based on the log, the problem seems like related to "*NACKS Received*"
and "*ChannelN*" code throw the TimeoutException with reply-code *406
*(PRECONDITION_FAILED).

I don't think it has anything to do with nacks - unless you are actually
seeing some exception mentioning nacks.
Looks like a straightforward timeout.
I check all RabbitMQ servers are still working fine and I believe the

time I set for *waitForConfirmsOrDie *(10000) should be sufficient enough.

10s should indeed generally be long enough. But one can certainly envision
scenarios where it won't be. What's the general health of your rabbit when
this happens? Plenty of free memory and disk? No strange errors in the
logs? Are you using mirrored/HA queues at all? Could there possibly be a
spike, i.e. lots of messages getting published in a short space of time?
Regards,
Matthias.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20120926/ebb8abc7/attachment.htm>

Matthias Radestock

2012-09-26 07:17:28 UTC

Permalink

Wong,

*_Master Node Usage_*
Memory : 300M
CPU % : Average 30%
_*Slave Node Usage*_
Memory : 130M
CPU % : Average 30%
there are only few messages during that time, e.g.
2012-09-26 03:28:32 => 4 messages

If the CPU is averaging 30% when the load is so low then the machines
must be doing something else too. I notice from...

=INFO REPORT==== 26-Sep-2012::03:29:37 ===
accepting AMQP connection <0.8897.18> (192.168.0.100:43836
<http://192.168.0.100:43836/> -> 192.168.0.100:5672
<http://192.168.0.100:5672/>)

...that your clients are on the same machine as one of the nodes. Could
it be that the machine was busy with other tasks, such as your application?

Also, 3:29 in the morning is just the sort of time that I'd expect
systems to run various housekeeping tasks which could hammer the CPU or
disk. Those can result in significant delays, particularly when your
messages persistent.

I recommend increasing the timeout substantially, e.g. to 1 minute. That
way if the problem really is just due to load it will go away.

Are you using mirrored/HA queues at all?
Yes, I'm using HA in all queues.

We are aware of two bugs in 2.8.x that can cause confirms to go missing.
One of them involved setting an x-message-ttl=0 on queues and use of the
'immediate' flag on publish. So if you are doing that then, well, don't
:). The 2nd issue involves (re)starting of nodes. So check whether any
of the nodes in your cluster (re)started around the time of the problem.

Regards,

Matthias.

Wong Kam Hoong

2012-10-03 04:24:58 UTC

Permalink

Hi Matthias,

Thanks for your advise, after I set the timeout to 1 minute the problem
gone, but 1 minute seems too long..

Regards,
Wong

On Wed, Sep 26, 2012 at 3:17 PM, Matthias Radestock

Wong,

*_Master Node Usage_*
Memory : 300M
CPU % : Average 30%
_*Slave Node Usage*_
Memory : 130M
CPU % : Average 30%
there are only few messages during that time, e.g.
2012-09-26 03:28:32 => 4 messages

If the CPU is averaging 30% when the load is so low then the machines must
be doing something else too. I notice from...
=INFO REPORT==== 26-Sep-2012::03:29:37 ===

accepting AMQP connection <0.8897.18> (192.168.0.100:43836
<http://192.168.0.100:43836/> -> 192.168.0.100:5672
<http://192.168.0.100:5672/>)

...that your clients are on the same machine as one of the nodes. Could it
be that the machine was busy with other tasks, such as your application?
Also, 3:29 in the morning is just the sort of time that I'd expect systems
to run various housekeeping tasks which could hammer the CPU or disk. Those
can result in significant delays, particularly when your messages
persistent.
I recommend increasing the timeout substantially, e.g. to 1 minute. That
way if the problem really is just due to load it will go away.
Are you using mirrored/HA queues at all?

Yes, I'm using HA in all queues.

We are aware of two bugs in 2.8.x that can cause confirms to go missing.
One of them involved setting an x-message-ttl=0 on queues and use of the
'immediate' flag on publish. So if you are doing that then, well, don't :).
The 2nd issue involves (re)starting of nodes. So check whether any of the
nodes in your cluster (re)started around the time of the problem.
Regards,
Matthias.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20121003/f86635c8/attachment.htm>

Matthias Radestock

2012-10-03 21:05:21 UTC

Permalink

Wong,

Post by Wong Kam Hoong
Thanks for your advise, after I set the timeout to 1 minute the problem
gone, but 1 minute seems too long..

Well yes, but the point of setting the timeout to 1 minute was to check
that there isn't a more fundamental problem, like the confirms not being
received at all.

Is a 1 minute timeout actually causing a problem for you, bearing in
mind that, as seems to be the case, most of the time the confirm arrives
a lot quicker?

If it really does bother you then I suggest you start recording some
data on the basic resource usage of your system - CPU, memory, disk,
etc, and see whether there is any correlation with confirm times.

Regards,

Matthias.