[rabbitmq-discuss] abnormally large memory use in "binaries"

Discussion:

[rabbitmq-discuss] abnormally large memory use in "binaries"

Brian Hammond

2013-10-08 19:23:06 UTC

I've been seeing a problem on and off with rabbitmq using a ton of memory with few messages which is suddenly much worse today.

This morning, I upgraded to (RabbitMQ 3.1.5.41008,?Erlang R14B04) using the nightly build from?08-Oct-2013.

About 6 hours later I have 19GB of memory in use with 800 messages.?

According to the management console, the messages are using up 2mb and 19GB is in "binaries".

Is this a known issue that is being actively investigated?

I'm happy to provide addition details if it will help work towards a resolution.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131008/4912fcf9/attachment.htm>

Emile Joubert

2013-10-11 09:07:16 UTC

Hi Brian,

Post by Brian Hammond
According to the management console, the messages are using up 2mb and
19GB is in "binaries".

That would make sense if the average message size was 24Mb. Are you sure
that is not the case?

-Emile

Brian Hammond

2013-10-11 17:23:38 UTC

It happened again today with just 500 messages in a queue where the average message size is 280B, which would be about 135k.

Doesn't rabbit track message memory usage separately in the "Message store index" ?

The disparity from way under 1mb to 18gb of used space makes it seem unlikely message storage is at fault unless things are just not being cleaned up (such as a memory leak).

I take it from your response that this is not currently on the radar as a systemic problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131011/d249eae3/attachment.htm>

Emile Joubert

2013-10-14 10:03:51 UTC

Hi Brian,

Post by Brian Hammond
It happened again today with just 500 messages in a queue where the
average message size is 280B, which would be about 135k.

How did you determine the average message size? Is it possible that the
queue could contain a small number of extremely large messages that skew
the average?

Post by Brian Hammond
Doesn't rabbit track message memory usage separately in the "Message store index" ?
The disparity from way under 1mb to 18gb of used space makes it seem
unlikely message storage is at fault unless things are just not being
cleaned up (such as a memory leak).

Does the memory use increase gradually or in large jumps?

Post by Brian Hammond
I take it from your response that this is not currently on the radar as a systemic problem.

I'm not aware of any bugs that could explain the problem you describe.
Could you supply the output of "rabbitmqctl report" while you observe
the reported condition?

-Emile

Brian Hammond

2013-10-15 15:34:21 UTC

Post by Emile Joubert
How did you determine the average message size? Is it possible that the
queue could contain a small number of extremely large messages that skew
the average?

The messages on that queue are generated from a very small Java model object.

I looked at a sample from the routing key that queue listens to and the average is from 1000 message.

Post by Emile Joubert

Post by Emile Joubert
Does the memory use increase gradually or in large jumps?

It seems to be pretty gradual up until it hit the ceiling.

Post by Emile Joubert
I'm not aware of any bugs that could explain the problem you describe.
Could you supply the output of "rabbitmqctl report" while you observe
the reported condition?

Sure, thing. I grabbed it at 10 minute intervals, but can do a finer granularity if needed.

I'll try to attach it to this post.?

If that doesn't work, I'll upload it somewhere are post a url.

I started it on 2013/10/14 at 10:53 and around 2:53 it hit 18gb and stayed there till one of the guys restarted it around 5:24 pm.

Just for fun, here is the size of "binary" as a sort of summary of the usage:

?{"10:53":0.13, "11:03":1.03, "11:13":2.25, "11:23":2.98, "11:33":3.27, "11:43":3.66, "11:53":4.09, "12:03":4.55, "12:13":5.21, "12:23":6.87, "12:33":9.28, "12:43":9.49, "12:53":10.58, "13:03":10.16, "13:13":11.28, "13:23":12.13, "13:33":12.72, "13:43":14.01, "13:53":13.03, "14:03":14.16, "14:13":15.37, "14:23":15.21, "14:33":16.27, "14:43":17.33, "14:53":18.37, "15:03":18.79, "15:13":18.78, "15:23":18.89, "15:33":18.72, "15:43":18.83, "15:53":18.83, "16:03":18.79, "16:14":18.79, "16:24":18.79, "16:34":18.78, "16:44":18.84, "16:54":18.84, "17:04":18.84, "17:14":18.84 }
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131015/b0a21ef2/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: send-rabbit-reports.tgz
Type: application/octet-stream
Size: 459311 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131015/b0a21ef2/attachment.obj>

Emile Joubert

2013-10-16 11:11:15 UTC

Hi Brian,

Post by Brian Hammond
Sure, thing. I grabbed it at 10 minute intervals, but can do a finer granularity if needed.

Thanks for the detailed diagnostics. I don't see any reason for the
memory use in the reports. Could you try to run this command on the
broker while the condition is being observed and send back the reply?

rabbitmqctl eval 'lists:sublist(lists:reverse(lists:sort([{Sz, P,
process_info(P)} || {{binary, [{_, Sz, _}]}, P} <-
[{erlang:process_info(P, binary), P} || P <- erlang:processes()]])), 20).'

-Emile

Emile Joubert

2013-10-17 09:00:27 UTC

Post by Emile Joubert
rabbitmqctl eval 'lists:sublist(lists:reverse(lists:sort([{Sz, P,
process_info(P)} || {{binary, [{_, Sz, _}]}, P} <-
[{erlang:process_info(P, binary), P} || P <- erlang:processes()]])), 20).'

Actually, that command won't provide useful results. Could you please
try this instead?

rabbitmqctl eval 'lists:sublist(lists:reverse(lists:sort([{length(Bin),
lists:usort(Bin), P, process_info(P)} || {{binary, Bin = [_|_]}, P}
<-[{erlang:process_info(P, binary), P} || P <- erlang:processes()]])), 20).'

-Emile

Brian Hammond

2013-10-18 18:20:50 UTC

I ran the first command for 24 hours every 10 minutes, but the second command blew up rabbit most of the time:

Error: unable to connect to node rabbit at project2: nodedown

DIAGNOSTICS
===========

nodes in question: [rabbit at project2]

hosts, their running nodes and ports:
- project2: [{rabbitmqctl2916,46327}]

current node details:
- node name: rabbitmqctl2916 at project2
- home dir: /var/lib/rabbitmq
- cookie hash: g9gANsnT4AWb9xKsznR+nA==

I stopped the collection script after 11 iterations, but only have 5 files with anything in them.?

They range in size from 109k to over 1gig.?

Compressed they are like 280mb (https://www.dropbox.com/s/jvaaohg59cl6gn3/rabbit-memory-details.tgz), not sure if it is useful or not since it was for such a small, and irregular, time interval.

Additionally, a colleague pointed out an interesting failure case where we were out of memory with 4 messages in queues.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131018/95df1908/attachment.htm>

Brian Hammond

2013-10-18 19:33:04 UTC

This is a topper: ?Loading Image...

Using 18gb with 0 messages...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131018/76ae4d25/attachment.htm>

Emile Joubert

2013-10-21 17:01:37 UTC

Hi Brian,

Post by Brian Hammond
https://www.dropbox.com/s/ccq0zidcck2ca8j/rabbitMQ-dead-with-0-messages.png
Using 18gb with 0 messages...

Thank you very much for this information. This has allowed me to form a
hypothesis about where the memory might be going. In order to confirm,
can you supply the output of netstat on the consumer connections? I'm
particularly interested in the size of the send and receive queue.

Also, you originally claimed that messages are about 280B. From the
information provided the sizes appear to be much larger than that, some
of them multiple megabytes in size. Can you think of an explanation for
the discrepancy?

-Emile

Brian Hammond

2013-10-21 18:50:03 UTC

Do you have a specific set of arguments in mind for netstat? If not I'll go look at the man page.

The one queue that had 4 messages in it was the only one with that size of message.

Other queues have variously sized messages. Some of which are quite large.

BTW, I'm working on a test program trying to reproduce the issue from the outside. Slim odds granted, but I noticed something weird: ?when I ctrl-c'd my test program at one point, all the queues, connections and channels went away, but the admin console still reports about 10gb of memory in use, 15 minutes later.

I'll provide more information as my tests progress.

On Monday, October 21, 2013 1:01 PM, Emile Joubert <emile at rabbitmq.com> wrote:

Hi Brian,

? https://www.dropbox.com/s/ccq0zidcck2ca8j/rabbitMQ-dead-with-0-messages.png
Using 18gb with 0 messages...

Thank you very much for this information. This has allowed me to form a
hypothesis about where the memory might be going. In order to confirm,
can you supply the output of netstat on the consumer connections? I'm
particularly interested in the size of the send and receive queue.

Also, you originally claimed that messages are about 280B. From the
information provided the sizes appear to be much larger than that, some
of them multiple megabytes in size. Can you think of an explanation for
the discrepancy?

-Emile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131021/574fe3db/attachment.htm>

Emile Joubert

2013-10-22 11:31:40 UTC

Hi Brian,

Post by Brian Hammond
Do you have a specific set of arguments in mind for netstat? If not I'll
go look at the man page.

netstat shows this information by default.

Post by Brian Hammond
I'll provide more information as my tests progress.

Brian Hammond

2013-10-22 16:20:16 UTC

Running this now: ?netstat -t -n | grep :5672

As far as the prefetch goes, we are using a value of 500. When using a value like 1, we saw pretty bad performance.

My understanding of pre-fetch was the consumer asks for that number of messages and then has to send an acknowledge before more messages are sent to it. If we have memory issues when there are almost no messages then our pre-fetch values seems an unlikely cause.

If we set prefetch really low, I could see us having a lot of messages in the queue. Setting it high or unlimited should actually be better for the server.

Nonetheless, I'll try it after this current batch with netstat runs to explosion.

Still no luck reproducing the issue with a simple test program...

On Tuesday, October 22, 2013 7:31 AM, Emile Joubert <emile at rabbitmq.com> wrote:

Hi Brian,

Post by Brian Hammond
Do you have a specific set of arguments in mind for netstat? If not I'll
go look at the man page.

netstat shows this information by default.

Post by Brian Hammond
I'll provide more information as my tests progress.

From the reports it appears that all consumers are either using
unlimited prefetch or a value of 500. It is likely that the broker
memory use will be reduced if consumers use a much lower prefetch count,
say 1. Would it be possible for you to try that? The reason why it might
help in your particular case is most of the memory is being consumed by
the Erlang processes that write to the network.

I realise that it might be difficult for you to make this change to all
consumers. Let me know if this is too onerous so we can discuss
alternatives.

-Emile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131022/3920af48/attachment.htm>

Michael Klishin

2013-10-22 16:27:29 UTC

Post by Brian Hammond
My understanding of pre-fetch was the consumer asks for that number of messages and then has to send an acknowledge before more messages are sent to it. If we have memory issues when there are almost no messages then our pre-fetch values seems an unlikely cause.

This is correct but note that basic.qos is set for a channel, not a consumer.

Post by Brian Hammond
If we set prefetch really low, I could see us having a lot of messages in the queue. Setting it high or unlimited should actually be better for the server.

RabbitMQ does not remove messages until they are acknowledged. Not using manual acknowledgements
may ease the pressure on the server, of course, if it's an option for your system.

MK

Brian Hammond

2013-10-28 19:32:17 UTC

Sorry for the delay.?

Here is the output with the netstat output.

On Tuesday, October 22, 2013 12:20 PM, Brian Hammond <brianin3d at yahoo.com> wrote:

Running this now: ?netstat -t -n | grep :5672

As far as the prefetch goes, we are using a value of 500. When using a value like 1, we saw pretty bad performance.

My understanding of pre-fetch was the consumer asks for that number of messages and then has to send an acknowledge before more messages are sent to it. If we have memory issues when there are almost no messages then our pre-fetch values seems an unlikely cause.

If we set prefetch really low, I could see us having a lot of messages in the queue. Setting it high or unlimited should actually be better for the server.

Nonetheless, I'll try it after this current batch with netstat runs to explosion.

Still no luck reproducing the issue with a simple test program...

On Tuesday, October 22, 2013 7:31 AM, Emile Joubert <emile at rabbitmq.com> wrote:

Hi Brian,

Post by Brian Hammond
Do you have a specific set of arguments in mind for netstat? If not I'll
go look at the man page.

netstat shows this information by default.

Post by Brian Hammond
I'll provide more information as my tests progress.

From the reports it appears that all consumers are either using
unlimited prefetch or a value of 500. It is likely that the broker
memory use will be reduced if consumers use a much lower prefetch count,
say 1. Would it be possible for you to try that? The reason why it might
help in your particular case is most of the memory is being consumed by
the Erlang processes that write to the network.

I realise that it might be difficult for you to make this change to all
consumers. Let me know if this is too onerous so we can discuss
alternatives.

-Emile
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131028/83816f5e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rabbit-report-with-netstat.tgz
Type: application/octet-stream
Size: 471807 bytes
Desc: not available
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20131028/83816f5e/attachment.obj>

Emile Joubert

2013-10-29 15:53:08 UTC

Hi Brian,

Post by Brian Hammond
Here is the output with the netstat output.

Thanks for this information. Can you confirm whether limiting the
prefetch count to a small number (e.g. 1) helped? I see the prefetch
counts are still 500 or unlimited. This change may impact performance,
as you rightly point out, but it would be extremely useful to confirm
whether it constrains memory use.

I realise that it might be difficult for you to make this change to all
consumers. Let me know if this is too onerous so we can discuss
alternatives.

-Emile

15 Replies
71 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Brian Hammond 2013-10-08 19:23:06 UTC

Emile Joubert 2013-10-11 09:07:16 UTC

Brian Hammond 2013-10-11 17:23:38 UTC

Emile Joubert 2013-10-14 10:03:51 UTC

Brian Hammond 2013-10-15 15:34:21 UTC

Emile Joubert 2013-10-16 11:11:15 UTC

Emile Joubert 2013-10-17 09:00:27 UTC

Brian Hammond 2013-10-18 18:20:50 UTC

Brian Hammond 2013-10-18 19:33:04 UTC

Emile Joubert 2013-10-21 17:01:37 UTC

Brian Hammond 2013-10-21 18:50:03 UTC

Emile Joubert 2013-10-22 11:31:40 UTC

Brian Hammond 2013-10-22 16:20:16 UTC

Michael Klishin 2013-10-22 16:27:29 UTC

Brian Hammond 2013-10-28 19:32:17 UTC

Emile Joubert 2013-10-29 15:53:08 UTC

about - legalese

Loading...