Pushpin worker crashes often with assertion error

Am running pushpin in Kubernetes. The pushpin pod is crashing often with below error:

[INFO] 2024-04-26 13:12:30.487 [handler] control: POST /publish/ code=200 10 items=1
[INFO] 2024-04-26 13:12:30.487 [handler] publish channel=2d7adb71-de95-4824-9b38-516732b8805b receivers=1
thread 'server-worker-0' panicked at 'assertion failed: `(left == right)`
  left: `601`,
 right: `16985`', src/websocket.rs:1187:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[INFO] 2024-04-26 13:12:30.545 [handler] control: POST /publish/ code=200 10 items=1
[INFO] 2024-04-26 13:12:30.545 [handler] publish channel=52a09fd1-f880-483e-9805-b477a6c824d1 receivers=0
[ERR] 2024-04-26 13:12:30.618 condure: Exited unexpectedly
[INFO] 2024-04-26 13:12:30.618 [zurl] stopping...
[INFO] 2024-04-26 13:12:30.618 [handler] stopping...
[INFO] 2024-04-26 13:12:30.619 [zurl] stopped
[INFO] 2024-04-26 13:12:30.618 [proxy] stopping...
[INFO] 2024-04-26 13:12:30.621 [handler] stopped
[INFO] 2024-04-26 13:12:31.126 [proxy] stopped
[INFO] 2024-04-26 13:12:31.128 stopped

I analysed the memory and cpu consumption but they look normal. And there is no limits configured for the pushpin pod. From the assertion error, am not able to figure out what exactly its asserting. Could someone help?

Have you tried setting the environment variable listed in the error message? The backtrace could be quite useful.

Yes. However am not sure how to interpret it:

thread 'server-worker-1' panicked at 'assertion failed: `(left == right)`
  left: `1065`,
 right: `17449`', src/websocket.rs:1187:13
stack backtrace:
   0:     0x5b88cc02b38f - <unknown>
   1:     0x5b88cc048d3e - <unknown>
   2:     0x5b88cc00fcd5 - <unknown>
   3:     0x5b88cc02b145 - <unknown>
   4:     0x5b88cc018d3f - <unknown>
   5:     0x5b88cc0189f8 - <unknown>
   6:     0x5b88cc01934b - <unknown>
   7:     0x5b88cc02b6e7 - <unknown>
   8:     0x5b88cc02b4dc - <unknown>
   9:     0x5b88cc018ef2 - <unknown>
  10:     0x5b88cbe36a33 - <unknown>
  11:     0x5b88cc04824b - <unknown>
  12:     0x5b88cbe303eb - <unknown>

It appears that the line number has changed a bit, but the relevant code is here. This code is validating that the size of a data item matches the expected size, because (based on the comments) a failure to get the proper size is a serious failure.

Unfortunately we’ll need someone familiar with the code to help figure this out… maybe @jkarneges can find some time to help.

The log output shows a published message received right before the crash. Is that common or does it happen without such log lines too?

@jkarneges the logs are always the same whenever this restart is happening

Did you increase the client_buffer_size above the 8192 default? Do your messages usually exceed this buffer size? Does it seem possible that you send enough data in a short time to the same client such that its buffer fills and possibly its TCP send buffer in the OS also fills? Maybe this happens during some bursty moment. Just trying to think about which code paths are being taken.

client_buffer_size=1048576 this is the configured buffer size.

The max size of messages is above configured value. I will check if it exceeds often.

Ah, it looks like this may have already been fixed. I suggest upgrading.

Thanks. It worked. Not observing any crash so far