Frequent client drop outs on dev 3.4.3

I tried the dev version of ASL3 for a few weeks. I noticed that quite frequently many of the connected clients to our hub node would simulteneously drop out of the connected list on allmon3 and then reconnect (for those with 'permanent' connections anyway). But not all of them, seems they would switch around in batches.

This would happen at seemingly random times, sometimes many successively minutes apart, sometimes 3 hours apart.

I was not able to correlate this with anything in particular. I did not see any suspiciuous messages in the asterisk cli, or in the asterisk log file.

Our hub node typically hosts 20-29 clients, and runs on a pi3.

I had to revert back to version 3.3.0 since this was too disruptive. So far after going back to 3.3.0 I am not seeing this as of yet with 15 hours of uptime.

Probably not the most helpful report, since I won't be able to test 3.4.3+ on this particular hub at this time. I just want to put this out there if anyone else has seen this on similar client count and hardware.

Clearly, we want ASL3 nodes to be rock solid. Any information you can collect and report back to us will be helpful.

Things to check for would be crash reports (e.g. does systemctl status asterisk show a different PID? and/or do you have a /var/lib/asterisk/core file?).

Things that would be good to know include whether you are using an arm64 (e.g. Raspberry Pi) or amd64 (Intel, AMD) system? SimpleUSB, USBRadio, "hub", ...? Do you have any insight into what type(s) of connections are possibly triggering the failures (WT connections, DVSM/IAX, EchoLink, ...)? What extras to you have running on this system?

Also, if you have logs, configurations, or crash reports to share then it would be best to create a GitHub issue (app_rpt) and "attach" the files.

There are no core files that went along. I did reboot last night after the downgrade, so my logs are gone :frowning:. This node is a 'hub' node.

One thing that seemed odd, was that one node seemed to never disconnect/reset connection. I know for a fact that node is running ASL3 3.3.0.

One other node on the local LAN that acts as a client for our USB dongle also never disconnected/reset connection. That node is runnning on a separate pi3, and is running ASL 3.4.3.

I would suspect that most of the nodes that connect to this hub node are still on hamvoip.

I also noticed that ASL3 3.4.3 on this hub node, per the output of 'top', used about 10-25 percent more cpu when in keyup than ASL3 3.3.0 on the pi3.

I have used up all of my capital trying out 3.4.3, I will probably have to give this some time yet on 3.3.0 on this particular node.

Thank you for your reply, I will refer back in the future to provide more useful info. I am going to let 3.3.0 go for a few days and see if anything similar happens.

I made a test setup with 26 nodes connected.

node 452381 is the hub running latest dev of ASL3, on a pi3.
node 452384 is connected to the hub and has a simpleusb hotspot running dev of ASL3, on a pi3.
node 2495 is also connected to provide some traffic, and that node is running ASL3 3.3.0, on a pi3.

24 other nodes connected to the hub running hamvoip are set to private nodes in the range 1975-1998, on pi3's.

I saw a bunch of the nodes disconnect/reconnect in sequence. It looked like what I had observed previously.

I have attached the printable output from the hub node 452381 from putty, from ASL3 asterisk console set to debug=5.

The last part of the putty output is a dump of the /var/log/asterisk file.

The disconnect/reconnect sequences started at about 12:36PM or so of the asterisk console output. There was TX activity being generated by connected node 2495 at that time as well.

452381-1.txt (2.2 MB)

I also looked at allmon3 on the connected node 2495 mentioned above, and that node's allmon indicated my hub node 452381 was 'LAST RECV" at time 12:36:30, despite my hub node and all of the test nodes connected to it, were never in TX purposefully.

I have started another console dump and tcpdump from my hub node 452381 and one of the test nodes and waiting to see if this will happen again.

Thank you

A few hours later I noticed another 'hangup storm', as I will call it. In the attached asterisk console log this was mostly at/around time 14:58:16.

Google Drive link of zipped console output:

On one of the nodes that was part of the 'hangup storm', node 1998 with IP address 192.168.1.10, this was what that node's asterisk console showed at the time of 'hangup':

[May 3 14:58:16] ERROR[464] app_rpt.c: No link messages from [452381] in 46 seconds -- One way audio??? Forcing reconnect.
[May 3 14:58:16] WARNING[464] app_rpt.c: Node 1998: Reconnect Attempt to 452381 in process

Are 'link messages' missing from this tcpdump for all of these hamvoip nodes?

I was able to tcpdump the IAX traffic from hub node 452381 with IP address 192.168.1.3.

Google Drive link of zipped tcpdump output from node 452381, with IP address 192.168.1.3:

Here is a tcpdump of IAX traffic from node 1998, with IP address 192.168.1.10 (all of the hamvoip nodes ended up hanging up).

The 'hangup' message from console output of ASL3 node 452381 has tcpdump time 1024.384141 from the tcpdump of node 452381.

The 'hangup' message from the console of hamvoip node 1998 has tcpdump time 1608.290664 from the tcpdump of node 1998.

There was also another node that somehow managed to TX across the links without being keyed up. This had node number 1979 and IP address 192.168.1.29. From the tcpdump of node 452381 it looks like to me that they 'keyup' was maybe 60ms of audio, and the audio data looked like it was all hexadecimal 0xf (tcpdump time 1034.810074).

For the asterisk console output timestamps from node 452381 and tcpdump output from 452381 I have matched the console time to the tcpdump time as such:

Console time: 14:58:06.931
tcpdump: 1014.841640

Console Time: 14:58:16.162
tcpdump: 1024.162128

Something to add. I saw 'hangup storm' 4 times in 6 hours running ASL3 3.4.5.

I downgraded the hub node 452381 to ASL3 3.3.0, and after 14 hours I have not seen it happen yet.

I was able to test these recent versions of ASL3 and did not encounter 'hangup storm':
3.3.0
3.4.0

These versions I was able to test and did have 'hangup storm':
3.4.1
3.4.3
3.4.5

I have yet to test:
3.4.2
3.4.4

Something added with version 3.4.1 seems to be the cause?

Matt