Announcing the Stress-test Net (testing an ASL3 server for maximum number of concurrent connections)

KE4DYI · September 11, 2024, 4:22am

Greetings, all:

As I have migrated most of my multi-connection-holding nodes to ASL3, it occurs to me that it has never been satisfactorily proven how many connections a single node, without any load balancing, can handle before it completely falls over when hardware isn’t the bottleneck.

I have a slightly over-the-top system for testing ASL3. It is a dedicated machine with a sixth generation Zeon E3 processor and 64GB of RAM on a corporate fiber connection. This is an entry-level server/workstation-class processor, but it’s more than enough to throw anything ASL could possibly do, I’m sure.

So, to sate my personal curiosity, as well as for the general education of anyone who is interested in deploying large ASL3 nodes at scale, I’m running a stress-test net this Saturday, September 14, at 3:00 PM EDT, 12:00 PM PDT, 19:00 Z. The aim is to have as many nodes as possible connect to this system, node 508429, and monitor performance and resources throughout to see what happens when high connection counts are reached on a single server.

Importantly for this particular test, this node is not part of any larger network. It is just a solitary Asterisk server not inter-connected to any other hubs. The goal is to get as many individual nodes connected as possible.

So, if you’ve got multiple nodes to spare, bring them all, and connect directly, I.E. don’t chain multiple nodes together, and then connect one of them to 508429.

Using HamVoIP with internal load balancing on a Raspberry Pi 4 with no USB sound fobs connected, and no web stuff running in the background, it has been proven that 141 connections can be achieved before the Pi is choked, and just can’t handle anymore. However, audio is very jittery when it gets to that point, at least in my testing.

This is neither HamVoIP nor 32-bit aarch. It’s a completely different environment, for better or worse. Hopefully better. The custom editions to app_rpt for load balancing don’t exist. Do they need to? That’s what I want to find out.

I’ve seen very high adjacent node connection counts exceeding 300 on large nets, such as the Absolute Tech Net, but this involves multiple hubs, each holding a large number of connections, which is not what I’m aiming for here.

Yes, I actually want to get enough connections on this system that it struggles, or maybe even crashes, to see where the threshold is, which is normally a thing you don’t want.

On top of testing the performance of an ASL3 node, it may end up being the place where you meet interesting people you’ve never heard before. Who knows?

Again, that’s Saturday, September 14, at 3:00 PM EDT, 12:00 PM PDT, 19:00 Z, node 508429, to stress-test an ASL3 server running on decent hardware on plenty of bandwidth.

I am not on Facebook, so feel free to post this to any Allstar-related groups you are on to get the word out.

Thanks and 73

Patrick, N2DYI

Mike · September 11, 2024, 1:25pm

I did some stress tests over 5 years ago.
What I found was a big variable that become hard to judge in any general rule for those to follow.

For the node in test can be weighted down by nodes and node networks that have a lot of iax2 connection and error traffic if allison can be allowed to speak.

Announcements will stack-up until allison catches up but if she doesn’t, it eats into your memory bandwidth. All of it depends on just who and how many are connected and the iax traffic.

So, for anyone wanting to allow the most possible connections, for instance during a ecomm event, Turn off allison and stop foreign announcements will go a long way.

There are different points of bottle-necking.
internet bandwidth
cpu bandwidth
and memory.

You exact criteria will be dependent on what is available and how much you use in your system in external scripts and background tasks. Those that may be on a schedule, need to be taken into account or risk a surprise crash when you are near limits.
And then the actual traffic.

For those in ecomm
I suggest using control states and a macro to quickly change how your node uses it’s resources to maximize them when needed.
Don’t forget a second macro to change things back to normal.

So, that’s a skinny on what I learned. Wish you success in your stress test.
Anxious to hear the results.

wd6awp · September 13, 2024, 7:00pm

Did you do any tuning to ASL3 to allow more connections, etc?

KE4DYI · September 13, 2024, 8:03pm

I have done almost nothing to the node thus far. It’s basically stock. Duplex=0, telemdefault=0, and that’s about all I’ve changed. What can/should I do to tune ASL3 for maximum performance on high load=more concurrent connections? I’m happy to play.

WA3WCO · September 13, 2024, 11:54pm

If possible, can you capture and share some stats from your system from before the stress test and periodically during the test.

For starters, I’m thinking :

ps -F -p `pgrep asterisk`
asterisk -x 'core show threads'
asterisk -x 'core show channels'

KE4DYI · September 14, 2024, 1:45am

I wrote a small script to do those three commands in succession every five minutes and log to a file. I’ll post the output here tomorrow.

KE4DYI · September 14, 2024, 8:43pm

Here is the post mortem on the stress test. We learned a lot, and hopefully got some good data to play with.

My node did well up until about 80 local connections were hanging off of it. Any more than that, and it started to jitter. Once the direct connection count reached 90, it was unusable.
When direct connection counts were high, dropping off some nodes that were bridging other nodes helped to an extent, but it seemed happiest when the number of direct connections was at or below 80.

This forum does not allow attachments of text files, so I have uploaded the large (520Kb) text file to a web server here. http://xlx.borris.me/files/stress-test.txt

Thanks to everyone who participated, and I hope this leads to something useful.

WA3WCO · September 15, 2024, 12:05am

Thank you for organizing todays event !

N3FAA · September 15, 2024, 6:05am

Results don’t really say anything. What were the server resources when things went bad, and where was the chokepoint (i.e. RAM, CPU, Bandwidth), etc.?

KE4DYI · September 15, 2024, 7:47am

When everything choked up, the CPU utilization was almost never above about load averages of 1.1 to 1.2 as reported by top and uptime, and system memory usage only topped out at right about 1GB out of 64GB on this 4 physical core, 8 logical core Zeon E3.

WA3WCO · September 15, 2024, 12:28pm

Were there any interesting messages in the asterisk logs?

KE4DYI · September 15, 2024, 4:54pm

Oh my. messages.log, starting September 10, ending at the end of September 14, is 420 MB in size. There are probably a ton of repeats. What’s the best way to filter that?

WA3WCO · September 15, 2024, 5:03pm

Well … that’s a hard question to answer.

Does the file compress well? If so, then it’s not a big deal.

If not, then trying to find nuggets of interest becomes more of a challenge. You can remove messages that have little value (and probably should not have been logged in the first place). Beyond that, if you know a time frame when you know there was trouble then you can extract messages before/during/after.

KE4DYI · September 15, 2024, 6:14pm

It compressed to 5.something MB zip, 1.something MB 7z.
I don’t have time to really look at this at the moment, and my only PC here is offline due to a hardware failure, so I compressed and uploaded that file from a terminal on my phone.
If anyone wants to look at it, the file lives here:
http://xlx.borris.me/files/messages.zip
Lots of activity spikes around 2024/09/14 14:05 through about 2024/09/14 14:50 or so.

WA3WCO · September 15, 2024, 8:39pm

Sometimes compression just rocks (420M → 4M) !!!

Thank you!