ASL 3 being blocked on large networks

N7FGP · April 26, 2025, 1:54am

My ASL 3 node was blocked from a large Allstarlink network along with others since ASL 3 dosn’t seem to handle large amounts of connections well and then goes into a disconnect - reconnect circle causing a dos attack on the large network causing massive packet loss for everyone. So offending nodes get blocked and the packet loss goes away.

Is there anything being done to remedy this?

N8EI · April 26, 2025, 1:33pm

Never heard of this supposed issue before. I run a fairly large node on ASL3 with a mix of ASL3 and non-ASL3 clients and don't experience the issue you're mentioning. We'd need more information about what's going on.

N2DYI · April 26, 2025, 3:01pm

I, unfortunately, can only be of minimal help here, other than to say that I was recently part of the problem as such.

A very large net, which normally exceeds 250 connected nodes a week, noticed a huge amount of packet loss when connected to my network of nodes, which usually hosts between 60 and 85 connections at any given time. All the hub nodes on my system are ASL3, except two, which are HamVoIP. The majority of active connections are on the two HamVoIP nodes, running on rPi4's hosted in data centers. The rest are on Linode VPS's with much less traffic. The whole system is being held together by ASL3 running on dedicated hardware in an office building with very good routes to just about everywhere in the United States, anyway.

Anyhow, audio was extremely bad across this large net until they dropped my network, which immediately cleared everything up. So, after the net, we ran some experiments, and determined that, at the threshold of connections they were experiencing bad performance, if I dropped my ASL3 nodes off the network, even though they weren't handling a lot of traffic, and left the two HamVoIP nodes on, everything was good. As soon as even one of the ASL3 nodes showed up, things started to get worse, even though they were only hosting a few nodes each.

What I did as a workaround was to provide the network owner a private node that connects back to my system through USRP, so no extra telemetry and other IAX stuff is sent back-and-forth. This node is the only one that can connect. Now, all my nodes, including the ASL3 ones, can stay connected with no loss in performance.

Back in September, I did a test, and noticed that my single ASL3 node running on a Dell Precision 620 could only reliably handle about 85 direct connections before audio started to really degrade to the point of uselessness after about 90 adjacent or so. Where things got even worse was when nodes hosting other nodes that were, in turn, hosting even more nodes showed up. The farther down the line it went, the worse things got.
Bandwidth/network latency was certainly not a concern at the time. The machine was only pushing about 33 mbps when saturation was reached, and it's capable of greater than gigabit throughput in both directions on the public internet.

I expect, though I don't know for sure, that it has something to do with the constant telemetry being sent between nodes. HamVoIP did some closed-source voodoo that can suppress much of this telemetry, and that seems to keep things a little more stable, but apart from that, I don't have any logs or fun stuff to show what's going on.

Mike · April 26, 2025, 3:20pm

Everything has it's limit. The remedy is in knowing what was empty.
(cpu,mem,network)
The code does not limit you, your 'available' resources do.

The results in effect can vary per incident when you run out of something.

N2DYI · April 26, 2025, 4:26pm

That being the case, why can I host more connections on a Raspberry Pi 4 running HamVoIP before problems occur than I can on a 16 core Zeon with 64GB RAM, on a multi-GB connection and no software other than the base Debian 12 stuff and ASL?

Mike · April 26, 2025, 4:44pm

Can't say, nor do I want to. Do your own homework.
Did you check the available resources of each ? You might then know.

I am certain they don't use the same naturally. They are different codes. And you could argue which is more efficient in resources used and capabilities. Don't matter.

Available resources determines your limits first. How you utilize features has a direct effect on how much is used. Having allison speak foreign telemetry can tip a high connection count over the edge if there is a lot of activity and speech buffers are stacking up faster than can be played. for just one example.

Or, you tell me what is limiting the number of connections ?

The code here is public. Find the bad code.

WA3WCO · April 26, 2025, 5:41pm

I remember the test from last September. Since then, I know that we have addressed at least one of the issues that was generating massive amounts of log messages. We have also been doing a lot of code "cleanup".

N8EI · April 26, 2025, 11:57pm

I wonder if the difference in behavior can be obtained by packet-capturing the IAX2 data flows. I just don't know if wireshark can decode IAX2... probably can't.

Mason10198 · April 27, 2025, 1:43pm

Wireshark does have an IAX dissector built in. You can just tcpdump on the ASL host and open it in Wireshark on your PC.

WY7JT · April 27, 2025, 3:02pm

K4FXC made mention of this bug on the Win System Tech Net, and N5ZUA asked him about a list that is truncated as a fix with Hamvoip. I was playing a video game and half listening. If the big system was the ECR, they might have engaged David as I believe he was involved with them handling larger loads. Since Patrick kind of hit on the issue, telemetry and traffic between all connections limited by his workaround connection method, what traffic is being sent to all? Why not just direct connections? Or limit to direct connections if over a certain amount? Just guessing.

Someone could ask David as he volunteers his time to answer tech questions on the Tech Net and likes to be helpful to fellow hams.

N2DYI · April 27, 2025, 4:50pm

No, it wasn't ECR. But, yes, I think HamVoIP truncates extremely long lists. I don't know where the cut-off point is.

WA3WCO · April 27, 2025, 5:07pm

Q? does one truncate a long list? or make sure that you don't run into an ill effects when you receive a longer (complete) list?

Matthew_Annen · April 27, 2025, 6:55pm

N7FPG, when this was happening for you, what version number of ASL3 were you using?

When I was using 3.4.3 for a few weeks, our hub node would get into a disconnect/reconnect cycle at times.

It seems tamer for us on ASL 3.3.0.

I am not saying for sure what we saw is related, but is there any version of ASL3 that is 'better' for you?

WA3WCO · April 28, 2025, 12:38am

Was your hub node ... when it was running 3.4.3 ... making (or accepting) EchoLink connections? If so, we've already addressed the outgoing connection issue ("beta" packages available) and have a fix for the incoming connection issue that needs a bit more testing.

Also, do you know if the asterisk process on your hub node was crashing (minimally, look for /var/lib/asterisk/core)?

Matthew_Annen · April 28, 2025, 3:19pm

It's the same thing I described as best I could here:

No echolink on our hub node. No core files exist to the problem.

I am working on a test setup to see if I can reproduce, since we went back to 3.3.0 on our live setup.

N7FGP described it as a connect/disconnect cycle, which is kind of what we were experiencing.

WA3WCO · April 28, 2025, 4:23pm

Your node connecting/disconnecting from another node? or other nodes connecting/disconnecting from your node?

Matthew_Annen · April 28, 2025, 9:10pm

Both of those were happening.

On Mondays and Tuesdays we connect to Blind Hams 506311 at 7PM and 6PM, that seems to be a good test as well.

Matthew_Annen · April 28, 2025, 9:13pm

After dropping down to 3.3.0, haven't had it happen once. Stopped immediately.

I am really hoping to replicate in a test environment.

WY0X · April 29, 2025, 12:53am

It can... because Wireshark is awesome... LOL!

Never tried it on Ham Asterisk flavors though, needs to meet the IAX2 RFC...

N8EI · June 13, 2025, 6:36pm

The beta release of 3.5 may address these problems:

app_rpt: do_link_post() update to reduce sending keying and link post messages at the "same time". by @mkmer in AllStarLink/app_rpt#592
app_rpt: Address negative elap times. by @mkmer in AllStarLink/app_rpt#644
app_rpt.c: Account for the residual microseconds when calculating elap time by @mkmer in AllStarLink/app_rpt#636
app_rpt: Address negative elap times. by @mkmer in AllStarLink/app_rpt#644

All of these are related to telemetry "storms" and mis-counts in the telemetry data.