I, unfortunately, can only be of minimal help here, other than to say that I was recently part of the problem as such.
A very large net, which normally exceeds 250 connected nodes a week, noticed a huge amount of packet loss when connected to my network of nodes, which usually hosts between 60 and 85 connections at any given time. All the hub nodes on my system are ASL3, except two, which are HamVoIP. The majority of active connections are on the two HamVoIP nodes, running on rPi4's hosted in data centers. The rest are on Linode VPS's with much less traffic. The whole system is being held together by ASL3 running on dedicated hardware in an office building with very good routes to just about everywhere in the United States, anyway.
Anyhow, audio was extremely bad across this large net until they dropped my network, which immediately cleared everything up. So, after the net, we ran some experiments, and determined that, at the threshold of connections they were experiencing bad performance, if I dropped my ASL3 nodes off the network, even though they weren't handling a lot of traffic, and left the two HamVoIP nodes on, everything was good. As soon as even one of the ASL3 nodes showed up, things started to get worse, even though they were only hosting a few nodes each.
What I did as a workaround was to provide the network owner a private node that connects back to my system through USRP, so no extra telemetry and other IAX stuff is sent back-and-forth. This node is the only one that can connect. Now, all my nodes, including the ASL3 ones, can stay connected with no loss in performance.
Back in September, I did a test, and noticed that my single ASL3 node running on a Dell Precision 620 could only reliably handle about 85 direct connections before audio started to really degrade to the point of uselessness after about 90 adjacent or so. Where things got even worse was when nodes hosting other nodes that were, in turn, hosting even more nodes showed up. The farther down the line it went, the worse things got.
Bandwidth/network latency was certainly not a concern at the time. The machine was only pushing about 33 mbps when saturation was reached, and it's capable of greater than gigabit throughput in both directions on the public internet.
I expect, though I don't know for sure, that it has something to do with the constant telemetry being sent between nodes. HamVoIP did some closed-source voodoo that can suppress much of this telemetry, and that seems to keep things a little more stable, but apart from that, I don't have any logs or fun stuff to show what's going on.