Stale DNS entry after upgrading to NNX 6 digit nodes

wojo · January 10, 2025, 3:13am

Hi, upgraded 63809 to NNX, and asl-node-lookup and lookups for DNS on the AllStar DNS servers at AWS (bypassing any intermediates) show a stale record for the 5-digit node that should be gone now.

dig 63809.nodes.allstarlink.org @ns-436.awsdns-54.com

The new 6 digit nodes (638090, 638091, etc) show up just fine as well, which is good.

I noticed this when my startup_macro would keep on trying to connect to the 5 digit node 63809 which is not valid anymore. Also broke all DTMF linking commands.

Can this stale entry be removed?

N8EI · January 10, 2025, 4:26am

Short answer: yes but no…. Long answer is a fiendishly complicated exploration of many years of AllStarLink. Fixing some of the DNS grooming is “on the list” but it shouldn’t be breaking anything. Are you experiencing a problem?

wojo · January 10, 2025, 12:44pm

Yes, I converted a 5 digit to 6 digit NNX, and now no one (including myself) using node_lookup_method with dns or both methods (with both being the default) can connect to my nodes because of this via DTMF, macros, etc.

So unless this entry is purged from the NNX conversion, it blocks the ability to connect to my node unless a manual change is made on every originating node to switch to file.

Sounds like the NNX conversion is missing a step to purge the old 5 digit node number?

Not sure if this happens all the time, or mine was a one-off. Either way, it does block me right now and nothing I can do on my side to fix/unregister the old invalid 5 digit node.

If some of this is OSS (I haven’t investigated), perhaps I could help contribute as well to the admin dashboard, etc. I also have a lot of familiarity with AWS Route 53, which it seems this is using on the backend.

Thanks!

N8EI · January 10, 2025, 2:32pm

I don’t think your issue is DNS. I use DNS lookups exclusively and I just tried to connect to your systems and the lookup/connection works fine. However both 638090 and 638091 are both failing to connect from a network-perspective not a DNS lookup perspective. Both 638090 and 638091 are trying to be reached at 76.242.53.247 port 4570. Some questions:

Are both 638090 and 638091 on the same server (as defined in the portal) or on different servers?
If they are on different servers, they cannot both try to use the same UDP port 4570
If they are on different servers, make sure you are aware of https://allstarlink.github.io/adv-topics/multinodesnetwork/

wojo · January 10, 2025, 3:53pm

I have them set as private right now, and the IPs you see are valid but I am no longer port forwarding. Apologies, I’m not trying to debug external public Internet access, it’s restricted behind a VPN.

This is a valid use case in that rather than using <2000 with potential of node conflicts, I’m using the newly recommended approach of using NNX and treating the node as private.

So in my use case, these nodes have direct nodes stanza entries to private IPs allowing them to connect. They could easily be available to the Internet, or have IP whitelists, etc but that is really not germane to the underlying issue I’ll describe below.

I believe the root of the issue has to do with how Asterisk handles 5 and 6 digit nodes during DTMF entry or startup_macro for example. Since it doesn’t know the length, it uses either DNS or file to determine valid nodes progressively.

As a user dials via DTMF *3638090, the way I understand it is that Asterisk will check character by character and when it reaches a match, it’ll short circuit on a match. In my case, with *36380901 it matches DNS for 63809 (which doesn’t exist and is not valid anymore now that I’ve moved to NNX), and short circuits.

This also happens with startup_macro = *3638090 for example. It tries to connect to 63809 instead, incorrectly.

Asterisk is doing the right thing, but instead of getting a valid entry for 63809.nodes.allstarlink.org it should be getting a NXDOMAIN and continue on to evaluate either 1) a valid 6 digit node DNS entry or, 2) stop at 6 digits and accept that as the entry (as that is the max length configured in ASL today).

I can prove this is the issue as well in a few ways:

If I change to file lookup method, I can connect to my private nodes. That’s because 63809 isn’t in that file, only 638090 and 638091
If I block 63809.nodes.allstarlink.org on my firewall with a NXDOMAIN response, basically simulating if the AWS Route 53 server returned NXDOMAIN itself, everything also works and it continues to evaluate my 6 digit node IDs

I hope this helps! I believe the removal of the pre-NNX node in DNS is the proper solution and the further reliance on DNS instead of file (if I saw that correctly somewhere) and usage of NNX in the next few years will make this even more of an issue.

TIA and let me know if I was clear, that ended up a lot longer than I thought

N8EI · January 10, 2025, 5:39pm

Hrm, this is interesting behavior because that’s not how it’s supposed to work. I wonder if your DNS is very slow or something. What should be happening is after a *3 (matches DTMF command), the process should issue a lookup until it receives the last digit which it will resolve again and then pause. DNS entries haven’t been cleaned up in a long time (because there’s a massive ambiguity hole as to when a node is no longer valid - this is the long long story). I can go in and hand-delete the record, but if it fixes your issue the question remains why it’s affecting you and not tons of other people.

wojo · January 10, 2025, 6:47pm

It’s fixed now after deleting that record. Very curious what that ambiguity hole is, but that’s for a different time!

Hrm indeed. I’m aware of the 3s timeout known issue (Differences & Issues - AllStarLink Manual), but this doesn’t seem to be related.

DNS being slow can be teased out by staying within the 60s TTL window. I’ve reproduced the issue while within that window where the local resolver (and my local DNS cache on my Mikrotik router) are cached, so no delays.

Since I can also fake a record on my router (with an A record instead of NXDOMAIN), I can reproduce easily locally. Are there any logs or tests I can do to assist teasing out what is happening?

Also interesting not a lot of people are reporting this. I was assuming that it was because NNX is not that common and even in those situations people do outbound linking only to public and popular nodes.

N8EI · January 10, 2025, 7:25pm

It’s not a question of the TTL. Internally to app_rpt it’s a 3s timer for each DTMF digit lookup. If for some reason your DNS got stuck at the 5th digit, it would fail and then “connect”.

NXX is very common. That’s why I’m confused. I’ll have to add this to a list of things to test.

N8EI · January 10, 2025, 7:27pm

By the way, if you’re interesting contributing to the project, contact @WA3WCO. He’s taking point on vetting volunteers.

wojo · January 10, 2025, 8:34pm

Got it, yeah the 3s thing was only similar in the area, but not in the impact or to what is happening here.

Let me know if I can help at all, I may have time this weekend to dig a bit as well.

Thanks much, will do re: contributing!