Recent AllStar Link DNS issues

There’s a been quite a bit of conversation and derision regarding DNS and specifically the hosting donated to ASL, Inc. I have remained silent for some time regarding this, but cannot abide the things being said about donors to ASL.

First a bit of technical background:

Primary DNS is back-ended into the DB cluster which has a node in three locations. ASL can function for registration/DNS/phone portal if any one datacenter goes offline. Two nodes is the minimum config to keep ASL on the net, or N+1 redundancy in the parlance of telecom. Any DNS request that comes in is looked up in the DB if it comes into the primary servers; no working DB, no DNS. However ASL had multiple redundant secondary DNS servers that will store local cache from the primaries. If the primary(s) is offline, your resolver will lookup from a secondary.

ASL had secondaries provided by the FNT datacenter provider, which went away. I was asked by the admin team to do secondary DNS as I have NS1-4.KEEKLES.ORG running for my own use.

Why did FNT go away?

The board, through it’s own ineptitude regarding technical issues and treatment of the core admin/dev teams, pissed off one of the major donors to ASL, Inc. This resulted in notice being given (for the second time) to be off our FNT hypervisor by 00:00 UTC on Nov 1. I want to thank the provider of this datacenter and hypervisor. They really helped quite a bit all while putting up with lots of bullshit.

To migrate away from this, Stacy, with consent of the admin team, spun up a hypervisor and multiple VM’s in Seattle. Stacy and Nate spent damn near a week figuring out the database tuning to make this work (FNT/ORD/TPA had under 40 ms between them) on a higher latency link. They replicated the major servers from FNT, and one of the primary DNS servers. This all happened with out so much as a service outage.

Outages brewing…

Recently some changes have started to be made without permission or even consultation of the admin team. This has resulted in multiple outages, some due to not understanding DNS basics (SOA, DS, NS delegation, etc.) by board members. I’ve stayed out of this, other than when asked to help troubleshoot, and the admin team has fixed most of it.

On about 28 September, 2019 (this is the day it broke, it was changed before this) a change was made, without informing the admin team, to the records at the registrar, removing NS1-4.KEEKLES.ORG and leaving only Smithers-FNT and Karl-TPA as DNS servers for the allstarlink.org zone. This left the system in a precarious position as the primary DNS servers need a working DB or nothing resolves. I should mention at this time the ORD datacenter was in a middle of a 50+ day outage, so the DB servers were running in N+0. Wouldn’t you know it, they had an issue with FNT rebooting and the DB went down to 1 active node.

“The death of God left the angels in a strange position.”
– Multics source code comments in an error routine

With no secondary servers, I was informed of it by my node sending me an email:
rsync: getaddrinfo: rsync.allstarlink.org 873: Name or service not known

I tried to alert the admin team, and got no response. ASL was off the web completely due to an unauthorized and unnoticed changed. I got into the TPA server and set the DB to god mode, and got TPA online, then emailed the admin team. The admin team fixed the NS records at .org and then everything was working.

This was incident #1 of the board causing an outage.

How a small problem was made worse by unannounced changes

Most recently with FNT going away and CAUSTIC-SEA coming online as the co-primary DNS server, the secondary DNS servers needed to be updated. Much like nodes auth in ASL Land, this is IP address based. Each primary server needs to be configured in the secondary servers with the IP’s it will be sending AFXR from. It so happens that NS4.KEEKLES.ORG is an IPv6 enabled server [2607:f3f0:2:1001:225:90ff:fee4:63a5], and we had never configured allstarlink.org to permit Karl-TPA to update it via IPv4 or IPv6 so it was updating from Smithers-FNT. With the SEA servers coming online, the change was pushed to NS4.keekles.org removing FNT and replacing it with the IPv4 address of Caustic-SEA, taking for granted that the TPA primary was working.

It happens the Caustic-SEA server had IPv6 enabled, but not really used. This meant that the AXFR’s were sent to NS4.KEEKLES.ORG from the IPv6 address and this was not configured as an acceptable source for updates. NS4 did what it was suppose to do, serve up the last known cache of the zone, and log it. As I was not on the admin team, I didn’t check any of the DNS until asked by them to look into “DNS problems”. It was quite obvious the record was out of date, but some looking at the logs found the issues with TPA and the IPv6 address on SEA.

This issue was resolved at 12:30pm on 18 November 2019, about 30 min after an admin team member brought it to my attention. At the same time, Stacy found he had been locked out of DNS as he was going to just drop the NS4 server rather than bug me about it. There was no notice on this to the admin team it was just done.

I’ll mention I setup the ASL account for domain names as a separate org-id and maintained it until asked by the board for the login info. As it was via gandi, I told them to setup an account and give me their handles so each person has a unique login. They wanted to use a role account thus making tracking of changes impossible. :facepalm: I set it up and then removed myself from the account. The admin team had full access to it as they needed it, with the board having owner access.

Now the board decided to migrate all other domains to a third party breaking DNSSEC in the process. No notice was given to the admin team, it just broke and let people figure it out. Nodes.allstarlink.org has been broken many times in the past by such maladroit actions from people outside the admin team. The admin team was instituted as a meritocracy, anyone who wanted to be involved could, with consent of the others.

https://wiki.allstarlink.org/wiki/Admin_Committee

There’s quite few negative statements floating about regarding the work/design of the admin team. This is fostered by the stated desire of the board to commercialize AllStarLink. Pissing off major donors is not how you ensure the advancement of a Free Software project. I know the admin team didn’t appreciate the extra work required to migrate from one working environment to another.

When there is a willful lack of communications it’s not good for operations and development in an open source project.

If something breaks, check DNS first and email the admin team admin@allstarlink.org. It’s likely they will be just as surprised it’s broken as you!

For those of you unaware, Bryan tenured his resignation from the ASL Admin team. At that time the board determine to set a goal for ASL to move from donated server environments (including Bryan’s) to commercially hosted environments for the primary services of ASL. This included retiring Flint at the request of the donator, but that donator also understood that moving some to AWS or other commercial locations was a better solution.

In an effort to move off of Bryan’s servers (keekles.org) and also to work on formalizing our commercial strategy the Admin team has been making some sizable changes. The changes have been throughly tested, vetted by a competent team and have proved to resolve several errors being caused.

The Admin team has added many new competent volunteers in the wake of Bryan’s departure with a diverse background of skills - including DNS management.

Byran asked to leave ASL, but seems to not actually wish to do so. It was his decision and his decision alone. It seemed the right decision for Bryan given the stress that ASL was causing and his request was honored.

The admin team is and continues to make changes to align with the Board’s request to put commercial services behind the ASL registration infrastructure. This includes a mandate to utilize donated services for redundancy.

As a member of the admin team, the liaison to the board and one of the people who donates a datacenter, equipment and bandwidth to the cause I want all of you to be assured regardless of what Bryan says we have a capable admin team. It is important to further remember, they are all volunteers and this is not a paying day job.

We look forward to further making changes to ASL that are in-line with the board’s vision (probably not Bryan’s) that the team feels are beneficial to you (the actual users) in the long term support and stabilization of the ASL infrastructure.

It is not unlikely there will be some hiccups along the way (we still have a 10 year old server in the mix), but we are looking forward to a brighter future.

As always, if you have concerns, we encourage you to email helpdesk at AllStarLink.org or reach out to a board member as necessary. You may email me directly, I’m good on qrz.

On behalf of the Admin team - we are excited about the future of ASL and the changes being made to solidify a viable infrastructure for the long term.

Mike
KB8ZGL
-ASL Admin team Liaison to the board

There have been intermittent DNS problems for many months, well before what is mentioned above. One of our board members found a way to demonstrate the problem and pushed for a solution. Our remaining team of technical people identified and fixed the problem in short order.

At the heart of disagreements among the team is the board’s desire to “Use commercial services where available and affordable”. This is for the long term survivability and uptime of AllStar systems. For example if there is a DNS failure on AWS many people (not just AllStar users) will be impacted and it will get fixed quickly by Amazon employees. Whereas our servers have to be fixed by AllStar volunteers with limited availability. Also, as mentioned above, people have removed their contributions for various reasons and this has caused huge churn and some outages. The board’s desire is to prevent any one person’s contributions or expertise to adversely impact services.

It’s unfortunate that these disagreements have been made public. But it’s been an open secret for some time now. I suppose this presents an opportunity to discuss opposing points of view. That’s a good thing if the discussion remains civil. If this topic is not of interest the follow icon allows for adjustment of preferences for this topic.

To the team responsible fo DNS. Please contact me for a solution.
I may donate some resources on my server for loadbancing and redundancy as an immediate fix. I have some simple advice on how to futureproof this issue.
Contact me by email only.

We’re planing to move DNS to AWS. But I will contact you via email to get your input. And thank you for that.