There’s a been quite a bit of conversation and derision regarding DNS and specifically the hosting donated to ASL, Inc. I have remained silent for some time regarding this, but cannot abide the things being said about donors to ASL.
First a bit of technical background:
Primary DNS is back-ended into the DB cluster which has a node in three locations. ASL can function for registration/DNS/phone portal if any one datacenter goes offline. Two nodes is the minimum config to keep ASL on the net, or N+1 redundancy in the parlance of telecom. Any DNS request that comes in is looked up in the DB if it comes into the primary servers; no working DB, no DNS. However ASL had multiple redundant secondary DNS servers that will store local cache from the primaries. If the primary(s) is offline, your resolver will lookup from a secondary.
ASL had secondaries provided by the FNT datacenter provider, which went away. I was asked by the admin team to do secondary DNS as I have NS1-4.KEEKLES.ORG running for my own use.
Why did FNT go away?
The board, through it’s own ineptitude regarding technical issues and treatment of the core admin/dev teams, pissed off one of the major donors to ASL, Inc. This resulted in notice being given (for the second time) to be off our FNT hypervisor by 00:00 UTC on Nov 1. I want to thank the provider of this datacenter and hypervisor. They really helped quite a bit all while putting up with lots of bullshit.
To migrate away from this, Stacy, with consent of the admin team, spun up a hypervisor and multiple VM’s in Seattle. Stacy and Nate spent damn near a week figuring out the database tuning to make this work (FNT/ORD/TPA had under 40 ms between them) on a higher latency link. They replicated the major servers from FNT, and one of the primary DNS servers. This all happened with out so much as a service outage.
Outages brewing…
Recently some changes have started to be made without permission or even consultation of the admin team. This has resulted in multiple outages, some due to not understanding DNS basics (SOA, DS, NS delegation, etc.) by board members. I’ve stayed out of this, other than when asked to help troubleshoot, and the admin team has fixed most of it.
On about 28 September, 2019 (this is the day it broke, it was changed before this) a change was made, without informing the admin team, to the records at the registrar, removing NS1-4.KEEKLES.ORG and leaving only Smithers-FNT and Karl-TPA as DNS servers for the allstarlink.org zone. This left the system in a precarious position as the primary DNS servers need a working DB or nothing resolves. I should mention at this time the ORD datacenter was in a middle of a 50+ day outage, so the DB servers were running in N+0. Wouldn’t you know it, they had an issue with FNT rebooting and the DB went down to 1 active node.
“The death of God left the angels in a strange position.”
– Multics source code comments in an error routine
With no secondary servers, I was informed of it by my node sending me an email:
rsync: getaddrinfo: rsync.allstarlink.org 873: Name or service not known
I tried to alert the admin team, and got no response. ASL was off the web completely due to an unauthorized and unnoticed changed. I got into the TPA server and set the DB to god mode, and got TPA online, then emailed the admin team. The admin team fixed the NS records at .org and then everything was working.
This was incident #1 of the board causing an outage.
How a small problem was made worse by unannounced changes
Most recently with FNT going away and CAUSTIC-SEA coming online as the co-primary DNS server, the secondary DNS servers needed to be updated. Much like nodes auth in ASL Land, this is IP address based. Each primary server needs to be configured in the secondary servers with the IP’s it will be sending AFXR from. It so happens that NS4.KEEKLES.ORG is an IPv6 enabled server [2607:f3f0:2:1001:225:90ff:fee4:63a5], and we had never configured allstarlink.org to permit Karl-TPA to update it via IPv4 or IPv6 so it was updating from Smithers-FNT. With the SEA servers coming online, the change was pushed to NS4.keekles.org removing FNT and replacing it with the IPv4 address of Caustic-SEA, taking for granted that the TPA primary was working.
It happens the Caustic-SEA server had IPv6 enabled, but not really used. This meant that the AXFR’s were sent to NS4.KEEKLES.ORG from the IPv6 address and this was not configured as an acceptable source for updates. NS4 did what it was suppose to do, serve up the last known cache of the zone, and log it. As I was not on the admin team, I didn’t check any of the DNS until asked by them to look into “DNS problems”. It was quite obvious the record was out of date, but some looking at the logs found the issues with TPA and the IPv6 address on SEA.
This issue was resolved at 12:30pm on 18 November 2019, about 30 min after an admin team member brought it to my attention. At the same time, Stacy found he had been locked out of DNS as he was going to just drop the NS4 server rather than bug me about it. There was no notice on this to the admin team it was just done.
I’ll mention I setup the ASL account for domain names as a separate org-id and maintained it until asked by the board for the login info. As it was via gandi, I told them to setup an account and give me their handles so each person has a unique login. They wanted to use a role account thus making tracking of changes impossible. :facepalm: I set it up and then removed myself from the account. The admin team had full access to it as they needed it, with the board having owner access.
Now the board decided to migrate all other domains to a third party breaking DNSSEC in the process. No notice was given to the admin team, it just broke and let people figure it out. Nodes.allstarlink.org has been broken many times in the past by such maladroit actions from people outside the admin team. The admin team was instituted as a meritocracy, anyone who wanted to be involved could, with consent of the others.
https://wiki.allstarlink.org/wiki/Admin_Committee
There’s quite few negative statements floating about regarding the work/design of the admin team. This is fostered by the stated desire of the board to commercialize AllStarLink. Pissing off major donors is not how you ensure the advancement of a Free Software project. I know the admin team didn’t appreciate the extra work required to migrate from one working environment to another.
When there is a willful lack of communications it’s not good for operations and development in an open source project.
If something breaks, check DNS first and email the admin team admin@allstarlink.org. It’s likely they will be just as surprised it’s broken as you!