A Minimal ASL Node Without Asterisk Dependency (R&D)

Another note on the 16K audio track.

I’ve got 16K SLIN audio working well. As was noted in other places on this thread, the audio quality is noticeably better. It is useful in cases where network and audio bandwidth are less constrained. If you don’t care about higher audio sampling rates you should disregard the rest of this message.

I recognize that the G.722 CODEC supports 16K audio. At the moment I’m working with 16K SLIN because it’s easier to implement. However, I want to do this in a way that is completely compatible with the existing implementations (i.e. doesn’t create CODEC negotiation problems) and is also documented in the RFC. I’m trying to get RFC5456 clarified so I’m sharing some notes here in case anyone cares and/or has some expertise that is relevant.

Problem

  • I’m using higher-quality audio sampling rates in my IAX2 implementation, but I want to stay fully-compatible with existing Asterisk implementations.

  • RFC5456 defines a media format for 16-bit linear audio (0x00000040), but it isn’t completely clear what the sampling rate for this media format should be.

  • Most of the other IAX2 media formats have an implicit sampling rate.

  • I think (not 100% sure) that the Asterisk implementation of the 16-bit linear media format (0x00000040) assumes an 8K sampling rate. Anyone who knows better should correct me.

  • I can see that the Wireshark dissector for IAX2 is assuming that undocumented media format 0x00008000 is marked as “16K 16-bit linear audio,” so apparently someone has already gone down this path and claimed an unused media format codepoint for this purpose. (NOTE: I use this media format in my 16K audio implementation).

  • The RFC defines an information element called SAMLPINGRATE which is obviously relevant, but it is never explained how this element should be used.

  • The expert at IANA who is responsible for editing the RFCs and related numeric assignments is hesitant to allocate a new media format for 16K linear audio if the existing SAMPLINGRATE element could be clarified and used for its intended purpose.

  • It’s possible that I’d want linear audio at other higher sampling rates (for example 22.05, 44.1, 48k), so the point about leveraging SAMPLINGRATE is well taken. This is especially true given that the FORMAT element is limited to 32-bits.

  • Informative point: SIP addresses this problem by allowing sampling rates to be specified in the SIP INVITE message (example: a=rtpmap:96 G729/8000).

Proposal

  • The existing SAMPLINGRATE information element will be added as an optional element on the NEW and ACCEPT message (RFC5456 tables 6.2.2 and 6.2.3 respectively).

  • If specified, the SAMPLINGRATE element on the NEW message allows the caller to define the maximum sampling rate specified for any capable media formats for which the sampling rate isn’t implicit. (For example: the SAMPLINGRATE element isn’t relevant for G.711 uLaw because the 8K assumption is implicit in the ITU specification, but it might be relevant for G.729 where different sample rates are allowed).

  • If specified, the SAMPLINGRATE element in the ACCEPT messages allows the callee to define the sampling rate that will be used for the call, but only in cases where the media format has flexibility in this regard. The accepted SAMPLINGRATE must not exceed the maximum sampling rate provided by the caller.

  • The default sampling rate for any media format that has flexibility in this regard will be 8K. The most important example is 16K linear media format (0x00000040), which will default to 8K if no SAMPLINGRATE element is used.

Following up on the 16K audio track. I raised the question about the best way to represent the 16K linear media format in the IAX2 protocol on the Asterisk development forum. Someone pointed me to an older Asterisk header file that showed that media format 0x00008000 had been allocated to 16K 16-bit linear, little endian format. The same file also explicitly stated that media format 0x00000040 is for 8K linear. So that resolves my question and eliminates the need to use the SAMPLINGRATE information element. That also shows that the IAX2 RFC is out of date. I’ve raised a request with the IANA people to have the RFC revised with this additional information.

All of this works. I gave an installation package to David NR9V and he and I had a cross-country QSO today using 16K “high definition” audio and AllScan interfaces on both sides. It sounded great. I don’t know for sure, but I’m assuming that was the first ever 16K QSO on AllStar.

Bruce has done an amazing job in all regards with this, and the 16Ksps audio sounds beautiful. Our QSO on Jan. 11 was about an hour long and the audio was perfect the whole time.

Looking forward to see both &-ASL and ASL3 continue to progress with these kinds of audio quality and usability improvements. The & web interface and configuration is very modern and clean, and all the little details have been done the right way, for example the URI Tx and Rx gain settings are in dB units corresponding to the hardware IC (ie. it reads the supported range of dB mixer setting values from the OS audio driver), and it has a 100% web-based configuration and management interface - no SSH, text editors, or Asterisk dial plan logic required. During our QSO one thing we talked about was level meters on the UI for both incoming and outgoing IAX and USB audio, and about an hour later Bruce had those in place and looking super nice.

BTW I did a quick parrot test demo here - https://youtu.be/vRurT4k_Jy4
This was from Jan. 10 and UI shown there is already out-of-date but the audio and parrot implementation are flawless.

Thanks David for the kind words on the audio quality. Knowing how much effort you put into the audio quality of your products, that means a lot.

So far the most interesting part of this project has been experimenting with ways to improve the audio quality of ASL. I’ve spent a lot of time playing with different jitter buffer and packet loss concealment algorithms. What you’re testing right now is what I think is the best combination. But it can be better.

Lately I’ve been researching a class of speech processing algorithms known as “Time Scale Modifications.” It seems like most of the work on these methods was done in the 1990s so this may be old news to everyone.

Basically, the TSM algorithms let you speed up or slow down speech without affecting pitch. If you try to speed up speech the naive way by just resampling it at a higher rate you end up with the chipmunk effect which sounds funny, but would be frowned upon by serious amateur radio operators. The TSM that sounds the best to me is called “Waveform Similarity-Based Overlap Add” (WSOLA). The original 1993 paper is behind an IEEE paywall, but here’s a PPT overview: https://hpac.cs.umu.se/teaching/sem-mus-16/presentations/Schmakeit.pdf

Some of these techniques are used in Asterisk, but from what I can tell it all happens on the SIP side of the system so us IAX-based hams don’t have the benefit.

There’s at least one place where I think this is practically relevant to AllStarLink audio quality:

Jitter Buffer Improvements

From what I’ve seen on high-quality wired networks (Verizon FIOS) and low-quality mobile hotspots (Total Wireless 4G/LTE + $20 Netgear LM1200), ASL voice packet loss is rare but packet delays are common. My wireless link will often experience a “drop out” of 5-10 packet periods (about 100-200ms) followed by a flood of those missing packets. As we all know, the jitter buffer is the solution to this. There’s already a jitter buffer in Asterisk/app_rpt, so nothing new here.

As long as your jitter buffer is long enough, you never hear these gaps/floods because the delayed playout is completely smooth. Making this delay longer gives better protection from gaps at the expense of annoying latency.

The adaptive jitter buffer methods (see description of Ramjee Algorithm I earlier in this thread) do a good job of automatically making the jitter buffer longer for jittery mobile connections, but this still leads to increased latency. (Automatic is nice because there are no configuration parameters to argue about. :slight_smile:)

When a network gap grows long enough to drain the jitter buffer completely the system switches to packet-loss concealment (PLC) to plug the gap with synthetic frames. Then the audio quality starts to degrade. For this reason, the PLC algorithm tries to switch back to real audio as soon as it is available. When the flood of late packets finally arrives, it seems like the best quality is obtained when the playout starts again where it left off (with suitable smoothing in and out of the synthetic period), but that has the effect of increasing the audio latency even more. The other option is to discard the late packets and maintain a fixed latency. You’ll need to try both to see which sounds better to you.

If the jitter buffer is sophisticated enough, the added latency created by playing the late packets can be quickly recovered back to the target value during the next period of silence. But for longer transmissions this isn’t an option.

This is where WSOLA comes into the picture. WSOLA can be used to speed up the playout a bit after a PLC event to get back down to the original target latency mid-stream. Once recovered, the jitter buffer is positioned to handle the next drop-out as smoothly as possible while maintaining a lower average latency.

To me a good analogy is AGC. AGC doesn’t need to wait until the end of a transmission to change the gain to the target value.

Going back to the primary goals of Amateur Radio, first among them are education, community service, preparedness, and then the other more usual goals such as ragchewing, pizza & beer. This ties into technology, science, engineering, math and computer science as well. Thus the reason organizations like ARDC and many others prioritize support for organizations who facilitate these larger community goals. Trends in STEM, maker/DiY communities, ham radio, electronics, software, etc., seem to imply that what people who are new to these areas want to see is simplicity, elegance (not a word you hear often in ham radio) and user-friendliness.

Younger people want to be able to just look at something and have it make intuitive sense what it is, what it does, how to use it and how to change it or adapt it. If it doesn't interest them in the first few minutes they have no shortage other potentially more interesting things to devote their attention to.

As pointed out by luminaries such as Alan Kay, father of the term Object-Oriented Programming (OOP) ( Alan Kay - Wikipedia ), the software industry has tended to do the opposite of the above.

Does a Radio-over-IP app need 2 Million Lines of Code? Is that conducive to the longer-term goals of amateur radio? If that size of a codebase was doing something very well and efficiently, and doing so in a way that was clear and intuitive, then sure, but if the same thing could be done with a more modern approach in 20K LOC, then clearly the 1st approach was not efficient or optimally conducive to the goals of the hobby.

If I were to try and figure out where exactly in App-rpt and Asterisk jitter buffering was done, which algorithms were used, and what tradeoffs are made with latency, and if I wanted to be able to view that information in real-time as I use a node, it would probably take days of digging through code and dozens of forum posts to figure all that out and enable/implement the appropriate log messages. Probably everyone just assumes it's good and fine, and maybe it is, but from my experience with a wide range of VOIP systems I would bet ($20?) that some very significant latency optimizations could be made. It is thus very refreshing to see &-ASL making such great progress and to see discussion of these kinds of details, and know that if I wanted to know more I could find these functions in probably 5 minutes of work in the & github repo. Thanks again Bruce!

There is now increasing awareness in the software industry of how to better achieve the goals of what might be called "modern" software. As new approaches emphasize simplicity, observability, precision, reliability, and open standards and best practices, they tend to be more successful and become more popular. This is hardly new in just about any other industry. The software industry, and those employed by large software companies, however tend to drink their own Kool-Aid and create overly complicated software and processes that make it difficult to scale to the level of individuals, hobbyists - and hams.

The beauty of FOSS ham radio software is there's no rush, no deadlines, and no tradeoffs required. We can make it as simple, clear and intuitive as we'd like. People often have the idea that software is supposed to be complex, poorly documented, and opaque, and seem to have lost sight of the fact that software, particularly in embedded communication systems, is nothing more than an abstraction of hardware, which is comprised of simple electronic components that follow simple physics concepts. Repeater systems, telecom systems, and audio systems existed long before the internet and PCs. In the olde days these things used switches, potentiometers, level meters, Op Amps and copper wire to do everything, and this underlying simple meta-structure has not fundamentally changed.

The thing I like about designing hardware is that everything is clear and precise. A one-page schematic can clearly document an entire product, with very little abstraction. Software can be just as simple, when people are able to step outside of legacy ways of thinking. Most software developers are like most politicians in a way, they seem to always think that more of whatever service they provide is always the solution to everything. The 2nd-order effects of complexity, unsustainability and inefficiency seem to be enthusiastically overlooked.

AllStar is here to stay, ASL3 works well, much of its complexity makes sense in context, though some doesn't, but longer-term there's room for other approaches that leverage its strengths and address its weaknesses in a truly modern way, and bring it closer to a highly optimized, stable, modular, extensible ROIP system that "just works", where things are properly optimized with clear and easily accessible instrumentation at every interface. The required pieces are already all there in Linux, Python, SciPy, GNURadio, IAX, other latency-optimized VOIP codecs, modern audio drivers, etc. Just a matter of integrating these pieces in the right way. Having used Asterisk and ASL for a few years now I don't see Asterisk as a very useful part of that, at least not as a platform. Sure maybe tie into it as a module if needed eg. for autopatch, SIP or PSTN purposes, but otherwise it seems, well kind of like a 1990's PBX system. Anyhow, just some food for thought. Signed, Looking forward to further innovations in ham radio...

Thanks David. That Alan Kay video is interesting. He seems to be quite proud of himself in the video, and quite critical of at least one wildly successful technology (the web browser). But I guess if you invent Smalltalk you are allowed to be a little bit arrogant. :slight_smile:

Regarding the line count on Asterisk: to be fair, we need to remember that a lot of that code does things that aren’t directly related to what we would think of as “radio linking.” I’m guessing there are things like voicemail lurking in those 2 million lines somewhere. SIP also. But I do agree with you that a greatly simplified take on amateur linking software should help to spur some more innovation in this important space. My rule of thumb is <10K lines - anything more and it gets to be hard to engage with. I’m going to try hard to stay under that number, particularly because I’m trying to run this on a microcontroller.

I did a lot of experimentation this week on kerchunk filtering. My notes are here in case anyone is interested in this stuff: https://mackinnon.info/ampersand/#kerchunk-filtering

I’ve also been doing some research on audio level setting, sharing for anyone who is interested in the mechanics: https://mackinnon.info/ampersand/asl-audio-levels.html. Comments welcomed!

This week I’ve been focused on some IAX2 protocol extensions. There are two things that may be of interest:

I’ve got a node that runs in a repeater site using a cellular 4G/LTE hotspot. The carrier implements CGNAT and that comes with all of the usual problems that have been discussed at length on these forums. I’ve come up with a way to allow the node to receive calls from the outside world without the use of VPNs, UDP proxies, or extra audio hops through hub nodes. It’s a pretty simple protocol extension that is borrowed from the EchoLink system. It seems to work fine - I can now place calls to my repeater node. This requires a simple broker on the network to facilitate handshakes. Perhaps if the idea turns out to have merit a broker could be put into the official ASL infrastructure? Documentation is here: https://mackinnon.info/ampersand/#firewallcgnat-traversal-ipv4-only

NOTE: From what I can understand, this problem doesn’t completely go away with IPv6. The NAT issue is fixed obviously, but the mobile carriers are not likely to allow inbound ports on the firewall. Someone should correct me if I’m wrong.

Secondly, I’ve been thinking about ways to propagate the callsign of the “active talker” through the network, similar to what is done on the DMR/D-Star/YSF networks (I think). If this idea turns out to have merit it could be applied more widely. Documentation is here: https://mackinnon.info/ampersand/#identification-of-active-talker

FWIW:

I did some testing a while back, and found that both AT&T and T-Mobile, using my iPhone as a hotspot, allowed direct incoming connections using IPV6. Verizon, I think, didn’t, but it’s been a while since I did that test.

Bruce,

I am having a wonderful time following your progress on this. I am very excited to see someone putting this much effort info reverse engineering and re-implementing these systems. This is how we move forward in technology and amateur radio - by breaking the mold and sometimes re-inventing the wheel.

A note I have about something you said...

According to app_rpt documentation, this feature is actually supposed to exist already, in the form of the K key in IAX text.

Sadly, it appears to never have actually been implemented. I brought this up as a feature request in another topic recently, and a Github issue was opened to hopefully look into this more at some point.

Nonetheless, excited to see some talk and development in this area! This is something that app_rpt has been desperately needing for a very long time.

Thanks Patrick, that’s very good to hear. There are a lot of good reasons to move to IPv6 and this would be the best one if it turned out that the mobile carriers would allow UDP ingress like that. I will do some more testing on IPv6 to see whether the entire issue goes away in that world.

Hi Mason, thanks much for reading my stuff. I fully recognize that I am reinventing wheels here, but I am also trying to make the wheel a bit rounder in a few places.

Allan WA3WCO was just pointing me to that exact same “K” enhancement request on a different thread. This is good to know about. I will look at this more closely to see how it compares to my current scheme. I’ve updated my docs with a reference.

One thing that would be required using the “K” scheme is a good way to convert the keyed node number to a more user-friendly call-sign/name. I’m sure there are ASL HTTP APIs that could do that, but I also know there are concerns about a bunch of stations hitting those central APIs. My thinking about letting the TALKERID flow though in parallel with the audio is that the call/name metadata is dwarfed by the audio that is describes.

But if there was a solution that already had much of the support baked into the existing network, that solution should win.

I would hardly call it reinventing the wheel. When Mercedes came out with the G63, no one at Winnebago thought anything of it :rofl:

A few notes for the week:

  1. I spent some time looking at the audio resampling filters used in app_rpt. Notes are here.
  2. After some talk on this forum about scalability of large conference servers I did some measurements to think about the critical path in that scenario. Notes are here.

Thanks for the continued great work taking a look at these fundamental details (that heretofore had seemingly been overlooked due likely to their being “small” details within the huge Asterisk&app-rpt codebase). A few notes on your notes:

  • A first important point is that even if a repeater system or radio has only a narrow audio bandwidth such as 3–3.5 KHz, it is much better to sample at at least 4x that frequency. Almost all pro audio equipment nowadays runs at 96KHz sample rates, which gives theoretical frequency response to 48KHz – which though twice as far up as anyone can actually hear is done so that sharp low-pass filters and the corresponding signal degradation and phase nonlinearities that those typically introduce do not impact audible frequencies or transient detail. Thus 16K codecs will definitely sound better than 8K even on standard 5KHz deviation FM systems. This kind of difference is not always immediately apparent but on some signals and with extended listening it definitely can make a noticeable improvement. Also, a 16K sample rate does not always equate to twice the actual network bandwidth if a codec is used that does some level of data compression. For example a jpg or png file at half the number of pixels may be only 40% smaller, which might then equate to only a 30% difference in network bandwidth with packet overhead.
  • It would be nice to quantify how much difference the current filtering in ASL3 actually makes. This could be done by running an audio test file from one node to another (ie. 48K in → 8K IAX → 48K out) with a series of tones spaced ~1/12th octave from 20-20K Hz and then look at the distortion resulting for each tone. Ideally there should be no more than 0.1% total distortion, which corresponds to a 60dB audio SNR. It would be fairly easy to automate this sort of test in C++ / python.
  • Re. scalability, yes there should definitely be no need to individually resample/filter every outgoing audio stream. With that properly optimized I would imagine & should have no trouble at all with 1K connections on a decent cloud VM.

OK, thanks David. A lot of this audio stuff is over my head but this is a very good learning experience. I’ve done a first pass of your proposed “round-trip” 48→8→48k processing test. The chart below shows the total distortion advantage (in dB) of the Ampersand filter over the filter that’s in app_rpt. Interestingly at lower frequency it’s actually negative, meaning the app_rpt filter has lower distortion by a few dB. At higher frequency the Ampersand filter starts to do better and gives around +10dB total distortion advantage.

I’m going to need to review my methodology carefully. One thing that’s important to note is that the test setup I’m using appears to have an inherent “noise floor” of about -45dB. I’m assuming that’s coming from numerical imprecision and shortcomings in the FFT methodology I’m using. In other words: if I just transmit a tone sampled at 48k and then analyze the total distortion (power at all non-fundamental frequencies / power at the fundamental frequency) without any re-sampling I end up with -45dB of distortion, and that’s flat across the band. So getting to -60dB is a pipe-dream at the moment - I’ll need to keep working on that. :slight_smile:

I am using tones that are even multiples of the resolution of the FFT (48,000 / 1024) Hz to avoid the “bin spreading” problem. But there must be something else going on.

That was quick, and looks like you’re well on your way down the rabbit hole of Audio DSP. Some additional thoughts:

  • Doing filtering frame-by-frame on 20mS blocks will have limitations in resolution / accuracy eg. quantization errors, boundary/windowing errors, etc. There are ways around that, as you’re probably well aware, but ideally the math will be provably correct ie. the transfer functions of the downsampling and upsampling filters should cancel out such that various test signals within the passband should pass through nearly identical to the original. In pro audio this could mean accuracy to 24 bits, but in telecom as low as 60dB might be OK which is the equivalent of only 10 bits. BTW when I mention distortion and bits I’m referencing Voltage rather than power thus why 0.1% distortion corresponds to ~-60dBFS of noise.
  • FFTs on small sample sets (eg. 20mS frames) are of limited use in audio, as FFTs work in linear rather than logarithmic frequency space and thus have very poor resolution in lower octaves, which might explain your graph above. Wavelet or other log-based transforms should be much better for audio frequency analysis, though ideally time domain methods can be used to evaluate distortion.
  • Interpolation and decimation should probably be done with some sort of smoothing such as a sinc function. This could significantly reduce distortion prior to additional filtering steps. The ASL3 channel drivers just do a straight decimation with no smoothing, and in the upsampling direction SimpleUSB calls lpass() which according to the function comments does a 31-tap FIR 2.9KHz LPF, but sinc upsampling would be more accurate and probably not significantly different in CPU efficiency. Virtual analog filters could also be interesting, probably they would be significantly more CPU-intensive but would not have issues with windowing or frame boundaries.

I’m not a DSP expert and my terminology and understanding of what all is being done in ASL is just from an “amateur” perspective. There are probably already clear standards and best practices on how to best do all this in telecom systems, which I think you have already done some fair amount of looking into. But signal processing was one of areas of emphasis when getting my BSEE many years ago and something that I will be getting more involved in hopefully relatively soon.

There is some significant complexity in how all this can best be done but this is just pure math that is easy to do simulations on and thereby prove numerically that the desired effects are achieved on various test signals with no artifacts. Once that’s done, it just works from then on and can be easily documented with simulation/test code that allows anyone to understand how it works and confirm it works as intended.

Thanks for all that background David. I spent a little time reading up on the wavelet stuff and it’s very cool. I’ll try that out later.

But back to my simple-minded FFT scheme, I found a few problems in my test setup. Now my noise floor is about -91dB … much better! After cleaning things up I get the result below, which shows very similar performance up to around 1.2kHz and then significantly better distortion numbers above that. The rest of the details are in my writeup since I’m guessing people are bored by this tangent.

Nice! That graph looks like I would have expected, and confirms that ASL3 does have aliasing issues at 2KHz and above.It may not be audible most of the time but it’s clear that at 3KHz & has ~30dB lower aliasing artifacts. And your test methodology and steps all look good for finding and quantifying those artifacts. BTW I do highly suggest that as you do various tests/simulations like this to put that code up in the & github repo somewhere so it’s available for anyone to look at and run for themselves. In that sense not only is the software open source but the work that goes into validating it does everything in the best way is also open source and reproducible.

A few notes for this week:

  1. I did some capacity testing. Notes are captured here. Bottom line: I was able to get to 500 nodes.
  2. I’ve started playing with the Voter support. It’s a very interesting capability. I’m able to work from the documentation but I probably need someone patient/knowledgeable about the Voter protocol to help fill a few gaps in my understanding. Also, does someone have a Voter radio interface (RTCM?) that they could lend me for a few weeks so I can test my implementation? I’d be happy to pay for the shipping. (Please reach out directly on that.) Thanks!

This week I was on a very interesting tangent looking at the VOTER support. Many thanks to Mason N5LSN for lending me his client hardware and Tom NN6H for pointing his spare RTCM at my development server. With the help of these guys I was able to get the protocol driver worked out and today Tom was able to send VOTER audio into the system and hear it out the other side.

I’ve also been working on getting my code to run on a microcontroller and I was able to create a simple VOTER client using an ARM Cortex-M0+ (Pico W) that can send and receive VOTER audio. I’ve not done anything with the GPS clocks yet, this is just a protocol demo so far.