Connections to my node drop - how to properly diagnose/adjust?

Hi,
I’m pretty good with networking -port selection and forwarding.
I periodically have nodes drop -lose connection.
But I’m unsure of how to troubleshoot this.
The connections “work fine” when they connect.
But I need to determine why they are dropping.
I have great connectivity and am not losing connection on my end.
How are these connections maintained? and what exactly determines/decides when they drop?
I am asking for both serverside (me) and on the client’s end.
Where are the timers set on both ends?
I have been searching for this and not finding any definite answers?
I am unsure if this is being done by app_rpt or IAX.
Only clue I have seen are some undocumented settings in iax.conf

dropcount = 2
autokill = yes
delayreject = yes

But have not found any information on these.
I do not see anything obvious in rpt.conf
I am already familiar with setting up “permanent connections”.
It this case this is not what I am trying to do.
I would like non-perment connections to stay up longer or more reliably,
I’d like to be able to diagnose WHY they are dropping and make adjustments.
I am not losing connectivity to them when they drop, pings are fine and port reachability tests good
the whole time when they are dropping.

Please help and thanks.

-Steve N8LBV

Steve,
Before I did network troubleshooting, I might want to rule out obvious things potentially.

Do you have a link inactivity timer set ?
Does the node in connection have one set ?

Is it the same node in question with the disconnects.
Can you test it with some other node(s).
(it could be that it is only a network issue between the 2 and not you at all)

Check registration data for your node and the node in connection drop,
right at the time it happens. Or look in /usr/lib/asterisk/rpt_extnodes to see if both nodes are in the node list. The intended connections server may also have an issue with the nodes list.

Are you using asl3 and is it up to date. apt update / apt upgrade / reboot

Is there any reason the other node may be manually disconnecting you ?
With your connection, are you also bringing a high traffic volume connection that is not part of your network.

Make contact with the other node system op and see what details that brings.

Additional details of the connection drop might be found in the logs - /var/logs/asterisk
as well as watching asterisk in the foreground. Which can be hard to do if random.

Hope that yields something for ya.
I am not likely thinking of everything, but it’s a good start.

Run tcpdump on one of the nodes and filter for iax traffic,see if you find any clues.

Thanks!!
I will keep at it…
Repeating myself, and still looking for any connection timeout settings and trying to understand how and where this may be done.
I would in the future like to be able to relax any such timers to deal with any temporary network connectivity issues even though we don;t appear to having these now.
I still lack an understanding of where how a timed out connection and drop is determined by the system… and have no clue how long it takes for a timeout to occur or where to look.
I think the IAX heartbeats are every few seconds (where can I find this info) and I don;t know how many have to be missed before it drops a node.
Just knowing how this works would help me greatly.
SteveG

Good suggestion, I can easily capture all of that.
I’m not sure if WIreshark has IAX disectors built in or if one can be easily added.
That would also be handy.

Lot in there so let me just stab at one.
Your node registers. It is then included in a list of registered nodes which is distributed to every registered node.

Your information is in that… node#, IP, port
When you connect to some other node, your system uses the info it has on hand to make the connection directly to that intended node.

If it should drop from registration, then your node does not know how to connect.
Same if your ip or port changes before the updated list is distributed again, which happens periodically, but I don’t have the latest on those timers. 5-10 min ?

I also think I herd it mentioned that if a node disappears, that it may remain in the list for up to 20 min if I remember correctly. But it is not instant is all I am pointing out.

But on the networking side, it may be dns issues within your network/isp.
You might also try a test of using google dns or some other trusted dns ?

Thanks,
I do not have a DNS issue and I do not have a connectivity issue.
Our nodes “know” their proper IP addresses and ports when these connections “drop”.
I continue to have about zero knowledge on how this is working (what is the criteria for it to drop a connection) nor have I been able to find any
valid information on how it works.
How is anyone supposed to troubleshoot this beyond the absolute basics?
That is what I am trying to learn.
Connectivity and DNS resolution are working.
And once a node is “connected” it should not need to check DNS over & over again to maintain this connection.
Nobody’s IP address or connectivity quality is changing when these connections drop.
Sometimes it stays connected for three days other times it will drop after thirty minutes.
It’s totally random.
All the while connectivity is just fine for all.
I need to get a handle on this.
Not having much success in finding good documentation on how it actually works.
And I need that to get to the next step of troubleshooting and monitoring.
I’m not seeing any information on the asterisk console (verbosity set to max) when these drop.
Not seeing anything telltale on the logs,
And also not seeing anything related to a drop with IAX2 debug all the way up.
I’ll keep trying- but still in the dark on this.

Perhaps one only needs to look at network iax2 connections in asterisk ?

We did not invent the protocol.
For all intents and purposes, it is a iax2-voip phone call.
Not sure if that thinking aids you any.
But I am curious what you find.

I’ll keep looking-
It’s also not happening often enough to be easy to track.
I’m going to bring up another node on a less solid Internet connection (cellphone hotspot) and will watch both sides carefully side by side.
But man, really I’d like to be able to make adjustments to the timeout criteria and actually know how it is working and what it is doing…
Magic black boxes are great, when they work, until they don’t.

Steve, have you tried to make the connection in the permanent method ?

Also, when the end reboots, you will loose the connection.

It’s not like the older asterisk versions when we were able to CLI>reload
and nobody did it that way anyhow. Everything stayed connected including sip/iax calls.

Now we have CLI> ‘dialplan reload’ and I don’t think it reloads all the config files.
At least I have had issues with it.

But if you are in contact with the other side, perhaps ask them to write down times when they reboot.

One issue I have is on VPS servers when they restart their networking etc.

The permanent connection stuff generally seems to work better-
I will come back to that and need to troubleshoot in a similar fashion if they do not, or do not stay up for weeks.
I am focusing right now on the manually linked non-permanent ones.
Those should at least stay up for “days” and they are not.
Rebooting on the other end is not the issue and neither is good connectivity.
Those basic types of items have been accounted for already and I am already on top of that stuff.
basic UDP path connectivity is not a factor.
Unless (again I need to know how this works)… If a path goes down or if “something” is checked every two seconds every 20 seconds every ? to determine a link should be dropped I NEED to know about this and how it works…
If the network connectivity goes out for only 15 seconds and this causes a link to drop, I need to know about how that works!!
Trying to troubleshoot this totally in the dark without knowing how it actually works is about impossible.
I think this type of basic info should be more easily available.
And If I can figure it out or find it I will make it more easily available.
I think it’s unreasonable to not be able to have it more readily at hand.
As a further troubleshooting measure I may setup nodes as private or hardcoded and totally bypass the whole “register with allstarlink” layer of this. to eliminate that as a factor.
Right now I have no idea if links are dropped if they fail to register with allstarlink periodically.
In my opinion this should not be a factor but then I don;t know how that works either exactly…
If a node fails to register with allstarlink, in my opinion it should keep working as it was same IP and port until it gets an opportunity to try again later.
Links should not drop instantly if this item fails temporarily.
I will take that possiblilty out of the equasion and connect a test node without allstarlink registration, to eliminate any DNS or connectivity to allstarlink possibilities.
We also run our own domain name servers, they are very fast responsive and available.
But that means about nothing if allstarlink becomes unreachable or has a nameserver hiccup or delay.
Connecting a couple of nodes “directly” will eliminate any possible DNS related or register with allstarlink issue if they are actually a factor.

Current documentation is one of the biggest needs for the project right now and to improve it we REALLY REALLY NEED people to step up and help with the documentation. There are too few of us keeping everything running and getting fixes out the door for ASL3.

Regarding your issue, I think you’re conflating a couple of things. If you have a connected node that is idle and timing out from network hiccups, that’s purely about Asterisk IAX2’s protocol and not ASL3 and has nothing to do with DNS resolution. You registration with the Registration Servers does not timeout within 15 seconds. The DNS lookups are used for initial connections. Once connected, it’s IP addresses. See the output of iax2 show netstats to show connectinon information.

The options for iax.conf are fully described here: asterisk/configs/samples/iax.conf.sample at master · asterisk/asterisk · GitHub

However, does using a persistent link not ride out any hiccups for you?

N8EI
I’ll be happy to help out with documentation and helping out.
This is where I’m starting at.
I don’t understand how this works and am trying to figure it out.
Permanent connections work better but I’d like dynamic ones to hang on longer and I need to understand how it actually works.
What exactly constitutes that a node is “connected” And what timeouts are they using to disconnect.
A few simple questions.
They are not IAX “registering” to my node, so none of those settings apply.
How does a node actually make a “connection” to mine?
How is that “connection” maintained, checked (at what interval) and where are these settings?
The document linked talks mostly about registrations and trunks.
I still don’t understand what a node connection actually is and how it’s qualified kept and dropped.
They are not IAX registrations and it’s not part of a trunk.
Could be peer, could be a friend?
Not understanding how it works.
I can certainly help out with docs when I get a better handle on it.
What exactly is “connected” ?

ASL3*CLI> iax2 show peers
Name/Username    Host                                           Mask                                      Port           Status      Description
iaxclient        (null)                                   (D)  (null)                                    (null)         Unmonitored
1 iax2 peers [0 online, 0 offline, 1 unmonitored]

------
ASL3*CLI> iax2 show registry
Host                                           dnsmgr  Username    Perceived                                      Refresh  State
0 IAX2 registrations.

-----
 iax2 show users
Username         Secret                Authen           Def.Context      A/C    Codec Pref
allstar-public   allstar               000000000000002  allstar-public   No     Host
iaxclient        Your_Secret_Password  000000000000002  iax-client       No     Host
radio            -no secret-           000000000000002  radio-secure     No     Host
allstar-sys      Key: allstar          000000000000004  allstar-sys      No     Host
iaxrpt           Your_Secret_Pasword_  000000000000002  iaxrpt           No     Host

-----
 iax2 show channels
Channel               Peer                                      Username    ID (Lo/Rem)  Seq (Tx/Rx)  Lag      Jitter  JitBuf  Format  FirstMsg    LastMsg
IAX2/174.x.x.x:  174.x.x.x                            radio       02613/08905  00210/00027  00137ms  0129ms  0176ms  ulaw    Tx:NEW      Rx:ACK
IAX2/97.x.x.x:4  97.x.x.x                             radio       03866/10797  00054/00204  00064ms  0012ms  0052ms  ulaw    Tx:NEW      Rx:ACK
IAX2/207.x.x.x  207.x.x.x                           radio       09928/01014  00084/00036  00067ms  0006ms  0065ms  ulaw    Rx:NEW      Rx:ACK
IAX2/68.x.x.x:45  68.x.x.x                              radio       12554/00142  00240/00032  00063ms  0021ms  0061ms  ulaw    Rx:NEW      Rx:ACK
IAX2/76.x.x.x:  76.x.x.x                            radio       16237/02982  00254/00110  00089ms  0000ms  0040ms  Unknow  Rx:NEW      Rx:ACK
5 active IAX channels
---
iax2 show netstats
                                -------- LOCAL ---------------------  -------- REMOTE --------------------
Channel                    RTT  Jit  Del  Lost   %  Drop  OOO  Kpkts  Jit  Del  Lost   %  Drop  OOO  Kpkts FirstMsg    LastMsg
IAX2/174.x.x.x:4569-   50  129  176    40   6     0   15     76   97  137  1492   7     3   21     74 Tx:NEW      Rx:ACK
IAX2/97.x.x.x:4569-3   31   12   52    20   3     0    0     60    8   64  1257   7     3    7     69 Tx:NEW      Rx:ACK
IAX2/207.x.x.x:3534   34    6   65    22   2     0    2      8   15   67    82   8     2    1      3 Rx:NEW      Rx:ACK
IAX2/68.x.x.x:4569-12   33   21   61   122   5     0    0     80   23   63  1418   6    10   13     69 Rx:NEW      Rx:ACK
IAX2/76.x.x.x:4569-   56    0   40     0   0     0    0     78   29   89  1531   7     6   18     74 Rx:NEW      Rx:ACK
5 active IAX channels

I’m sure I’m not going to do the Asterisk bits complete justice, but in general each connection is a call over IAX2. The call is controlled by the radio-secure stanza in extensions.conf. The call is accepted (or not) based on the registration status of the node at the time of the connection. So let’s take two connections 192.0.2.15:4569 as “node 63001” and 203.0.113.30:4569 as “node 64000”. In general:

Before a connection:

  • Node 63001 registers itself as 192.0.2.15:4569
  • Node 64000 registers itself as 203.0.113.30:4569
    (Note this is going to ignore alternative port registrations for now)

Node 63001 tries to link to 64000:

  • Internally the link 3 NODE command begins regardless of how that is called - cmd, DTMF, etc.
  • Node 63001 looks up the remote IP and port for node 64000 in DNS or the node text database (depending on configuration). In the example above, node 63001 gets back that node 64000 is reachable on 203.0.113.30:4569
  • Node 63001 makes a call to Node 64000 at 203.0.113.30:4569 from 192.0.2.15.4569.
  • Node 64000 receives the IAX incoming call connection. It does a lookup to ensure that Node 63001 is expected to be 192.0.2.15 and, upon match, permits the call.
  • Node 64000 uses the lookup to set the port it’s going to communicate back to 63001 on (in case NAT did something funky) and sets the UDP port based on the registration information
  • One established, there’s an IAX2 heartbeat between the two nodes at the protocol level functioning as a keepalive.
  • Talking happens (or not)

Eventually, the call is closed and the IAX2 connection is torn down when an unlink command (ilink 1 NODE) is issued for the node.

When two nodes are linked, you should be able to observe the heartbeat traffic between the two endpoints with tcpdump(1) filtering on port 4569 (or the defined alternative port). For example:

14:34:16.854195 IP 172.17.16.60.4569 > 172.17.16.54.4569: UDP, length 12
14:34:16.854246 IP 172.17.16.54.4569 > 172.17.16.60.4569: UDP, length 46
14:34:16.861532 IP 172.17.16.60.4569 > 172.17.16.54.4569: UDP, length 12

A 15s network outage being “okay” isn’t really contemplated in the IAX2 protocol since it’s a near-real-time protocol. The tunables you might be able to set for [radio-secure] in iax.conf are:

;qualify=yes                ; Make sure this peer is alive.
;qualifysmoothing =         ; Use an average of the last two PONG results to
                            ; reduce falsely detected LAGGED hosts.  The default
                            ; is 'no.'
;qualifyfreqok =            ; How frequently to ping the peer when everything
                            ; seems to be OK, in milliseconds.
;qualifyfreqnotok =         ; How frequently to ping the peer when it's either
                            ; LAGGED or UNAVAILABLE, in milliseconds.

I’ve not personally seen these used before with app_rpt, but can’t say it doesn’t work and/or help.

Thanks you for your time and help!
I will study this and see what I come up with…
Meanwhile I was trying to figure it out.
You mention it checks registration status: (not with my node though)-
registration is not being used between nodes themselves at all.
And I figure registration with allstar is used as a precursor to being “allowed” to connect out to the
public (DNS Based) nodelist.
Also a bit confusing (on my end).
Show IAX2 show channels refers to these as peers.
You referred to them as “calls”
While iax2 show peers shows blank:
iax2 show peers
Name/Username Host Mask Port Status Description
iaxclient (null) (D) (null) (null) Unmonitored
1 iax2 peers [0 online, 0 offline, 1 unmonitored]
ASL3*CLI>

Channel Peer Username ID (Lo/Rem) Seq (Tx/Rx) Lag Jitter JitBuf Format FirstMsg LastMsg
IAX2/68.x.x.x 68.x.x.x radio 00277/03742 00178/00182 00048ms 0018ms 0058ms ulaw Rx:NEW Rx:ACK
IAX2/174.x.x.x: 174.x.x.x radio 02613/08905 00201/00016 00047ms 0126ms 0169ms ulaw Tx:NEW Rx:ACK
IAX2/97.69.x.x.x 97.x.x.x radio 03866/10797 00045/00181 00045ms 0012ms 0052ms ulaw Tx:NEW Rx:ACK
IAX2/207.x.x.x 207.x.x.x radio 09928/01014 00077/00007 00067ms 0006ms 0065ms ulaw Rx:NEW Rx:ACK
IAX2/76.x.x.x: 76.x.x.x radio 16237/02982 00245/00101 00064ms 0000ms 0040ms Unknow Rx:NEW Rx:ACK
5 active IAX channels

The “call” is the correct terminology per Asterisk:

nickel*CLI> rpt cmd 2116 ilink 3 48496
    -- Call accepted by 172.17.16.54:4569 (format ulaw)
    -- Format for call is (ulaw)
    -- Hungup 'DAHDI/pseudo-1234653878'
    -- <DAHDI/pseudo-1173661613> Playing 'rpt/node.gsm' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/4.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/8.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/4.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/9.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/6.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'rpt/connected-to.gsm' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'rpt/node.gsm' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/2.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/1.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/1.ulaw' (language 'en')
    -- <DAHDI/pseudo-1173661613> Playing 'digits/6.ulaw' (language 'en')
    -- Hungup 'DAHDI/pseudo-1173661613'

Not going to get too “hung up” on this one.
But guessing that you are showing three (associated) calls here.
call #1- gets hung up near the beginning dahdi- guessing this bridges the “beep-beep” tone-gen connected sound.
Call #2 bridges dahdi for the digits+ connected-to audio then hangs up.
Call #3 (the link path) stays connected passing IAX2 sync and heartbeat (PING-PONG) while link audio is not present/being passed and uses the same call to pass audio when such is present.
All crude assumptions of course.

You are describing the idea pretty well. Think of it like a conference call. When ASL3 wants to make announcements, it connects to the conference call, says what it wants to say, then hangs up. The nodes (peers in IAX) are still in the call the whole time. Dahdi might connect and hangup numerous times in the logs as it makes announcements.

I have 2 ASL3 nodes running and I can connect them for days with no problem. DVSwitch will connect and stay connected for days also. I can unplug ethernet for a minute on a node, plug it back in, and connections are still there. tcpdump is where I would look next.

Think of it like a conference call.

Normally in asterisk something like meetme is used for a conference call or a 3-way call is setup in one of the endpoints (phones itself).
So thinking of it like a conference call just confuses the matter a bit.
Where is the conference call being bridged?
This is not a function of IAX2.
Is app_rpt itself the conference bridge?

TCPDump is not going to be very useful until I know what I am doing with the protocol.
“Looking there” will do about nothing unless I really know what I am doing.
Here’s that that looks like, and perfectly fine of course.
People often liek to say “use tcpdump” but then what exactly?
Look for what?
I can already see IAX2 activity at the console with IAX2 debug set on.
What exactly should I be doing with tcpdump as I know I already have good connectivity.

Thanks.