Asterisk Core Dumps

Any chance anything is being done to resolve the core dumps that happen on some of the larger hubs?

1 Like

Which of the larger hubs are you associated with, if any? I run a relative small hub (2530) and have taken to rebooting to restarting Asterisk every 2 weeks with cron.

Mostly Winsystem hub 2560, The core dumps seem to occur whenever two larger hubs get connected together. i.e Winsystem and WAN. The hub itself never has any problems if only single, or small groups of nodes connect, currently there are 75 nodes connected to 2560, 103 total including other nodes connected, and no problems. But if a node connects that’s connected to another large hub, a core dump will likely occur immediately. It also doesn’t matter if the person connecting is in monitor (rx only) or full duplex (transceive). Generally the most reliable way to find the offending node is to watch the bubble chart and see which node is trying to connect that has a lot of nodes connected to it.

We’ve seen that behavior before and Jim fixed it we thought. Are you one of the operators on the Winsystem AllStar? If so, what version of AllStar do you guys have on 2560?

Jim’s changes seemed to help at the time but the problem never went away completely, it has seemed to have gotten worse over time, presumably due to the odds of having larger connections with more nodes on the network.

Yes I am, The version is compiled from what was available on on github roughly 6 months ago. Asterisk GIT Version 99bf31e

Steve

···

On Fri, Jan 17, 2020 at 5:49 AM Tim Sawyer via AllStarLink Discussion Groups noreply@community.allstarlink.org wrote:


wd6awp

      ASL Admin




    January 17

Steve_Passmore:
The core dumps seem to occur whenever two larger hubs get connected together.

We’ve seen that behavior before and Jim fixed it we thought. Are you one of the operators on the Winsystem AllStar? If so, what version of AllStar do you guys have on 2560?


Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.

If you still have cores from the last incident, running echo bt | sudo gdb /usr/sbin/asterisk /core will create output that may help to identify the cause.

I get the following, but don’t know enough about debugging to be able to identify a direction to look for a problem from the results.

Steve

Core was generated by `/usr/sbin/asterisk -g -f -C /etc/asterisk/asterisk.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe1d0b0de16 in rpt_exec (chan=0x7fe1ac017830, data=<optimized out>)
    at app_rpt.c:23122
23122                   if ((!strncasecmp(l->chan->name,"echolink",8)) ||
[Current thread is 1 (Thread 0x7fe1c014e700 (LWP 417))]
(gdb) #0  0x00007fe1d0b0de16 in rpt_exec (chan=0x7fe1ac017830, data=<optimized o                                                                                                                               ut>)
    at app_rpt.c:23122
#1  0x0000000000489173 in pbx_exec (data=0x7fe1c0149cd0, app=0x2292990,
    c=0x7fe1ac017830) at pbx.c:537
#2  pbx_extension_helper (c=c@entry=0x7fe1ac017830,
    context=context@entry=0x7fe1ac017a80 "radio-secure",
    exten=exten@entry=0x7fe1ac017ad0 "2560", priority=1,
    label=label@entry=0x0, callerid=<optimized out>,
    action=action@entry=E_SPAWN, con=0x0) at pbx.c:1862
#3  0x00000000004908fc in ast_spawn_extension (callerid=<optimized out>,
    priority=<optimized out>, exten=0x7fe1ac017ad0 "2560",
    context=0x7fe1ac017a80 "radio-secure", c=0x7fe1ac017830) at pbx.c:2317
#4  __ast_pbx_run (c=c@entry=0x7fe1ac017830) at pbx.c:2406
#5  0x0000000000491499 in pbx_thread (data=data@entry=0x7fe1ac017830)
    at pbx.c:2621
#6  0x00000000004c0169 in dummy_start (data=<optimized out>) at utils.c:925
#7  0x00007fe1d32254a4 in start_thread (arg=0x7fe1c014e700)
    at pthread_create.c:456
#8  0x00007fe1d2822d0f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
(gdb) quit

is the code in question. It’s comparing the first 8 characters of the channel name to the literal “echolink”. Segmenation faults occur when a program attempts to access parts of memory it shouldn’t. In this case, the only memory it’s trying to access is the first 8 characters of l->chan->name. There may be a circumstance where a channel name has less than 8 characters and does not have the necessary null terminator. The chances of this occurring are theoretically greater with each additional node, which explains why you see it when connecting many nodes. I need to read the code a bit more to see under what conditions this could occur.

Thanks for the explanation. I’ve reached out to a couple people to try to get this nailed down but haven’t gotten any response yet. If you come up with anything please let us know.

Steve

Any ideas on this, we’re getting core dumps nearly daily, sometimes multiple. I’ll be make changes and recompile. The section of code appears to be dealing with echolink and tlb which aren’t installed on this hub. I know it may be a kludge but could that section of the code be disabled so it isn’t causing the core dump?

Thanks

How would you feel about running some debugging code to try to log the root cause?

Absolutely. I’ll be happy to do whatever it takes. I would really like to get to the bottom of this.

Once you guys get a fix, either do a pull request to the ASL Github on the dev or a new feature branch, or submit it to helpdesk@allstarlink.org.

I’ll make sure it is the next updates to the chan_echolink driver.

Thanks