LIGO Support Ticket 1750

Ticket Information
  Number:      support 1750
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wright
Date: Sun, 26 Nov 2006 15:37:22 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO condor_starter core dumps

We recently enabled core dump capabilities for the 64-bit setuid Condor
processes on the LIGO CIT Condor pool (Opteron 275 running FC4):

$ uname -a
Linux node319 2.6.17.13-CIT #1 SMP Tue Sep 19 18:39:07 PDT 2006 x86_64 x86_64 x8
6_64 GNU/Linux

$ condor_version
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

We are now observing approximately 10,000 condor_starter core dump's per day!

It is possible that this is related to [condor-support #1694] where the
condort_startd less frequently aborts with exit status 6.  Note, we have
not yet had a condor_startd SIGBARRT since enabling 64-bit core dump
capability in the kenel 2 days ago.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sun Nov 26 17:37:44 2006 (1164584267)
Date: Sun, 26 Nov 2006 15:42:03 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

http://www.ligo.caltech.edu/~anderson/condor.1750/readme

Here is an example of a condor_starter core dump:

# ls -l core.32281
-rw-------  1 condor condor  655360 Nov 26 13:48 core.32281


# file /ldcg/condor/sbin/condor_starter core.32281
/ldcg/condor/sbin/condor_starter: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs), for GNU/Linux 2.4.0, stripped
core.32281:                       ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'condor_starter.'


# gdb /ldcg/condor/sbin/condor_starter core.32281
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".


warning: core file may not match specified executable file.
Core was generated by `condor_starter ldas-grid.ligo.caltech.edu <10.14.2.69:46119> -a vm3'.
Program terminated with signal 6, Aborted.
#0  0x0000003bca72f280 in ?? ()
(gdb) where
#0  0x0000003bca72f280 in ?? ()
#1  0x0000003bca730750 in ?? ()
#2  0x0000003bca809120 in ?? ()
#3  0x0000003bca812d56 in ?? ()
#4  0x0000003bca930680 in ?? ()
#5  0x00000000006d1b37 in AUTHORITY_KEYID_free ()
Previous frame inner to this frame (corrupt stack?)



11/26 11:20:47 ******************************************************
11/26 11:20:47 ** condor_starter (CONDOR_STARTER) STARTING UP
11/26 11:20:47 ** /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_starter
11/26 11:20:47 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 11:20:47 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 11:20:47 ** PID = 27552
11/26 11:20:47 ** Log last touched 11/26 11:19:37
11/26 11:20:47 ******************************************************
11/26 11:20:47 Using config source: /usr1/condor/condor_config
11/26 11:20:47 Using local config sources: 
11/26 11:20:47    /usr1/condor/condor_config.local
11/26 11:20:47 DaemonCore: Command Socket at <10.14.2.69:48489>
11/26 11:20:47 Done setting resource limits
11/26 11:20:47 Communicating with shadow <10.14.0.12:44142>
11/26 11:20:47 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 11:20:47 Starting a VANILLA universe job with ID: 9359001.0
11/26 11:20:47 IWD: /archive/home/igor/S5/coherent/offline/SIMULATIONS/HSG1_S5_run24a
11/26 11:20:47 Input file: /archive/home/igor/INPUT_S5_CHT_run24a/14675_HSG1_S5_run24a.in
11/26 11:20:47 Output file: /usr1/igor/waveburst/OUTPUT_S4/14675_HSG1_S5_run24a.out
11/26 11:20:47 Error file: /usr1/igor/waveburst/ERROR_S4/14675_HSG1_S5_run24a.err
11/26 11:20:47 Renice expr "0" evaluated to 0
11/26 11:20:47 About to exec /archive/home/igor/S5/coherent/offline/SIMULATIONS/HSG1_S5_run24a/net.sh 
11/26 11:20:47 Create_Process succeeded, pid=27553
11/26 13:47:50 Process exited, pid=27553, status=0
11/26 13:47:50 Got SIGQUIT.  Performing fast shutdown.
11/26 13:47:50 ShutdownFast all jobs.
11/26 13:47:50 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
11/26 13:48:40 ********** STARTER starting up ***********
11/26 13:48:40 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 13:48:40 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 13:48:40 ******************************************
11/26 13:48:40 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 13:48:40 EventHandler {
11/26 13:48:40  func = 0x4bfda4
11/26 13:48:40  mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
11/26 13:48:40 }
11/26 13:48:40 condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
11/26 13:48:41 ********** STARTER starting up ***********
11/26 13:48:41 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 13:48:41 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 13:48:41 ******************************************
11/26 13:48:41 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 13:48:41 EventHandler {
11/26 13:48:41  func = 0x4bfda4
11/26 13:48:41  mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
11/26 13:48:41 }
11/26 13:48:41 condor_write(): Socket closed when trying to write 18 bytes to <10.14.0.12:58821>, fd is 17
11/26 13:48:41 Buf::write(): condor_write() failed
11/26 13:48:42 ********** STARTER starting up ***********
11/26 13:48:42 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 13:48:42 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 13:48:42 ******************************************
11/26 13:48:42 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 13:48:42 EventHandler {
11/26 13:48:42  func = 0x4bfda4
11/26 13:48:42  mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
11/26 13:48:42 }
11/26 13:48:42 condor_write(): Socket closed when trying to write 18 bytes to <10.14.0.12:58629>, fd is 17
11/26 13:48:42 Buf::write(): condor_write() failed
11/26 13:48:47 ********** STARTER starting up ***********
11/26 13:48:47 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 13:48:47 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 13:48:47 ******************************************
11/26 13:48:47 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 13:48:47 EventHandler {
11/26 13:48:47  func = 0x4bfda4
11/26 13:48:47  mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
11/26 13:48:47 }
11/26 13:48:47 condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
11/26 13:48:47 ********** STARTER starting up ***********
11/26 13:48:47 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 13:48:47 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 13:48:47 ******************************************
11/26 13:48:47 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 13:48:47 EventHandler {
11/26 13:48:47  func = 0x4bfda4
11/26 13:48:47  mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
11/26 13:48:47 }
11/26 13:48:47 condor_write(): Socket closed when trying to write 18 bytes to <10.14.0.12:54251>, fd is 17
11/26 13:48:47 Buf::write(): condor_write() failed
11/26 13:49:49 ******************************************************
11/26 13:49:49 ** condor_starter (CONDOR_STARTER) STARTING UP
11/26 13:49:49 ** /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_starter
11/26 13:49:49 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/26 13:49:49 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/26 13:49:49 ** PID = 32318
11/26 13:49:49 ** Log last touched 11/26 13:48:47
11/26 13:49:49 ******************************************************
11/26 13:49:49 Using config source: /usr1/condor/condor_config
11/26 13:49:49 Using local config sources: 
11/26 13:49:49    /usr1/condor/condor_config.local
11/26 13:49:49 DaemonCore: Command Socket at <10.14.2.69:39330>
11/26 13:49:49 Done setting resource limits
11/26 13:49:49 Communicating with shadow <10.14.0.12:36373>
11/26 13:49:49 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/26 13:49:49 Starting a VANILLA universe job with ID: 9359848.0
11/26 13:49:49 IWD: /archive/home/igor/S5/coherent/offline/SIMULATIONS/HSG1_S5_run24a
11/26 13:49:49 Input file: /archive/home/igor/INPUT_S5_CHT_run24a/15372_HSG1_S5_run24a.in
11/26 13:49:49 Output file: /usr1/igor/waveburst/OUTPUT_S4/15372_HSG1_S5_run24a.out
11/26 13:49:49 Error file: /usr1/igor/waveburst/ERROR_S4/15372_HSG1_S5_run24a.err
11/26 13:49:49 Renice expr "0" evaluated to 0
11/26 13:49:49 About to exec /archive/home/igor/S5/coherent/offline/SIMULATIONS/HSG1_S5_run24a/net.sh 
11/26 13:49:49 Create_Process succeeded, pid=32319


The core dump image for this may be found at,
http://www.ligo.caltech.edu/~anderson/condor.1750/core.32281


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sun Nov 26 17:42:48 2006 (1164584569)
Subject: Actions

Assigned to gthain by adesmet
===========================================================================
Date of actions: Mon Nov 27 10:18:08 2006 (1164644289)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Mon Nov 27 14:17:07 2006 (1164658627)
Subject: Actions

Assigned to wright by gthain
===========================================================================
Date of actions: Mon Nov 27 14:17:18 2006 (1164658638)
Subject: Actions

Ticket reopened by gthain
===========================================================================
Date of actions: Mon Nov 27 14:18:40 2006 (1164658720)
Subject: Actions

Ticket was reopened by gthain
===========================================================================
Date of actions: Mon Nov 27 14:18:40 2006 (1164658720)
Date: Mon, 27 Nov 2006 20:34:39 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

I just realized I was running gdb against the wrong binary, it is
"condor_starter.std" that is aborting not "condor_starter"--thank
you Linux for truncating the string in the output of /bin/file.
At any rate is a more useful example stack trace:

[root@node7 execute]# gdb /ldcg/condor/sbin/condor_starter.std core.21143 
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".


warning: core file may not match specified executable file.
Core was generated by `condor_starter ldas-grid.ligo.caltech.edu <10.14.1.7:60909> -a vm3'.
Program terminated with signal 6, Aborted.
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
#0  0x0000003551f2f280 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x0000003551f2f280 in raise () from /lib64/libc.so.6
#1  0x0000003551f30750 in abort () from /lib64/libc.so.6
#2  0x0000003551f282e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00000000004a43af in REMOTE_CONDOR_register_fs_domain ()
#4  0x0000000000488b7a in init_environment_info ()
#5  0x00000000004878f2 in init ()
#6  0x00000000004bfcbc in StateMachine::execute ()
#7  0x00000000004878bd in main ()


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 27 22:35:08 2006 (1164688509)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Mon, 27 Nov 2006 20:43:53 -0800
To: condor-support__AT__cs.wisc.edu


> I just realized I was running gdb against the wrong binary, it is
> "condor_starter.std" that is aborting not "condor_starter"--thank
> you Linux for truncating the string in the output of /bin/file.

hehe, yet another "Thank you Linux" moment. ;)

> At any rate is a more useful example stack trace:

indeed, that's quite helpful.

> #0  0x0000003551f2f280 in raise () from /lib64/libc.so.6
> #1  0x0000003551f30750 in abort () from /lib64/libc.so.6
> #2  0x0000003551f282e6 in __assert_fail () from /lib64/libc.so.6
> #3  0x00000000004a43af in REMOTE_CONDOR_register_fs_domain ()
> #4  0x0000000000488b7a in init_environment_info ()
> #5  0x00000000004878f2 in init ()
> #6  0x00000000004bfcbc in StateMachine::execute ()
> #7  0x00000000004878bd in main ()

this doesn't look anything like the other bug we're seeing in the  
startd.  however, this might be enough info for me to solve this.   
i'll let you know once i've had a chance to slaughter the other bug  
and look more closely at this one.

thanks for the update,
-derek



===========================================================================
Date mail was appended: Mon Nov 27 22:44:00 2006 (1164689041)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Wed, 29 Nov 2006 15:24:04 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 27, 2006, at 8:35 PM, Stuart Anderson wrote:

> #0  0x0000003551f2f280 in raise () from /lib64/libc.so.6
> (gdb) where
> #0  0x0000003551f2f280 in raise () from /lib64/libc.so.6
> #1  0x0000003551f30750 in abort () from /lib64/libc.so.6
> #2  0x0000003551f282e6 in __assert_fail () from /lib64/libc.so.6
> #3  0x00000000004a43af in REMOTE_CONDOR_register_fs_domain ()
> #4  0x0000000000488b7a in init_environment_info ()
> #5  0x00000000004878f2 in init ()
> #6  0x00000000004bfcbc in StateMachine::execute ()
> #7  0x00000000004878bd in main ()

alas, it's lame condor is blowing up with a real assert() here, and i  
know why that's happening.  however, it's going to take a lot more  
work to fix than i was hoping for. :(

meanwhile, here's an unstripped condor_starter.std binary you can  
use, which will hopefully provide much more useful info in the  
backtrace:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
condor_starter.std-6.8.3.pre-2006.11.29-unstripped.gz

if you don't mind, please install this and send me the gdb output  
from a core file using the full debugging symbols.

in parallel, i'll keep working on fixing the condor_starter to not  
blow up with a core dump and SIGABRT in this case (instead, it'll  
just print the error to the logs, which is what's expected, but not  
happening).

thanks,
-derek






===========================================================================
Date mail was appended: Wed Nov 29 17:24:13 2006 (1164842654)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 17:24:13 2006 (1164842656)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Thu, 30 Nov 2006 13:33:57 -0800
To: condor-support__AT__cs.wisc.edu


any luck with the unstripped starter and a new core file?

you could also use the new binaries i just sent in #1694:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
condor-6.8.3-linux-x86_64-rhel3-unstripped.tar.gz

since those are unstripped as well.  i didn't mess with changing the  
assert() to an EXCEPT() just yet, since i didn't want to be  
introducing too many variables at once.

let me know as soon as you have a coredump with the full debugging  
symbols.

i'm *guessing* this is actually a configuration error, not a bug, but  
the core file should helpfully say for certain.

thanks,
-derek



===========================================================================
Date mail was appended: Thu Nov 30 15:34:06 2006 (1164922447)
Date: Thu, 30 Nov 2006 13:41:28 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

On Thu, Nov 30, 2006 at 03:34:06PM -0600, condor-support response tracking system wrote:
> 
> 
> any luck with the unstripped starter and a new core file?
> 
> you could also use the new binaries i just sent in #1694:
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
> condor-6.8.3-linux-x86_64-rhel3-unstripped.tar.gz
> 
> since those are unstripped as well.  i didn't mess with changing the  
> assert() to an EXCEPT() just yet, since i didn't want to be  
> introducing too many variables at once.
> 
> let me know as soon as you have a coredump with the full debugging  
> symbols.
> 
> i'm *guessing* this is actually a configuration error, not a bug, but  
> the core file should helpfully say for certain.
> 

I have just downloaded last night's full unstripped build and will test
out both the startd and starter.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 15:41:50 2006 (1164922910)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Thu, 30 Nov 2006 13:43:14 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 30, 2006, at 1:41 PM, Stuart Anderson wrote:

> I have just downloaded last night's full unstripped build and will  
> test
> out both the startd and starter.

lovely, thanks!

-derek



===========================================================================
Date mail was appended: Thu Nov 30 15:43:23 2006 (1164923003)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Nov 30 15:43:23 2006 (1164923005)
Date: Thu, 30 Nov 2006 14:38:32 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

On Thu, Nov 30, 2006 at 03:34:06PM -0600, condor-support response tracking system wrote:
> 
> 
> any luck with the unstripped starter and a new core file?
> 
> you could also use the new binaries i just sent in #1694:
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
> condor-6.8.3-linux-x86_64-rhel3-unstripped.tar.gz
> 
> since those are unstripped as well.  i didn't mess with changing the  
> assert() to an EXCEPT() just yet, since i didn't want to be  
> introducing too many variables at once.
> 
> let me know as soon as you have a coredump with the full debugging  
> symbols.
> 
> i'm *guessing* this is actually a configuration error, not a bug, but  
> the core file should helpfully say for certain.
> 

Here is an example stack trace and a full debug symbol core dump image
may be found at,
http://www.ligo.caltech.edu/~anderson/condor.1750/core.8572
These are both from binaries at the URL above.

#0  0x0000003843a2f280 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x0000003843a2f280 in raise () from /lib64/libc.so.6
#1  0x0000003843a30750 in abort () from /lib64/libc.so.6
#2  0x0000003843a282e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00000000004a43e2 in REMOTE_CONDOR_register_fs_domain (domain=0x8ed8f0 "ligo") at senders.C:3301
#4  0x0000000000488bfa in init_environment_info () at starter.C:1201
#5  0x0000000000487972 in init () at starter.C:170
#6  0x00000000004bfcdc in StateMachine::execute (this=0x7fffc1581cf0) at state_machine_driver.C:142
#7  0x000000000048793d in main (argc=5, argv=0x7fffc1581e98) at starter.C:156

The full core

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 16:38:54 2006 (1164926335)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Thu, 30 Nov 2006 15:58:30 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 30, 2006, at 2:38 PM, Stuart Anderson wrote:

> Here is an example stack trace

ok, what's going on in the ShadowLog of your submit machine(s) when  
this happens?  the starter is dying while trying to send the message  
back to the shadow to tell it about itself when it's starting up.   
there's no good reason that should be failing, unless a) there's some  
kind of horrendous network failure going on or b) the shadow is dying  
right after it starts up, or c) some other crazy thing is happening.

now that i've seen the starter error, the next clue is definitely  
going to come from the ShadowLog.  the SchedLog might also have some  
clues, if the shadows are dying right away for some reason, etc.

thanks,
-derek



===========================================================================
Date mail was appended: Thu Nov 30 17:58:40 2006 (1164931120)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Nov 30 17:58:40 2006 (1164931121)
Date: Thu, 30 Nov 2006 16:30:31 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

On Thu, Nov 30, 2006 at 05:58:40PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 30, 2006, at 2:38 PM, Stuart Anderson wrote:
> 
> > Here is an example stack trace
> 
> ok, what's going on in the ShadowLog of your submit machine(s) when  
> this happens?  the starter is dying while trying to send the message  
> back to the shadow to tell it about itself when it's starting up.   
> there's no good reason that should be failing, unless a) there's some  
> kind of horrendous network failure going on or b) the shadow is dying  
> right after it starts up, or c) some other crazy thing is happening.
> 
> now that i've seen the starter error, the next clue is definitely  
> going to come from the ShadowLog.  the SchedLog might also have some  
> clues, if the shadows are dying right away for some reason, etc.
> 

Just to be clear, you are now forking this support ticket into 2 parts:

1) Prevent condor_starter.std from core dumping with SIGABRT.

2) Find out why condor_starter.std is generating a SIGABRT in the
   first place.

???


I am not sure what to look for in the log files, so I have put a copy of
the last Sched and Shadow logs for the last ~36 hours at

http://www.ligo.caltech.edu/~anderson/condor.1750/ShadowLog.gz
http://www.ligo.caltech.edu/~anderson/condor.1750/SchedLog.gz

I doubt it is option a) since the network appears to be working well.
It is also appears that this is only hapenning for condor_starter.std
not condor_starter.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 18:30:53 2006 (1164933053)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Thu, 30 Nov 2006 16:53:32 -0800
To: condor-support__AT__cs.wisc.edu

On Nov 30, 2006, at 4:30 PM, Stuart Anderson wrote:

> Just to be clear, you are now forking this support ticket into 2  
> parts:
>
> 1) Prevent condor_starter.std from core dumping with SIGABRT.
>
> 2) Find out why condor_starter.std is generating a SIGABRT in the
>    first place.
>
> ???

indeed.  from our perspective, it should just EXCEPT() in these  
cases, not SIGABRT.

however, from your perspective, you'd like to know why it's EXCEPT() 
ing or SIGABRTing, regardless. ;)

> I am not sure what to look for in the log files, so I have put a  
> copy of
> the last Sched and Shadow logs for the last ~36 hours at
>
> http://www.ligo.caltech.edu/~anderson/condor.1750/ShadowLog.gz
> http://www.ligo.caltech.edu/~anderson/condor.1750/SchedLog.gz

gunzipping now.  i'll let you know what i find.  what would be  
helpful would be some of the StarterLog files, too, just so i can see  
some timestamps of when the starters are dying, which i'd use to  
match up with what was happening at the Shadow at that time...

> I doubt it is option a) since the network appears to be working well.
> It is also appears that this is only hapenning for condor_starter.std
> not condor_starter.

ok, duly noted.  thanks.

-derek



===========================================================================
Date mail was appended: Thu Nov 30 18:53:42 2006 (1164934423)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
Date: Thu, 30 Nov 2006 17:01:47 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 30, 2006, at 4:30 PM, Stuart Anderson wrote:

> 2) Find out why condor_starter.std is generating a SIGABRT in the
>    first place.

forget it about the StarterLog, i don't need them.

your ShadowLog is *full* of errors like this:

11/30 16:18:08 (9142339.0) (2449):ERROR "Can't chdir() to "/archive/ 
home/cokelaer/S4/FullRunFinal3test"! [No such file or directory(2)]"  
at line 822 in file shadow.C

here's what's happening:

1) the shadow starts up
2) shadow connects to the startd to request a starter
3) tries to initialize itself and blows up
4) meanwhile, starter starts up, and tries to phone home to the shadow
5) shadow is now dead, socket is closed, and starter freaks out

a few action items:

a) it's dumb for the starter to assert() in this case (which we're  
already talking about)

b) it's kind of dumb for the shadow to request the starter before it  
does its own initialization.  i'll consider this with some other  
developers and figure out what/if we should do about it.

c) you should remove all the errant jobs that have bogus initial  
directories.


cheers,
-derek



===========================================================================
Date mail was appended: Thu Nov 30 19:01:57 2006 (1164934917)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Nov 30 19:01:57 2006 (1164934918)
Date: Thu, 30 Nov 2006 20:52:23 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

On Thu, Nov 30, 2006 at 07:01:57PM -0600, condor-support response tracking system wrote:
> 
> here's what's happening:
> 
> 1) the shadow starts up
> 2) shadow connects to the startd to request a starter
> 3) tries to initialize itself and blows up
> 4) meanwhile, starter starts up, and tries to phone home to the shadow
> 5) shadow is now dead, socket is closed, and starter freaks out
> 
> a few action items:
> 
> a) it's dumb for the starter to assert() in this case (which we're  
> already talking about)

Agreed.

> 
> b) it's kind of dumb for the shadow to request the starter before it  
> does its own initialization.  i'll consider this with some other  
> developers and figure out what/if we should do about it.

That sounds like a worth while optimization to me.

> 
> c) you should remove all the errant jobs that have bogus initial  
> directories.

There are not any running now, but that will not stop someone from
submitting a Dag with a one character typo in the future that
submits 1000s of jobs and generate 1000's of core dump images before
they notice :)


Action item d)

Now that ticket 1694 is under control we will probably disable all
condor core dumps on the execute machines next week until this gets
resolved--sometimes ignorance really is bliss.


Question:

I am still curious why we only recently started seeing this. Is step
3) and/or 5) in your hypothesis somehow 64-bit specific? I am pretty
sure we did not have this problem with 6.8.0 or the 6.7.x series, and
we had the same set of users on the same hardware. We where running with
ulimit -c 0
during that time, but the 32-bit condor library was able to override this
and generate lots of other coredumps for the other support tickets we
have opened on those older versions of condor.  It is certainly not
necessary to understand this, but I wonder if it fits into your current
working theory or not?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 22:52:43 2006 (1164948763)
Date: Fri, 6 Apr 2007 13:09:54 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps

I have lost track of this. Have both actions a) and b) been taken? If so,
in which release(s)?

Thanks.

On Thu, Nov 30, 2006 at 07:01:57PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 30, 2006, at 4:30 PM, Stuart Anderson wrote:
> 
> > 2) Find out why condor_starter.std is generating a SIGABRT in the
> >    first place.
> 
> forget it about the StarterLog, i don't need them.
> 
> your ShadowLog is *full* of errors like this:
> 
> 11/30 16:18:08 (9142339.0) (2449):ERROR "Can't chdir() to "/archive/ 
> home/cokelaer/S4/FullRunFinal3test"! [No such file or directory(2)]"  
> at line 822 in file shadow.C
> 
> here's what's happening:
> 
> 1) the shadow starts up
> 2) shadow connects to the startd to request a starter
> 3) tries to initialize itself and blows up
> 4) meanwhile, starter starts up, and tries to phone home to the shadow
> 5) shadow is now dead, socket is closed, and starter freaks out
> 
> a few action items:
> 
> a) it's dumb for the starter to assert() in this case (which we're  
> already talking about)
> 
> b) it's kind of dumb for the shadow to request the starter before it  
> does its own initialization.  i'll consider this with some other  
> developers and figure out what/if we should do about it.
> 
> c) you should remove all the errant jobs that have bogus initial  
> directories.
> 
> 
> cheers,
> -derek
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Apr  6 15:10:11 2007 (1175890211)
Date: Wed, 24 Oct 2007 11:42:42 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
X-Seen-BY: mailfromd 4.1 lava.cs.wisc.edu

Have items a) and b) been taken care of so this ticket can be closed?

Thanks.

On Thu, Nov 30, 2006 at 07:01:57PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 30, 2006, at 4:30 PM, Stuart Anderson wrote:
> 
> > 2) Find out why condor_starter.std is generating a SIGABRT in the
> >    first place.
> 
> forget it about the StarterLog, i don't need them.
> 
> your ShadowLog is *full* of errors like this:
> 
> 11/30 16:18:08 (9142339.0) (2449):ERROR "Can't chdir() to "/archive/ 
> home/cokelaer/S4/FullRunFinal3test"! [No such file or directory(2)]"  
> at line 822 in file shadow.C
> 
> here's what's happening:
> 
> 1) the shadow starts up
> 2) shadow connects to the startd to request a starter
> 3) tries to initialize itself and blows up
> 4) meanwhile, starter starts up, and tries to phone home to the shadow
> 5) shadow is now dead, socket is closed, and starter freaks out
> 
> a few action items:
> 
> a) it's dumb for the starter to assert() in this case (which we're  
> already talking about)
> 
> b) it's kind of dumb for the shadow to request the starter before it  
> does its own initialization.  i'll consider this with some other  
> developers and figure out what/if we should do about it.
> 
> c) you should remove all the errant jobs that have bogus initial  
> directories.
> 
> 
> cheers,
> -derek
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Oct 24 13:42:58 2007 (1193251378)
Date: Wed, 11 Jun 2008 11:12:25 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1750] LIGO condor_starter core dumps
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Derek,
	What is the status of this?

Thanks.

On Wed, Oct 24, 2007 at 11:42:42AM -0700, Stuart Anderson wrote:
> Have items a) and b) been taken care of so this ticket can be closed?
> 
> Thanks.
> 
> On Thu, Nov 30, 2006 at 07:01:57PM -0600, condor-support response tracking system wrote:
> > 
> > 
> > On Nov 30, 2006, at 4:30 PM, Stuart Anderson wrote:
> > 
> > > 2) Find out why condor_starter.std is generating a SIGABRT in the
> > >    first place.
> > 
> > forget it about the StarterLog, i don't need them.
> > 
> > your ShadowLog is *full* of errors like this:
> > 
> > 11/30 16:18:08 (9142339.0) (2449):ERROR "Can't chdir() to "/archive/ 
> > home/cokelaer/S4/FullRunFinal3test"! [No such file or directory(2)]"  
> > at line 822 in file shadow.C
> > 
> > here's what's happening:
> > 
> > 1) the shadow starts up
> > 2) shadow connects to the startd to request a starter
> > 3) tries to initialize itself and blows up
> > 4) meanwhile, starter starts up, and tries to phone home to the shadow
> > 5) shadow is now dead, socket is closed, and starter freaks out
> > 
> > a few action items:
> > 
> > a) it's dumb for the starter to assert() in this case (which we're  
> > already talking about)
> > 
> > b) it's kind of dumb for the shadow to request the starter before it  
> > does its own initialization.  i'll consider this with some other  
> > developers and figure out what/if we should do about it.
> > 
> > c) you should remove all the errant jobs that have bogus initial  
> > directories.
> > 
> > 
> > cheers,
> > -derek
> > 
> > 
> > 
> > 
> > ========================================
> > MESSAGE INFORMATION
> > ========================================
> > * From: Derek Wright <wright__AT__cs.wisc.edu>
> > * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
> > 
> > -- 
> > ======================================================================
> > This mail was sent from the RUST Mail System
> > Please direct all replies to condor-support__AT__cs.wisc.edu
> > Please include the current subject line in your reply.
> > ======================================================================
> > 
> 
> -- 
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Jun 11 13:12:38 2008 (1213207958)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Jul 18 13:29:29 2008 (1216405769)