LIGO Support Ticket 19727
Ticket Information
Number: admin 19727
User: anderson@ligo.caltech.edu
Email: jabadie__AT__ligo.caltech.edu
Status: new
Assigned To: zmiller
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: condor_submit and condor_shadow segfault
Date: Thu, 17 Sep 2009 15:14:24 -0700
CC: Josh Abadie <jabadie__AT__ligo.caltech.edu>
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Running Condor version,
# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
it was observed that condor_submit and condor_shadow segfaulted when
the root filesystem was full on a submit machine. This is not a
critical issue since clearly we needed to fix the underlying
filesystem issue, however, it is probably worth taking a look at why
this results in a segfault, e.g., perhaps there is a call to write()
that is having its return error code ignored that might show more
subtle problems under other circumstances.
I don't have any of the condor_submit core files, but here is the
evidence of those:
[root@ldas-pcdev1 log]# grep segfault /var/log/messages | tail
Sep 17 15:02:26 ldas-pcdev1 kernel: condor_submit[13319]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fff0ef196c0 error 4
Sep 17 15:02:28 ldas-pcdev1 kernel: condor_submit[13334]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fffead6e370 error 4
Sep 17 15:02:32 ldas-pcdev1 kernel: condor_submit[13357]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fffec678c00 error 4
Sep 17 15:02:34 ldas-pcdev1 kernel: condor_submit[13369]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fff78ba70b0 error 4
Sep 17 15:02:38 ldas-pcdev1 kernel: condor_submit[13381]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fffd202ce90 error 4
Sep 17 15:02:39 ldas-pcdev1 kernel: condor_submit[13390]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fffce170f30 error 4
Sep 17 15:02:42 ldas-pcdev1 kernel: condor_submit[13395]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fff78c742f0 error 4
Sep 17 15:02:43 ldas-pcdev1 kernel: condor_submit[13399]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fffcb2c8540 error 4
Sep 17 15:02:56 ldas-pcdev1 kernel: condor_submit[13405]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fff6d0ae4d0 error 4
Sep 17 15:02:59 ldas-pcdev1 kernel: condor_submit[13412]: segfault at
0000000000000000 rip 00000000005836c2 rsp 00007fff5ad84650 error 4
I do have a few hundred shadow core files if you want any, but here is
a quick look:
[root@ldas-pcdev1 log]# ls -l core.991
-rw------- 1 root root 1179648 Sep 17 15:02 core.991
[root@ldas-pcdev1 log]# file core.991
core.991: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-
style, from 'condor_shadow'
[root@ldas-pcdev1 log]# gdb /usr/sbin/condor_shadow core.991
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Reading symbols from /lib64/libdl.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.6
Reading symbols from /lib64/libm.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
Core was generated by `condor_shadow -f 20648509.0 --
schedd=<10.14.0.18:50189> --xfer-queue=limit=uplo'.
Program terminated with signal 11, Segmentation fault.
[New process 991]
#0 0x00000039b4e30215 in raise () from /lib64/libc.so.6
(gdb) where
#0 0x00000039b4e30215 in raise () from /lib64/libc.so.6
#1 0x000000000050f00e in linux_sig_coredump ()
#2 <signal handler called>
#3 0x00000000005ea6aa in Condor_Auth_Kerberos::init_server_info ()
#4 0x00000000005ebf3d in Condor_Auth_Kerberos::authenticate ()
#5 0x00000000005d417c in Authentication::authenticate_inner ()
#6 0x00000000005d4819 in Authentication::authenticate ()
#7 0x00000000005c6259 in ReliSock::perform_authenticate ()
#8 0x00000000005c634b in ReliSock::authenticate ()
#9 0x00000000005ddb8e in SecMan::authenticate_sock ()
#10 0x0000000000595d54 in ConnectQ ()
#11 0x0000000000596325 in QmgrJobUpdater::updateJob ()
#12 0x00000000005963d2 in QmgrJobUpdater::periodicUpdateQ ()
#13 0x0000000000514935 in TimerManager::Timeout ()
#14 0x0000000000506f69 in DaemonCore::Driver ()
#15 0x00000000005111f5 in main ()
Note, we are not using Kerberos authentication with Condor.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Thu Sep 17 17:14:37 2009 (1253225680)
Subject: Actions
Assigned to zmiller by griswold
===========================================================================
Date of actions: Fri Sep 18 12:25:43 2009 (1253294743)
Subject: Comments added
Created gittrac ticket #829 to track this issue as well.
Comments added by psilord
===========================================================================
Date comments were added: Fri Oct 9 11:42:36 2009 (1255106556)