LIGO Support Ticket 19595

Ticket Information
  Number:      admin 19595
  User:        anderson@ligo.caltech.edu
  Email:       jabadie__AT__ligo.caltech.edu
  Status:      open
  Assigned To: gthain
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: Shadow failures with Illegal instruction
Date: Sat, 15 Aug 2009 11:52:05 -0700
CC: Josh Abadie <jabadie__AT__ligo.caltech.edu>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

One of the submit machines for the LIGO Caltech condor pool is  
generating a large number of Shadow failures.

# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $

# uname -a
Linux ldas-grid 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 2009  
x86_64 x86_64 x86_64 GNU/Linux

# grep "Illegal instruction" SchedLog | wc -l
21934


# grep "Illegal instruction" SchedLog | tail
8/15 04:37:51 (pid:25772) Shadow pid 5142 died with signal 4 (Illegal  
instruction)
8/15 05:36:00 (pid:25772) Shadow pid 21754 died with signal 4 (Illegal  
instruction)
8/15 05:38:07 (pid:25772) Shadow pid 21929 died with signal 4 (Illegal  
instruction)
8/15 06:37:08 (pid:25772) Shadow pid 30545 died with signal 4 (Illegal  
instruction)
8/15 06:38:45 (pid:25772) Shadow pid 30821 died with signal 4 (Illegal  
instruction)
8/15 07:37:26 (pid:25772) Shadow pid 4411 died with signal 4 (Illegal  
instruction)
8/15 07:39:04 (pid:25772) Shadow pid 4529 died with signal 4 (Illegal  
instruction)
8/15 08:39:11 (pid:25772) Shadow pid 9944 died with signal 4 (Illegal  
instruction)
8/15 08:39:11 (pid:25772) Shadow pid 9945 died with signal 4 (Illegal  
instruction)
8/15 09:39:19 (pid:25772) Shadow pid 16174 died with signal 4 (Illegal  
instruction)

Digging a bit deeper for the last instance:

# grep 16174 ShadowLog | tail
8/14 10:34:33 (52969141.0) (16174):Shadow: Job 52969141.0 exited,  
termsig = 9, coredump = 0, retcode = 0
8/14 10:34:33 (52969141.0) (16174):Shadow: Job was kicked off without  
a checkpoint
8/14 10:34:33 (52969141.0) (16174):Shadow: DoCleanup: unlinking  
TmpCkpt '/usr1/condor/spool/cluster52969141.proc0.subproc0.tmp'
8/14 10:34:33 (52969141.0) (16174):Trying to unlink /usr1/condor/spool/ 
cluster52969141.proc0.subproc0.tmp
8/14 10:34:33 (52969141.0) (16174):user_time = 0 ticks
8/14 10:34:33 (52969141.0) (16174):sys_time = 0 ticks
8/14 10:37:04 (52969141.0) (16174):********** Shadow Exiting(107)  
**********
8/14 19:19:28 (52972960.0) (16166):Reaped child status - pid 16174  
exited with status 0
8/15 08:39:19 ** PID = 16174
8/15 08:39:19 (52971327.0) (16174): Request to run on slot5__AT__node340.ldas-cit.ligo.caltech.edu 
  <10.14.2.90:50426> was ACCEPTED


 From the starter log on the execute machine,

8/15 08:39:19 ******************************************************
8/15 08:39:19 ** condor_starter (CONDOR_STARTER) STARTING UP
8/15 08:39:19 ** /usr/sbin/condor_starter
8/15 08:39:19 ** SubsystemInfo: name=STARTER type=STARTER(8)  
class=DAEMON(1)
8/15 08:39:19 ** Configuration: subsystem:STARTER local:<NONE>  
class:DAEMON
8/15 08:39:19 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
8/15 08:39:19 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
8/15 08:39:19 ** PID = 6253
8/15 08:39:19 ** Log last touched 8/15 08:39:19
8/15 08:39:19 ******************************************************
8/15 08:39:19 Using config source: /usr1/condor/condor_config
8/15 08:39:19 Using local config sources:
8/15 08:39:19    /usr1/condor/condor_config.local
8/15 08:39:19 DaemonCore: Command Socket at <10.14.2.90:58968>
8/15 08:39:19 Done setting resource limits
8/15 08:39:19 Communicating with shadow <10.14.0.12:46794>
8/15 08:39:19 Submitting machine is "ldas-grid.ligo.caltech.edu"
8/15 08:39:19 setting the orig job name in starter
8/15 08:39:19 setting the orig job iwd in starter
8/15 08:39:19 Job 52971327.0 set to execute immediately
8/15 08:39:19 Starting a VANILLA universe job with ID: 52971327.0
8/15 08:39:19 IWD: /mnt/qfs1/lppekows/daily_ihope_runs/week5/20090809
8/15 08:39:19 Output file: /mnt/qfs1/lppekows/daily_ihope_runs/ 
week5/20090809/logs/splitbank---52971327-0.out
8/15 08:39:19 Error file: /mnt/qfs1/lppekows/daily_ihope_runs/ 
week5/20090809/logs/splitbank---52971327-0.err
8/15 08:44:19 condor_read(): timeout reading 5 bytes from  
<10.14.0.12:35207>.
8/15 08:44:19 IO: Failed to read packet header
8/15 08:44:19 ERROR "Assertion ERROR on (result)" at line 384 in file  
NTsenders.cpp
8/15 08:44:19 ShutdownFast all jobs.


Note, this is seen on only 1 out of 3 active submit machines.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date of creation: Sat Aug 15 13:52:25 2009 (1250362348)
Subject: Actions

Assigned to gthain by gthain
===========================================================================
Date of actions: Mon Aug 17 10:47:50 2009 (1250524071)
Date: Mon, 17 Aug 2009 10:49:28 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19595] LIGO: Shadow failures with Illegal
 instruction


Stuart:

It is interesting that this seems to be happening for both Standard 
universe and vanilla universe jobs -- would it be possible to send me 
the full ShadowLog and a corresponding StarterLog?

-Greg

===========================================================================
Date mail was appended: Mon Aug 17 10:49:33 2009 (1250524174)
CC: jabadie__AT__ligo.caltech.edu
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19595] LIGO: Shadow failures with Illegal
 instruction
Date: Mon, 17 Aug 2009 17:22:03 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

I have put the Sched and Shadow logs in,
http://www.ligo.caltech.edu/~anderson/condor.19595
along with a tar file of all the Start* logs from one random execute  
machine (node340).

Thanks.

On Aug 17, 2009, at 8:49 AM, condor-admin response tracking system  
wrote:

>
> Stuart:
>
> It is interesting that this seems to be happening for both Standard
> universe and vanilla universe jobs -- would it be possible to send me
> the full ShadowLog and a corresponding StarterLog?
>
> -Greg
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, jabadie__AT__ligo.caltech.edu
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Mon Aug 17 19:22:22 2009 (1250554943)