LIGO Support Ticket 14572

Ticket Information
  Number:      admin 14572
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      new
  Assigned To: tannenba
Date: Mon, 4 Dec 2006 15:47:40 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO schedd deadlock on startup acquiring lock on user log file

On the LIGO Caltech Condor pool running,

# condor_version
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
# cat /etc/redhat-release 
Fedora Core release 4 (Stentz)
# uname -a
Linux node329 2.6.18.3-CIT #1 SMP Sun Nov 26 13:16:15 PST 2006 x86_64 x86_64 x86_64 GNU/Linux

it was observed that the condor_schedd process would reproducibly deadlock
on startup before scheduling any jobs on the cluster. In pariticular
condor_schedd was running with an euid of a particular user (not the
normal condor user) and strace showed that it was trying to obtain a file
lock on a file owned by that same user--presumably a user condor log file.

It was also observed that this user had several jobs stuck in the X state
(for several days) and presumably the schedd was getting stuck trying to
"investigating" these jobs.

The work around was to rename these log files, e.g.,
$ mv /archive/home/xsiemens/small-grid/hybrid/log.130 /archive/home/xsiemens/small-grid/hybrid/log.130.tmp
to prevent the deadlock and then condor_schedd was able to startup
all the way and begin running jobs again.  This also resulted in the X state
jobs from being flushed from the queue.

Obiously condor can not fix all the various Linux/NFS file locking bugs,
but it may be possible to make condor_schedd startup more robust to
leaked locks on user files.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Mon Dec  4 17:48:01 2006 (1165276085)
Subject: Actions

Assigned to tannenba by roy
===========================================================================
Date of actions: Thu Dec  7 10:44:21 2006 (1165509861)