Released mid-October 1999, this is the first public release of Condor NT.
In general, this preview release on NT works the same as the full-blown release of Condor for Unix.
However, following items are still being worked on and are not supported in this preview:
universe = vanilla
Except for the functionality listed above, practically everything else works the same way in Condor NT Preview as it does in the full-blown release. This Preview release is based on the Condor 6.1.8 source tree, and thus the feature set is the same as 6.1.8. For instance, all of the following works in Condor NT:
Condor remote system calls and the ability to access network shares is not yet supported on NT -- they will be in the near future. For now, Condor NT users must utilize the Condor File Transfer mechanism.
When Condor finds a machine willing to execute your job, it will create a temporary subdirectory for your job on the execute machine. The Condor File Transfer mechanism will then send via TCP the job executable(s) and input files from the submitting machine into this temporary directory on the execute machine. After the input files have been transferred, the execute machine will start running the job with the temporary directory as the job's current working directory. When the job completes or is kicked off, Condor File Transfer will automatically send back to the submit machine any output files created by the job. After the files have been sent back successfully, the temporary working directory on the execute machine is deleted.
Condor's File Transfer mechanism has several features to ensure data integrity in a non-dedicated environment. For instance, transfers of multiple files are performed atomically.
), use the following new commands in
the submit-description file:
It is highly recommended that you specify a Requirements expression in your submit-description file that checks the size of the Disk attribute when using File Transfer! Doing so can ensure that Condor picks a machine with enough local disk space for your job. Here is a sample submit-description file:
# Condor submit file for program "foo.exe".
#
# foo reads from files "my-input-data" and "my-other-input-data".
# foo then writes out results into several files.
# The total disk space foo uses for all input and output files
# is never more than 10 megabytes.
#
executable = foo.exe
# Now set Requirements saying that the machine which runs our job
# must have more than 10megs of free disk space. Note that "Disk"
# is expressed in kilobytes; 10meg is 10000 kbytes.
requirements = Disk > 10000
#
queue
If you do not specify a requirement on Disk (a bad idea!), condor_submit will append to the job ad Requirements that Disk >= DiskUsage. The DiskUsage attribute is in the job ad and represents the maximum amount total disk space required by the job in kilobytes. Condor will automatically update DiskUsage approx every 20 minutes while your job runs with the amount of space being used by the job on the execute machine.
Itemized below are some current limitations of the File Transfer mechanism. We anticipate improvement on these areas in upcoming releases.
This section provides some details on how Condor NT starts and stops jobs. This discussion is geared for the Condor administrator or advanced user who is already familiar with the material in Chapter 2 (the Administrators' Manual) and wishes to know detailed information on what Condor NT does when starting/stopping jobs.
When Condor NT is about to start a job, the condor_startd on the execute machine spawns a condor_starter process. The condor_starter then creates:
Next, the condor_starter (henceforth called the starter) contacts the condor_shadow (henceforth called the shadow) process which is running on the submitting machine and pulls over the job's executable and input files. These files are placed into the temporary working subdirectory for the job. After all files have been received, the starter spawns the user's executable as user ``condor-run-dir_XXX'' with its current working directory set to the temporary working subdirectory (i.e. $(EXECUTE)/dir_XXX).
While the job is running, the starter is closely monitoring the CPU usage and image size of all processes started by the job. Every 20 minutes it sends this information, along with the total size of all files contained in the job's working subdirectory, to the shadow. The shadow then inserts this information into the job's ClassAd so policy/scheduling expressions can make use of this dynamic information.
If the job exits of its own accord (i.e. the job completes), the starter first terminates any processes started by the job which could still be laying around if the job did not clean up after itself. examines the job's temporary working subdirectory for any files which have been created or modified and sends these files back to the shadow running on the submit machine. The shadow places these files into the initialdir specified in the submit-description file; if no initialdir was specified, the files go into the directory where the user ran condor_submit. Once all the output files are safely transferred back, the job is removed from the queue. If, however, the condor_startd forcibly kills the job before all output files could be transferred, the job is not removed from the queue but instead switches back to Idle.
If the condor_startd decides to vacate a job prematurely (perhaps because the startd policy says to kick off jobs whenever activity on the keyboard is detected, or whatever), the starter sends a WM_CLOSE message to the job. If the job spawned multiple child processes, the WM_CLOSE message is only sent to the parent process (i.e. the one started by the starter). The WM_CLOSE message is the preferred way to terminate a process on Windows NT, since this method allows the job to cleanup and free any resources it may have allocated. Then when the job exits, the starter cleans up any processes left behind. At this point if transfer_files was set to ONEXIT (the default) in this job's submit file, the job simply switches from state Running to state Idle and no files are transferred back. But if transfer_files is set to ALWAYS, then any files in the job's temporary working directory which were changed or modified are first sent back to the shadow. But this time, the shadow places these so-called intermediate files into a subdirectory created in the $(SPOOL) directory on the submitting machine ($(SPOOL) is specified in Condor's configuration file). Then the job is switched back to the Idle state until Condor finds a different machine for it to run on. When the job is started again, Condor will place into the job's temporary working directory the executable and input files as before, plus any files stored in the submit machine's $(SPOOL) directory for that job.
NOTE: A Windows console process can intercept a WM_CLOSE message via the Win32 SetConsoleCtrlHandler() function if it needs to do special cleanup work at vacate time; a WM_CLOSE message generates a CTRL_CLOSE_EVENT. See SetConsoleCtrlHandler() in the Win32 documentation for more info.
NOTE: The default handler in Windows NT for a WM_CLOSE message is for the process to exit. Of course, the job could be coded to ignore it an not exit, but eventually the condor_startd will get impatient and hard-kill the job (if that is the policy desired by the administrator).
Finally, after the job has left and any files transferred back, the condor_starter will delete the temporary working directory, the temporary run account, the WindowStation and the desktop before exiting itself. If the starter should terminate abnormally for some reason, the condor_startd will take upon itself to cleanup the directory, the account, etc. If for some reason the condor_startd should disappear as well (i.e. if the entire machine was power-cycled hard), the condor_startd will cleanup the temporary directory(s) and/or account(s) left behind when Condor is restarted at reboot time.
On the execute machine, the user job is run using the access token of an account dynamically created by Condor which has bare-bones access rights and privileges. For instance, if your machines are configured so that only Administrators have write access C: WINNT, then certainly no Condor job run on that machine would be able to write anything there. The only files the job should be able to access on the execute machine are files accessible by group Everybody and files in the job's temporary working directory.
On the submit machine, Condor permits the File Transfer mechanism to only read files which the submitting user has access to read, and only write files to which the submitting user has access to write. For example, say only Administrators can write to C: WINNT on the submit machine, and a user gives the following to condor_submit :
executable = mytrojan.exe
initialdir = c:\winnt
output = explorer.exe
queue
Unless that user is in group Administrators, Condor will not permit
explorer.exe to be overwritten.
If for some reason the submitting user's account disappears between the time condor_submit was run and when the job runs, Condor is not able to check and see if the now-defunct submitting user has read/write access to a given file. In this case, Condor will ensure that group ``Everyone'' has read or write access to any file the job subsequently tries to read or write. This is in consideration for some network setups, where the user account only exists for as long as the user is logged in.
Condor also provides protection to the job queue. It would be bad if the integrity of the job queue is compromised, because a malicious user could remove other user's jobs or even change what executable a user's job will run. To guard against this, in Condor's default configuration all connections to the condor_schedd (the process which manages the job queue on a given machine) are authenticated using Windows NT's SSPI security layer. The user is then authenticated using the same challenge-response protocol that NT uses to authenticate users to Windows NT file servers. Once authenticated, the only users allowed to edit job entry in the queue are:
To protect the actual job queue files themselves, the Condor NT installation program will automatically set permissions on the entire Condor release directory so that only Administrators have write access.
Finally, Condor NT Preview has all the IP/Host-based security mechanisms present in the full-blown version of Condor. See section 3.8 starting on page
for complete information on how to allow/deny access to Condor based upon machine hostname or IP address.
Unix machines and Windows NT machines running Condor can happily co-exist in the same Condor pool without any problems. For now, the only restriction is jobs submitted on Windows NT must run on Windows NT, and job submitted on Unix must run on Unix. You will get this behavior by default, since condor_submit will automatically set a Requirements expression in the job ClassAd stating that the execute machine must have the same architecture and operating system as the submit machine.
There is absolutely no need to run more than one Condor central manager, even if you have both Unix and NT machines. The Condor central manager itself can run on either Unix or NT; there is no advantage to choosing one over the other. Here at University of Wisconsin-Madison, for instance, we have hundreds of Unix (Solaris, Linux, Irix, etc) and Windows NT machines in our Computer Science Department Condor pool. Our central manager is running on Windows NT. All is happy.